How to Perform Full-fledged RAG for Any Website Using Firecrawl and Korvus

vandriichuk This loop allows users to input queries and recei 5be7d9db 2208 4b10 95f7 376f314b0ac1 2 How to Perform Full-fledged RAG for Any Website Using Firecrawl and Korvus

We are excited to present a detailed guide on using the power of RAG (Retrieval Augmented Generation) from Korvus in combination with Firecrawl. This combination allows you to quickly and easily set up a generation system with enhanced search capabilities using data from any website. Our approach demonstrates how to combine efficient web scraping, data processing, and modern machine learning methods in one elegant solution.

What You’ll Learn from This Guide:

  1. How to use Firecrawl for efficient and structured web content collection (we’ll be using our blog as an example)
  2. How to process and index the collected data using Korvus’s powerful Pipeline and Collection tools
  3. How to perform vector search, text generation, and reranking (RAG) within a single query, using modern open-source models

Introduction to Key Tools

Firecrawl

Firecrawl is not just a web scraper, but a powerful tool that transforms websites into clean, structured markdown data. It’s an ideal solution for creating a robust and up-to-date knowledge base for RAG applications. Firecrawl automatically cleans the content, removing unnecessary elements and leaving only important information.

Korvus

Korvus is a versatile SDK for PostgresML, available in several programming languages: Python, JavaScript, Rust, and C. It handles all the complex work of document processing, vector search, and response generation, combining these processes into one efficient query. With Korvus, you can take advantage of all the benefits of in-database machine learning without needing to work directly with SQL or manage the database.

PostgresML

PostgresML is an in-database ML/AI engine developed by experienced machine learning engineers from Instacart. It allows you to train, test, and deploy machine learning models right inside Postgres. This means you can perform complex machine learning operations in the same place where your data is stored, significantly increasing efficiency and reducing infrastructure complexity.

These three tools together provide everything needed to deploy a flexible and powerful RAG stack based on web data. The key advantage of this approach is that your data is stored right where inference is performed. This eliminates the need for a separate vector database or additional frameworks like LlamaIndex or Langchain to combine all components. As we like to say: “The fewer microservices, the fewer problems.”

Getting Started

Before we dive into the code, you’ll need to set up a few important elements. To work with our system, you need to set two key environment variables: FIRECRAWL_API_KEY and KORVUS_DATABASE_URL.

  1. To obtain FIRECRAWL_API_KEY, register at firecrawl.dev. This key will provide you access to Firecrawl’s powerful web scraping capabilities.
  2. As for KORVUS_DATABASE_URL, you have two options:
    • The easiest way is to sign up at postgresml.org. There you’ll get a ready-to-use database with all necessary extensions already installed.
    • For those who prefer more control, you can deploy your own postgres instance with pgml and pgvector extensions installed. This will require a bit more setup but will give you full control over your environment.

Detailed Code Breakdown

Now let’s dive into the code and examine each step of the process in more detail.

1. Import and Initialization

We’ll start by importing the necessary libraries and initializing our Firecrawl application:

from korvus import Collection, Pipeline
from firecrawl import FirecrawlApp
import os, time, asyncio
from rich import print

firecrawl = FirecrawlApp(api_key=os.environ["FIRECRAWL_API_KEY"])
Python

Here we import key components from Korvus and Firecrawl, as well as some standard Python libraries. Note the use of the rich library for enhanced console output. We initialize the Firecrawl application using the API key stored in an environment variable.

2. Defining Pipeline and Collection

The next step is setting up our data processing pipeline and collection:

pipeline = Pipeline(
    "v0",
    {
        "markdown": {
            "splitter": {"model": "markdown"},
            "semantic_search": {
                "model": "mixedbread-ai/mxbai-embed-large-v1",
            },
        },
    },
)
collection = Collection("fire-crawl-demo-v0")

async def add_pipeline():
    await collection.add_pipeline(pipeline)
Python

Here we define the structure of our pipeline. We specify that we’ll be working with markdown content, using a special markdown splitter and the mixedbread-ai/mxbai-embed-large-v1 model for semantic search. This configuration tells Korvus how to process our documents.

We also create a collection named “fire-crawl-demo-v0” and define an asynchronous function to add our pipeline to this collection.

3. Web Scraping with Firecrawl

Now we move on to the process of collecting data from the website:

def crawl():
    crawl_url = "https://postgresml.org/blog"
    params = {
        "crawlerOptions": {
            "excludes": [],
            "includes": ["blog/*"],
            "limit": 250,
        },
        "pageOptions": {"onlyMainContent": True},
    }
    job = firecrawl.crawl_url(crawl_url, params=params, wait_until_done=False)
    while True:
        print("Scraping...")
        status = firecrawl.check_crawl_status(job["jobId"])
        if status["status"] != "active":
            break
        time.sleep(5)
    return status
Python

This function demonstrates how to use Firecrawl to collect data from a website. We set the target URL (in this case, the PostgresML blog) and define scraping parameters. We limit scraping to only blog pages and set a limit of 250 pages. The onlyMainContent option ensures that we only get the main content of the page without navigation elements and other “noise”.

The function initiates a scraping task and then periodically checks its status until the process is complete.

4. Processing and Indexing Collected Data

After collecting data from the website, we need to process and index it for efficient searching:

async def main():
    await add_pipeline()
    results = crawl()
    documents = [
        {"id": data["metadata"]["sourceURL"], "markdown": data["markdown"]}
        for data in results["data"]
    ]
    await collection.upsert_documents(documents)
Python

This function performs the following steps:

  1. Adds the previously defined pipeline to our collection.
  2. Initiates the website scraping process using the crawl() function.
  3. Forms a list of documents from the collected data, using the source URL as the ID and markdown content as the document text.
  4. Uploads these documents to the collection using the upsert_documents method. During this process, the pipeline automatically splits the markdown into parts and generates embeddings for each fragment, storing everything in Postgres.

5. Performing RAG

Now that our data is indexed, we can perform RAG:

async def do_rag(user_query):
    results = await collection.rag(
        {
            "CONTEXT": {
                "vector_search": {
                    "query": {
                        "fields": {
                            "markdown": {
                                "query": user_query,
                                "parameters": {
                                    "prompt": "Represent this sentence for searching relevant passages: "
                                },
                            }
                        },
                    },
                    "document": {"keys": ["id"]},
                    "rerank": {
                        "model": "mixedbread-ai/mxbai-rerank-base-v1",
                        "query": user_query,
                        "num_documents_to_rerank": 100,
                    },
                    "limit": 5,
                },
                "aggregate": {"join": "\n\n\n"},
            },
            "chat": {
                "model": "meta-llama/Meta-Llama-3.1-405B-Instruct",
                "messages": [
                    {
                        "role": "system",
                        "content": "You are a question answering bot. Answer the user's question concisely based on the context.",
                    },
                    {
                        "role": "user",
                        "content": f"Given the context\n\n:{{CONTEXT}}\n\nAnswer the question: {user_query}",
                    },
                ],
                "max_tokens": 256,
            },
        },
        pipeline,
    )
    return results
Python

This function combines vector search, reranking, and text generation to provide context-aware answers to user queries. It uses the Meta-Llama-3.1-405B-Instruct model for text generation.

Let’s break down this complex query into 4 main steps:

  1. Vector search is performed, finding the 100 best matching fragments for the user’s query.
  2. The vector search results are reranked using the mixedbread-ai/mxbai-rerank-base-v1 cross-encoder, and the results are limited to the top 5 matches.
  3. The reranked results are joined using the \n\n\n separator and substituted in place of the {{CONTEXT}} placeholder in the messages.
  4. Text generation is performed using the meta-llama/Meta-Llama-3.1-405B-Instruct model.

This is a complex query with many parameters that can be tuned. For more detailed information on the rag method, we recommend referring to the Korvus guide on RAG.

6. Interactive Loop for Processing Queries

Finally, we tie everything together using an interactive loop in our main() function:

async def main():
    # ... (previous code for setup and indexing)
    while True:
        user_query = input("\n\nQuery > ")
        if user_query == "q":
            break
        results = await do_rag(user_query)
        print(results)

asyncio.run(main())
Python

This loop allows users to input queries and receive RAG-powered responses based on the indexed content from the PostgresML blog.

Conclusion

In this guide, we’ve demonstrated how to create a powerful RAG system using Firecrawl and Korvus. This approach is just a small example of the simplicity of performing RAG in-database, with fewer microservices.

Our solution is faster, more cost-effective, and easier to manage than the common approach to RAG (Vector DB + frameworks + moving your data to the models). But don’t take our word for it. Try out Firecrawl and Korvus on PostgresML, and see the performance benefits for yourself. As always, let us know what you think!

Leave a Reply