Simple RAG with LlamaIndex, Qdrant & GPT-4o Mini: A Step-by-Step Guide for Beginners

19082024 Simple RAG with LlamaIndex, Qdrant & GPT-4o Mini: A Step-by-Step Guide for Beginners

Retrieval-Augmented Generation (RAG) applications are gaining significant attention in the AI world. But why?

RAG is an intelligent technique that combines two powerful concepts:

  1. The capabilities of Large Language Models (such as GPT-4o or Claude 3.5 Sonnet).
  2. Access to up-to-date, relevant, and domain-specific knowledge (through Vector Databases).

If you’ve used ChatGPT or Claude, you’re likely aware of their major limitations:

  • Knowledge cut-off: Their knowledge is not up-to-date. They don’t have access to the most current information.
  • Lack of domain-specific knowledge: GPT-4 or Claude are generalists. They struggle to provide detailed, accurate answers in specialized domains.
  • Lack of company-specific knowledge: LLMs know nothing about company-specific documents such as employee handbooks, product manuals, or customer support tickets.
  • Hallucinations: LLMs can sometimes generate false or nonsensical information, especially when asked about topics beyond their training data.
  • Handling long context: While you can ask LLMs to provide answers based on your sources (such as ebooks, PDFs, etc.), they struggle with very long documents, often leading to inaccurate or incomplete responses.

How does RAG solve these limitations?

RAG addresses these issues by passing up-to-date, domain, or company-specific information in manageable chunks. Technically speaking, RAG combines the powers of LLMs and Vector Databases.

Here are some common RAG use cases:

  • Educational tools
  • Content generation
  • Employee onboarding
  • Customer support chatbots
  • Internal knowledge management
  • Advanced question-answering systems
  • Internal chatbots (with company-specific knowledge)

RAG Pipeline

The main purpose of RAG is to “feed” LLMs with specific knowledge. Here’s a basic RAG pipeline:

  1. Ingestion: Load knowledge sources (e.g., eBooks, PDFs, text files, spreadsheets, websites). Split them into smaller chunks, and use embeddings to “translate” them into vectors. Store the chunks and vectors in a Vector Database.
  2. Retrieval: Perform vector embedding on the user’s query. Pass the embedded query to the Vector Database to return the most relevant documents.
  3. Synthesis: Pass the query and the retrieved documents to a Large Language Model. Ask the LLM to answer the question based on the provided context.

Now, let’s build a simple RAG application using LlamaIndex, Qdrant, and GPT-4o Mini!

Part 1: Preparing the Project

Step 1: Set Up the Python Environment

First, create a project folder, set up a Python virtual environment, and create a .env file to store your OpenAI API key:

OPENAI_API_KEY=sk-proj-111111111111111111111111
Python

Replace the placeholder with your actual OpenAI API key.

Step 2: Install Necessary Packages

Install the required packages using pip:

pip install llama-index qdrant_client llama-index-vector-store-qdrant python-dotenv
Python

Step 3: Run the Local Qdrant Client

Qdrant is a popular vector database. To run a local instance:

docker pull qdrant/qdrant
docker run -p 6333:6333 -p 6334:6334 \
    -v $(pwd)/qdrant_storage:/qdrant/storage:z \
    qdrant/qdrant
Python

Qdrant will be accessible at:

  • REST API: localhost:6333
  • Web UI: localhost:6333/dashboard

Part 2: Data Ingestion to Qdrant Vector Database

Step 1: Prepare and Save Data

For this example, we’ll use a markdown document with information about a fictional company. Save it as ./data/fake_company.md in your project directory.

Step 2: Ingest Data into the Qdrant Collection

Create a file named ingest.py with the following content:

import os
from dotenv import load_dotenv
from llama_index.core.node_parser import MarkdownNodeParser
from llama_index.readers.file import FlatReader
from pathlib import Path
from llama_index.core import VectorStoreIndex, StorageContext
from llama_index.vector_stores.qdrant import QdrantVectorStore
from qdrant_client import QdrantClient

load_dotenv()

md_docs = FlatReader().load_data(Path("./data/fake_company.md"))

parser = MarkdownNodeParser()
nodes = parser.get_nodes_from_documents(md_docs)

client = QdrantClient(url="http://localhost:6333")
collection_name = "FakeCompany"

def create_index(nodes, collection_name):
    vector_store = QdrantVectorStore(collection_name, client=client)
    storage_context = StorageContext.from_defaults(vector_store=vector_store)
    index = VectorStoreIndex(nodes=nodes, storage_context=storage_context)

if __name__ == "__main__":
    create_index(nodes, collection_name)
Python

Run the script with:

python ingest.py
Python

Part 3: Retrieval & Response with GPT-4o Mini

Create a file named query.py with the following content:

from llama_index.core import Settings
from llama_index.llms.openai import OpenAI
from llama_index.vector_stores.qdrant import QdrantVectorStore
from qdrant_client import QdrantClient
from llama_index.core import VectorStoreIndex
from llama_index.core.retrievers import VectorIndexRetriever
from llama_index.core.query_engine import RetrieverQueryEngine

from dotenv import load_dotenv

load_dotenv()

# Select GPT-4o Mini model
llm = OpenAI(model="gpt-4o-mini", temperature=0.1)
Settings.llm = llm

# Qdrant parameters
client = QdrantClient(url="http://localhost:6333")
collection_name = "FakeCompany"
vector_store = QdrantVectorStore(collection_name, client=client)

# Load existing index
index = VectorStoreIndex.from_vector_store(vector_store)

# Create retriever and query engine
retriever = VectorIndexRetriever(index)
query_engine = RetrieverQueryEngine(retriever=retriever)

if __name__ == "__main__":
    response = query_engine.query(
        "How does the RoboFlex Series cater to both industrial and domestic markets?"
    )
    print(response.response)
Python

Run the script with:

python query.py
Python

Conclusion and Next Steps

Congratulations! You now have a working RAG solution. Here are some ideas for further experimentation:

  1. Experiment with parameters like chunk size or the number of returned documents (top k).
  2. Add a user interface (e.g., a chatbot) to interact with your RAG application.
  3. Use your own data and try different node parsers provided by LlamaIndex.
  4. Try other models or vector databases.

Remember, the best way to understand RAG is through practice. Use this code as a foundation and start experimenting. Don’t be afraid to modify the code to suit your needs and try new ideas.

Good luck with your RAG experiments!

Leave a Reply