Back to Blog
AI/MLRAGGoogle GeminiPythonLangChain

Building a Multi-Source RAG System with Google Gemini

November 15, 2024
3 min read
By Saif Ur Rehman

Retrieval-Augmented Generation (RAG) has become a cornerstone technique for building AI systems that combine the power of large language models with domain-specific knowledge. In this post, I'll walk through the architecture and implementation of a production-ready RAG system using Google Gemini.

Why RAG?

Traditional LLMs have limitations:

  • Knowledge cutoff dates
  • Hallucination on specific domain knowledge
  • No access to private/proprietary data

RAG solves these by grounding LLM responses in retrieved, relevant documents.

System Architecture

Our RAG system consists of three main components: Document Ingestion, Vector Storage, and Query Processing.

Document Ingestion Pipeline

The first step is loading and chunking documents efficiently:

from langchain.document_loaders import PDFLoader, TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

def ingest_documents(file_path: str):
    if file_path.endswith('.pdf'):
        loader = PDFLoader(file_path)
    else:
        loader = TextLoader(file_path)

    documents = loader.load()
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=1000,
        chunk_overlap=200
    )
    return text_splitter.split_documents(documents)

Vector Store with ChromaDB

ChromaDB provides efficient embedding storage and retrieval:

from chromadb import Client
from chromadb.config import Settings

def create_vector_store(chunks):
    client = Client(Settings(
        chroma_db_impl="duckdb+parquet",
        persist_directory="./chroma_db"
    ))

    collection = client.create_collection(
        name="documents",
        metadata={"hnsw:space": "cosine"}
    )

    collection.add(
        documents=[chunk.page_content for chunk in chunks],
        metadatas=[chunk.metadata for chunk in chunks],
        ids=[f"doc_{i}" for i in range(len(chunks))]
    )
    return collection

Query Processing with Gemini

Combining retrieval with generation using Google Gemini:

import google.generativeai as genai

def query_rag_system(query: str, collection):
    results = collection.query(
        query_texts=[query],
        n_results=5
    )

    context = "\n\n".join(results['documents'][0])
    model = genai.GenerativeModel('gemini-pro')

    prompt = f"""
    Based on the following context, answer the question.

    Context: {context}

    Question: {query}
    Answer:
    """

    return model.generate_content(prompt).text

Key Optimizations

Semantic Chunking

Instead of fixed-size chunks, use semantic boundaries for better context preservation. This approach splits text at natural paragraph breaks while maintaining a target chunk size.

Hybrid Search

Combining semantic and keyword search improves retrieval accuracy by capturing both meaning and specific terms.

Production Considerations

Cost Optimization

  • Cache embeddings to avoid recomputation
  • Use batch processing for document ingestion
  • Implement rate limiting for API calls

Performance Metrics

Our system achieves:

  • Document ingestion: ~50 pages/second
  • Query latency: under 2 seconds end-to-end
  • Concurrent users: 100+ with proper caching

Monitoring

Implement comprehensive monitoring using Prometheus metrics to track query performance, error rates, and resource utilization.

Lessons Learned

Chunk size matters: We found 800-1000 tokens optimal for technical documentation.

Metadata is crucial: Always store source information, page numbers, and timestamps for traceability.

Quality over quantity: 5 highly relevant chunks are better than 20 marginally relevant ones.

User feedback loop: Implement thumbs up/down to continuously improve retrieval quality.

Conclusion

Building a production RAG system requires careful attention to document processing, retrieval quality, and LLM integration. Google Gemini's strong reasoning capabilities combined with ChromaDB's efficient vector search make for a powerful combination.

For a complete implementation, check out my Smart Research Assistant project.


Questions or feedback? Feel free to reach out on LinkedIn or GitHub.