Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Retrieval-Augmented Generation (RAG)

Overview

RAG combines information retrieval with text generation, allowing LLMs to access and use external knowledge sources to produce more accurate, up-to-date, and contextually relevant responses.

Modern RAG Architecture (2024-2025)

1. Core Components

┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
│   Document      │    │   Vector Store  │    │   LLM with      │
│   Processing    │───▶│   & Retrieval   │───▶│   Generation    │
│                 │    │                 │    │                 │
└─────────────────┘    └─────────────────┘    └─────────────────┘

2. Advanced RAG Patterns

  • Hybrid Search: Combining semantic + keyword search
  • Multi-hop RAG: Iterative retrieval for complex queries
  • Agentic RAG: LLM-driven retrieval and reasoning
  • Graph RAG: Knowledge graph enhanced retrieval

Document Processing Pipeline

1. Document Loading

# Modern document loaders
from langchain_community.document_loaders import (
    PyPDFLoader, 
    UnstructuredFileLoader,
    WebBaseLoader,
    DirectoryLoader
)

# Popular formats supported:
# - PDF, DOCX, PPTX
# - HTML, Markdown
# - CSV, JSON
# - Audio/Video (via transcription)

2. Text Splitting Strategies

from langchain.text_splitter import (
    RecursiveCharacterTextSplitter,
    SemanticChunkSplitter,
    MarkdownHeaderTextSplitter
)

# Modern chunking approaches:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    separators=["\n\n", "\n", ". ", " ", ""]
)

3. Embedding Generation

# Modern embedding models
from sentence_transformers import SentenceTransformer
from openai import OpenAI

# High-performance options:
# - text-embedding-3-large (OpenAI)
# - BGE-large-en-v1.5
# - E5-large-v2
# - Instructor-XL

embedder = SentenceTransformer('BAAI/bge-large-en-v1.5')
embeddings = embedder.encode(chunks)

Vector Databases & Retrieval

1. Modern Vector Databases

  • Pinecone: Managed vector database
  • Weaviate: Open-source with hybrid search
  • Chroma: Lightweight, in-memory
  • Qdrant: High-performance, Rust-based
  • Milvus: Scalable for large datasets

2. Retrieval Strategies

# Hybrid search example
def hybrid_retrieval(query, vector_store, keyword_store, alpha=0.5):
    semantic_results = vector_store.similarity_search(query, k=10)
    keyword_results = keyword_store.search(query, k=10)
    
    # Combine scores
    combined_results = rerank_results(
        semantic_results, keyword_results, alpha
    )
    return combined_results[:5]

3. Advanced Retrieval Techniques

  • Re-ranking: Cross-encoders for better ranking
  • Query Expansion: Generate multiple query variations
  • HyDE: Hypothetical Document Embeddings
  • ColBERT: Contextualized late interaction

Generation & Integration

1. Prompt Engineering for RAG

# Modern RAG prompt template
RAG_PROMPT_TEMPLATE = """
You are a helpful assistant. Use the following context to answer the question.

Context:
{context}

Question: {question}

Answer the question based on the context above. If the context doesn't contain relevant information, say so and provide a general answer if possible.
"""

2. LLM Integration

from langchain.chains import RetrievalQA
from langchain_community.chat_models import ChatOpenAI

# Modern LLM choices for RAG:
llm = ChatOpenAI(
    model="gpt-4-turbo-preview",
    temperature=0.1,
    max_tokens=2000
)

rag_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=vector_store.as_retriever()
)

Evaluation & Monitoring

1. RAG Evaluation Metrics

  • Retrieval Metrics:
    • Hit Rate, MRR (Mean Reciprocal Rank)
    • NDCG (Normalized Discounted Cumulative Gain)
  • Generation Metrics:
    • Faithfulness, Answer Relevance
    • Context Utilization
  • End-to-End Metrics:
    • User satisfaction, Task completion rate

2. Evaluation Frameworks

# RAGAS for automated evaluation
from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevance,
    context_recall,
    context_precision
)

results = evaluate(
    dataset=test_dataset,
    metrics=[faithfulness, answer_relevance]
)

Advanced RAG Techniques

1. Query Optimization

  • Query Understanding: Intent classification
  • Query Reformulation: LLM-based query rewriting
  • Multi-query Generation: Create multiple search queries

2. Context Management

  • Context Compression: Summarize retrieved documents
  • Context Filtering: Remove irrelevant information
  • Context Reordering: Prioritize most relevant chunks

3. Multi-modal RAG

  • Image RAG: CLIP embeddings for visual search
  • Audio RAG: Speech-to-text + text RAG
  • Video RAG: Frame extraction + multimodal understanding

Modern RAG Frameworks & Tools

1. Open Source Frameworks

  • LangChain/LangGraph: Comprehensive RAG framework
  • LlamaIndex: Optimized for LLM data applications
  • Haystack: End-to-end NLP framework
  • DSPy: Programming model for LM applications

2. Commercial Platforms

  • OpenAI Assistants: Built-in RAG capabilities
  • Google Vertex AI Search: Enterprise RAG solution
  • Azure AI Search: Microsoft’s cognitive search
  • AWS Kendra: Intelligent document search

Implementation Best Practices

1. Data Quality

  • Clean and preprocess documents thoroughly
  • Implement proper chunking strategies
  • Handle different document types appropriately

2. Retrieval Optimization

  • Use hybrid search for better recall
  • Implement re-ranking for precision
  • Tune retrieval parameters (k, similarity threshold)

3. Generation Quality

  • Design effective prompt templates
  • Implement context window management
  • Handle cases where context is insufficient

4. Performance & Scalability

  • Use efficient embedding models
  • Implement caching for frequent queries
  • Monitor and optimize latency

Common Challenges & Solutions

1. Retrieval Failures

  • Problem: Relevant documents not retrieved
  • Solution: Hybrid search, query expansion, fine-tuned embeddings

2. Context Overload

  • Problem: Too much irrelevant information
  • Solution: Better chunking, context compression, re-ranking

3. Hallucinations

  • Problem: LLM generates information not in context
  • Solution: Better prompting, context verification, self-checking

4. Scalability Issues

  • Problem: Slow performance with large document sets
  • Solution: Efficient vector databases, indexing strategies, caching

Real-world Applications

1. Enterprise Knowledge Bases

  • Internal documentation search
  • Customer support automation
  • Employee onboarding

2. Research & Education

  • Academic paper analysis
  • Educational content generation
  • Research assistance

3. E-commerce & Customer Service

  • Product information retrieval
  • Customer query resolution
  • Personalized recommendations
  • Medical literature search
  • Legal document analysis
  • Compliance checking

1. Agentic RAG

  • Autonomous research and synthesis
  • Multi-step reasoning with retrieval
  • Self-correcting systems

2. Multimodal Expansion

  • Integration of images, audio, video
  • Cross-modal retrieval and generation
  • Unified multimodal understanding

3. Real-time RAG

  • Streaming data integration
  • Live knowledge updates
  • Dynamic context adaptation

4. Federated RAG

  • Privacy-preserving knowledge access
  • Cross-organizational knowledge sharing
  • Secure information retrieval

Code Example: Complete RAG Pipeline

import os
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import Chroma
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain.chains import RetrievalQA
from langchain_community.llms import OpenAI

# 1. Load documents
loader = PyPDFLoader("document.pdf")
documents = loader.load()

# 2. Split documents
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200
)
chunks = text_splitter.split_documents(documents)

# 3. Create embeddings and vector store
embeddings = HuggingFaceEmbeddings(
    model_name="BAAI/bge-large-en-v1.5"
)
vector_store = Chroma.from_documents(
    documents=chunks,
    embedding=embeddings,
    persist_directory="./chroma_db"
)

# 4. Create RAG chain
llm = OpenAI(temperature=0.1)
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=vector_store.as_retriever(search_kwargs={"k": 3})
)

# 5. Query the system
response = qa_chain.run("What is the main topic of this document?")
print(response)

Resources

  • Frameworks: LangChain, LlamaIndex, Haystack
  • Vector Databases: Pinecone, Weaviate, Chroma, Qdrant
  • Embedding Models: OpenAI, Sentence Transformers, BGE
  • Evaluation: RAGAS, TruLens, LangSmith
  • Papers: “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks”