Core Foundations
What is RAG and Why It Exists
Retrieval-Augmented Generation (RAG) is a technique that enhances Large Language Models (LLMs) by providing them with external knowledge sources during inference. This addresses critical limitations of plain LLMs:
Limitations of Plain LLMs
- Hallucinations: LLMs can generate plausible-sounding but factually incorrect information when they lack specific knowledge
- Stale Knowledge: Training data has a cutoff date, making LLMs unaware of recent events or updates
- No Source Attribution: Cannot cite where information came from, making verification difficult
- Limited Context: Knowledge is frozen at training time, cannot access real-time or proprietary data
- Cost of Retraining: Updating knowledge requires expensive full model retraining
RAG solves these problems by:
- Retrieving relevant documents from an external knowledge base
- Augmenting the prompt with retrieved context
- Generating answers grounded in the provided context
- Enabling source attribution and citation
- Allowing real-time knowledge updates without model retraining
Basic RAG Loop
The fundamental RAG process follows this flow:
Query
User submits a question or request
Retrieve
Search knowledge base for relevant documents
Augment
Combine query with retrieved context in prompt
Generate
LLM produces answer using augmented context
RAG vs Fine-tuning vs Tools/Callouts
Prerequisites: Embeddings and Vector Similarity
Embeddings are dense vector representations of text that capture semantic meaning. Similar texts have similar embeddings, enabling semantic search.
Vector Similarity Metrics
Cosine Similarity: cos(θ) = (A · B) / (||A|| × ||B||)
- Range: -1 to 1 (typically 0 to 1 for normalized vectors)
- Measures angle between vectors
- Best for normalized embeddings
- Ignores vector magnitude
Dot Product: A · B = Σ(Ai × Bi)
- Range: -∞ to +∞
- Considers both direction and magnitude
- Faster computation
- Requires same embedding space
Euclidean Distance: d = √Σ(Ai - Bi)²
- Range: 0 to +∞
- Measures straight-line distance
- Lower = more similar
- Can be converted to similarity: 1 / (1 + distance)
Sparse vs Dense Retrieval
Sparse Retrieval (Keyword-based)
- Examples: BM25, TF-IDF, keyword matching
- How it works: Creates sparse vectors based on term frequency
- Pros: Fast, interpretable, good for exact matches
- Cons: Misses synonyms, no semantic understanding, vocabulary mismatch
Dense Retrieval (Semantic)
- Examples: Sentence-BERT, OpenAI embeddings, Cohere embeddings
- How it works: Neural networks create dense vectors capturing meaning
- Pros: Semantic understanding, handles synonyms, cross-lingual
- Cons: Computationally expensive, requires training, less interpretable
Basic NLP Concepts
- Tokenization: Breaking text into tokens (words, subwords, or characters) that models can process
- Transformers: Neural architecture using self-attention to understand relationships between tokens
- Attention: Mechanism allowing models to focus on relevant parts of input when generating output
- LLM Inference: Process of generating text from a trained model, involving forward passes through neural networks
Data & Knowledge Ingestion
Knowledge Sources and Formats
RAG systems can ingest knowledge from diverse sources:
📄 Documents
- PDFs (research papers, manuals, reports)
- Word documents (.docx, .doc)
- Markdown files
- Plain text files
🌐 Web Content
- HTML pages
- Wikis (Confluence, MediaWiki)
- Blog posts
- Documentation sites
💬 Communication
- Support tickets
- Email threads
- Slack/Teams messages
- Chat logs
🗄️ Databases
- SQL databases
- NoSQL stores
- Data warehouses
- Knowledge bases
🔌 APIs
- REST APIs
- GraphQL endpoints
- Webhooks
- Real-time streams
📊 Logs & Events
- Application logs
- Event streams
- Monitoring data
- Audit trails
Data Structure Types
Structured Data
Well-defined schema (tables, JSON with fixed structure). Examples: databases, CSV files, API responses with schemas.
Unstructured Data
No predefined format (free text, images, audio). Examples: documents, emails, social media posts.
Semi-structured Data
Some structure but flexible schema (JSON, XML, HTML). Examples: web pages, API responses, configuration files.
Ingestion Pipelines
Batch vs Streaming
ETL/ELT Patterns for RAG
Extract
Pull data from sources (APIs, databases, files)
Transform
Clean, chunk, embed, add metadata
Load
Store in vector database with metadata
Connectors: Pre-built integrations for common sources (S3, SharePoint, Confluence, Slack, etc.) that handle authentication, pagination, and format conversion.
Document Modeling
Chunking Strategies
Fixed Window Chunking
Split documents into fixed-size chunks (e.g., 512 tokens).
- Pros: Simple, predictable, easy to implement
- Cons: May break sentences/paragraphs, loses context
- Use when: Uniform document structure, simple use cases
Sliding Window Chunking
Fixed-size chunks with overlap between adjacent chunks.
- Pros: Preserves context at boundaries, reduces information loss
- Cons: More storage, potential redundancy
- Use when: Context continuity is important
- Overlap: Typically 10-20% of chunk size (e.g., 100 tokens overlap for 512-token chunks)
Semantic Chunking
Split based on semantic boundaries (sentences, paragraphs, sections).
- Pros: Preserves meaning, better retrieval quality
- Cons: Variable chunk sizes, more complex
- Use when: Quality is critical, documents have clear structure
- Methods: Sentence transformers, topic modeling, section detection
Chunk Size and Overlap Trade-offs
- Smaller chunks (128-256 tokens): More precise retrieval, better for specific facts, but may miss context
- Medium chunks (512-1024 tokens): Balance between precision and context, most common choice
- Larger chunks (2048+ tokens): More context, better for complex reasoning, but less precise retrieval
- Overlap considerations: Higher overlap improves context continuity but increases storage and processing costs
Metadata Design
{
"source": "document_url_or_id",
"timestamp": "2024-01-15T10:30:00Z",
"doc_type": "pdf|html|ticket|api",
"language": "en|es|fr|...",
"access_control": {
"roles": ["admin", "engineer"],
"departments": ["engineering"],
"tenants": ["company_a"]
},
"quality_score": 0.95,
"chunk_index": 3,
"total_chunks": 15,
"section": "introduction",
"author": "john.doe@company.com",
"last_updated": "2024-01-10T08:00:00Z"
}
Key Metadata Fields:
- Source: Origin of the document for citation and traceability
- Timestamp: When document was created/updated for freshness tracking
- Access Control Tags: For row-level security and multi-tenancy
- Doc Type: Enables type-specific processing and filtering
- Language: For multilingual systems and language-specific retrieval
- Quality Scores: Confidence metrics for ranking and filtering
Embedding Pipelines
Embedding Model Choice
General-Purpose Models
- OpenAI text-embedding-ada-002: 1536 dimensions, good general performance
- OpenAI text-embedding-3-small/large: Latest models with better performance
- Cohere embed-english-v3.0: 1024 dimensions, strong semantic understanding
- Sentence-BERT (all-MiniLM-L6-v2): 384 dimensions, fast and efficient
- Use when: General knowledge, diverse domains, standard use cases
Domain-Specific Models
- BioBERT: Biomedical domain
- Legal-BERT: Legal documents
- SciBERT: Scientific papers
- CodeBERT: Programming code
- Use when: Specialized domain, technical jargon, domain-specific terminology
Multilingual Models
- multilingual-e5-base/large: Supports 100+ languages
- paraphrase-multilingual-MiniLM: Cross-lingual understanding
- Use when: International content, cross-lingual search, global knowledge bases
Offline vs On-the-Fly Embedding
Embedding Versioning
When updating embedding models or document schemas, maintain version tracking:
- Model Version: Track which embedding model was used (e.g., "text-embedding-ada-002-v1")
- Schema Version: Track metadata schema changes
- Migration Strategy: Gradual migration, dual indexing, or full re-embedding
- Backward Compatibility: Support queries against old embeddings during transition
Retrieval & RAG Architectures
Retrieval Building Blocks
Vector Stores
FAISS (Facebook AI Similarity Search)
- Open-source library by Meta
- In-memory or disk-based
- Supports GPU acceleration
- Best for: Research, prototyping, self-hosted
Pinecone
- Managed vector database service
- Auto-scaling, high availability
- Metadata filtering
- Best for: Production, managed infrastructure
Weaviate
- Open-source vector database
- GraphQL API
- Built-in ML models
- Best for: Complex queries, graph + vector
pgvector
- PostgreSQL extension
- SQL + vector search
- ACID transactions
- Best for: Existing PostgreSQL infrastructure
Chroma
- Embedding database
- Simple Python API
- Lightweight
- Best for: Development, small-scale
Qdrant
- Vector similarity search engine
- REST API
- Payload filtering
- Best for: Production, high performance
Retrieval Techniques
Top-K Retrieval
Retrieve the K most similar documents based on similarity score. Common values: K=5 to K=20.
- Higher K: More context, but may include irrelevant docs
- Lower K: More focused, but may miss relevant information
Score Thresholds
Filter results below a similarity threshold (e.g., only return docs with similarity > 0.7).
- Prevents low-quality retrievals
- Adaptive thresholds based on query type
- Can be combined with Top-K
MMR (Maximal Marginal Relevance)
Diversity-focused retrieval that balances relevance and diversity.
- Reduces redundancy in results
- Formula: MMR = λ × Relevance - (1-λ) × MaxSimilarity
- λ parameter controls relevance vs diversity trade-off
Metadata Filtering
Filter results based on metadata before or after vector search.
- Pre-filter: Filter before search (faster, but may miss results)
- Post-filter: Filter after search (slower, but more accurate)
- Examples: date ranges, document types, access permissions
Retrieval Variants
Sparse Retrieval (BM25)
Best Match 25 (BM25) is a probabilistic ranking function for information retrieval.
- Based on term frequency and inverse document frequency (TF-IDF)
- Handles exact keyword matches well
- Fast and interpretable
- Use when: Keyword-heavy queries, exact matches needed, interpretability required
Dense Retrieval
Uses embedding vectors for semantic similarity search.
- Captures semantic meaning, not just keywords
- Handles synonyms and paraphrasing
- Requires embedding model
- Use when: Semantic understanding needed, natural language queries, cross-lingual search
Hybrid Search
Combines sparse and dense retrieval for best of both worlds.
- Reciprocal Rank Fusion (RRF): Combines rankings from both methods
- Weighted Combination: Weighted sum of scores (e.g., 0.3 × BM25 + 0.7 × Dense)
- Use when: Need both keyword and semantic matching, production systems
Reranking
Cross-Encoder Reranking
Uses cross-encoder models (BERT-based) to score query-document pairs more accurately.
- More accurate than bi-encoder (embedding) models
- Slower (can't pre-compute embeddings)
- Typically rerank top 20-100 candidates
- Models: cross-encoder/ms-marco-MiniLM, bge-reranker
LLM Judge Reranking
Uses LLM to evaluate and rank retrieved documents.
- Most flexible and context-aware
- Can consider complex relevance criteria
- Expensive and slow
- Use for critical applications or complex queries
Vanilla RAG Pipeline in Detail
Query Understanding
Parse and normalize user query, extract intent, identify entities
Query Embedding
Convert query to embedding vector using same model as documents
Vector Search
Find top-K similar documents in vector store
Rerank
Optionally rerank results using cross-encoder or LLM judge
Context Packing
Combine retrieved documents into context, respect token limits
Prompt Construction
Build prompt with system instructions, context, and user query
Generation
LLM generates answer based on augmented prompt
Single-Shot QA vs Conversation
RAG Architectural Patterns
Single-Index Architecture
All documents in one vector index.
- Pros: Simple, single query, easy to manage
- Cons: No domain separation, harder to scale
- Use when: Small-scale, single domain, homogeneous content
Multi-Index Architecture
Separate indexes for different domains or document types.
- Pros: Domain-specific optimization, better organization, parallel queries
- Cons: More complex, need to route queries, multiple indexes to maintain
- Use when: Multiple domains, different document types, specialized retrieval needs
Multi-Store Architecture
Combines vector store with other data stores (SQL, graph, etc.).
- Pros: Best tool for each data type, flexible queries
- Cons: Complex integration, query coordination needed
- Use when: Mixed data types, structured + unstructured, complex queries
Multi-Tenancy Patterns
Per-Tenant Index
Separate vector index for each tenant.
- Pros: Complete isolation, tenant-specific optimization
- Cons: Higher cost, more indexes to manage
Shared Index with Row-Level Filters
Single index with metadata-based filtering for tenant isolation.
- Pros: Cost-effective, easier management, cross-tenant analytics possible
- Cons: Requires careful access control, potential for data leakage if misconfigured
- Implementation: Filter by tenant_id in metadata before/after retrieval
Advanced Retrieval & Knowledge Workflows
Advanced RAG Techniques
Multi-Hop / Multi-Step Retrieval
Iterative retrieval where each step uses information from previous retrievals.
- Step 1: Initial query retrieves relevant documents
- Step 2: Extract entities/concepts from Step 1 results
- Step 3: Query for documents related to extracted entities
- Step 4: Combine results from all steps
- Use when: Complex queries requiring multiple pieces of information, research tasks
Recursive Retrieval
Retrieves parent documents when child chunks are relevant.
- If a chunk is retrieved, also retrieve its parent document/section
- Provides broader context around relevant chunks
- Useful for hierarchical document structures
Query Planning
LLM generates a plan for breaking down complex queries into sub-queries.
- Analyze query complexity
- Generate sub-questions
- Execute sub-queries in sequence or parallel
- Synthesize final answer from sub-query results
Query Rewriting and Decomposition
Sub-Question Decomposition
Break complex questions into simpler sub-questions.
- Example: "What are the benefits and drawbacks of RAG?" → ["What are benefits of RAG?", "What are drawbacks of RAG?"]
- Retrieve for each sub-question
- Combine answers
Self-Ask Pattern
LLM asks itself follow-up questions to gather needed information.
- Model generates its own questions
- Retrieves answers to self-generated questions
- Uses answers to respond to original query
Self-RAG
Model critiques its own retrieval and generation, triggering re-retrieval if needed.
- Retrieve: Initial retrieval based on query
- Critique: Model evaluates if retrieved context is sufficient
- Re-retrieve: If insufficient, refine query and retrieve again
- Generate: Generate answer using retrieved context
- Self-reflect: Evaluate answer quality, decide if re-generation needed
Graph and Structured RAG
GraphRAG Concepts
Knowledge Graph Construction
Extract entities and relationships from documents to build a knowledge graph.
- Entity Extraction: Identify people, organizations, concepts, etc.
- Relationship Extraction: Identify connections between entities
- Graph Storage: Store in graph database (Neo4j, Amazon Neptune)
- Graph Embeddings: Create embeddings for nodes and edges
Graph Search + Embeddings
Combine graph traversal with vector similarity search.
- Graph Traversal: Follow relationships to find connected entities
- Vector Search: Find semantically similar nodes
- Hybrid: Use graph structure to filter/rank vector results
- Use when: Relationships matter, entity-centric queries, complex knowledge domains
Structured Data Integration
Vector + SQL/OLAP
Combine unstructured vector search with structured database queries.
- Vector search for semantic content
- SQL queries for structured data (numbers, dates, categories)
- Merge results in final answer
- Example: "Find documents about Q4 sales (vector) where revenue > $1M (SQL)"
Vector + API Tools
Combine retrieved context with real-time API data.
- Retrieve relevant documents
- Call APIs for real-time data (prices, weather, stock info)
- LLM synthesizes both sources
- Enables dynamic, up-to-date answers
Workflow / Agent Patterns with RAG
Agentic Flows
Multi-step workflows where agents use tools including RAG retrieval.
Search Tool
Agent uses RAG to search knowledge base
Summarizer
Summarize retrieved documents
Writer
Generate final response based on summary
Validator
Verify answer quality and fact-check
Task-Oriented Workflows
Summarization
- Retrieve relevant documents
- Extract key points
- Generate concise summary
- Use: Meeting notes, research papers, long documents
Comparison
- Retrieve documents about each item
- Extract features/attributes
- Compare side-by-side
- Use: Product comparison, policy analysis
Decision Support
- Retrieve relevant policies/guidelines
- Analyze current situation
- Recommend actions
- Use: Compliance, risk assessment
Report Generation
- Retrieve data from multiple sources
- Synthesize information
- Generate structured report
- Use: Status reports, analysis reports
Enterprise Knowledge Management Workflows
Use Cases
Internal KB Search
- Company documentation search
- Employee self-service
- Knowledge discovery
- Onboarding assistance
Policy Assistants
- HR policy queries
- Compliance checking
- Regulatory guidance
- Procedure lookup
Project Knowledge Discovery
- Find related projects
- Learn from past projects
- Identify experts
- Best practices discovery
Governance Workflows
Content Lifecycle
Ingestion → Processing → Indexing → Serving → Archival
Curation
Review, tag, categorize, quality check
Approval
Subject matter expert review, approval workflow
Archival
Version control, deprecation, removal
Quality, Evaluation, and Production
Retrieval Evaluation
Ground-Truth Datasets
Create evaluation datasets with query-answer pairs and relevant document IDs.
- QA Datasets: Question-answer pairs with source documents
- Relevance Labels: Human-annotated relevance scores for query-document pairs
- Benchmark Datasets: MS MARCO, Natural Questions, SQuAD, BEIR
Retrieval Metrics
Hit Rate
Percentage of queries where at least one relevant document is retrieved in top-K.
Formula: (Queries with at least 1 relevant doc) / (Total queries)
Recall@K
Percentage of relevant documents retrieved in top-K results.
Formula: (Relevant docs retrieved) / (Total relevant docs)
NDCG (Normalized Discounted Cumulative Gain)
Measures ranking quality, giving higher weight to top positions.
- Accounts for position of relevant documents
- Range: 0 to 1 (higher is better)
- Best for: Ranking evaluation
MRR (Mean Reciprocal Rank)
Average of reciprocal ranks of first relevant document.
Formula: (1 / rank of first relevant) averaged across queries
Per-Component Evaluation
- Retriever: Evaluate retrieval quality independently (Hit Rate, Recall@K)
- Reranker: Evaluate ranking improvement (NDCG improvement)
- Generator: Evaluate answer quality given perfect retrieval (faithfulness, relevance)
Generation Evaluation
Answer Quality Metrics
Faithfulness
Answer is grounded in retrieved context, no hallucinations.
- Check if claims are supported by context
- Detect contradictions
- Methods: NLI models, LLM-as-judge, fact-checking
Relevance
Answer addresses the query.
- Does answer match query intent?
- Is information complete?
- Methods: Semantic similarity, LLM-as-judge
Completeness
Answer covers all aspects of the query.
- No missing information
- All sub-questions addressed
- Methods: Coverage analysis, LLM-as-judge
Style
Answer matches desired tone and format.
- Formal vs casual
- Technical vs simple
- Length and structure
Evaluation Methods
Common Issues
Hallucination Detection
Answers contain information not in retrieved context.
- Use NLI (Natural Language Inference) models to check if claims are entailed by context
- LLM-as-judge to verify faithfulness
- Citation tracking to ensure all claims have sources
"Lost in the Middle" Problem
LLMs pay more attention to the beginning and end of context, missing information in the middle.
- Reorder retrieved documents (most relevant in middle)
- Limit context size
- Use attention mechanisms that emphasize important parts
Context Over-/Under-Stuffing
Too much or too little context affects answer quality.
- Over-stuffing: Too many documents dilute focus, increase cost/latency
- Under-stuffing: Missing relevant information leads to incomplete answers
- Solution: Adaptive retrieval (start with K=5, expand if needed)
Prompt & Context Engineering
Context Window Budgeting
Manage limited context windows efficiently:
- System Prompt: 200-500 tokens (instructions, guidelines)
- Retrieved Context: 2000-4000 tokens (documents)
- User Query: 50-200 tokens
- Response Buffer: 500-1000 tokens (for generation)
- Total Budget: Typically 4K-8K tokens for most models
Formatting Citations and Quotes
System: You are a helpful assistant. Always cite sources using [1], [2], etc.
Context:
[1] Source: document1.pdf, Page 5
"RAG improves answer quality by providing external context."
[2] Source: document2.pdf, Section 3.2
"Vector similarity search enables semantic retrieval."
User: How does RAG work?
Assistant: RAG works by retrieving relevant documents [1] and using
semantic search [2] to find context...
Source Attribution
- Include document IDs, URLs, or titles in context
- Request citations in system prompt
- Parse citations from LLM output
- Link citations back to original sources
System Prompts
Grounding
Instructions to base answers only on provided context.
Example: "Only use information from the provided context. If the answer is not in the context, say 'I don't have that information.'"
Safety
Guidelines for handling sensitive or harmful content.
Example: "Do not generate harmful, biased, or inappropriate content. Refuse to answer questions about illegal activities."
Tool Calling
Instructions for when and how to use tools.
Example: "Use the search tool when you need additional information. Use the calculator for mathematical operations."
Refusal Behaviors
When to refuse answering.
Example: "Refuse to answer if: (1) information is not in context, (2) query is harmful, (3) query violates policies."
Scaling & Operations
Latency and Cost Optimization
Caching Strategies
- Query Caching: Cache query embeddings and results for repeated queries
- Context Caching: Cache frequently retrieved document contexts
- Generation Caching: Cache LLM responses for identical queries
- TTL (Time-To-Live): Set expiration for cached content
Approximate Search
- Use approximate nearest neighbor (ANN) algorithms instead of exact search
- Trade slight accuracy for significant speed improvement
- Examples: HNSW (Hierarchical Navigable Small World), IVF (Inverted File Index)
- Speedup: 10-100x faster than exact search
Batching
- Batch embedding generation for multiple documents
- Batch LLM inference for multiple queries
- Reduces API calls and improves throughput
- Trade-off: Slightly higher latency for individual requests
Index Maintenance
Re-embedding
When to re-embed documents:
- New embedding model available (better quality)
- Document updates (content changed)
- Schema changes (metadata structure changed)
- Migration strategy: Gradual migration vs full re-embedding
Deletions
Handling document removal:
- Soft delete: Mark as deleted, filter in queries
- Hard delete: Remove from index immediately
- Cascade deletes: Remove all chunks when parent deleted
- Versioning: Keep old versions for audit trail
Drift Management
Handling changes over time:
- Model Drift: New embedding models may change similarity scores
- Schema Drift: Metadata schema changes require migration
- Content Drift: Documents updated but embeddings not refreshed
- Solution: Version tracking, migration scripts, monitoring
Monitoring
Usage Analytics
- Query volume and patterns
- Popular queries and documents
- User engagement metrics
- Peak usage times
Feedback Loops
- User ratings (thumbs up/down)
- Correction submissions
- Reported issues
- Usage patterns
Drift Detection
- Retrieval quality trends
- Answer quality degradation
- Model performance changes
- Anomaly detection
Regression Tracking
- A/B testing results
- Version comparisons
- Performance benchmarks
- Quality metrics over time
Security & Compliance
Access Control in Retrieval
Row-Level Security
Filter documents based on user permissions before retrieval.
- Check user roles/permissions
- Apply metadata filters (department, team, clearance level)
- Pre-filter or post-filter based on performance needs
- Audit all access attempts
Tenant Isolation
Ensure multi-tenant systems don't leak data between tenants.
- Separate indexes per tenant (strongest isolation)
- Shared index with tenant_id filtering (cost-effective)
- Validate filters at multiple layers
- Test isolation regularly
Data Protection
PII/PHI Redaction
Remove or mask sensitive information.
- Detect PII (SSN, email, phone) and PHI (medical records)
- Redact before indexing or at query time
- Use NER (Named Entity Recognition) models
- Comply with GDPR, HIPAA, CCPA
Data Residency
Store data in specific geographic regions.
- Choose vector store region based on requirements
- Ensure LLM API calls comply with data residency
- Track data location in metadata
- Comply with regional regulations (EU, US, etc.)
Audit and Compliance
Audit Trails
Log all access and operations.
- Query logs (who, what, when)
- Document access logs
- Modification history
- Retention policies
Policy-Aware Answering
Ensure answers comply with organizational policies.
- Check policies before generating answers
- Refuse to answer policy-violating queries
- Include policy disclaimers when needed
- Regular policy updates and compliance checks