01

Core Foundations

What is RAG and Why It Exists

Retrieval-Augmented Generation (RAG) is a technique that enhances Large Language Models (LLMs) by providing them with external knowledge sources during inference. This addresses critical limitations of plain LLMs:

Limitations of Plain LLMs

  • Hallucinations: LLMs can generate plausible-sounding but factually incorrect information when they lack specific knowledge
  • Stale Knowledge: Training data has a cutoff date, making LLMs unaware of recent events or updates
  • No Source Attribution: Cannot cite where information came from, making verification difficult
  • Limited Context: Knowledge is frozen at training time, cannot access real-time or proprietary data
  • Cost of Retraining: Updating knowledge requires expensive full model retraining

RAG solves these problems by:

  • Retrieving relevant documents from an external knowledge base
  • Augmenting the prompt with retrieved context
  • Generating answers grounded in the provided context
  • Enabling source attribution and citation
  • Allowing real-time knowledge updates without model retraining

Basic RAG Loop

The fundamental RAG process follows this flow:

1

Query

User submits a question or request

2

Retrieve

Search knowledge base for relevant documents

3

Augment

Combine query with retrieved context in prompt

4

Generate

LLM produces answer using augmented context

RAG vs Fine-tuning vs Tools/Callouts

Approach
Use Case
Pros
Cons
RAG
Dynamic knowledge, citations needed, frequent updates
No retraining, source attribution, real-time updates
Retrieval latency, context limits, retrieval quality dependency
Fine-tuning
Task-specific behavior, domain adaptation, style transfer
Better task performance, smaller models possible
Expensive, requires data, knowledge frozen, no citations
Tools/Callouts
Real-time data, API calls, structured operations
Live data access, structured outputs, function calling
API dependencies, error handling complexity, latency

Prerequisites: Embeddings and Vector Similarity

Embeddings are dense vector representations of text that capture semantic meaning. Similar texts have similar embeddings, enabling semantic search.

Vector Similarity Metrics

Similarity Metrics
Cosine Similarity: cos(θ) = (A · B) / (||A|| × ||B||)
- Range: -1 to 1 (typically 0 to 1 for normalized vectors)
- Measures angle between vectors
- Best for normalized embeddings
- Ignores vector magnitude

Dot Product: A · B = Σ(Ai × Bi)
- Range: -∞ to +∞
- Considers both direction and magnitude
- Faster computation
- Requires same embedding space

Euclidean Distance: d = √Σ(Ai - Bi)²
- Range: 0 to +∞
- Measures straight-line distance
- Lower = more similar
- Can be converted to similarity: 1 / (1 + distance)

Sparse vs Dense Retrieval

Sparse Retrieval (Keyword-based)

  • Examples: BM25, TF-IDF, keyword matching
  • How it works: Creates sparse vectors based on term frequency
  • Pros: Fast, interpretable, good for exact matches
  • Cons: Misses synonyms, no semantic understanding, vocabulary mismatch

Dense Retrieval (Semantic)

  • Examples: Sentence-BERT, OpenAI embeddings, Cohere embeddings
  • How it works: Neural networks create dense vectors capturing meaning
  • Pros: Semantic understanding, handles synonyms, cross-lingual
  • Cons: Computationally expensive, requires training, less interpretable

Basic NLP Concepts

  • Tokenization: Breaking text into tokens (words, subwords, or characters) that models can process
  • Transformers: Neural architecture using self-attention to understand relationships between tokens
  • Attention: Mechanism allowing models to focus on relevant parts of input when generating output
  • LLM Inference: Process of generating text from a trained model, involving forward passes through neural networks
02

Data & Knowledge Ingestion

Knowledge Sources and Formats

RAG systems can ingest knowledge from diverse sources:

📄 Documents

  • PDFs (research papers, manuals, reports)
  • Word documents (.docx, .doc)
  • Markdown files
  • Plain text files

🌐 Web Content

  • HTML pages
  • Wikis (Confluence, MediaWiki)
  • Blog posts
  • Documentation sites

💬 Communication

  • Support tickets
  • Email threads
  • Slack/Teams messages
  • Chat logs

🗄️ Databases

  • SQL databases
  • NoSQL stores
  • Data warehouses
  • Knowledge bases

🔌 APIs

  • REST APIs
  • GraphQL endpoints
  • Webhooks
  • Real-time streams

📊 Logs & Events

  • Application logs
  • Event streams
  • Monitoring data
  • Audit trails

Data Structure Types

Structured Data

Well-defined schema (tables, JSON with fixed structure). Examples: databases, CSV files, API responses with schemas.

Unstructured Data

No predefined format (free text, images, audio). Examples: documents, emails, social media posts.

Semi-structured Data

Some structure but flexible schema (JSON, XML, HTML). Examples: web pages, API responses, configuration files.

Ingestion Pipelines

Batch vs Streaming

Approach
When to Use
Characteristics
Batch Processing
Initial load, periodic updates, large datasets
Processes data in chunks, scheduled runs, higher throughput, simpler error handling
Streaming
Real-time updates, event-driven systems, low latency requirements
Continuous processing, immediate updates, lower latency, more complex state management

ETL/ELT Patterns for RAG

E

Extract

Pull data from sources (APIs, databases, files)

T

Transform

Clean, chunk, embed, add metadata

L

Load

Store in vector database with metadata

Connectors: Pre-built integrations for common sources (S3, SharePoint, Confluence, Slack, etc.) that handle authentication, pagination, and format conversion.

Document Modeling

Chunking Strategies

Fixed Window Chunking

Split documents into fixed-size chunks (e.g., 512 tokens).

  • Pros: Simple, predictable, easy to implement
  • Cons: May break sentences/paragraphs, loses context
  • Use when: Uniform document structure, simple use cases

Sliding Window Chunking

Fixed-size chunks with overlap between adjacent chunks.

  • Pros: Preserves context at boundaries, reduces information loss
  • Cons: More storage, potential redundancy
  • Use when: Context continuity is important
  • Overlap: Typically 10-20% of chunk size (e.g., 100 tokens overlap for 512-token chunks)

Semantic Chunking

Split based on semantic boundaries (sentences, paragraphs, sections).

  • Pros: Preserves meaning, better retrieval quality
  • Cons: Variable chunk sizes, more complex
  • Use when: Quality is critical, documents have clear structure
  • Methods: Sentence transformers, topic modeling, section detection

Chunk Size and Overlap Trade-offs

  • Smaller chunks (128-256 tokens): More precise retrieval, better for specific facts, but may miss context
  • Medium chunks (512-1024 tokens): Balance between precision and context, most common choice
  • Larger chunks (2048+ tokens): More context, better for complex reasoning, but less precise retrieval
  • Overlap considerations: Higher overlap improves context continuity but increases storage and processing costs

Metadata Design

Example Metadata Schema
{
  "source": "document_url_or_id",
  "timestamp": "2024-01-15T10:30:00Z",
  "doc_type": "pdf|html|ticket|api",
  "language": "en|es|fr|...",
  "access_control": {
    "roles": ["admin", "engineer"],
    "departments": ["engineering"],
    "tenants": ["company_a"]
  },
  "quality_score": 0.95,
  "chunk_index": 3,
  "total_chunks": 15,
  "section": "introduction",
  "author": "john.doe@company.com",
  "last_updated": "2024-01-10T08:00:00Z"
}

Key Metadata Fields:

  • Source: Origin of the document for citation and traceability
  • Timestamp: When document was created/updated for freshness tracking
  • Access Control Tags: For row-level security and multi-tenancy
  • Doc Type: Enables type-specific processing and filtering
  • Language: For multilingual systems and language-specific retrieval
  • Quality Scores: Confidence metrics for ranking and filtering

Embedding Pipelines

Embedding Model Choice

General-Purpose Models

  • OpenAI text-embedding-ada-002: 1536 dimensions, good general performance
  • OpenAI text-embedding-3-small/large: Latest models with better performance
  • Cohere embed-english-v3.0: 1024 dimensions, strong semantic understanding
  • Sentence-BERT (all-MiniLM-L6-v2): 384 dimensions, fast and efficient
  • Use when: General knowledge, diverse domains, standard use cases

Domain-Specific Models

  • BioBERT: Biomedical domain
  • Legal-BERT: Legal documents
  • SciBERT: Scientific papers
  • CodeBERT: Programming code
  • Use when: Specialized domain, technical jargon, domain-specific terminology

Multilingual Models

  • multilingual-e5-base/large: Supports 100+ languages
  • paraphrase-multilingual-MiniLM: Cross-lingual understanding
  • Use when: International content, cross-lingual search, global knowledge bases

Offline vs On-the-Fly Embedding

Approach
When to Use
Trade-offs
Offline Embedding
Batch ingestion, stable documents, large-scale systems
Faster queries, cost-effective, but requires re-embedding for updates
On-the-Fly Embedding
Real-time updates, dynamic content, small-scale systems
Always fresh, flexible, but slower queries and higher costs

Embedding Versioning

When updating embedding models or document schemas, maintain version tracking:

  • Model Version: Track which embedding model was used (e.g., "text-embedding-ada-002-v1")
  • Schema Version: Track metadata schema changes
  • Migration Strategy: Gradual migration, dual indexing, or full re-embedding
  • Backward Compatibility: Support queries against old embeddings during transition
03

Retrieval & RAG Architectures

Retrieval Building Blocks

Vector Stores

FAISS (Facebook AI Similarity Search)

  • Open-source library by Meta
  • In-memory or disk-based
  • Supports GPU acceleration
  • Best for: Research, prototyping, self-hosted

Pinecone

  • Managed vector database service
  • Auto-scaling, high availability
  • Metadata filtering
  • Best for: Production, managed infrastructure

Weaviate

  • Open-source vector database
  • GraphQL API
  • Built-in ML models
  • Best for: Complex queries, graph + vector

pgvector

  • PostgreSQL extension
  • SQL + vector search
  • ACID transactions
  • Best for: Existing PostgreSQL infrastructure

Chroma

  • Embedding database
  • Simple Python API
  • Lightweight
  • Best for: Development, small-scale

Qdrant

  • Vector similarity search engine
  • REST API
  • Payload filtering
  • Best for: Production, high performance

Retrieval Techniques

Top-K Retrieval

Retrieve the K most similar documents based on similarity score. Common values: K=5 to K=20.

  • Higher K: More context, but may include irrelevant docs
  • Lower K: More focused, but may miss relevant information

Score Thresholds

Filter results below a similarity threshold (e.g., only return docs with similarity > 0.7).

  • Prevents low-quality retrievals
  • Adaptive thresholds based on query type
  • Can be combined with Top-K

MMR (Maximal Marginal Relevance)

Diversity-focused retrieval that balances relevance and diversity.

  • Reduces redundancy in results
  • Formula: MMR = λ × Relevance - (1-λ) × MaxSimilarity
  • λ parameter controls relevance vs diversity trade-off

Metadata Filtering

Filter results based on metadata before or after vector search.

  • Pre-filter: Filter before search (faster, but may miss results)
  • Post-filter: Filter after search (slower, but more accurate)
  • Examples: date ranges, document types, access permissions

Retrieval Variants

Sparse Retrieval (BM25)

Best Match 25 (BM25) is a probabilistic ranking function for information retrieval.

  • Based on term frequency and inverse document frequency (TF-IDF)
  • Handles exact keyword matches well
  • Fast and interpretable
  • Use when: Keyword-heavy queries, exact matches needed, interpretability required

Dense Retrieval

Uses embedding vectors for semantic similarity search.

  • Captures semantic meaning, not just keywords
  • Handles synonyms and paraphrasing
  • Requires embedding model
  • Use when: Semantic understanding needed, natural language queries, cross-lingual search

Hybrid Search

Combines sparse and dense retrieval for best of both worlds.

  • Reciprocal Rank Fusion (RRF): Combines rankings from both methods
  • Weighted Combination: Weighted sum of scores (e.g., 0.3 × BM25 + 0.7 × Dense)
  • Use when: Need both keyword and semantic matching, production systems

Reranking

Cross-Encoder Reranking

Uses cross-encoder models (BERT-based) to score query-document pairs more accurately.

  • More accurate than bi-encoder (embedding) models
  • Slower (can't pre-compute embeddings)
  • Typically rerank top 20-100 candidates
  • Models: cross-encoder/ms-marco-MiniLM, bge-reranker

LLM Judge Reranking

Uses LLM to evaluate and rank retrieved documents.

  • Most flexible and context-aware
  • Can consider complex relevance criteria
  • Expensive and slow
  • Use for critical applications or complex queries

Vanilla RAG Pipeline in Detail

1

Query Understanding

Parse and normalize user query, extract intent, identify entities

2

Query Embedding

Convert query to embedding vector using same model as documents

3

Vector Search

Find top-K similar documents in vector store

4

Rerank

Optionally rerank results using cross-encoder or LLM judge

5

Context Packing

Combine retrieved documents into context, respect token limits

6

Prompt Construction

Build prompt with system instructions, context, and user query

7

Generation

LLM generates answer based on augmented prompt

Single-Shot QA vs Conversation

Mode
Characteristics
Implementation
Single-Shot QA
Each query is independent, no conversation history
Simple prompt with query + context, stateless
Conversation with History
Maintains conversation context, follow-up questions
Include chat history in prompt, manage conversation state
Memory-Augmented
Persistent memory across sessions, user preferences
External memory store, retrieval of relevant memories

RAG Architectural Patterns

Single-Index Architecture

All documents in one vector index.

  • Pros: Simple, single query, easy to manage
  • Cons: No domain separation, harder to scale
  • Use when: Small-scale, single domain, homogeneous content

Multi-Index Architecture

Separate indexes for different domains or document types.

  • Pros: Domain-specific optimization, better organization, parallel queries
  • Cons: More complex, need to route queries, multiple indexes to maintain
  • Use when: Multiple domains, different document types, specialized retrieval needs

Multi-Store Architecture

Combines vector store with other data stores (SQL, graph, etc.).

  • Pros: Best tool for each data type, flexible queries
  • Cons: Complex integration, query coordination needed
  • Use when: Mixed data types, structured + unstructured, complex queries

Multi-Tenancy Patterns

Per-Tenant Index

Separate vector index for each tenant.

  • Pros: Complete isolation, tenant-specific optimization
  • Cons: Higher cost, more indexes to manage

Shared Index with Row-Level Filters

Single index with metadata-based filtering for tenant isolation.

  • Pros: Cost-effective, easier management, cross-tenant analytics possible
  • Cons: Requires careful access control, potential for data leakage if misconfigured
  • Implementation: Filter by tenant_id in metadata before/after retrieval
04

Advanced Retrieval & Knowledge Workflows

Advanced RAG Techniques

Multi-Hop / Multi-Step Retrieval

Iterative retrieval where each step uses information from previous retrievals.

  • Step 1: Initial query retrieves relevant documents
  • Step 2: Extract entities/concepts from Step 1 results
  • Step 3: Query for documents related to extracted entities
  • Step 4: Combine results from all steps
  • Use when: Complex queries requiring multiple pieces of information, research tasks

Recursive Retrieval

Retrieves parent documents when child chunks are relevant.

  • If a chunk is retrieved, also retrieve its parent document/section
  • Provides broader context around relevant chunks
  • Useful for hierarchical document structures

Query Planning

LLM generates a plan for breaking down complex queries into sub-queries.

  • Analyze query complexity
  • Generate sub-questions
  • Execute sub-queries in sequence or parallel
  • Synthesize final answer from sub-query results

Query Rewriting and Decomposition

Sub-Question Decomposition

Break complex questions into simpler sub-questions.

  • Example: "What are the benefits and drawbacks of RAG?" → ["What are benefits of RAG?", "What are drawbacks of RAG?"]
  • Retrieve for each sub-question
  • Combine answers

Self-Ask Pattern

LLM asks itself follow-up questions to gather needed information.

  • Model generates its own questions
  • Retrieves answers to self-generated questions
  • Uses answers to respond to original query

Self-RAG

Model critiques its own retrieval and generation, triggering re-retrieval if needed.

  • Retrieve: Initial retrieval based on query
  • Critique: Model evaluates if retrieved context is sufficient
  • Re-retrieve: If insufficient, refine query and retrieve again
  • Generate: Generate answer using retrieved context
  • Self-reflect: Evaluate answer quality, decide if re-generation needed

Graph and Structured RAG

GraphRAG Concepts

Knowledge Graph Construction

Extract entities and relationships from documents to build a knowledge graph.

  • Entity Extraction: Identify people, organizations, concepts, etc.
  • Relationship Extraction: Identify connections between entities
  • Graph Storage: Store in graph database (Neo4j, Amazon Neptune)
  • Graph Embeddings: Create embeddings for nodes and edges

Graph Search + Embeddings

Combine graph traversal with vector similarity search.

  • Graph Traversal: Follow relationships to find connected entities
  • Vector Search: Find semantically similar nodes
  • Hybrid: Use graph structure to filter/rank vector results
  • Use when: Relationships matter, entity-centric queries, complex knowledge domains

Structured Data Integration

Vector + SQL/OLAP

Combine unstructured vector search with structured database queries.

  • Vector search for semantic content
  • SQL queries for structured data (numbers, dates, categories)
  • Merge results in final answer
  • Example: "Find documents about Q4 sales (vector) where revenue > $1M (SQL)"

Vector + API Tools

Combine retrieved context with real-time API data.

  • Retrieve relevant documents
  • Call APIs for real-time data (prices, weather, stock info)
  • LLM synthesizes both sources
  • Enables dynamic, up-to-date answers

Workflow / Agent Patterns with RAG

Agentic Flows

Multi-step workflows where agents use tools including RAG retrieval.

🔍

Search Tool

Agent uses RAG to search knowledge base

📝

Summarizer

Summarize retrieved documents

✍️

Writer

Generate final response based on summary

Validator

Verify answer quality and fact-check

Task-Oriented Workflows

Summarization

  • Retrieve relevant documents
  • Extract key points
  • Generate concise summary
  • Use: Meeting notes, research papers, long documents

Comparison

  • Retrieve documents about each item
  • Extract features/attributes
  • Compare side-by-side
  • Use: Product comparison, policy analysis

Decision Support

  • Retrieve relevant policies/guidelines
  • Analyze current situation
  • Recommend actions
  • Use: Compliance, risk assessment

Report Generation

  • Retrieve data from multiple sources
  • Synthesize information
  • Generate structured report
  • Use: Status reports, analysis reports

Enterprise Knowledge Management Workflows

Use Cases

Internal KB Search

  • Company documentation search
  • Employee self-service
  • Knowledge discovery
  • Onboarding assistance

Policy Assistants

  • HR policy queries
  • Compliance checking
  • Regulatory guidance
  • Procedure lookup

Project Knowledge Discovery

  • Find related projects
  • Learn from past projects
  • Identify experts
  • Best practices discovery

Governance Workflows

📥

Content Lifecycle

Ingestion → Processing → Indexing → Serving → Archival

✏️

Curation

Review, tag, categorize, quality check

Approval

Subject matter expert review, approval workflow

📦

Archival

Version control, deprecation, removal

05

Quality, Evaluation, and Production

Retrieval Evaluation

Ground-Truth Datasets

Create evaluation datasets with query-answer pairs and relevant document IDs.

  • QA Datasets: Question-answer pairs with source documents
  • Relevance Labels: Human-annotated relevance scores for query-document pairs
  • Benchmark Datasets: MS MARCO, Natural Questions, SQuAD, BEIR

Retrieval Metrics

Hit Rate

Percentage of queries where at least one relevant document is retrieved in top-K.

Formula: (Queries with at least 1 relevant doc) / (Total queries)

Recall@K

Percentage of relevant documents retrieved in top-K results.

Formula: (Relevant docs retrieved) / (Total relevant docs)

NDCG (Normalized Discounted Cumulative Gain)

Measures ranking quality, giving higher weight to top positions.

  • Accounts for position of relevant documents
  • Range: 0 to 1 (higher is better)
  • Best for: Ranking evaluation

MRR (Mean Reciprocal Rank)

Average of reciprocal ranks of first relevant document.

Formula: (1 / rank of first relevant) averaged across queries

Per-Component Evaluation

  • Retriever: Evaluate retrieval quality independently (Hit Rate, Recall@K)
  • Reranker: Evaluate ranking improvement (NDCG improvement)
  • Generator: Evaluate answer quality given perfect retrieval (faithfulness, relevance)

Generation Evaluation

Answer Quality Metrics

Faithfulness

Answer is grounded in retrieved context, no hallucinations.

  • Check if claims are supported by context
  • Detect contradictions
  • Methods: NLI models, LLM-as-judge, fact-checking

Relevance

Answer addresses the query.

  • Does answer match query intent?
  • Is information complete?
  • Methods: Semantic similarity, LLM-as-judge

Completeness

Answer covers all aspects of the query.

  • No missing information
  • All sub-questions addressed
  • Methods: Coverage analysis, LLM-as-judge

Style

Answer matches desired tone and format.

  • Formal vs casual
  • Technical vs simple
  • Length and structure

Evaluation Methods

Method
Pros
Cons
LLM-as-Judge
Flexible, understands context, scalable
Cost, potential bias, less interpretable
Human Evaluation
Most accurate, understands nuance
Expensive, slow, subjective
Automated Metrics
Fast, cheap, reproducible
May not capture quality, limited scope

Common Issues

Hallucination Detection

Answers contain information not in retrieved context.

  • Use NLI (Natural Language Inference) models to check if claims are entailed by context
  • LLM-as-judge to verify faithfulness
  • Citation tracking to ensure all claims have sources

"Lost in the Middle" Problem

LLMs pay more attention to the beginning and end of context, missing information in the middle.

  • Reorder retrieved documents (most relevant in middle)
  • Limit context size
  • Use attention mechanisms that emphasize important parts

Context Over-/Under-Stuffing

Too much or too little context affects answer quality.

  • Over-stuffing: Too many documents dilute focus, increase cost/latency
  • Under-stuffing: Missing relevant information leads to incomplete answers
  • Solution: Adaptive retrieval (start with K=5, expand if needed)

Prompt & Context Engineering

Context Window Budgeting

Manage limited context windows efficiently:

  • System Prompt: 200-500 tokens (instructions, guidelines)
  • Retrieved Context: 2000-4000 tokens (documents)
  • User Query: 50-200 tokens
  • Response Buffer: 500-1000 tokens (for generation)
  • Total Budget: Typically 4K-8K tokens for most models

Formatting Citations and Quotes

Example Prompt with Citations
System: You are a helpful assistant. Always cite sources using [1], [2], etc.

Context:
[1] Source: document1.pdf, Page 5
"RAG improves answer quality by providing external context."

[2] Source: document2.pdf, Section 3.2
"Vector similarity search enables semantic retrieval."

User: How does RAG work?

Assistant: RAG works by retrieving relevant documents [1] and using 
semantic search [2] to find context...

Source Attribution

  • Include document IDs, URLs, or titles in context
  • Request citations in system prompt
  • Parse citations from LLM output
  • Link citations back to original sources

System Prompts

Grounding

Instructions to base answers only on provided context.

Example: "Only use information from the provided context. If the answer is not in the context, say 'I don't have that information.'"

Safety

Guidelines for handling sensitive or harmful content.

Example: "Do not generate harmful, biased, or inappropriate content. Refuse to answer questions about illegal activities."

Tool Calling

Instructions for when and how to use tools.

Example: "Use the search tool when you need additional information. Use the calculator for mathematical operations."

Refusal Behaviors

When to refuse answering.

Example: "Refuse to answer if: (1) information is not in context, (2) query is harmful, (3) query violates policies."

Scaling & Operations

Latency and Cost Optimization

Caching Strategies

  • Query Caching: Cache query embeddings and results for repeated queries
  • Context Caching: Cache frequently retrieved document contexts
  • Generation Caching: Cache LLM responses for identical queries
  • TTL (Time-To-Live): Set expiration for cached content

Approximate Search

  • Use approximate nearest neighbor (ANN) algorithms instead of exact search
  • Trade slight accuracy for significant speed improvement
  • Examples: HNSW (Hierarchical Navigable Small World), IVF (Inverted File Index)
  • Speedup: 10-100x faster than exact search

Batching

  • Batch embedding generation for multiple documents
  • Batch LLM inference for multiple queries
  • Reduces API calls and improves throughput
  • Trade-off: Slightly higher latency for individual requests

Index Maintenance

Re-embedding

When to re-embed documents:

  • New embedding model available (better quality)
  • Document updates (content changed)
  • Schema changes (metadata structure changed)
  • Migration strategy: Gradual migration vs full re-embedding

Deletions

Handling document removal:

  • Soft delete: Mark as deleted, filter in queries
  • Hard delete: Remove from index immediately
  • Cascade deletes: Remove all chunks when parent deleted
  • Versioning: Keep old versions for audit trail

Drift Management

Handling changes over time:

  • Model Drift: New embedding models may change similarity scores
  • Schema Drift: Metadata schema changes require migration
  • Content Drift: Documents updated but embeddings not refreshed
  • Solution: Version tracking, migration scripts, monitoring

Monitoring

Usage Analytics

  • Query volume and patterns
  • Popular queries and documents
  • User engagement metrics
  • Peak usage times

Feedback Loops

  • User ratings (thumbs up/down)
  • Correction submissions
  • Reported issues
  • Usage patterns

Drift Detection

  • Retrieval quality trends
  • Answer quality degradation
  • Model performance changes
  • Anomaly detection

Regression Tracking

  • A/B testing results
  • Version comparisons
  • Performance benchmarks
  • Quality metrics over time

Security & Compliance

Access Control in Retrieval

Row-Level Security

Filter documents based on user permissions before retrieval.

  • Check user roles/permissions
  • Apply metadata filters (department, team, clearance level)
  • Pre-filter or post-filter based on performance needs
  • Audit all access attempts

Tenant Isolation

Ensure multi-tenant systems don't leak data between tenants.

  • Separate indexes per tenant (strongest isolation)
  • Shared index with tenant_id filtering (cost-effective)
  • Validate filters at multiple layers
  • Test isolation regularly

Data Protection

PII/PHI Redaction

Remove or mask sensitive information.

  • Detect PII (SSN, email, phone) and PHI (medical records)
  • Redact before indexing or at query time
  • Use NER (Named Entity Recognition) models
  • Comply with GDPR, HIPAA, CCPA

Data Residency

Store data in specific geographic regions.

  • Choose vector store region based on requirements
  • Ensure LLM API calls comply with data residency
  • Track data location in metadata
  • Comply with regional regulations (EU, US, etc.)

Audit and Compliance

Audit Trails

Log all access and operations.

  • Query logs (who, what, when)
  • Document access logs
  • Modification history
  • Retention policies

Policy-Aware Answering

Ensure answers comply with organizational policies.

  • Check policies before generating answers
  • Refuse to answer policy-violating queries
  • Include policy disclaimers when needed
  • Regular policy updates and compliance checks