mechanism: chunk docs -> embed -> vector store -> top-k retrieval -> concat to prompt -> generate. no reranking, no iteration. works for simple QA over static corpora.
knowledge graph as retrieval substrate. entities + relations replace flat chunks. enables multi-hop reasoning across connected nodes. community detection for global summarization.
Neo4jMicrosoft GraphRAGLlamaIndex KGApache Jena
multi-entity QA, legal discovery, biomedical literature
HyDE (hypothetical document embeddings)
QUERY-SIDE
LLM generates hypothetical answer -> embed that instead of query. bridges vocabulary mismatch between questions and documents. zero-shot; no labeled data needed. cost: extra LLM call per query.
tree-structured retrieval: cluster chunks -> summarize clusters -> recursive tree. query matches at multiple abstraction levels. handles both specific detail and thematic questions from same corpus.
RAPTOR (Stanford)LlamaIndex TreeIndex
long documents, book-level QA, research synthesis
hybrid RAG
MULTI-SIGNAL
dense vectors + sparse retrieval (BM25/TF-IDF) fused via reciprocal rank. dense catches semantics; sparse catches exact terms. outperforms either alone on most benchmarks. standard in production.
prepend document-level context to each chunk before embedding. anthropic's approach: LLM generates 50-100 token context prefix per chunk. reduces retrieval failures 49% (anthropic benchmark). combine with BM25 for 67% improvement.
Anthropic APIcustom pipelineLlamaIndex
any chunked corpus where chunk isolation hurts retrieval
>>
generation control
how the LLM uses retrieved context
self-RAG
REFLECTIVE
LLM decides when to retrieve, then self-critiques output with special tokens. trained on reflection tokens: [Retrieve], [IsRel], [IsSup], [IsUse]. skips retrieval when unnecessary. 5-20% accuracy gains on knowledge-intensive tasks.
evaluator scores retrieval quality -> three paths: correct (use docs), ambiguous (refine + web search), incorrect (web search only). lightweight retrieval evaluator triggers correction before generation. robust to noisy retrieval.
LangGraphcustom pipelineTavily Search
production QA with unreliable corpora, mixed-source systems
speculative RAG
PARALLEL
small LM generates multiple draft answers from different doc subsets -> large LM verifies best. drafting parallelized across doc partitions. reduces latency (large model sees only candidates, not all docs). accuracy: comparable to full-context.
chain of retrieval-generation loops. each step refines query using previous output. decomposes complex questions into sub-queries. each iteration sharpens context window. tradeoff: latency scales linearly with steps.
LlamaIndex sub-questionDSPyLangGraph
multi-hop reasoning, comparative analysis, research synthesis
>>
system architecture
how the pipeline is organized
agentic RAG
AUTONOMOUS
agent decides: which tools, when to retrieve, how to combine. LLM as orchestrator over retrieval tools, APIs, calculators, code execution. planning + memory + tool selection. genuine decision loops, not just retrieve-then-generate.
pipeline as interchangeable modules: retriever, reranker, compressor, generator, router. swap components without rewriting. enables A/B testing individual stages. DSPy-style programmatic optimization of each module.
DSPyHaystackLlamaIndex pipelines
production systems needing iterative optimization
adaptive RAG
ROUTED
classifier routes query to optimal strategy: no retrieval | single-step | multi-step. simple queries skip retrieval entirely. complex queries get iterative pipeline. reduces unnecessary latency and cost on easy questions.
LangGraphcustom classifierDSPy
high-throughput QA with mixed query complexity
>>
modality & memory
what you retrieve and remember
multi-modal RAG
CROSS-MODAL
retrieve across text, images, tables, audio, video. unified embedding space (CLIP/SigLIP) or modal-specific retrievers with fusion layer. table extraction via layout models. image captioning as retrieval bridge.
generate retrieval plan before executing any search. LLM decomposes question -> identifies information needs -> plans retrieval sequence -> executes. avoids irrelevant retrieval and ensures coverage.
DSPycustom pipelineLangGraph
complex research questions, multi-source investigations
reranking RAG
PRECISION
cross-encoder reranks initial retrieval results before generation. first-stage: fast bi-encoder (top-100). second-stage: slow cross-encoder (top-k from 100). dramatically improves precision@k. near-universal in production.
Cohere RerankColBERTbge-rerankerJina Reranker
any production RAG system (should be default)
>>
scale & distribution
production deployment patterns
federated RAG
DISTRIBUTED
retrieval across decentralized data sources without centralizing data. privacy-preserving: data stays at source. aggregation layer merges results. critical for healthcare, finance, multi-org collaborations.
continuous index updates from live data streams. new documents indexed on ingest, not batch. handles time-sensitive queries where stale data = wrong answers. event-driven architecture.
Kafka + vector DBApache FlinkMilvus streaming
news, financial data, social monitoring, fraud detection