Building a Multi-Agent RAG Pipeline: Lessons from Production
How we built a production RAG system that handles 10K+ queries daily — the architecture, the failures, and the hard-won optimizations that cut latency by 60%.
The Problem
When we first deployed our RAG pipeline, it worked great — in demo. Under load, it fell apart. Queries timed out, context windows overflowed, and the retrieval quality degraded non-linearly as the document corpus grew past 10K chunks.
We were building a document intelligence system for Nivant Labs’ internal tools. The requirement was straightforward: ingest technical documentation, answer questions about it with citations. What we didn’t anticipate was how quickly naive RAG breaks at scale.
The Failure Mode
Our initial architecture was textbook:
User Query → Embedding → Vector Search → LLM Generation → Response
Simple. Elegant. Wrong.
The first sign of trouble was latency. A single query took 4-6 seconds end-to-end. The embedding call was fast (~200ms), but the vector search on our Pinecone index started creeping up as we added documents. By 50K chunks, search alone took 1.2s.
The second sign was context poisoning. With top-K=5 retrieval, we’d often get 2-3 irrelevant chunks mixed in. The LLM would latch onto these, generating confident-sounding but wrong answers.
The third sign was cost. Every query sent 5 chunks × 4K tokens to GPT-4. At scale, that’s not sustainable.
The Investigation
We profiled every stage of the pipeline:
| Stage | P50 | P95 | Bottleneck |
|---|---|---|---|
| Query embedding | 180ms | 350ms | Model size |
| Vector search | 800ms | 2.1s | Index size |
| Reranking | 0ms | 0ms | Not implemented |
| LLM generation | 3.2s | 8.5s | Context length |
| Citation validation | 0ms | 0ms | Not implemented |
The data told a clear story: we were doing too much in one shot and not validating anything.
The Solution: Multi-Agent Architecture
We redesigned the pipeline as a multi-agent system with specialized roles:
User Query
│
▼
┌─────────────────┐
│ Query Router │ ← Classifies query type (factual, analytical, code)
└────────┬────────┘
│
┌────┴────┐
▼ ▼
┌────────┐ ┌────────┐
│Retrieve│ │ Code │ ← Parallel specialized retrievers
│ Text │ │ Search │
└───┬────┘ └───┬────┘
│ │
▼ ▼
┌─────────────────┐
│ Reranker │ ← Cross-encoder reranking (Cohere)
└────────┬────────┘
│
┌────┴────┐
▼ ▼
┌────────┐ ┌────────┐
│ LLM │ │Citation│ ← Parallel generation + validation
│ Gen │ │ Check │
└───┬────┘ └───┬────┘
│ │
▼ ▼
┌─────────────────┐
│ Aggregator │ ← Final response assembly
└─────────────────┘
Key Design Decisions
1. Query Router
The router classifies incoming queries into three types using a lightweight classifier (not an LLM call):
async def route_query(query: str) -> QueryType:
embedding = await fast_embed(query)
probs = router_model.predict(embedding)
return QueryType(probs.argmax())
This let us dispatch to specialized retrievers instead of hammering one index.
2. Specialized Retrievers
Instead of one vector index, we split into text and code indices:
class HybridRetriever:
def __init__(self):
self.text_index = VectorIndex("text-chunks")
self.code_index = VectorIndex("code-chunks")
self.bm25_index = BM25Index()
async def retrieve(self, query: str, qtype: QueryType, top_k: int = 10):
if qtype == QueryType.CODE:
results = await self.code_index.search(query, top_k)
else:
dense = await self.text_index.search(query, top_k)
sparse = self.bm25_index.search(query, top_k // 2)
results = self.reciprocal_rank_fusion(dense, sparse)
return results
3. Cross-Encoder Reranking
The biggest quality win came from adding a reranking step:
class Reranker:
async def rerank(self, query: str, chunks: list[Chunk], top_k: int = 3):
pairs = [(query, c.text[:256]) for c in chunks]
scores = await self.cohere.rerank(pairs=pairs)
return [c for c, s in zip(chunks, scores) if s > 0.3][:top_k]
This single change improved relevance from 72% to 94%.
The Results
| Metric | Before | After | Improvement |
|---|---|---|---|
| P50 latency | 4.2s | 1.8s | 57% faster |
| P95 latency | 9.1s | 3.4s | 63% faster |
| Relevance score | 72% | 94% | +22pp |
| Cost per query | $0.042 | $0.018 | 57% cheaper |
| Citation accuracy | 68% | 96% | +28pp |
Trade-offs and Lessons
What we sacrificed:
- Architectural complexity — 1 service became 5. More moving parts, more monitoring.
- Cold start latency — The router and reranker add ~100ms to simple queries.
- Infra cost — Running the reranker adds compute cost, but the savings from using cheaper LLMs more than offset it.
What surprised us:
- The reranker was the single highest-impact change. Not a better LLM, not a bigger index — just ordering results better.
- The citation validator caught hallucinated citations in ~12% of generations, even with GPT-4.
- Hybrid search (dense + sparse) significantly outperformed pure vector search for technical documentation with code snippets.