·12 min read

Building a Multi-Agent RAG Pipeline: Lessons from Production

How we built a production RAG system that handles 10K+ queries daily — the architecture, the failures, and the hard-won optimizations that cut latency by 60%.

The Problem

When we first deployed our RAG pipeline, it worked great — in demo. Under load, it fell apart. Queries timed out, context windows overflowed, and the retrieval quality degraded non-linearly as the document corpus grew past 10K chunks.

We were building a document intelligence system for Nivant Labs’ internal tools. The requirement was straightforward: ingest technical documentation, answer questions about it with citations. What we didn’t anticipate was how quickly naive RAG breaks at scale.

The Failure Mode

Our initial architecture was textbook:

User Query → Embedding → Vector Search → LLM Generation → Response

Simple. Elegant. Wrong.

The first sign of trouble was latency. A single query took 4-6 seconds end-to-end. The embedding call was fast (~200ms), but the vector search on our Pinecone index started creeping up as we added documents. By 50K chunks, search alone took 1.2s.

The second sign was context poisoning. With top-K=5 retrieval, we’d often get 2-3 irrelevant chunks mixed in. The LLM would latch onto these, generating confident-sounding but wrong answers.

The third sign was cost. Every query sent 5 chunks × 4K tokens to GPT-4. At scale, that’s not sustainable.

The Investigation

We profiled every stage of the pipeline:

Stage P50 P95 Bottleneck
Query embedding 180ms 350ms Model size
Vector search 800ms 2.1s Index size
Reranking 0ms 0ms Not implemented
LLM generation 3.2s 8.5s Context length
Citation validation 0ms 0ms Not implemented

The data told a clear story: we were doing too much in one shot and not validating anything.

The Solution: Multi-Agent Architecture

We redesigned the pipeline as a multi-agent system with specialized roles:

User Query


┌─────────────────┐
│  Query Router   │  ← Classifies query type (factual, analytical, code)
└────────┬────────┘

    ┌────┴────┐
    ▼         ▼
┌────────┐ ┌────────┐
│Retrieve│ │  Code   │  ← Parallel specialized retrievers
│  Text  │ │ Search  │
└───┬────┘ └───┬────┘
    │         │
    ▼         ▼
┌─────────────────┐
│   Reranker      │  ← Cross-encoder reranking (Cohere)
└────────┬────────┘

    ┌────┴────┐
    ▼         ▼
┌────────┐ ┌────────┐
│  LLM   │ │Citation│  ← Parallel generation + validation
│ Gen    │ │ Check  │
└───┬────┘ └───┬────┘
    │         │
    ▼         ▼
┌─────────────────┐
│  Aggregator     │  ← Final response assembly
└─────────────────┘

Key Design Decisions

1. Query Router

The router classifies incoming queries into three types using a lightweight classifier (not an LLM call):

async def route_query(query: str) -> QueryType:
    embedding = await fast_embed(query)
    probs = router_model.predict(embedding)
    return QueryType(probs.argmax())

This let us dispatch to specialized retrievers instead of hammering one index.

2. Specialized Retrievers

Instead of one vector index, we split into text and code indices:

class HybridRetriever:
    def __init__(self):
        self.text_index = VectorIndex("text-chunks")
        self.code_index = VectorIndex("code-chunks")
        self.bm25_index = BM25Index()

    async def retrieve(self, query: str, qtype: QueryType, top_k: int = 10):
        if qtype == QueryType.CODE:
            results = await self.code_index.search(query, top_k)
        else:
            dense = await self.text_index.search(query, top_k)
            sparse = self.bm25_index.search(query, top_k // 2)
            results = self.reciprocal_rank_fusion(dense, sparse)
        return results

3. Cross-Encoder Reranking

The biggest quality win came from adding a reranking step:

class Reranker:
    async def rerank(self, query: str, chunks: list[Chunk], top_k: int = 3):
        pairs = [(query, c.text[:256]) for c in chunks]
        scores = await self.cohere.rerank(pairs=pairs)
        return [c for c, s in zip(chunks, scores) if s > 0.3][:top_k]

This single change improved relevance from 72% to 94%.

The Results

Metric Before After Improvement
P50 latency 4.2s 1.8s 57% faster
P95 latency 9.1s 3.4s 63% faster
Relevance score 72% 94% +22pp
Cost per query $0.042 $0.018 57% cheaper
Citation accuracy 68% 96% +28pp

Trade-offs and Lessons

What we sacrificed:

  • Architectural complexity — 1 service became 5. More moving parts, more monitoring.
  • Cold start latency — The router and reranker add ~100ms to simple queries.
  • Infra cost — Running the reranker adds compute cost, but the savings from using cheaper LLMs more than offset it.

What surprised us:

  • The reranker was the single highest-impact change. Not a better LLM, not a bigger index — just ordering results better.
  • The citation validator caught hallucinated citations in ~12% of generations, even with GPT-4.
  • Hybrid search (dense + sparse) significantly outperformed pure vector search for technical documentation with code snippets.