July 2, 2026 · 8 min read

RAG for documentation: a practical guide for developers

RAG — retrieval-augmented generation — is the architecture behind most "chat with your docs" products. If you've ever used a documentation chatbot that gave you an accurate, cited answer instead of a hallucinated one, it was probably RAG. Here's how it actually works, what makes it hard, and when it's worth building vs. using a service.

The problem RAG solves

LLMs are trained on a fixed dataset. Your docs weren't in that dataset — or if they were, they were an older version, or they're too niche to have been memorized accurately. When someone asks Claude about your specific API, it either guesses or admits it doesn't know. Both responses lose trust.

RAG fixes this by giving the LLM access to the right content at query time. Instead of relying on training data, the model retrieves relevant chunks from your docs, uses them as context, and generates an answer grounded in what's actually there. The content is live, not frozen at training time.

The pipeline in detail

1. Crawling and extraction

First you need the text. For docs sites, this usually means crawling with something like Cheerio or Puppeteer — fetching pages, extracting the main content, discarding nav, footers, and scripts. The quality of this step matters a lot. Noisy content produces noisy answers.

Key decisions: which pages to index, how deep to crawl, how to handle dynamic content (SPAs that render via JS require a headless browser), how to handle auth-gated pages (you usually can't).

2. Chunking

You can't embed a whole page as one vector — the semantic meaning gets diluted and you lose precision. You split content into chunks, typically 200-500 tokens each, with overlap between adjacent chunks so context doesn't get cut at boundaries.

The chunking strategy significantly affects retrieval quality. Splitting mid-sentence or mid-code block loses context. Some implementations chunk by heading structure (each section is a chunk) which tends to work better for docs.

3. Embedding

Each chunk gets converted to a vector — a high-dimensional numerical representation of its semantic meaning — using an embedding model. OpenAI's text-embedding-3-small is common, costs fractions of a cent per thousand tokens, and produces 1536-dimensional vectors.

Embedding your entire docs site might cost a few dollars. Re-embedding on updates costs the same. This is not the expensive part.

4. Vector storage

Vectors go into a database that supports similarity search. The query at retrieval time is "find me the N vectors most similar to this query vector" — cosine similarity, usually. Options:

— Supabase pgvector: SQL database with a vector extension. Good if you're already on Postgres. Easy to combine with metadata filtering.

— Pinecone: purpose-built vector DB. Fast, managed, has a free tier.

— Weaviate / Qdrant / Chroma: open-source options, self-host or managed.

5. Query serving

At query time: embed the question → similarity search → retrieve top-K chunks → stuff chunks into LLM context → prompt LLM to answer the question using only the provided context → return answer with source citations.

The prompt matters. If you don't instruct the LLM to stay grounded in the context, it'll blend the retrieved content with training data and hallucinate details. Standard practice is to tell it explicitly: "Answer only using the following context. If the answer isn't in the context, say so."

What makes it hard

Stale content: your docs change. Your vector index doesn't update itself. You need a re-indexing pipeline triggered on publish, or a scheduled job.

Retrieval failures: sometimes the relevant content isn't retrieved because the query phrasing doesn't match the chunk embedding. Solutions: query expansion, HyDE (hypothetical document embedding), re-ranking. Each adds complexity.

Chunk boundary problems: the answer spans two chunks, but only one is retrieved. Overlap helps; it doesn't fully solve it.

Long-tail maintenance: crawlers break when sites restructure. Embeddings need to be updated when models are deprecated. None of this is hard individually, but it adds up.

When to build vs. use a service

Build it when: you're embedding RAG into a product feature (not just internal tooling), you have strict data residency requirements, or your retrieval needs are specialized enough that a generic solution won't work.

Use a service when: the goal is "AI agents can query our docs" and you'd rather not maintain a RAG pipeline. AgentReady handles the full pipeline — crawl, chunk, embed, index — and exposes the result via MCP so Claude Desktop, Cursor, and other MCP clients can query it without any code on your side.

Skip the pipeline →