Back to all posts
Saket's Blog

RAG Explained Properly: How AI Systems Use Your Data

2026-02-11
13 min read
AIRAGLLMEmbeddingsVector Databases

RAG Explained Properly: How AI Systems Use Your Data

One of the fastest ways to build a disappointing AI product is to assume the base model already knows everything that matters. It does not know your company docs, your pricing rules, your product edge cases, your internal policies, or the latest version of anything that changed after training. Yet this is exactly where many real-world AI applications need to operate.

That gap is why RAG became such a big deal.

Unfortunately, it is also why RAG gets explained badly. You will often hear some version of: "Just put your docs in a vector database and the model will answer from them." That description is not technically wrong, but it leaves out the part that matters most: RAG is only useful when retrieval is done well, and retrieval done well is a system design problem.

So let's explain it properly.

Why the Base Model Is Not Enough

A base LLM has broad knowledge from training data. That is useful for general language and common concepts, but it breaks down in practical scenarios:

  • Your support docs changed last week
  • Your legal policy is private
  • Your product catalog is not public
  • Your internal architecture is unique
  • Your user wants an answer grounded in your actual source material

You could try to paste all of that into a prompt, but that quickly runs into problems:

  • too much text
  • poor relevance
  • high cost
  • slow responses
  • inconsistent grounding

RAG exists because the model needs the right context at the right time, not all possible context all at once.

What RAG Actually Means

RAG stands for Retrieval-Augmented Generation.

The name sounds heavier than the idea.

It simply means:

  1. Retrieve relevant information from an external knowledge source
  2. Add that information to the prompt
  3. Let the model generate an answer using both the user query and the retrieved context

That is the core pattern.

The model is still generating the final answer, but the answer is now grounded in retrieved information rather than relying only on what the model happened to absorb during training.

A Good Mental Model

Here is the simplest mental model I know:

The model is the writer. The retrieval system is the librarian.

The writer is good at language. The librarian is good at finding the right material. If the librarian brings back irrelevant pages, outdated notes, or nothing useful at all, the writer cannot save the situation consistently. If the librarian returns the right context, the writer suddenly looks much smarter.

That is why people often over-credit the model and under-credit retrieval quality.

The Basic RAG Pipeline

A practical RAG system usually has two phases:

Phase 1: Indexing

This is where you prepare your data ahead of time.

Typical steps:

  1. Collect source documents
  2. Clean and normalize them
  3. Split them into chunks
  4. Turn those chunks into embeddings
  5. Store them in a searchable index, often a vector database

Phase 2: Retrieval and Answering

This happens at query time.

Typical steps:

  1. User asks a question
  2. The question is embedded into vector form
  3. The system retrieves the most relevant chunks
  4. Those chunks are added to the prompt
  5. The LLM generates an answer grounded in that context

That is the architecture most people mean when they talk about RAG.

What Are Embeddings?

An embedding is a numerical representation of text that captures semantic meaning.

You do not need to stare at the vector math to understand the idea. Just think of it this way:

An embedding converts text into coordinates in a high-dimensional space where similar meanings tend to land closer together.

So these texts may end up near one another:

  • "How do I reset my password?"
  • "I forgot my login password"
  • "Password recovery steps"

Even though the wording differs, the meaning is related.

That is why embeddings are so useful for retrieval. They help the system find chunks based on semantic similarity, not just exact keyword matches.

Why Chunking Matters So Much

Chunking is the process of splitting documents into smaller pieces before embedding and storing them.

This sounds like a boring implementation detail. It is not. Chunking quality often determines whether your RAG system is helpful or frustrating.

If chunks are too large:

  • retrieval becomes noisy
  • irrelevant text gets pulled in
  • important details are buried
  • prompt cost grows

If chunks are too small:

  • you lose context
  • meaning gets fragmented
  • answers become incomplete

A good chunk should be large enough to preserve meaning and small enough to stay focused.

For example, imagine a product manual with a section explaining password reset. You usually want the chunk to contain the full steps and surrounding context, not half a sentence before the steps and the other half in a separate chunk.

This is also why chunk overlap often helps. A little overlap preserves continuity across boundaries.

What a Vector Database Actually Does

A vector database stores embeddings and lets you search for nearby vectors efficiently.

That is the simple version.

In practice, it usually also stores metadata such as:

  • source document ID
  • title
  • URL
  • section name
  • timestamps
  • permissions or tenant data

When a user asks a question, the system embeds that query and searches for chunks whose embeddings are closest in meaning.

Popular tools vary, but the concept is stable:

  • store vectors
  • search by similarity
  • return candidate chunks fast enough for a live system

You do not always need a specialized managed vector database for a small prototype. But you do need a retrieval mechanism that can reliably return relevant chunks at the right speed and scale.

What Retrieval Looks Like in Practice

Imagine a user asks:

Can enterprise users export audit logs?

A decent RAG system does not search for only the exact words "export audit logs." It embeds the question, searches semantically, and may retrieve chunks like:

  • "Enterprise plan users can download audit history as CSV"
  • "Audit trail export is available on the Admin settings page"
  • "Log retention differs by plan tier"

Those chunks are then passed into the prompt so the model can answer in a grounded way:

Yes. Enterprise users can export audit history as a CSV from the Admin settings page. Log retention limits may vary by plan tier.

The answer is generated text, but the substance comes from retrieval.

What RAG Is Good For

RAG shines when the knowledge needed is:

  • private
  • frequently updated
  • too large to paste manually
  • too specific to rely on model pretraining

Common use cases:

  • internal knowledge assistants
  • customer support systems
  • documentation search
  • contract or policy Q&A
  • research assistants over curated sources
  • product copilots grounded in product docs

In all of these, the value comes from grounding the model in the right source material.

Common RAG Mistakes

This is where many "RAG does not work" stories come from.

Mistake 1: Treating All Documents as Equal

A pricing page, a stale wiki page, and a debugging transcript should not necessarily be retrieved with the same weight or trust.

Metadata matters. Source quality matters. Freshness matters.

Mistake 2: Bad Chunking

This is probably the most common implementation problem. If your chunks are messy, retrieval quality will suffer no matter how good the model is.

Mistake 3: Retrieving Too Much

People often assume more context means better answers. Usually it means more noise. The goal is not to dump the library into the prompt. The goal is to retrieve the most relevant evidence.

Mistake 4: Skipping Retrieval Evaluation

If the system answers poorly, teams often blame the model first. But sometimes the real issue is that the retrieval step brought back weak context. You need to inspect what was retrieved, not just the final answer.

Mistake 5: Expecting RAG to Fix Everything

RAG helps with grounding. It does not automatically solve:

  • reasoning errors
  • ambiguous questions
  • permissions problems
  • weak source documents
  • bad prompt design

RAG is important. It is not a miracle patch.

A Simple Architecture View

At a system level, a basic RAG application might look like this:

Documents -> chunking -> embeddings -> vector index

User question -> query embedding -> retrieval -> prompt assembly -> LLM answer

That is the backbone.

Later, teams often add:

  • reranking
  • metadata filters
  • access control
  • citation display
  • caching
  • evaluations
  • feedback loops

But the basic flow stays the same.

RAG vs Fine-Tuning

Beginners often mix these up, so let's separate them cleanly.

RAG

Use external data at request time.

Good for:

  • changing information
  • private docs
  • traceable sources
  • reducing hallucinations over known content

Fine-Tuning

Adjust model behavior using additional training.

Good for:

  • style consistency
  • domain-specific patterns
  • structured task behavior
  • format adherence in some settings

Fine-tuning is not the main answer when your problem is "the model needs access to our current internal documents." That is a retrieval problem first.

What Good RAG Feels Like

A good RAG system usually feels boring in the best way.

It does not feel like wild intelligence. It feels accurate, grounded, and useful. It finds the right information quickly, answers in context, and avoids pretending when the source material is missing.

Signs of a good system:

  • relevant chunks are consistently retrieved
  • answers stay close to source material
  • the system can say "I do not know" when the documents do not support an answer
  • users can trace where the answer came from

That kind of boring reliability is what makes people trust AI systems in production.

Where RAG Fits in the Learning Journey

If you have already built a simple API app, RAG is the next pattern worth learning because it introduces the system thinking that modern AI products actually need.

It teaches you that:

  • the model is only part of the system
  • retrieval quality often matters more than prompt cleverness
  • external data changes product behavior dramatically
  • architecture decisions shape AI usefulness more than branding does

This is one of the biggest transitions from demo-building to serious AI engineering.

References & Further Reading

Closing Thoughts

RAG matters because most useful AI systems do not live in a vacuum. They live inside products, teams, documents, workflows, and changing business reality. A base model gives you general capability. Retrieval gives that capability something real to stand on.

In the next post, we will go one step further and look at agents, tool use, and MCP. That is where the question changes from "How does the model answer with my data?" to "How does the system decide, act, call tools, and move through a workflow?"

Next in the series: AI Agents, Tool Use, and MCP: What Actually Matters.