Lesson 36 • Advanced

Building RAG Systems

Learn how to ground a language model in your own documents — so it answers from real, retrievable facts instead of guessing or hallucinating.

What You'll Learn in This Lesson

✓Why RAG exists: grounding, fewer hallucinations, fresh and private data
✓The 6-step pipeline: chunk → embed → store → retrieve → augment → generate
✓How embeddings turn meaning into vectors you can compare
✓How cosine similarity ranks chunks and picks the top-k
✓Chunking strategies and the trade-offs between them
✓How to evaluate a RAG system and spot why it goes wrong

Before you start: Make sure you've completed Advanced NLP and are comfortable with Python lists, loops, and functions.

📚 Real-World Analogy: The Open-Book Exam

A language model on its own is like a student taking a closed-book exam. It answers from memory — often brilliantly, but sometimes it confidently writes down something that simply isn't true. That confident-but-wrong answer is what we call a hallucination.

RAG turns it into an open-book exam. Before the student answers, you hand them the exact pages that contain the answer. They read those pages, then write a response grounded in real text. The same question, but now the answer is backed by a source you can point to.

Your job in RAG is to be the helpful librarian: given a question, find the right pages fast and slide them across the desk. The rest of this lesson is about how to do that finding well.

1Why RAG? The Problems It Solves

A language model only knows what it saw during training. That creates three real problems, and RAG fixes all three by feeding the model trusted text at question time.

Grounding & fewer hallucinations

The model answers from text you supplied, so it stops inventing facts and can cite where each claim came from.

Fresh data

Training data is frozen in the past. RAG lets the model use today's documents — add a file and it's instantly usable.

Private data

Your internal handbook, support tickets, or contracts were never in the training set. RAG reads them without retraining anything.

Key idea: RAG doesn't change the model's brain — it changes what the model gets to read right before it answers. That's why it needs no training run and updates instantly.

2The RAG Pipeline — 6 Steps

Every RAG system is the same six steps. The first three happen once (when you load your documents); the last three happen on every question.

  Setup (once, when documents arrive)
  ┌───────────────────────────────────────────────┐
  │ 1. CHUNK   split documents into bite-size pieces │
  │ 2. EMBED   turn each chunk into a vector          │
  │ 3. STORE   save the vectors in a vector database  │
  └───────────────────────────────────────────────┘

  Query time (every question)
  ┌───────────────────────────────────────────────┐
  │ 4. RETRIEVE  embed the question, find top-k chunks│
  │ 5. AUGMENT   paste those chunks into the prompt    │
  │ 6. GENERATE  the model answers from that context   │
  └───────────────────────────────────────────────┘

The clever part is step 4 — measuring which chunks are most similar in meaning to the question. The worked example below builds that whole loop by hand so you can see exactly what happens.

🔧 Worked Example: The Whole Loop by Hand

Run this. It builds a three-fact knowledge base, scores the question against each fact with cosine similarity, retrieves the best one, and stitches together the final prompt. Every line is commented and the expected output is at the bottom.

Try It: The RAG Loop

Retrieve the best fact and build a grounded prompt

Try it Yourself »

Python

# A complete RAG loop by hand — no libraries, just lists and maths.
# RAG = Retrieval-Augmented Generation: fetch the right facts, then answer.

import math

# 1) Tiny "knowledge base" — each chunk is a short fact we trust.
chunks = [
    "Python was created by Guido van Rossum and released in 1991.",
    "The Eiffel Tower is 330 metres tall and stands in Paris.",
    "Photosynthesis lets plants turn sunlight into chemical energy.",
]

# 2) Toy "embeddings" — real systems use a model to make the
...

3Embeddings — Turning Meaning Into Numbers

An embedding is a list of numbers — a vector — that captures the meaning of a piece of text. An embedding model is trained so that texts with similar meaning get vectors pointing in similar directions.

"Who created Python?" and "Python was made by Guido van Rossum" end up close together, even though they share almost no words. That's the magic: embeddings compare meaning, not spelling. This is also called semantic search — searching by sense rather than by keyword.

🧭 How do we measure "close"?

With cosine similarity. It measures the angle between two vectors and returns a score from -1 (opposite) to 1 (identical direction). A score near 1 means "very similar meaning". You already used it in the worked example — here's the formula it implements:

cosine(a, b) = dot(a, b) / (length(a) * length(b))

# dot(a, b)   = sum of a[i] * b[i]
# length(v)   = sqrt(sum of v[i] * v[i])

In real projects you never hand-write vectors. A model like all-MiniLM-L6-v2 produces a 384-number vector for any text in milliseconds. The example below shows the real call.

🧠 Real Embeddings with sentence-transformers

This is how embeddings look in production — read it to see the shape of the real API. (It needs the sentence-transformers package installed locally, so study it here rather than running it in the browser.)

# In the real world you don't hand-write vectors — a model makes them.
# sentence-transformers turns any text into a meaningful embedding.

from sentence_transformers import SentenceTransformer, util

# A small, fast, popular embedding model (384 numbers per text).
model = SentenceTransformer("all-MiniLM-L6-v2")

chunks = [
    "Python was created by Guido van Rossum.",
    "Cats are small domesticated carnivorous mammals.",
    "The Pacific is the largest and deepest ocean.",
]

# embed_documents-style call: one vector per chunk.
chunk_embeddings = model.encode(chunks)

query = "Who invented the Python language?"
query_embedding = model.encode(query)

# cos_sim returns a similarity score for the query vs every chunk.
scores = util.cos_sim(query_embedding, chunk_embeddings)[0]
best = int(scores.argmax())

print("Best match:", chunks[best])

# Expected output:
# Best match: Python was created by Guido van Rossum.

Notice the pattern is identical to your hand-written loop: embed the chunks, embed the query, compare with cosine similarity, take the best. The only difference is a trained model makes the vectors instead of you.

4Retrieval — Picking the Top-K Chunks

One chunk is rarely enough. Most questions need facts spread across several chunks, so you keep the top-k — the k highest-scoring chunks, often the top 3 to 5. You score every chunk, sort highest-first, and slice off the best few.

Choosing k is a balance: too few and you miss relevant facts; too many and you bury the answer in noise and fill up the model's limited context window. Start around 3-5 and tune it. Run the worked example to see top-k selection in action.

Try It: Top-K Retrieval

Score every chunk and keep the best few

Try it Yourself »

Python

# Retrieve the TOP-K most relevant chunks for a query.
# Real RAG rarely uses just one chunk — it stuffs the best few into the prompt.

import math

chunks = [
    "Python is a popular language for data science and AI.",
    "Pandas is a Python library for working with tables of data.",
    "The Great Barrier Reef is the world's largest coral reef system.",
    "NumPy gives Python fast arrays and maths for machine learning.",
    "Mount Everest is the highest mountain above sea level.",
]

# Toy
...

🎯 Your Turn: Finish the Retrieval Step

Fill in the three blanks: complete the cosine-similarity formula, score every chunk, and find the index of the best one. The expected output is in the comments so you can check yourself.

Your Turn: Retrieve the Best Chunk

Complete cosine similarity and pick the top match

Try it Yourself »

Python

# 🎯 YOUR TURN — finish the retrieval step. Fill in each ___.
import math

chunks = [
    "Dogs are loyal companion animals kept as pets.",
    "Python is a programming language used to build software.",
    "Coffee is a popular caffeinated drink made from beans.",
]
vectors = [
    [0.0, 0.9],   # animals
    [0.9, 0.1],   # programming
    [0.1, 0.0],   # drinks (weakly programming-ish, mostly neutral)
]

query = "What language do I use to write programs?"
query_vector = [0.85, 0.15]   # leans
...

5Chunking Strategies — The Quality Lever

Chunking is splitting documents into pieces small enough to embed and retrieve. It is the single biggest factor in RAG quality. Chunk badly and even a perfect model retrieves the wrong text.

Strategy	How it splits	Trade-off
Fixed-size	Every N characters or tokens	Simple and even, but can cut mid-sentence
Sentence	On sentence boundaries	Keeps meaning intact; chunk sizes vary
Paragraph	On blank lines / paragraphs	Full context per chunk; can be too big
Recursive	Tries paragraphs, then sentences, then words	Respects structure; the common default
Semantic	Where the topic shifts (by embedding)	Best retrieval; needs extra computation

💡 Sensible defaults to start with

• Chunk size: 200-500 tokens — big enough to hold a full idea, small enough to stay focused.
• Overlap: 10-20% — repeat a little text between chunks so a fact split across a boundary still survives.
• Keep metadata: store the source file, page, and section with each chunk so you can cite it later.
• Test on real questions — the only way to know your chunking works is to retrieve against actual queries.

6Augment & Generate — The Production Stack

Once you've retrieved the top-k chunks, you augment the prompt — paste the chunks in as context — and ask the model to generate an answer grounded in them. The instruction "answer using only the context" is what keeps it honest.

You don't build all six steps from scratch in real life. Libraries like LangChain and LlamaIndex handle chunking, embedding, and storage; a vector database like Chroma, FAISS, or Pinecone stores and searches the vectors fast. Read the pipeline below — it's the same six steps, just wired together with real tools and a Claude call for the final answer.

# A production-shaped RAG pipeline: chunk -> embed -> store -> retrieve -> answer.
# LangChain wires the pieces together so you write very little glue code.

from langchain_community.document_loaders import TextLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import Chroma
from anthropic import Anthropic

# 1) CHUNK: load a document and split it into overlapping pieces.
docs = TextLoader("handbook.txt").load()
splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=80)
chunks = splitter.split_documents(docs)

# 2) EMBED + STORE: turn chunks into vectors and index them in Chroma.
embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
store = Chroma.from_documents(chunks, embeddings)

# 3) RETRIEVE: pull the top-k chunks most similar to the question.
question = "How many holiday days do new employees get?"
hits = store.similarity_search(question, k=3)
context = "\n\n".join(hit.page_content for hit in hits)

# 4) AUGMENT + GENERATE: ground the model in the retrieved context.
client = Anthropic()   # reads ANTHROPIC_API_KEY from the environment
prompt = (
    "Answer using ONLY the context. If the answer is not there, say you "
    "don't know.\n\nContext:\n" + context + "\n\nQuestion: " + question
)
reply = client.messages.create(
    model="claude-opus-4-8",
    max_tokens=1024,
    messages=[{"role": "user", "content": prompt}],
)
print(reply.content[0].text)

# Expected output:
# New employees receive 25 holiday days per year, plus public holidays.
# (Grounded in the retrieved handbook chunks — not invented by the model.)

Pro tip: add a reranker (a cross-encoder, or a hosted reranking service) after the initial retrieval. It re-scores your top-k chunks more carefully and typically lifts answer quality by 15-30% for very little extra code.

7Evaluating a RAG System

RAG can fail in two different places, so you measure two different things. Knowing which one is broken tells you what to fix.

Retrieval quality

Did the right chunks come back? Measure context recall (did we fetch the chunk that holds the answer?) and context precision (how much of what we fetched was actually relevant?).

Answer quality

Given the right chunks, was the answer good? Measure faithfulness (does the answer stick to the context, or did it drift?) and answer relevance (did it actually address the question?).

If retrieval is poor, improve chunking, the embedding model, or k. If retrieval is good but answers are bad, tighten the prompt and instruct the model to refuse when the context doesn't contain the answer. Tools like RAGAS automate these scores so you can track them as you change the pipeline.

🎯 Your Turn: Top-K and Prompt Building

Now put the back half of the pipeline together: sort the scored chunks, keep the top 2, and build a grounded prompt. Fill in each ___ and check against the expected output.

Your Turn: Build the Augmented Prompt

Sort, slice top-k, and assemble the context

Try it Yourself »

Python

# 🎯 YOUR TURN — pick the top-k chunks and build the prompt. Fill in each ___.
chunks = [
    "Mars is the fourth planet from the Sun.",
    "Jupiter is the largest planet in the Solar System.",
    "A banana is a long curved yellow fruit.",
]
# Pretend we already scored each chunk against the query "Tell me about planets".
scores = [0.91, 0.88, 0.12]

# Pair each score with its chunk index, then sort highest-first.
scored = []
for i, score in enumerate(scores):
    scored.append((score, i))
# �
...

🎯 Mini-Challenge: A Tiny Semantic Search Engine

Support is fading now — you get the data and the brief, but the logic is yours to write. Build the whole retrieval step from scratch: cosine similarity, score every chunk, sort, and print the top 2. The expected output is in the comments.

Mini-Challenge: Semantic Search

Write the retrieval logic yourself

Try it Yourself »

Python

# 🎯 MINI-CHALLENGE: Build a tiny semantic search engine.
#
# You are given chunks and matching toy vectors plus a query vector.
# 1. Write cosine_similarity(a, b) using math.sqrt and zip.
# 2. Score every chunk vector against the query vector.
# 3. Sort the (score, index) pairs highest-first.
# 4. Print the top 2 chunks with their scores rounded to 3 places.
#
# ✅ Expected (top 2):
#   [0.997] Neural networks are layers of connected artificial neurons.
#   [0.949] Deep learning trains very larg
...

Common Errors (And How to Fix Them)

These four mistakes account for most broken RAG systems. Learn to recognise them.

❌ Bad chunking

Chunks too small lose context; too large dilute the answer and overflow the context window.

# ❌ 20-token chunks — "He founded it in 2008" with no idea who "He" is
splitter = RecursiveCharacterTextSplitter(chunk_size=20, chunk_overlap=0)

✅ Fix: use a sensible size and a little overlap:

splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=80)

❌ Irrelevant retrieval

The model hallucinates because the wrong chunks came back — it never saw the answer.

# ❌ Only fetching 1 chunk — one near-miss and the answer is gone
hits = store.similarity_search(question, k=1)

✅ Fix: retrieve more candidates (and consider a reranker):

hits = store.similarity_search(question, k=5)

❌ Context overflow

Stuffing too many chunks in blows past the model's context window — the request errors or silently drops text.

# ❌ Pasting 50 huge chunks into one prompt
context = "\n".join(hit.page_content for hit in store.similarity_search(q, k=50))

✅ Fix: keep k small and chunks reasonably sized; rerank to top 3-5:

hits = store.similarity_search(q, k=5)   # a few focused chunks beat fifty

❌ No citations / no grounding instruction

Without telling the model to use only the context, it falls back on memory and you can't trace claims.

# ❌ No grounding rule — the model answers from memory
prompt = context + "\n" + question

✅ Fix: instruct it to stay in the context and admit when it can't:

prompt = (
    "Answer using ONLY the context below. If it isn't there, say "
    "'I don't know'. Cite the source line you used.\n\n"
    "Context:\n" + context + "\n\nQuestion: " + question
)

📋 Quick Reference

Term	What It Means	Typical Choice
Chunk	A small slice of a document, ready to embed	`200-500 tokens, 10-20% overlap`
Embedding	A vector capturing a text's meaning	`all-MiniLM-L6-v2 (fast)`
Cosine similarity	Score of how aligned two vectors are (−1 to 1)	`Higher = more relevant`
Top-k	The k highest-scoring chunks for a query	`k = 3 to 5`
Vector DB	Stores vectors and searches them fast	`Chroma, FAISS, Pinecone`
Reranker	Re-scores top-k chunks more precisely	`Cross-encoder or hosted service`
Augment	Paste retrieved chunks into the prompt	`"Answer using only the context"`

❓ Frequently Asked Questions

Q: What is Retrieval-Augmented Generation (RAG)?

A: RAG is a pattern that fetches relevant text from your own documents and pastes it into the prompt before the language model answers. The model reads that retrieved context and grounds its answer in it, instead of relying only on what it memorised during training.

Q: Why use RAG instead of fine-tuning the model?

A: RAG needs no training run — you just index documents, and you can add, edit, or delete them instantly. It also lets the model cite its sources and work with fresh or private data the model has never seen. Fine-tuning bakes knowledge in permanently and is slow and costly to update.

Q: What is an embedding?

A: An embedding is a list of numbers (a vector) that captures the meaning of a piece of text. Texts with similar meaning get vectors that point in similar directions, so you can measure how related two pieces of text are with cosine similarity.

Q: How big should my chunks be?

A: A common starting point is 200 to 500 tokens per chunk with 10 to 20 percent overlap. Chunks that are too small lose context; chunks that are too large dilute the relevant part and waste the model's context window. Always test retrieval quality on real questions.

Q: What does 'top-k' mean in RAG?

A: Top-k means you keep the k most similar chunks for a query — for example, the top 3. You rank every chunk by its similarity score to the question and pass only the highest-scoring few into the prompt.

Q: Why does my RAG system still hallucinate?

A: Usually the retrieval step returned the wrong chunks, so the model never saw the right facts. Check your chunking, try a better embedding model, retrieve more chunks, add a reranker, and instruct the model to answer only from the provided context and to say 'I don't know' when the context is missing.

🎉

Lesson complete — you can build knowledge-grounded AI!

You now understand why RAG exists, the six-step pipeline, how embeddings and cosine similarity power retrieval, how chunking decides quality, and how to evaluate the whole thing. You even built the retrieval loop by hand.

🚀 Up next: Vector Databases — the engine that stores millions of embeddings and finds the top-k in milliseconds.

Building RAG Systems

What You'll Learn in This Lesson

📚 Real-World Analogy: The Open-Book Exam

1Why RAG? The Problems It Solves

2The RAG Pipeline — 6 Steps

🔧 Worked Example: The Whole Loop by Hand

Try It: The RAG Loop

3Embeddings — Turning Meaning Into Numbers

🧠 Real Embeddings with sentence-transformers

4Retrieval — Picking the Top-K Chunks

Try It: Top-K Retrieval

🎯 Your Turn: Finish the Retrieval Step

Your Turn: Retrieve the Best Chunk

5Chunking Strategies — The Quality Lever

6Augment & Generate — The Production Stack

7Evaluating a RAG System

🎯 Your Turn: Top-K and Prompt Building

Your Turn: Build the Augmented Prompt

🎯 Mini-Challenge: A Tiny Semantic Search Engine

Mini-Challenge: Semantic Search

Common Errors (And How to Fix Them)

📋 Quick Reference

❓ Frequently Asked Questions

Lesson complete — you can build knowledge-grounded AI!

Cookie & Privacy Settings