Skip to main content

    Lesson 36 • Advanced

    Building RAG Systems

    Learn how to ground a language model in your own documents — so it answers from real, retrievable facts instead of guessing or hallucinating.

    What You'll Learn in This Lesson

    • Why RAG exists: grounding, fewer hallucinations, fresh and private data
    • The 6-step pipeline: chunk → embed → store → retrieve → augment → generate
    • How embeddings turn meaning into vectors you can compare
    • How cosine similarity ranks chunks and picks the top-k
    • Chunking strategies and the trade-offs between them
    • How to evaluate a RAG system and spot why it goes wrong

    📚 Real-World Analogy: The Open-Book Exam

    A language model on its own is like a student taking a closed-book exam. It answers from memory — often brilliantly, but sometimes it confidently writes down something that simply isn't true. That confident-but-wrong answer is what we call a hallucination.

    RAG turns it into an open-book exam. Before the student answers, you hand them the exact pages that contain the answer. They read those pages, then write a response grounded in real text. The same question, but now the answer is backed by a source you can point to.

    Your job in RAG is to be the helpful librarian: given a question, find the right pages fast and slide them across the desk. The rest of this lesson is about how to do that finding well.

    1Why RAG? The Problems It Solves

    A language model only knows what it saw during training. That creates three real problems, and RAG fixes all three by feeding the model trusted text at question time.

    Grounding & fewer hallucinations

    The model answers from text you supplied, so it stops inventing facts and can cite where each claim came from.

    Fresh data

    Training data is frozen in the past. RAG lets the model use today's documents — add a file and it's instantly usable.

    Private data

    Your internal handbook, support tickets, or contracts were never in the training set. RAG reads them without retraining anything.

    2The RAG Pipeline — 6 Steps

    Every RAG system is the same six steps. The first three happen once (when you load your documents); the last three happen on every question.

      Setup (once, when documents arrive)
      ┌───────────────────────────────────────────────┐
      │ 1. CHUNK   split documents into bite-size pieces │
      │ 2. EMBED   turn each chunk into a vector          │
      │ 3. STORE   save the vectors in a vector database  │
      └───────────────────────────────────────────────┘
    
      Query time (every question)
      ┌───────────────────────────────────────────────┐
      │ 4. RETRIEVE  embed the question, find top-k chunks│
      │ 5. AUGMENT   paste those chunks into the prompt    │
      │ 6. GENERATE  the model answers from that context   │
      └───────────────────────────────────────────────┘

    The clever part is step 4 — measuring which chunks are most similar in meaning to the question. The worked example below builds that whole loop by hand so you can see exactly what happens.

    🔧 Worked Example: The Whole Loop by Hand

    Run this. It builds a three-fact knowledge base, scores the question against each fact with cosine similarity, retrieves the best one, and stitches together the final prompt. Every line is commented and the expected output is at the bottom.

    Try It: The RAG Loop

    Retrieve the best fact and build a grounded prompt

    Try it Yourself »
    Python
    # A complete RAG loop by hand — no libraries, just lists and maths.
    # RAG = Retrieval-Augmented Generation: fetch the right facts, then answer.
    
    import math
    
    # 1) Tiny "knowledge base" — each chunk is a short fact we trust.
    chunks = [
        "Python was created by Guido van Rossum and released in 1991.",
        "The Eiffel Tower is 330 metres tall and stands in Paris.",
        "Photosynthesis lets plants turn sunlight into chemical energy.",
    ]
    
    # 2) Toy "embeddings" — real systems use a model to make the
    ...

    3Embeddings — Turning Meaning Into Numbers

    An embedding is a list of numbers — a vector — that captures the meaning of a piece of text. An embedding model is trained so that texts with similar meaning get vectors pointing in similar directions.

    "Who created Python?" and "Python was made by Guido van Rossum" end up close together, even though they share almost no words. That's the magic: embeddings compare meaning, not spelling. This is also called semantic search — searching by sense rather than by keyword.

    🧭 How do we measure "close"?

    With cosine similarity. It measures the angle between two vectors and returns a score from -1 (opposite) to 1 (identical direction). A score near 1 means "very similar meaning". You already used it in the worked example — here's the formula it implements:

    cosine(a, b) = dot(a, b) / (length(a) * length(b))
    
    # dot(a, b)   = sum of a[i] * b[i]
    # length(v)   = sqrt(sum of v[i] * v[i])

    In real projects you never hand-write vectors. A model like all-MiniLM-L6-v2 produces a 384-number vector for any text in milliseconds. The example below shows the real call.

    🧠 Real Embeddings with sentence-transformers

    This is how embeddings look in production — read it to see the shape of the real API. (It needs the sentence-transformers package installed locally, so study it here rather than running it in the browser.)

    # In the real world you don't hand-write vectors — a model makes them.
    # sentence-transformers turns any text into a meaningful embedding.
    
    from sentence_transformers import SentenceTransformer, util
    
    # A small, fast, popular embedding model (384 numbers per text).
    model = SentenceTransformer("all-MiniLM-L6-v2")
    
    chunks = [
        "Python was created by Guido van Rossum.",
        "Cats are small domesticated carnivorous mammals.",
        "The Pacific is the largest and deepest ocean.",
    ]
    
    # embed_documents-style call: one vector per chunk.
    chunk_embeddings = model.encode(chunks)
    
    query = "Who invented the Python language?"
    query_embedding = model.encode(query)
    
    # cos_sim returns a similarity score for the query vs every chunk.
    scores = util.cos_sim(query_embedding, chunk_embeddings)[0]
    best = int(scores.argmax())
    
    print("Best match:", chunks[best])
    
    # Expected output:
    # Best match: Python was created by Guido van Rossum.

    Notice the pattern is identical to your hand-written loop: embed the chunks, embed the query, compare with cosine similarity, take the best. The only difference is a trained model makes the vectors instead of you.

    4Retrieval — Picking the Top-K Chunks

    One chunk is rarely enough. Most questions need facts spread across several chunks, so you keep the top-k — the k highest-scoring chunks, often the top 3 to 5. You score every chunk, sort highest-first, and slice off the best few.

    Choosing k is a balance: too few and you miss relevant facts; too many and you bury the answer in noise and fill up the model's limited context window. Start around 3-5 and tune it. Run the worked example to see top-k selection in action.

    Try It: Top-K Retrieval

    Score every chunk and keep the best few

    Try it Yourself »
    Python
    # Retrieve the TOP-K most relevant chunks for a query.
    # Real RAG rarely uses just one chunk — it stuffs the best few into the prompt.
    
    import math
    
    chunks = [
        "Python is a popular language for data science and AI.",
        "Pandas is a Python library for working with tables of data.",
        "The Great Barrier Reef is the world's largest coral reef system.",
        "NumPy gives Python fast arrays and maths for machine learning.",
        "Mount Everest is the highest mountain above sea level.",
    ]
    
    # Toy
    ...

    🎯 Your Turn: Finish the Retrieval Step

    Fill in the three blanks: complete the cosine-similarity formula, score every chunk, and find the index of the best one. The expected output is in the comments so you can check yourself.

    Your Turn: Retrieve the Best Chunk

    Complete cosine similarity and pick the top match

    Try it Yourself »
    Python
    # 🎯 YOUR TURN — finish the retrieval step. Fill in each ___.
    import math
    
    chunks = [
        "Dogs are loyal companion animals kept as pets.",
        "Python is a programming language used to build software.",
        "Coffee is a popular caffeinated drink made from beans.",
    ]
    vectors = [
        [0.0, 0.9],   # animals
        [0.9, 0.1],   # programming
        [0.1, 0.0],   # drinks (weakly programming-ish, mostly neutral)
    ]
    
    query = "What language do I use to write programs?"
    query_vector = [0.85, 0.15]   # leans
    ...

    5Chunking Strategies — The Quality Lever

    Chunking is splitting documents into pieces small enough to embed and retrieve. It is the single biggest factor in RAG quality. Chunk badly and even a perfect model retrieves the wrong text.

    StrategyHow it splitsTrade-off
    Fixed-sizeEvery N characters or tokensSimple and even, but can cut mid-sentence
    SentenceOn sentence boundariesKeeps meaning intact; chunk sizes vary
    ParagraphOn blank lines / paragraphsFull context per chunk; can be too big
    RecursiveTries paragraphs, then sentences, then wordsRespects structure; the common default
    SemanticWhere the topic shifts (by embedding)Best retrieval; needs extra computation

    💡 Sensible defaults to start with

    • Chunk size: 200-500 tokens — big enough to hold a full idea, small enough to stay focused.
    • Overlap: 10-20% — repeat a little text between chunks so a fact split across a boundary still survives.
    • Keep metadata: store the source file, page, and section with each chunk so you can cite it later.
    • Test on real questions — the only way to know your chunking works is to retrieve against actual queries.

    6Augment & Generate — The Production Stack

    Once you've retrieved the top-k chunks, you augment the prompt — paste the chunks in as context — and ask the model to generate an answer grounded in them. The instruction "answer using only the context" is what keeps it honest.

    You don't build all six steps from scratch in real life. Libraries like LangChain and LlamaIndex handle chunking, embedding, and storage; a vector database like Chroma, FAISS, or Pinecone stores and searches the vectors fast. Read the pipeline below — it's the same six steps, just wired together with real tools and a Claude call for the final answer.

    # A production-shaped RAG pipeline: chunk -> embed -> store -> retrieve -> answer.
    # LangChain wires the pieces together so you write very little glue code.
    
    from langchain_community.document_loaders import TextLoader
    from langchain_text_splitters import RecursiveCharacterTextSplitter
    from langchain_community.embeddings import HuggingFaceEmbeddings
    from langchain_community.vectorstores import Chroma
    from anthropic import Anthropic
    
    # 1) CHUNK: load a document and split it into overlapping pieces.
    docs = TextLoader("handbook.txt").load()
    splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=80)
    chunks = splitter.split_documents(docs)
    
    # 2) EMBED + STORE: turn chunks into vectors and index them in Chroma.
    embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
    store = Chroma.from_documents(chunks, embeddings)
    
    # 3) RETRIEVE: pull the top-k chunks most similar to the question.
    question = "How many holiday days do new employees get?"
    hits = store.similarity_search(question, k=3)
    context = "\n\n".join(hit.page_content for hit in hits)
    
    # 4) AUGMENT + GENERATE: ground the model in the retrieved context.
    client = Anthropic()   # reads ANTHROPIC_API_KEY from the environment
    prompt = (
        "Answer using ONLY the context. If the answer is not there, say you "
        "don't know.\n\nContext:\n" + context + "\n\nQuestion: " + question
    )
    reply = client.messages.create(
        model="claude-opus-4-8",
        max_tokens=1024,
        messages=[{"role": "user", "content": prompt}],
    )
    print(reply.content[0].text)
    
    # Expected output:
    # New employees receive 25 holiday days per year, plus public holidays.
    # (Grounded in the retrieved handbook chunks — not invented by the model.)

    7Evaluating a RAG System

    RAG can fail in two different places, so you measure two different things. Knowing which one is broken tells you what to fix.

    Retrieval quality

    Did the right chunks come back? Measure context recall (did we fetch the chunk that holds the answer?) and context precision (how much of what we fetched was actually relevant?).

    Answer quality

    Given the right chunks, was the answer good? Measure faithfulness (does the answer stick to the context, or did it drift?) and answer relevance (did it actually address the question?).

    If retrieval is poor, improve chunking, the embedding model, or k. If retrieval is good but answers are bad, tighten the prompt and instruct the model to refuse when the context doesn't contain the answer. Tools like RAGAS automate these scores so you can track them as you change the pipeline.

    🎯 Your Turn: Top-K and Prompt Building

    Now put the back half of the pipeline together: sort the scored chunks, keep the top 2, and build a grounded prompt. Fill in each ___ and check against the expected output.

    Your Turn: Build the Augmented Prompt

    Sort, slice top-k, and assemble the context

    Try it Yourself »
    Python
    # 🎯 YOUR TURN — pick the top-k chunks and build the prompt. Fill in each ___.
    chunks = [
        "Mars is the fourth planet from the Sun.",
        "Jupiter is the largest planet in the Solar System.",
        "A banana is a long curved yellow fruit.",
    ]
    # Pretend we already scored each chunk against the query "Tell me about planets".
    scores = [0.91, 0.88, 0.12]
    
    # Pair each score with its chunk index, then sort highest-first.
    scored = []
    for i, score in enumerate(scores):
        scored.append((score, i))
    # �
    ...

    🎯 Mini-Challenge: A Tiny Semantic Search Engine

    Support is fading now — you get the data and the brief, but the logic is yours to write. Build the whole retrieval step from scratch: cosine similarity, score every chunk, sort, and print the top 2. The expected output is in the comments.

    Mini-Challenge: Semantic Search

    Write the retrieval logic yourself

    Try it Yourself »
    Python
    # 🎯 MINI-CHALLENGE: Build a tiny semantic search engine.
    #
    # You are given chunks and matching toy vectors plus a query vector.
    # 1. Write cosine_similarity(a, b) using math.sqrt and zip.
    # 2. Score every chunk vector against the query vector.
    # 3. Sort the (score, index) pairs highest-first.
    # 4. Print the top 2 chunks with their scores rounded to 3 places.
    #
    # ✅ Expected (top 2):
    #   [0.997] Neural networks are layers of connected artificial neurons.
    #   [0.949] Deep learning trains very larg
    ...

    Common Errors (And How to Fix Them)

    These four mistakes account for most broken RAG systems. Learn to recognise them.

    ❌ Bad chunking

    Chunks too small lose context; too large dilute the answer and overflow the context window.

    # ❌ 20-token chunks — "He founded it in 2008" with no idea who "He" is
    splitter = RecursiveCharacterTextSplitter(chunk_size=20, chunk_overlap=0)

    ✅ Fix: use a sensible size and a little overlap:

    splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=80)

    ❌ Irrelevant retrieval

    The model hallucinates because the wrong chunks came back — it never saw the answer.

    # ❌ Only fetching 1 chunk — one near-miss and the answer is gone
    hits = store.similarity_search(question, k=1)

    ✅ Fix: retrieve more candidates (and consider a reranker):

    hits = store.similarity_search(question, k=5)

    ❌ Context overflow

    Stuffing too many chunks in blows past the model's context window — the request errors or silently drops text.

    # ❌ Pasting 50 huge chunks into one prompt
    context = "\n".join(hit.page_content for hit in store.similarity_search(q, k=50))

    ✅ Fix: keep k small and chunks reasonably sized; rerank to top 3-5:

    hits = store.similarity_search(q, k=5)   # a few focused chunks beat fifty

    ❌ No citations / no grounding instruction

    Without telling the model to use only the context, it falls back on memory and you can't trace claims.

    # ❌ No grounding rule — the model answers from memory
    prompt = context + "\n" + question

    ✅ Fix: instruct it to stay in the context and admit when it can't:

    prompt = (
        "Answer using ONLY the context below. If it isn't there, say "
        "'I don't know'. Cite the source line you used.\n\n"
        "Context:\n" + context + "\n\nQuestion: " + question
    )

    📋 Quick Reference

    TermWhat It MeansTypical Choice
    ChunkA small slice of a document, ready to embed200-500 tokens, 10-20% overlap
    EmbeddingA vector capturing a text's meaningall-MiniLM-L6-v2 (fast)
    Cosine similarityScore of how aligned two vectors are (−1 to 1)Higher = more relevant
    Top-kThe k highest-scoring chunks for a queryk = 3 to 5
    Vector DBStores vectors and searches them fastChroma, FAISS, Pinecone
    RerankerRe-scores top-k chunks more preciselyCross-encoder or hosted service
    AugmentPaste retrieved chunks into the prompt"Answer using only the context"

    ❓ Frequently Asked Questions

    Q: What is Retrieval-Augmented Generation (RAG)?

    A: RAG is a pattern that fetches relevant text from your own documents and pastes it into the prompt before the language model answers. The model reads that retrieved context and grounds its answer in it, instead of relying only on what it memorised during training.

    Q: Why use RAG instead of fine-tuning the model?

    A: RAG needs no training run — you just index documents, and you can add, edit, or delete them instantly. It also lets the model cite its sources and work with fresh or private data the model has never seen. Fine-tuning bakes knowledge in permanently and is slow and costly to update.

    Q: What is an embedding?

    A: An embedding is a list of numbers (a vector) that captures the meaning of a piece of text. Texts with similar meaning get vectors that point in similar directions, so you can measure how related two pieces of text are with cosine similarity.

    Q: How big should my chunks be?

    A: A common starting point is 200 to 500 tokens per chunk with 10 to 20 percent overlap. Chunks that are too small lose context; chunks that are too large dilute the relevant part and waste the model's context window. Always test retrieval quality on real questions.

    Q: What does 'top-k' mean in RAG?

    A: Top-k means you keep the k most similar chunks for a query — for example, the top 3. You rank every chunk by its similarity score to the question and pass only the highest-scoring few into the prompt.

    Q: Why does my RAG system still hallucinate?

    A: Usually the retrieval step returned the wrong chunks, so the model never saw the right facts. Check your chunking, try a better embedding model, retrieve more chunks, add a reranker, and instruct the model to answer only from the provided context and to say 'I don't know' when the context is missing.

    🎉

    Lesson complete — you can build knowledge-grounded AI!

    You now understand why RAG exists, the six-step pipeline, how embeddings and cosine similarity power retrieval, how chunking decides quality, and how to evaluate the whole thing. You even built the retrieval loop by hand.

    🚀 Up next: Vector Databases — the engine that stores millions of embeddings and finds the top-k in milliseconds.

    Sign up for free to track which lessons you've completed and get learning reminders.

    Previous

    Cookie & Privacy Settings

    We use cookies to improve your experience, analyze traffic, and show personalized ads. You can manage your preferences below.

    By clicking "Accept All", you consent to our use of cookies for analytics and personalized advertising. You can customize your preferences or reject non-essential cookies.

    Privacy PolicyTerms of Service

    Install LearnCodingFast

    Learn faster with the app on your home screen.