Lesson 36 • Advanced
Building RAG Systems
Combine LLMs with knowledge retrieval — build chatbots that answer questions from your private documents without hallucinating.
✅ What You'll Learn
- • RAG pipeline: embed → retrieve → generate
- • Document chunking strategies and their tradeoffs
- • Cosine similarity for semantic search
- • RAG vs fine-tuning: when to use each
📚 Giving LLMs a Library Card
🎯 Real-World Analogy: An LLM without RAG is like a brilliant professor answering questions purely from memory — they're often right, but sometimes confidently wrong (hallucination). RAG is like giving that professor access to a library: "Before answering, check these relevant books first." The professor's answers are now grounded in actual sources, dramatically reducing errors.
RAG (2020) is the most practical way to build AI applications over private data. Instead of expensive fine-tuning, you simply index your documents in a vector database and retrieve relevant chunks before each LLM call. It's the architecture behind most enterprise chatbots, customer support bots, and knowledge assistants.
Try It: RAG Pipeline
Build a knowledge base, retrieve relevant docs, and augment LLM prompts
import numpy as np
# RAG: Retrieval-Augmented Generation
# Give LLMs access to your private knowledge base
np.random.seed(42)
def cosine_similarity(a, b):
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
def embed_text(text, dim=64):
"""Simulate text embedding (in practice: use sentence-transformers)"""
np.random.seed(hash(text) % 2**31)
return np.random.randn(dim)
# Build a knowledge base
documents = [
"Python was created by Guido van Rossum and first relea
...Try It: Document Chunking
Compare fixed-size vs sentence-based chunking strategies
import numpy as np
# Document Chunking: How to Split Documents for RAG
# The #1 factor affecting RAG quality
np.random.seed(42)
def chunk_by_size(text, chunk_size, overlap):
"""Fixed-size chunking with overlap"""
words = text.split()
chunks = []
for i in range(0, len(words), chunk_size - overlap):
chunk = " ".join(words[i:i + chunk_size])
if chunk:
chunks.append(chunk)
return chunks
def chunk_by_sentence(text, max_sentences=3):
"""Sentence-
...⚠️ Common Mistake: Using chunks that are too small (under 100 tokens) or too large (over 1000 tokens). Small chunks lose context — the retrieved text doesn't contain enough information. Large chunks waste the LLM's context window and dilute the relevant information. Aim for 200-500 tokens with 10-20% overlap.
💡 Pro Tip: Use LangChain or LlamaIndex to build RAG systems in under 50 lines. For embeddings, use sentence-transformers/all-MiniLM-L6-v2 (fast) or OpenAI's text-embedding-3-small (accurate). Always add a reranker (Cohere Rerank or cross-encoder) after initial retrieval — it improves answer quality by 15-30%.
📋 Quick Reference
| Component | Options | Recommendation |
|---|---|---|
| Embedding Model | OpenAI, sentence-transformers | all-MiniLM-L6-v2 |
| Vector DB | Pinecone, Chroma, FAISS | Chroma for prototyping |
| Chunking | Fixed, sentence, recursive | Recursive with overlap |
| Reranker | Cohere, cross-encoder | Always add one |
| LLM | GPT-4, Claude, LLaMA | GPT-4 for quality |
🎉 Lesson Complete!
You can now build knowledge-grounded AI systems! Next, learn about vector databases — the engine that powers RAG retrieval.
Sign up for free to track which lessons you've completed and get learning reminders.