Lesson 37 • Advanced
Vector Databases & Embeddings
By the end of this lesson you'll be able to turn data into vectors, search them by meaning with the right similarity metric, and pick between FAISS, Pinecone, Chroma, pgvector, and Weaviate.
What You'll Learn in This Lesson
- ✓What an embedding is and why similar things sit close together
- ✓Cosine vs dot product vs Euclidean — and when to use each
- ✓Write a brute-force nearest-neighbour search in plain Python
- ✓How ANN and HNSW make search fast at scale
- ✓Index vs brute force, and filtering search by metadata
- ✓Choose between FAISS, Pinecone, Chroma, pgvector, Weaviate
📚 Real-World Analogy: the librarian who reads minds
A traditional database is a card catalogue: it finds a book only if you know its exact title or author. A vector database is the brilliant librarian who finds books by meaning, not by title. You say "something like Harry Potter," and they walk you to other fantasy novels with magic and boarding schools — even though none share the words "Harry" or "Potter."
How does the librarian do it? In their head, every book has a position based on its themes: fantasy books cluster in one corner, cookbooks in another. "Closeness" in that mental map means "similar in meaning." A vector database makes that map literal: each item becomes a point in space (an embedding), and search means "find the nearest points to where the query lands."
1Embeddings: meaning as a list of numbers
An embedding is a list of numbers (a vector) that captures the meaning of something — a word, a sentence, an image. An embedding model is trained so that similar inputs get similar vectors, meaning they point in nearly the same direction in space.
Real embeddings are big. OpenAI's text-embedding-3-small has 1536 numbers per item; all-MiniLM-L6-v2 has 384. You never read these numbers by hand — you let math compare them. In the worked example below, each item is just 3 numbers so you can see the whole idea at once: the first number is "how fruity," the third is "how techy."
The key intuition: direction encodes meaning. Two fruity items point the same way; a techy item points elsewhere. Search is "whose vector points most like my query's vector?"
2Similarity metrics: cosine vs dot vs Euclidean
To find "nearest," you need a way to score how close two vectors are. There are three you'll meet:
Cosine similarity
The cosine of the angle between vectors. Ignores length, cares only about direction. Ranges 0–1 for typical embeddings. The safe default for text.
Dot product
Multiply matching components and sum. Factors in magnitude. Once vectors are L2-normalised (length 1), the dot product equals cosine — which is why databases normalise then use it (it's fast).
Euclidean (L2)
Straight-line distance between the points. Smaller is closer. Common for image features. Sensitive to magnitude, so normalise first.
The formula for cosine is dot(a, b) / (norm(a) * norm(b)), where norm is the vector's length. That's exactly what the worked example codes up next.
3Worked example: brute-force nearest-neighbour search
Here is the heart of a vector database in pure Python — no libraries. You score the query against every item with cosine similarity, sort, and return the top-k. This is called brute force because it checks everything: simple and exact, but it does n comparisons. Read each comment, then run it.
Worked Example: Brute-Force Nearest Neighbours
Cosine similarity over toy vectors — returns the top-k closest ids
# WORKED EXAMPLE — find the most similar items by MEANING
# Plain Python, no libraries. Each item is a tiny "embedding" (a list of numbers).
# We score every item against the query with COSINE similarity, then return the top-k.
import math
# A toy collection: id -> embedding vector.
# Think of these as 3 numbers describing each thing ("sporty", "fruity", "techy").
items = {
"apple": [0.9, 0.1, 0.0], # very fruity
"banana": [0.8, 0.2, 0.0], # fruity
"laptop": [0.0, 0.1, 0.9]
...🎯 Your Turn #1: return the top-k IDs
___ blanks. Check your output against the # ✅ Expected output comment.Your Turn #1: Top-k IDs
Fill the blanks to return the nearest-neighbour ids
# 🎯 YOUR TURN #1 — return the IDs of the top-k nearest neighbours
# Fill in the blanks marked with ___
import math
items = {
"dog": [0.9, 0.1, 0.0],
"cat": [0.8, 0.2, 0.0],
"car": [0.0, 0.1, 0.9],
"truck": [0.1, 0.0, 0.9],
}
def dot(a, b):
return sum(x * y for x, y in zip(a, b))
def norm(a):
return math.sqrt(sum(x * x for x in a))
def cosine_similarity(a, b):
# 👉 cosine = dot(a, b) divided by (norm(a) * norm(b))
return ___ / (norm(a) * norm(b))
def t
...🎯 Your Turn #2: normalise, then pick the metric
___ blanks.Your Turn #2: Normalise & Search
L2-normalise vectors, then use the dot product as cosine
# 🎯 YOUR TURN #2 — normalise a vector, then search with the right metric
# Fill in the blanks marked with ___
import math
def norm(a):
return math.sqrt(sum(x * x for x in a))
def normalize(a):
n = norm(a)
# 👉 divide every component by the length so the vector has length 1.0
return [x / ___ for x in a]
# After L2-normalising both vectors, the DOT PRODUCT equals the COSINE similarity.
def dot(a, b):
return sum(x * y for x, y in zip(a, b))
items = {
"red": normalize
...4Going fast: ANN, indexing, and HNSW intuition
Brute force is exact, but scoring every vector is O(n). At 10 million vectors that's 10 million comparisons per query — too slow. The fix is approximate nearest neighbour (ANN) search: accept ~98% of the right answers in exchange for being 10–100× faster. You build an index once, then queries skip almost all the work.
HNSW intuition — a "small world" graph
HNSW (Hierarchical Navigable Small World) is the most popular ANN index. Picture a stack of maps:
- The top layer is sparse — a few "express stops" that let you jump across the whole space fast.
- Each layer below is denser, with more local connections.
- You enter at the top, greedily hop to whichever neighbour is closer to the query, then drop a layer and refine.
- You arrive near the true nearest neighbours having touched only a tiny fraction of the points.
Other index families: IVF (inverted file) splits vectors into clusters and searches only the nearest clusters; LSH hashes similar vectors into the same bucket; PQ (product quantization) compresses vectors to shrink memory. HNSW usually gives the best quality/speed tradeoff, which is why FAISS, Pinecone, Qdrant, and pgvector all offer it.
5Metadata filtering: similarity plus structured rules
Pure similarity isn't always enough. You often need "find docs similar to this query but only from 2024" or "only in English." Vector databases let you attach metadata (a small dict of fields like topic, year, language) to each vector and filter on it during search.
In the Chroma worked example below, the where={"topic": "fruit"} clause keeps only fruit documents, then ranks those by similarity. Filtering shrinks the candidate set, which also speeds things up — a win on both relevance and latency.
6The landscape: FAISS, Pinecone, Chroma, pgvector, Weaviate
The toy code above is what every production system does — just optimised and persisted. These two read-only examples show the real APIs. They won't run in the box (they need installs), so each ends with an # Expected output comment showing what you'd see.
| Tool | What it is | Best for |
|---|---|---|
| FAISS | Self-hosted library (you manage storage) | Fast on-prem / research, full control |
| Pinecone | Fully managed cloud service | Production, zero ops, easy scaling |
| Chroma | Embedding DB, runs locally | Prototyping and RAG, zero setup |
| pgvector | PostgreSQL extension | You already run Postgres, < few M vectors |
| Weaviate | Standalone DB with filtering | Multimodal search, built-in modules |
Worked Example: FAISS (read-only)
Build an exact index, then swap in HNSW for approximate search
# FAISS — Facebook AI Similarity Search (an indexing LIBRARY, not a server)
# Read-only example: this needs 'pip install faiss-cpu numpy' and won't run in the box.
import numpy as np
import faiss
dim = 64 # embedding dimension
vectors = np.random.randn(1000, dim).astype("float32")
faiss.normalize_L2(vectors) # normalise so inner product = cosine
# IndexFlatIP = brute force (exact) with inner product.
index = faiss.IndexFlatIP(dim)
index.add(vecto
...Worked Example: ChromaDB with metadata filtering (read-only)
Similarity search combined with a metadata where-filter
# ChromaDB — an embedding database with built-in storage + METADATA FILTERING
# Read-only example: needs 'pip install chromadb'.
import chromadb
client = chromadb.Client()
collection = client.create_collection("articles")
# Chroma can embed text for you; here we pass our own vectors to stay explicit.
collection.add(
ids=["a1", "a2", "a3"],
embeddings=[[0.9, 0.1, 0.0], [0.0, 0.1, 0.9], [0.85, 0.15, 0.0]],
metadatas=[{"topic": "fruit"}, {"topic": "tech"}, {"topic": "fruit"}],
do
...7Common Errors (and how to fix them)
❌ Wrong similarity metric → irrelevant results
Searching with Euclidean distance when your model was trained for cosine returns neighbours that look random.
✅ Fix: use the metric your embedding model expects (usually cosine / normalised inner product), and set the same metric on the index.
❌ Unnormalised vectors → magnitude dominates
Feeding raw vectors into a dot-product (inner-product) index lets long vectors win regardless of direction.
# ❌ raw vectors of different lengths score = dot(a, b) # magnitude skews the ranking
✅ Fix: L2-normalise every embedding (and the query) to length 1, e.g. faiss.normalize_L2(x), so dot product equals cosine.
❌ Expecting 100% accuracy from an approximate index
ANN indexes (HNSW, IVF, LSH) may miss a true neighbour now and then — that's the speed/accuracy tradeoff, not a bug.
✅ Fix: tune recall (e.g. HNSW efSearch, IVF nprobe), or use a flat/brute-force index when exactness is mandatory and the collection is small.
❌ Dimension mismatch
AssertionError: d == self.d # (or) ValueError: query dim 768 != index dim 1536
Your query vector has a different length than the stored vectors — often from mixing two embedding models.
✅ Fix: embed queries and documents with the same model, and create the index with that exact dimension (e.g. 1536 for text-embedding-3-small).
📋 Quick Reference
| Metric | Measures | Use when |
|---|---|---|
| Cosine | Angle (direction) | Text embeddings — safe default |
| Dot product | Direction + magnitude | Normalised vectors (= cosine, fast) |
| Euclidean (L2) | Straight-line distance | Image features; normalise first |
| Search method | Speed | Accuracy | Notes |
|---|---|---|---|
| Flat (brute force) | Slow at scale | 100% exact | Great up to ~tens of thousands |
| IVF | Fast | ~95% | Cluster then search clusters |
| HNSW | Very fast | ~98% | Graph-based; best tradeoff |
| LSH | Very fast | ~85% | Hash-based, low memory |
| PQ | Fast | ~90% | Compresses vectors, tiny memory |
❓ Frequently Asked Questions
Q: What is a vector database?
A: A vector database stores embeddings (lists of numbers that capture meaning) and finds the items whose vectors are closest to a query vector. It powers semantic search, RAG, and recommendations — matching by meaning rather than exact keywords.
Q: Cosine vs dot product vs Euclidean — which similarity metric should I use?
A: Cosine similarity compares direction and ignores length, so it is the safe default for text embeddings. The dot product factors in magnitude and equals cosine once vectors are L2-normalised, which is why most databases normalise and use inner product. Euclidean (L2) distance measures straight-line distance and is common for image features. Pick the metric your embedding model was trained with.
Q: What is approximate nearest neighbour (ANN) search, and how does HNSW work?
A: ANN trades a tiny bit of accuracy for a huge speed gain by not scanning every vector. HNSW (Hierarchical Navigable Small World) builds a layered graph: you enter at a sparse top layer to jump across the space, then descend into denser layers to refine, greedily hopping to closer neighbours. It typically reaches ~98% recall while being orders of magnitude faster than brute force.
Q: When is brute-force (exact) search good enough versus an index?
A: Brute force compares the query to every vector — O(n), 100% exact, and perfectly fine up to roughly tens of thousands of vectors. Once you reach hundreds of thousands or millions, build an index (HNSW or IVF) so latency stays low. The tradeoff is exactness for speed and memory.
Q: Which vector database should I choose?
A: Use Chroma for local prototyping (zero setup). Use pgvector if you already run PostgreSQL and have under a few million vectors. Use FAISS when you want a fast self-hosted library and control the storage yourself. Use Pinecone for a fully managed cloud service, or Weaviate when you need multimodal search and built-in filtering. Always benchmark with your own data.
Q: Why must I normalise vectors before cosine similarity?
A: Cosine similarity divides by each vector's length, so unnormalised vectors with large magnitudes can skew dot-product-based search. L2-normalising every embedding to length 1 makes the dot product equal the cosine, keeping comparisons consistent and fast.
🎯 Mini-Challenge: build a tiny semantic search engine
Support is faded now — only an outline is given. Put the whole loop together yourself: build a small collection, write cosine similarity, and return the top-k IDs for a query.
- Make a dict of at least 4 ids → 3-number embeddings.
- Write
cosine_similarity(a, b)fromdot()andnorm(). - Write
search(query, docs, top_k)returning the best ids. - Run a query and print the result.
Mini-Challenge: Semantic Search
Build the full nearest-neighbour search from an outline
# 🎯 MINI-CHALLENGE: build a tiny semantic search engine
# Support is faded — only the outline is given. Write the code yourself.
#
# 1. Make a dict 'docs' of at least 4 ids -> 3-number embeddings.
# 2. Write cosine_similarity(a, b) using dot() and norm() (math.sqrt).
# 3. Write search(query, docs, top_k) that returns the top_k ids by cosine.
# 4. Run a query and print the result.
#
# ✅ Expected (for query [1,0,0], the fruity/red-ish ids should come first):
# e.g. ['apple', 'orange']
import
...🎉 Lesson Complete!
You can now explain what an embedding is, pick between cosine, dot product, and Euclidean, write a brute-force nearest-neighbour search by hand, reason about ANN and HNSW, filter by metadata, and choose between FAISS, Pinecone, Chroma, pgvector, and Weaviate.
🚀 Up next: Model Evaluation — measure how good your models and retrieval pipelines actually are with the right metrics.
Sign up for free to track which lessons you've completed and get learning reminders.