Lesson 37 • Advanced

Vector Databases & Embeddings

By the end of this lesson you'll be able to turn data into vectors, search them by meaning with the right similarity metric, and pick between FAISS, Pinecone, Chroma, pgvector, and Weaviate.

What You'll Learn in This Lesson

✓What an embedding is and why similar things sit close together
✓Cosine vs dot product vs Euclidean — and when to use each
✓Write a brute-force nearest-neighbour search in plain Python
✓How ANN and HNSW make search fast at scale
✓Index vs brute force, and filtering search by metadata
✓Choose between FAISS, Pinecone, Chroma, pgvector, Weaviate

Before you start: it helps to have finished RAG Systems, since vector databases are the retrieval engine behind RAG. Basic Python (lists, dicts, functions) is all the code here assumes.

📚 Real-World Analogy: the librarian who reads minds

A traditional database is a card catalogue: it finds a book only if you know its exact title or author. A vector database is the brilliant librarian who finds books by meaning, not by title. You say "something like Harry Potter," and they walk you to other fantasy novels with magic and boarding schools — even though none share the words "Harry" or "Potter."

How does the librarian do it? In their head, every book has a position based on its themes: fantasy books cluster in one corner, cookbooks in another. "Closeness" in that mental map means "similar in meaning." A vector database makes that map literal: each item becomes a point in space (an embedding), and search means "find the nearest points to where the query lands."

1Embeddings: meaning as a list of numbers

An embedding is a list of numbers (a vector) that captures the meaning of something — a word, a sentence, an image. An embedding model is trained so that similar inputs get similar vectors, meaning they point in nearly the same direction in space.

Real embeddings are big. OpenAI's text-embedding-3-small has 1536 numbers per item; all-MiniLM-L6-v2 has 384. You never read these numbers by hand — you let math compare them. In the worked example below, each item is just 3 numbers so you can see the whole idea at once: the first number is "how fruity," the third is "how techy."

The key intuition: direction encodes meaning. Two fruity items point the same way; a techy item points elsewhere. Search is "whose vector points most like my query's vector?"

2Similarity metrics: cosine vs dot vs Euclidean

To find "nearest," you need a way to score how close two vectors are. There are three you'll meet:

Cosine similarity

The cosine of the angle between vectors. Ignores length, cares only about direction. Ranges 0–1 for typical embeddings. The safe default for text.

Dot product

Multiply matching components and sum. Factors in magnitude. Once vectors are L2-normalised (length 1), the dot product equals cosine — which is why databases normalise then use it (it's fast).

Euclidean (L2)

Straight-line distance between the points. Smaller is closer. Common for image features. Sensitive to magnitude, so normalise first.

Match the metric to your model. Use whatever the embedding model was trained with (most text models: cosine / normalised inner product). Mixing metrics silently returns wrong neighbours.

The formula for cosine is dot(a, b) / (norm(a) * norm(b)), where norm is the vector's length. That's exactly what the worked example codes up next.

3Worked example: brute-force nearest-neighbour search

Here is the heart of a vector database in pure Python — no libraries. You score the query against every item with cosine similarity, sort, and return the top-k. This is called brute force because it checks everything: simple and exact, but it does n comparisons. Read each comment, then run it.

Worked Example: Brute-Force Nearest Neighbours

Cosine similarity over toy vectors — returns the top-k closest ids

Try it Yourself »

Python

# WORKED EXAMPLE — find the most similar items by MEANING
# Plain Python, no libraries. Each item is a tiny "embedding" (a list of numbers).
# We score every item against the query with COSINE similarity, then return the top-k.

import math

# A toy collection: id -> embedding vector.
# Think of these as 3 numbers describing each thing ("sporty", "fruity", "techy").
items = {
    "apple":   [0.9, 0.1, 0.0],   # very fruity
    "banana":  [0.8, 0.2, 0.0],   # fruity
    "laptop":  [0.0, 0.1, 0.9]
...

🎯 Your Turn #1: return the top-k IDs

Finish the cosine formula, sort by score, and return just the IDs. Fill the three ___ blanks. Check your output against the # ✅ Expected output comment.

Your Turn #1: Top-k IDs

Fill the blanks to return the nearest-neighbour ids

Try it Yourself »

Python

# 🎯 YOUR TURN #1 — return the IDs of the top-k nearest neighbours
# Fill in the blanks marked with ___

import math

items = {
    "dog":   [0.9, 0.1, 0.0],
    "cat":   [0.8, 0.2, 0.0],
    "car":   [0.0, 0.1, 0.9],
    "truck": [0.1, 0.0, 0.9],
}

def dot(a, b):
    return sum(x * y for x, y in zip(a, b))

def norm(a):
    return math.sqrt(sum(x * x for x in a))

def cosine_similarity(a, b):
    # 👉 cosine = dot(a, b) divided by (norm(a) * norm(b))
    return ___ / (norm(a) * norm(b))

def t
...

🎯 Your Turn #2: normalise, then pick the metric

Normalise each vector to length 1, then search with the dot product (which now equals cosine). Fill the three ___ blanks.

Your Turn #2: Normalise & Search

L2-normalise vectors, then use the dot product as cosine

Try it Yourself »

Python

# 🎯 YOUR TURN #2 — normalise a vector, then search with the right metric
# Fill in the blanks marked with ___

import math

def norm(a):
    return math.sqrt(sum(x * x for x in a))

def normalize(a):
    n = norm(a)
    # 👉 divide every component by the length so the vector has length 1.0
    return [x / ___ for x in a]

# After L2-normalising both vectors, the DOT PRODUCT equals the COSINE similarity.
def dot(a, b):
    return sum(x * y for x, y in zip(a, b))

items = {
    "red":   normalize
...

4Going fast: ANN, indexing, and HNSW intuition

Brute force is exact, but scoring every vector is O(n). At 10 million vectors that's 10 million comparisons per query — too slow. The fix is approximate nearest neighbour (ANN) search: accept ~98% of the right answers in exchange for being 10–100× faster. You build an index once, then queries skip almost all the work.

HNSW intuition — a "small world" graph

HNSW (Hierarchical Navigable Small World) is the most popular ANN index. Picture a stack of maps:

The top layer is sparse — a few "express stops" that let you jump across the whole space fast.
Each layer below is denser, with more local connections.
You enter at the top, greedily hop to whichever neighbour is closer to the query, then drop a layer and refine.
You arrive near the true nearest neighbours having touched only a tiny fraction of the points.

Other index families: IVF (inverted file) splits vectors into clusters and searches only the nearest clusters; LSH hashes similar vectors into the same bucket; PQ (product quantization) compresses vectors to shrink memory. HNSW usually gives the best quality/speed tradeoff, which is why FAISS, Pinecone, Qdrant, and pgvector all offer it.

Rule of thumb: brute force is fine up to tens of thousands of vectors. Past hundreds of thousands, build an HNSW or IVF index. You trade a little exactness for a lot of speed.

5Metadata filtering: similarity plus structured rules

Pure similarity isn't always enough. You often need "find docs similar to this query but only from 2024" or "only in English." Vector databases let you attach metadata (a small dict of fields like topic, year, language) to each vector and filter on it during search.

In the Chroma worked example below, the where={"topic": "fruit"} clause keeps only fruit documents, then ranks those by similarity. Filtering shrinks the candidate set, which also speeds things up — a win on both relevance and latency.

6The landscape: FAISS, Pinecone, Chroma, pgvector, Weaviate

The toy code above is what every production system does — just optimised and persisted. These two read-only examples show the real APIs. They won't run in the box (they need installs), so each ends with an # Expected output comment showing what you'd see.

Tool	What it is	Best for
FAISS	Self-hosted library (you manage storage)	Fast on-prem / research, full control
Pinecone	Fully managed cloud service	Production, zero ops, easy scaling
Chroma	Embedding DB, runs locally	Prototyping and RAG, zero setup
pgvector	PostgreSQL extension	You already run Postgres, < few M vectors
Weaviate	Standalone DB with filtering	Multimodal search, built-in modules

Worked Example: FAISS (read-only)

Build an exact index, then swap in HNSW for approximate search

Try it Yourself »

Python

# FAISS — Facebook AI Similarity Search (an indexing LIBRARY, not a server)
# Read-only example: this needs 'pip install faiss-cpu numpy' and won't run in the box.

import numpy as np
import faiss

dim = 64                                  # embedding dimension
vectors = np.random.randn(1000, dim).astype("float32")
faiss.normalize_L2(vectors)               # normalise so inner product = cosine

# IndexFlatIP = brute force (exact) with inner product.
index = faiss.IndexFlatIP(dim)
index.add(vecto
...

Worked Example: ChromaDB with metadata filtering (read-only)

Similarity search combined with a metadata where-filter

Try it Yourself »

Python

# ChromaDB — an embedding database with built-in storage + METADATA FILTERING
# Read-only example: needs 'pip install chromadb'.

import chromadb

client = chromadb.Client()
collection = client.create_collection("articles")

# Chroma can embed text for you; here we pass our own vectors to stay explicit.
collection.add(
    ids=["a1", "a2", "a3"],
    embeddings=[[0.9, 0.1, 0.0], [0.0, 0.1, 0.9], [0.85, 0.15, 0.0]],
    metadatas=[{"topic": "fruit"}, {"topic": "tech"}, {"topic": "fruit"}],
    do
...

7Common Errors (and how to fix them)

❌ Wrong similarity metric → irrelevant results

Searching with Euclidean distance when your model was trained for cosine returns neighbours that look random.

✅ Fix: use the metric your embedding model expects (usually cosine / normalised inner product), and set the same metric on the index.

❌ Unnormalised vectors → magnitude dominates

Feeding raw vectors into a dot-product (inner-product) index lets long vectors win regardless of direction.

# ❌ raw vectors of different lengths
score = dot(a, b)            # magnitude skews the ranking

✅ Fix: L2-normalise every embedding (and the query) to length 1, e.g. faiss.normalize_L2(x), so dot product equals cosine.

❌ Expecting 100% accuracy from an approximate index

ANN indexes (HNSW, IVF, LSH) may miss a true neighbour now and then — that's the speed/accuracy tradeoff, not a bug.

✅ Fix: tune recall (e.g. HNSW efSearch, IVF nprobe), or use a flat/brute-force index when exactness is mandatory and the collection is small.

❌ Dimension mismatch

AssertionError: d == self.d
# (or) ValueError: query dim 768 != index dim 1536

Your query vector has a different length than the stored vectors — often from mixing two embedding models.

✅ Fix: embed queries and documents with the same model, and create the index with that exact dimension (e.g. 1536 for text-embedding-3-small).

📋 Quick Reference

Metric	Measures	Use when
Cosine	Angle (direction)	Text embeddings — safe default
Dot product	Direction + magnitude	Normalised vectors (= cosine, fast)
Euclidean (L2)	Straight-line distance	Image features; normalise first

Search method	Speed	Accuracy	Notes
Flat (brute force)	Slow at scale	100% exact	Great up to ~tens of thousands
IVF	Fast	~95%	Cluster then search clusters
HNSW	Very fast	~98%	Graph-based; best tradeoff
LSH	Very fast	~85%	Hash-based, low memory
PQ	Fast	~90%	Compresses vectors, tiny memory

❓ Frequently Asked Questions

Q: What is a vector database?

A: A vector database stores embeddings (lists of numbers that capture meaning) and finds the items whose vectors are closest to a query vector. It powers semantic search, RAG, and recommendations — matching by meaning rather than exact keywords.

Q: Cosine vs dot product vs Euclidean — which similarity metric should I use?

A: Cosine similarity compares direction and ignores length, so it is the safe default for text embeddings. The dot product factors in magnitude and equals cosine once vectors are L2-normalised, which is why most databases normalise and use inner product. Euclidean (L2) distance measures straight-line distance and is common for image features. Pick the metric your embedding model was trained with.

Q: What is approximate nearest neighbour (ANN) search, and how does HNSW work?

A: ANN trades a tiny bit of accuracy for a huge speed gain by not scanning every vector. HNSW (Hierarchical Navigable Small World) builds a layered graph: you enter at a sparse top layer to jump across the space, then descend into denser layers to refine, greedily hopping to closer neighbours. It typically reaches ~98% recall while being orders of magnitude faster than brute force.

Q: When is brute-force (exact) search good enough versus an index?

A: Brute force compares the query to every vector — O(n), 100% exact, and perfectly fine up to roughly tens of thousands of vectors. Once you reach hundreds of thousands or millions, build an index (HNSW or IVF) so latency stays low. The tradeoff is exactness for speed and memory.

Q: Which vector database should I choose?

A: Use Chroma for local prototyping (zero setup). Use pgvector if you already run PostgreSQL and have under a few million vectors. Use FAISS when you want a fast self-hosted library and control the storage yourself. Use Pinecone for a fully managed cloud service, or Weaviate when you need multimodal search and built-in filtering. Always benchmark with your own data.

Q: Why must I normalise vectors before cosine similarity?

A: Cosine similarity divides by each vector's length, so unnormalised vectors with large magnitudes can skew dot-product-based search. L2-normalising every embedding to length 1 makes the dot product equal the cosine, keeping comparisons consistent and fast.

🎯 Mini-Challenge: build a tiny semantic search engine

Support is faded now — only an outline is given. Put the whole loop together yourself: build a small collection, write cosine similarity, and return the top-k IDs for a query.

Make a dict of at least 4 ids → 3-number embeddings.
Write cosine_similarity(a, b) from dot() and norm().
Write search(query, docs, top_k) returning the best ids.
Run a query and print the result.

Mini-Challenge: Semantic Search

Build the full nearest-neighbour search from an outline

Try it Yourself »

Python

# 🎯 MINI-CHALLENGE: build a tiny semantic search engine
# Support is faded — only the outline is given. Write the code yourself.
#
# 1. Make a dict 'docs' of at least 4 ids -> 3-number embeddings.
# 2. Write cosine_similarity(a, b) using dot() and norm() (math.sqrt).
# 3. Write search(query, docs, top_k) that returns the top_k ids by cosine.
# 4. Run a query and print the result.
#
# ✅ Expected (for query [1,0,0], the fruity/red-ish ids should come first):
#    e.g. ['apple', 'orange']

import 
...

🎉 Lesson Complete!

You can now explain what an embedding is, pick between cosine, dot product, and Euclidean, write a brute-force nearest-neighbour search by hand, reason about ANN and HNSW, filter by metadata, and choose between FAISS, Pinecone, Chroma, pgvector, and Weaviate.

🚀 Up next: Model Evaluation — measure how good your models and retrieval pipelines actually are with the right metrics.

Vector Databases & Embeddings

What You'll Learn in This Lesson

📚 Real-World Analogy: the librarian who reads minds

1Embeddings: meaning as a list of numbers

2Similarity metrics: cosine vs dot vs Euclidean

3Worked example: brute-force nearest-neighbour search

Worked Example: Brute-Force Nearest Neighbours

🎯 Your Turn #1: return the top-k IDs

Your Turn #1: Top-k IDs

🎯 Your Turn #2: normalise, then pick the metric

Your Turn #2: Normalise & Search

4Going fast: ANN, indexing, and HNSW intuition

5Metadata filtering: similarity plus structured rules

6The landscape: FAISS, Pinecone, Chroma, pgvector, Weaviate

Worked Example: FAISS (read-only)

Worked Example: ChromaDB with metadata filtering (read-only)

7Common Errors (and how to fix them)

📋 Quick Reference

❓ Frequently Asked Questions

🎯 Mini-Challenge: build a tiny semantic search engine

Mini-Challenge: Semantic Search

🎉 Lesson Complete!

Cookie & Privacy Settings