Lesson 37 • Advanced
Vector Databases & Embeddings
Store and search embeddings at scale — learn vector arithmetic, approximate nearest neighbours, and how to choose between FAISS, Pinecone, and ChromaDB.
✅ What You'll Learn
- • Embeddings: converting text/images to vectors
- • Vector arithmetic and cosine similarity
- • Search algorithms: brute force, IVF, HNSW, LSH
- • Choosing the right vector database for your use case
🗄️ Databases for AI
🎯 Real-World Analogy: Traditional databases are like a library organized by Dewey Decimal — great for finding a specific book by title or author. Vector databases are like a librarian who understands meaning: "Find me books similar to Harry Potter" returns fantasy novels with magic systems, even if they share no keywords. This semantic search is what powers RAG, recommendation engines, and image similarity.
Vector databases store high-dimensional embeddings and enable similarity search — finding the closest vectors to a query. This is the foundation of RAG systems, semantic search, recommendation engines, and anomaly detection.
Try It: Vector Embeddings
See how similar words have similar vectors — and do vector arithmetic
import numpy as np
# Embeddings: Converting Anything to Vectors
# Similar items have similar vectors (close in space)
np.random.seed(42)
def cosine_similarity(a, b):
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
# Simulated word embeddings (in practice: use sentence-transformers)
embeddings = {
"king": np.array([0.8, 0.2, 0.9, 0.1, 0.7]),
"queen": np.array([0.8, 0.2, 0.9, 0.9, 0.7]),
"man": np.array([0.5, 0.1, 0.3, 0.1, 0.4]),
"woman": np.array(
...Try It: Vector Search
Compare exact vs approximate nearest neighbor search algorithms
import numpy as np
# Vector Search: Finding Nearest Neighbors at Scale
# The core operation of vector databases
np.random.seed(42)
def brute_force_search(query, vectors, top_k=3):
"""Exact nearest neighbor search — O(n)"""
similarities = [np.dot(query, v) / (np.linalg.norm(query) * np.linalg.norm(v)) for v in vectors]
indices = np.argsort(similarities)[-top_k:][::-1]
return indices, [similarities[i] for i in indices]
def lsh_hash(vector, planes):
"""Locality-Sensitive Has
...⚠️ Common Mistake: Not normalizing vectors before cosine similarity search. If vectors have different magnitudes, cosine similarity becomes unreliable. Always L2-normalize your embeddings, or use inner product search on pre-normalized vectors for best results.
💡 Pro Tip: Start with ChromaDB for prototyping (runs locally, zero setup). For production with <1M vectors, use pgvector (if you already have PostgreSQL). For >10M vectors, use Pinecone (managed) or Qdrant (self-hosted). Always benchmark with your actual data before choosing.
📋 Quick Reference
| Algorithm | Speed | Accuracy | Memory |
|---|---|---|---|
| Flat (brute force) | Slow | 100% exact | Low |
| IVF | Fast | ~95% | Medium |
| HNSW | Very fast | ~98% | High |
| LSH | Very fast | ~85% | Low |
| PQ (Product Quantization) | Fast | ~90% | Very low |
🎉 Lesson Complete!
You now understand vector databases — the engine behind semantic search and RAG! Next, learn comprehensive model evaluation metrics.
Sign up for free to track which lessons you've completed and get learning reminders.