Sharding & Distributed SQL
Scale SQL horizontally across multiple servers with sharding strategies and distributed databases.
๐ฏ What You'll Learn
- Sharding vs partitioning: when single-server isn't enough
- Hash, range, and geographic shard key strategies
- Vitess for MySQL sharding at scale
- CockroachDB / YugabyteDB for auto-distributed SQL
- Solving resharding, cross-shard joins, and distributed transactions
๐ Sharding vs Partitioning
Partitioning splits a table within one server (vertical filing cabinet). Sharding splits data across multiple servers (multiple filing cabinets in different offices). You shard when a single machine can't handle the load โ typically at billions of rows or thousands of writes per second.
Sharding Fundamentals
Hash, range, and geographic sharding strategies
-- SHARDING: splitting data across MULTIPLE database servers
-- vs PARTITIONING: splitting within a SINGLE server
-- Why shard?
-- โข Single server can't handle the write load
-- โข Dataset exceeds one server's storage
-- โข Need geographic distribution for latency
-- Common shard key strategies:
-- 1. HASH-BASED SHARDING
-- shard_id = hash(user_id) % num_shards
-- Pros: even distribution
-- Cons: cross-shard queries, resharding is painful
-- Example: 4 database servers
-- User 101 โ hash(101)
...๐ Choosing a Shard Key
The shard key is the column that determines which server stores each row. Choose wrong and you get hotspots (one shard doing all the work) or constant cross-shard queries (defeating the purpose of sharding).
๐ก Pro Tip โ The Co-Location Rule
Data that's queried together should live on the same shard. In a SaaS app, shard by tenant_id so all of a tenant's data is on one server. Cross-tenant analytics can use a separate analytical replica.
Shard Key Selection
Good vs bad shard keys and the co-location principle
-- Choosing a shard key is the MOST important decision
-- โ
GOOD shard keys:
-- โข user_id (most queries are per-user)
-- โข tenant_id (multi-tenant SaaS)
-- โข region (geographic locality)
-- โ BAD shard keys:
-- โข created_at (all new writes hit one shard)
-- โข auto-increment ID (sequential = hotspot)
-- โข status (low cardinality, uneven distribution)
-- The "co-location" principle:
-- Data that's queried together should live together
-- Example: E-commerce
-- Shard by customer_id:
-- โ
Orders
...๐ Vitess โ MySQL Sharding Middleware
Vitess sits between your application and MySQL, routing queries to the correct shard automatically. It powers YouTube, Slack, and GitHub. Your application writes normal SQL โ Vitess handles the complexity.
Vitess Configuration
Define VSchema and let Vitess route queries
-- Vitess: MySQL sharding middleware (used by YouTube, Slack)
-- Sits between your app and MySQL, handles routing
-- VSchema defines how tables are sharded:
-- {
-- "sharded": true,
-- "vindexes": {
-- "hash": { "type": "hash" }
-- },
-- "tables": {
-- "users": {
-- "column_vindexes": [{
-- "column": "id",
-- "name": "hash"
-- }]
-- },
-- "orders": {
-- "column_vindexes": [{
-- "column": "user_id",
-- "name": "hash"
--
...๐ชณ Distributed SQL (CockroachDB / YugabyteDB)
Distributed SQL databases handle sharding, replication, and distributed transactions automatically. You write standard PostgreSQL โ the database distributes data and coordinates transactions across nodes.
โ ๏ธ Common Mistake
Assuming distributed SQL has the same latency as single-node. Network round trips between nodes add milliseconds to every transaction. Design your schema to minimize cross-node queries.
Distributed SQL
CockroachDB with geo-pinning and follower reads
-- CockroachDB / YugabyteDB: Distributed SQL
-- PostgreSQL-compatible, auto-sharding, auto-replication
-- Create a table โ CockroachDB handles distribution
CREATE TABLE products (
id UUID DEFAULT gen_random_uuid() PRIMARY KEY,
name VARCHAR(200) NOT NULL,
category VARCHAR(50),
price DECIMAL(10,2),
region VARCHAR(20)
);
-- Data is automatically split into "ranges" (~512MB each)
-- and distributed across nodes
-- Pin data to specific regions for compliance:
ALTER TABLE produc
...โก Real-World Challenges
Sharding introduces complexity that doesn't exist in single-server setups. Understanding these challenges upfront saves weeks of debugging later.
Sharding Challenges
Resharding, cross-shard joins, and distributed IDs
-- Real-world sharding challenges and solutions
-- CHALLENGE 1: Resharding (adding more shards)
-- Problem: hash(id) % 4 โ hash(id) % 8
-- Solution A: Consistent hashing (minimize data movement)
-- Solution B: Double-write during migration
-- Solution C: Use Vitess/CockroachDB (automatic)
-- CHALLENGE 2: Cross-shard JOINs
-- Problem: orders on shard A, products on shard B
-- Solutions:
-- A) Denormalize: copy product_name into orders table
-- B) Reference tables: small tables replicated to all
...๐ Quick Reference
| Solution | Type | Best For |
|---|---|---|
| Vitess | Middleware | MySQL at YouTube scale |
| CockroachDB | Distributed SQL | Auto-sharding, PostgreSQL compat |
| YugabyteDB | Distributed SQL | PostgreSQL + Cassandra hybrid |
| Citus | PG Extension | Shard PostgreSQL tables |
| DuckDB | Embedded OLAP | Analytics on local data |
๐ Lesson Complete!
You now understand how to scale SQL horizontally with sharding and distributed databases. Next, explore materialized views and caching!
Sign up for free to track which lessons you've completed and get learning reminders.