Lesson 46 • Advanced Track
Performance Testing & Benchmarking
By the end of this lesson you'll measure a single query with EXPLAIN ANALYZE, load-test a whole server with pgbench or sysbench, design a benchmark whose numbers you can actually trust, and read latency percentiles (p50/p95/p99) to find the slow 1% your users notice. This is how you turn "it feels slow" into a number you can fix.
What You'll Learn
- ✓Measure one query with EXPLAIN ANALYZE — planning vs execution time
- ✓Spot stale stats by comparing estimated rows vs actual rows
- ✓Load-test a server with pgbench (Postgres) and sysbench (MySQL)
- ✓Design a fair benchmark: warm cache, real data, repeat runs
- ✓Read p50/p95/p99 latency instead of trusting averages
- ✓Locate the bottleneck — CPU, I/O, or locks
EXPLAIN ANALYZE against real data, copy your query into a free playground like db-fiddle.com or a local Postgres. The load tools (pgbench, sysbench) are command-line programs you run in a terminal, shown here as commented commands. Every example states its expected output so you can check yourself.🏎️ Why Benchmark At All?
🏁 Real-world analogy
Benchmarking is a car's crash test and dyno run combined. EXPLAIN ANALYZE is putting one part on a test bench to see exactly where it strains. A load test with pgbench is driving the whole car flat-out on a track with a stopwatch. You never ship the car because it "felt fine" in the driveway — and you never ship a query because it was fast on your tiny laptop dataset.
Benchmarking turns vague worry into evidence. It answers three questions precisely: how many transactions per second can this database sustain, which queries fall apart at production scale, and did that index I just added actually help — or did I just convince myself it did?
1. Measure One Query: EXPLAIN ANALYZE
Before you stress-test a whole server, profile the single query you suspect. EXPLAIN shows the plan the database intends to use without running it. EXPLAIN ANALYZE actually runs the query and reports two extra things: the real timings and the real row counts at every step.
Two pairs of numbers do all the work. Planning Time vs Execution Time tells you whether the database spent its time choosing a plan or running it — for normal queries it's nearly all execution. Estimated rows vs actual rows tells you whether the planner's statistics are accurate; a large gap means the planner is guessing badly and is likely picking a bad plan.
EXPLAIN ANALYZE executes the query for real — including INSERT/UPDATE/DELETE. Wrap writes in a transaction you roll back, or you'll change your data while "just measuring".EXPLAIN ANALYZE a single query
Add real timings and real row counts to a query.
-- EXPLAIN shows the PLAN the database INTENDS to use (no run, instant).
-- EXPLAIN ANALYZE actually RUNS the query and adds real timings + real row counts.
EXPLAIN ANALYZE
SELECT customer_id, SUM(total) AS lifetime_value
FROM orders
WHERE status = 'delivered'
GROUP BY customer_id
ORDER BY lifetime_value DESC
LIMIT 10;
-- Read the output bottom-up. Two numbers matter most on every line:
-- cost=... the planner's GUESS before running
-- rows=N ESTIMATED rows
-- (actu
...Here's a real plan. Read it bottom-up — the innermost (most indented) node runs first, and its rows flow up into the node above it:
Reading an EXPLAIN ANALYZE plan
Estimated vs actual rows, Seq Scan, planning vs execution time.
-- A real EXPLAIN ANALYZE plan (read it bottom-up):
Limit (cost=18450.2..18450.3 rows=10 width=12)
(actual time=212.4..212.5 rows=10 loops=1)
-> Sort (cost=18450.2..18460.9 rows=4280 width=12)
(actual time=212.4..212.4 rows=10 loops=1)
Sort Key: (sum(total)) DESC
-> HashAggregate (cost=... rows=4280 width=12)
(actual time=205.1..210.8 rows=4192 loops=1)
-> Seq Scan on orders (cost=0..14210 rows=51000 width=12)
...The three signals to scan for
Seq Scan on a big table + many "Rows Removed by Filter" → the query read the whole table and threw most of it away. A well-chosen index usually fixes this.
Estimated rows=N far from actual rows=M → stale statistics. Run ANALYZE your_table; so the planner can choose better.
Execution Time >> Planning Time → the cost is in running the query (the normal case). If Planning Time is huge instead, you may have too many joins or partitions for the planner to reason about cheaply.
Your Turn: measure it, then read the rows
Turn the plain query into a measured one, then compare estimated vs actual rows. The expected output is in the comments so you can check yourself.
🎯 Your Turn: add EXPLAIN ANALYZE
Add the two keywords, then read estimated rows vs actual rows.
-- 🎯 YOUR TURN — turn this plain query into a measured one.
-- Goal: find out whether the planner's row estimate matches reality.
___ ___ -- 👉 add the two keywords that RUN the query and report real timings
SELECT product_id, COUNT(*) AS times_ordered
FROM order_items
WHERE quantity >= 3
GROUP BY product_id;
-- ✅ Expected: the plan now ends with "Planning Time" and "Execution Time" lines,
-- and each node shows "(actual ... rows=M)" next to its "rows=N" estimate.
-- If estimated
...2. Load-Test the Whole Server: pgbench & sysbench
One fast query proves little. Real systems fail when many clients hit the database at once and start competing for CPU, disk, and locks. A load test simulates that crowd. Two tools dominate:
pgbenchships with PostgreSQL. It builds a dataset (-i -s scale), then runs N concurrent clients (-c) for a fixed time (-T), reporting transactions per second and latency. Point it at your query with-f script.sql.sysbenchis the MySQL/MariaDB equivalent (it also covers Postgres). Itsoltp_read_writeworkload mimics a realistic mix of reads and writes and prints percentile latencies out of the box.
The headline number is TPS (transactions per second) — throughput. But the number that predicts complaints is the latency distribution, which both tools also print. You'll learn to read it in Section 4.
pgbench and sysbench commands
Initialise data, run concurrent clients, report TPS and latency.
-- LOAD TESTING = many simulated clients hammering the DB at once.
-- A single query can be fast while the server collapses under 200 concurrent users.
-- ── PostgreSQL: pgbench (ships with Postgres) ──
-- 1) Build a realistic dataset. Scale 100 ≈ 10M rows ≈ 1.5GB:
-- pgbench -i -s 100 shopdb
-- 2) Run: 20 clients, 4 threads, 60 seconds, latency report on:
-- pgbench -c 20 -j 4 -T 60 -r shopdb
-- 3) Run YOUR query, not the built-in toy one, with a custom script:
-- pgbench -c 20 -j 4 -
...3. Designing a Fair Benchmark
A benchmark is only as good as its setup. The most common mistake is measuring something that has nothing to do with production and then making decisions on it. Five rules keep your numbers honest:
- Warm the cache. The first run reads from disk; later runs read from RAM and can be 100x faster. Decide whether you're measuring warm or cold, then discard the throwaway first run so every comparison is apples-to-apples.
- Use realistic data volume. A query that's instant on 1,000 rows can crawl on 10 million, because the planner switches strategies as tables grow. Generate production-sized data so the plan you benchmark is the plan prod will run.
- Use representative queries in their real proportions — mostly reads, some writes, your actual hot queries — not
SELECT 1. - Repeat the runs (5+). A single measurement can be wrecked by a background backup or autovacuum. Report the median.
- Change one thing at a time. Add the index, re-run the same benchmark, compare. Change two things and you can't attribute the difference.
Build a trustworthy benchmark
Warm cache, generate realistic data, isolate one change.
-- DESIGN A FAIR BENCHMARK — five rules, or your numbers are fiction.
-- 1) WARM THE CACHE. The first run reads from disk (slow); later runs read from
-- RAM (fast). Pick one and be consistent. To warm Postgres deliberately:
-- SELECT pg_prewarm('orders');
-- Then discard the first run as a throwaway.
-- 2) REALISTIC DATA VOLUME. Test on production-sized data, not 100 rows.
-- Generate 5M realistic orders so query plans match what prod will choose:
INSERT INTO orders (customer_id
...4. Read Percentiles, Not Averages
An average hides your worst experiences. If 99 requests take 5ms and one takes 2,000ms, the average is still a comfortable 25ms — but one user in a hundred waited two full seconds. That's why you read percentiles:
- p50 (median) — half of requests were at least this fast. The "typical" experience.
- p95 — 95% were this fast; the slow 5% were worse.
- p99 — the tail. The slowest 1% — and at a million requests a day, that's 10,000 angry users.
Below is the kind of summary pgbench and sysbench print, and which you can compute yourself from logged timings using percentile_cont:
Result — load-test summary for the 'checkout' transaction:
| avg_ms | p50_ms | p95_ms | p99_ms | tps |
|---|---|---|---|---|
| 7.8 | 6.1 | 40 | 410 | 980 |
Look at that row. The average (7.8ms) and median (6.1ms) look excellent — but p99 is 410ms, ~50x the average. Something is going badly wrong for 1% of requests, and the average completely hid it. Always report p95/p99 next to the average.
Compute p50/p95/p99 yourself
Use percentile_cont to expose the tail an average hides.
-- AVERAGES LIE. p99 = "99% of requests were at least this fast."
-- The slowest 1% is what users complain about, and averages hide it.
-- pgbench/sysbench print percentiles for you, but you can compute them
-- from your own logged query timings too:
SELECT
ROUND(AVG(duration_ms), 1) AS avg_ms,
ROUND(percentile_cont(0.50) WITHIN GROUP (ORDER BY duration_ms), 1) AS p50_ms,
ROUND(percentile_cont(0.95) WITHIN GROUP (ORDER BY duration_ms), 1) AS
...5. Find the Bottleneck: CPU vs I/O vs Locks
Once a benchmark shows a problem, you need to know which resource is the limit. There are three usual suspects, and each has a tell:
- CPU-bound — cores pinned near 100% (
top/htop), TPS flat as you add clients. Cause: heavy sorts, aggregations, or function calls. Fix: better indexes so the engine touches fewer rows, or pre-compute. - I/O-bound — high disk read wait, lots of buffer reads in
EXPLAIN (ANALYZE, BUFFERS), CPU idle while latency climbs. Cause: data doesn't fit in RAM, Seq Scans on big tables. Fix: indexes, more memory, or faster storage. - Lock-bound — latency spikes (a tall p99 tail) while CPU and disk look idle. Cause: transactions waiting on each other's row locks. Inspect with
SELECT * FROM pg_stat_activity WHERE wait_event_type = 'Lock';. Fix: shorter transactions, the right isolation level, less contended update patterns.
The percentile shape is your fastest clue: a fat p99 tail over a calm average, with idle hardware, almost always means lock contention — not slow queries.
Your Turn: pick the slow query
Two queries were load-tested. Use the percentiles — not the averages — to decide which one to fix first.
🎯 Your Turn: read the percentiles
Spot the tail-latency query hiding behind a healthy average.
-- 🎯 YOUR TURN — two queries were load-tested. Which is the real problem?
--
-- query avg_ms p95_ms p99_ms tps
-- ---------- ------ ------ ------ ------
-- search 6.2 9.1 11.0 3100
-- checkout 7.8 40.0 410.0 980
--
-- Fill in the blank: the query to fix first is the one whose p99 spikes
-- far above its average (a long tail), even if its average looks fine.
SELECT '___' AS fix_this_first; -- 👉 'search' or 'checkout' ?
-- ✅ Ex
...Common Errors (and the fix)
- Benchmarking on a cold cache or tiny data: a 2ms result on 1,000 rows tells you nothing about 10 million rows on a warm server. Generate production-sized data and discard the first warm-up run before you measure.
- Reporting the average instead of percentiles: a healthy 8ms average can hide a 400ms p99. Always show p95/p99 — that tail is what users actually feel.
- Trusting one run: a single measurement is noise. A backup or autovacuum kicking off mid-test can double it. Run 5+ times and report the median.
- Testing on a laptop, not prod-like hardware: your SSD, RAM, and core count differ wildly from production, so plans and limits differ too. Benchmark on hardware that resembles prod, or you're measuring your laptop.
- Using
EXPLAINwhen you meantEXPLAIN ANALYZE: plainEXPLAINshows only estimates and never runs the query, so it has no real timings or actual row counts. AddANALYZEto get the truth (and remember it executes the query). - Changing two things at once: adding an index and regenerating data in the same step means you can't tell which caused the change. Isolate one variable per run.
📘 Quick Reference
| Tool / Term | What it measures / does |
|---|---|
| EXPLAIN | Show the planned strategy + estimates (does NOT run) |
| EXPLAIN ANALYZE | RUN the query; report real timings + actual rows |
| (ANALYZE, BUFFERS) | Add disk/cache buffer reads — spot I/O pressure |
| ANALYZE table | Refresh stats so estimated rows match reality |
| pgbench -c -T -f | Postgres load test: N clients, T seconds, your script |
| sysbench oltp_read_write | MySQL/multi-DB load test with percentile output |
| tps | Transactions per second (throughput) |
| p50 / p95 / p99 | Latency percentiles — read these, not the average |
| percentile_cont(0.99) | Compute p99 from your own logged timings |
Frequently Asked Questions
Q: What's the difference between EXPLAIN and EXPLAIN ANALYZE?
EXPLAIN only predicts a plan with cost estimates and never runs the query, so it's instant and safe. EXPLAIN ANALYZE actually executes it and adds real timings and actual row counts — far more useful, but it runs the query (including writes), so wrap mutations in a rolled-back transaction.
Q: Why is my p99 so much higher than my average?
A long tail usually means lock contention, a cold cache on certain inputs, or a missing index that only bites for some queries. The average smooths it away; the p99 exposes the slowest 1% of requests, which is exactly what users complain about.
Q: How much test data do I really need?
Enough that the query planner makes the same choices it will in production — usually millions of rows, sized to your real tables. The danger isn't "too much" data; it's too little, which lets a Seq Scan look fine and hides the plan switch that happens at scale.
Q: Can I trust a benchmark run on my laptop?
For rough comparisons of two queries on the same machine, yes. For predicting production capacity, no — your laptop's CPU, RAM, and disk are nothing like the server. Benchmark on prod-like hardware before you size anything.
Mini-Challenge: Prove the Index Helped
Put it all together — measure, change one thing, re-measure fairly, and compare. Write it, then run it against real data in a playground to confirm.
🎯 Mini-Challenge
EXPLAIN ANALYZE before/after an index, with repeat runs and one variable.
-- 🎯 MINI-CHALLENGE: prove an index actually helped (fairly)
-- Using only what this lesson covered:
-- 1. Run EXPLAIN ANALYZE on a SELECT ... WHERE status = 'delivered'
-- and note the Execution Time and whether it's a Seq Scan.
-- 2. Create an index: CREATE INDEX idx_orders_status ON orders(status);
-- 3. Run the SAME EXPLAIN ANALYZE again, 3+ times (ignore the first warm-up run).
-- 4. Compare the median Execution Time and the scan type before vs after.
--
-- ✅ Expected: scan c
...🎉 Lesson Complete
- ✅
EXPLAIN ANALYZEruns a query and reports real timings + actual rows - ✅ Estimated-vs-actual rows reveals stale stats; planning-vs-execution time tells you where the cost is
- ✅
pgbenchandsysbenchload-test the whole server, not just one query - ✅ A fair benchmark warms the cache, uses realistic data, repeats runs, and changes one thing
- ✅ p50/p95/p99 expose the slow tail an average hides; idle hardware + a tall p99 means locks
- ✅ Next: designing Multi-Tenant Databases
Sign up for free to track which lessons you've completed and get learning reminders.