Advanced Track
Indexing Internals: B-Trees, Hash & Bitmap Indexes
By the end of this lesson you'll know what actually happens inside an index — how a B-tree turns a million-row scan into a handful of hops, when a hash or bitmap index wins, the difference between clustered and non-clustered storage, why a covering index can skip the table entirely, and the leftmost-prefix rule that decides whether your composite index gets used at all.
What You'll Learn
- ✓How a B-tree/B+tree gives O(log n) lookups, range scans, and free sorting
- ✓Why hash indexes are equality-only and bitmap indexes suit low-cardinality columns
- ✓Clustered vs non-clustered indexes — physical order vs pointers
- ✓Covering indexes and the index-only scan that never touches the table
- ✓The leftmost-prefix rule for composite (multi-column) indexes
- ✓How to read an EXPLAIN plan to confirm an index is actually used
CREATE INDEX, primary keys, and reading a query result. The in-browser editor lets you write and edit SQL; to run it, copy your query into a free playground like sqliteonline.com or db-fiddle.com (pick PostgreSQL to see real EXPLAIN plans). Every example shows the expected plan or result so you can check yourself. Our Sample Table: employees
Picture this employees table with 1,000,000 rows. Without an index, finding one salary means reading all million rows. The examples below build indexes on it and watch the plan change.
Result:
| id | name | department_id | hire_date | salary |
|---|---|---|---|---|
| 1 | Ada Lovelace | 5 | 2021-03-14 | 92000 |
| 2 | Alan Turing | 5 | 2019-06-23 | 88000 |
| 3 | Grace Hopper | 3 | 2020-11-02 | 75000 |
| 4 | Linus Torvalds | 7 | 2023-01-09 | 81000 |
| … | … | … | … | … |
1. B-Tree Indexes — The Workhorse
An index is a separate data structure the database keeps next to your table to find rows quickly — like the index at the back of a book. The default everywhere (PostgreSQL, MySQL, SQLite, SQL Server) is the B-tree (technically a B+tree: only the leaf level holds the data, and the leaves are chained together).
📖 Real-world analogy
A B-tree is a sorted phone book. The values are kept in order, so you never read it cover to cover. You open near the middle, see you've gone too far, jump back — a few hops and you're there. Each hop roughly halves what's left, which is why lookups are logarithmic: 1,000,000 rows take only about 20 comparisons (log₂ 1,000,000 ≈ 20), not a million.
How it works under the hood: internal nodes are signposts ("salaries from 50k–60k are down this branch"); the bottom leaf nodes hold the actual entries in sorted order and are linked left-to-right in a chain. That structure gives you three wins for free:
- Equality (
= 75000): walk down the tree to one leaf. - Range (
BETWEEN 50000 AND 80000): find the first matching leaf, then follow the leaf chain until you pass the upper bound. - Sorting (
ORDER BY salary): the leaves are already sorted, so no separate sort step is needed.
CREATE INDEX — a B-tree
Build a B-tree and use it for equality, range, and sorting.
-- A B-tree index keeps the indexed values in SORTED order,
-- so the database can binary-search them instead of reading
-- every row. This is the default in PostgreSQL, MySQL, SQLite...
CREATE INDEX idx_employees_salary
ON employees (salary);
-- Now an equality lookup walks down the tree (a few hops),
-- instead of scanning all 1,000,000 rows:
SELECT id, name, salary
FROM employees
WHERE salary = 75000;
-- A range scan finds the FIRST matching leaf, then walks the
-- sorted leaves left-to-rig
...2. Reading the EXPLAIN Plan
How do you prove the index is being used? Ask the planner with EXPLAIN. It prints the plan — the steps the database will take, as text. The two lines you care about most:
Seq Scan— a full table scan; every row read. The index was not used.Index Scan— it jumped through the index. That's what you want.
EXPLAIN — confirm the index is used
Read a plan as text: Index Scan good, Seq Scan bad.
-- Ask the planner what it would do (this text is the EXPLAIN
-- "plan" — read it bottom-up; the deepest node runs first):
EXPLAIN
SELECT id, name, salary FROM employees WHERE salary = 75000;
-- ✅ Expected plan (uses the index we just created):
-- Index Scan using idx_employees_salary on employees
-- Index Cond: (salary = 75000)
--
-- "Index Scan" = good — it jumped straight to the rows.
-- If you instead see "Seq Scan on employees", the index was
-- NOT used and every row was read one by
...Your Turn: pick the index type
Choose B-tree vs hash vs bitmap for the described column and justify it. The expected answer is in the comments so you can check yourself.
🎯 Your Turn: choose & justify
Fill in the index type and the one-line reason.
-- 🎯 YOUR TURN — fill in the blanks and justify your choice.
-- Scenario: the "orders" table has a "status" column with just
-- 4 possible values (pending, shipped, delivered, cancelled).
-- It is ONLY ever filtered with status = '...' and the table is
-- read far more often than it is written.
-- 1) Which index TYPE fits a low-cardinality, read-heavy,
-- equality-only column? (btree / hash / bitmap)
-- 👉 Replace ___ with one of those three words:
CREATE ___ INDEX idx_orders_status ON ord
...3. Hash Indexes — Equality Only
A hash index runs each value through a hash function and stores it in a bucket. Looking up an exact value is O(1) on average — jump straight to the bucket. The catch: hashing scrambles order, so there is no "next" value. That means equality only — no ranges, no ORDER BY, no prefix LIKE.
Think of a hash index as a coat check: hand over your ticket number and you get the exact coat instantly — but you can't ask for "all coats with tickets between 40 and 80", because the rack isn't in number order.
Hash index — = and nothing else
O(1) equality, but no range, sort, or LIKE.
-- A HASH index stores hash(value) -> row location in buckets.
-- Lookup is O(1) average for EQUALITY, but the hash destroys
-- ordering, so it can do = and nothing else.
CREATE INDEX idx_users_email_hash
ON users USING hash (email);
-- ✅ Great — exact match jumps to one bucket:
SELECT * FROM users WHERE email = 'alice@example.com';
-- ❌ A hash index CANNOT serve any of these (no order exists):
-- WHERE email > 'a' -- no range scan
-- WHERE email LIKE 'alice%' -- no prefix search
...4. Bitmap Indexes — Low Cardinality
Cardinality means "how many distinct values a column has". email is high-cardinality (nearly unique); status with four values is low-cardinality. A bitmap index stores one bit-string per distinct value — a row of 1s and 0s where bit position N means "row N has this value". To find "shipped AND west", the database just bitwise-ANDs two bit-strings, which is blazingly fast.
Imagine a wall of light switches, one per row. The "shipped" bitmap flips on every shipped order; the "west" bitmap flips on every western order. Overlay the two and the switches still on are exactly the rows you want.
Oracle creates bitmap indexes explicitly; PostgreSQL builds equivalent bitmap scans on the fly to combine several B-tree indexes. The downside is writes: every insert or update must flip bits across whole bit-strings, so bitmaps belong on read-heavy tables (reporting, data warehouses), not busy transactional tables.
Bitmap index — combine cheap filters
One bit-string per value; AND/OR them together.
-- A BITMAP index shines on LOW-CARDINALITY columns — ones with
-- only a handful of distinct values (status, is_active, region).
-- It stores one bit-string per distinct value; bit position N
-- means "row N has this value":
-- status='shipped' -> 1 0 0 1 1 0 ...
-- status='pending' -> 0 1 1 0 0 1 ...
-- (Oracle creates these explicitly; PostgreSQL builds them
-- on the fly when combining several B-tree indexes.)
CREATE BITMAP INDEX idx_orders_status ON orders (status);
-- The magic:
...5. Clustered vs Non-Clustered
A clustered index decides the physical order of the rows on disk — the table is the index, with full rows living in the leaves. Because rows can only be stored one way, a table can have exactly one clustered index (usually the primary key). A non-clustered index is a separate sorted structure whose leaves hold the indexed value plus a pointer back to the row; you can have many of these, but each lookup pays for a pointer hop to fetch the rest of the row.
A clustered index is a dictionary: the words and their definitions are physically printed in sorted order. A non-clustered index is the index at the back of a textbook: it lists terms with page numbers you must flip to.
Clustered vs non-clustered
Physical row order vs a sorted structure of pointers.
-- CLUSTERED index = the table rows are PHYSICALLY stored in the
-- index's order. There can be only ONE per table (rows can only
-- be sorted one way). The primary key is usually the cluster key.
-- A "leaf" of the clustered index IS the full row.
-- (SQL Server: CREATE CLUSTERED INDEX; MySQL/InnoDB clusters
-- on the PRIMARY KEY automatically.)
CREATE CLUSTERED INDEX idx_orders_pk ON orders (order_id);
-- NON-CLUSTERED index = a separate sorted structure whose leaves
-- hold the indexed
...6. Covering Indexes — Skip the Table
A covering index includes every column a query touches, so the database answers it from the index alone and never visits the table — an index-only scan. You keep the filter/sort columns as the index keys and tack on the columns you only need to return as the payload (via INCLUDE in PostgreSQL/SQL Server, or just by listing them in MySQL).
Covering index — Index Only Scan
INCLUDE the returned column so the table is never read.
-- A COVERING index contains EVERY column a query needs, so the
-- database answers the query from the index alone and never
-- touches the table — an "index-only scan".
-- This query needs customer_id (to filter) and total (to return):
SELECT total FROM orders WHERE customer_id = 42;
-- Put the filter column first, then INCLUDE the payload column:
CREATE INDEX idx_orders_cust_total
ON orders (customer_id) INCLUDE (total);
-- (PostgreSQL/SQL Server use INCLUDE; MySQL just lists both
-- column
...7. Composite Indexes & the Leftmost-Prefix Rule
A composite (multi-column) index on (A, B, C) is sorted by A first, ties broken by B, then C — exactly like a phone book sorted by (last name, then first name). The leftmost-prefix rule follows directly: the index can seek only on a prefix of its columns with no gaps. It serves filters on A, on A+B, or on A+B+C — but not on B alone or C alone, because without the leading column the entries are scattered all over the index.
💡 Rule of thumb — column order is everything
Put the columns you filter with = first, the column you range/sort on next, and payload-only columns last. Get the order wrong and the index sits unused while your query falls back to a Seq Scan.
Your Turn: which queries does it serve?
Given the composite index below, mark each query ✅ (index helps) or ❌ (it can't) using the leftmost-prefix rule.
🎯 Your Turn: leftmost-prefix
Decide which of four queries the (dept, hire_date, salary) index serves.
-- 🎯 YOUR TURN — leftmost-prefix rule.
-- A composite B-tree index is sorted by its FIRST column, then
-- ties broken by the second, then the third — like a phone book
-- sorted by (last_name, first_name). You can only use it from
-- the LEFT with no gaps.
CREATE INDEX idx_emp ON employees (department_id, hire_date, salary);
-- For EACH query, mark ✅ if the index helps or ❌ if it cannot,
-- 👉 by replacing each ___ :
-- a) WHERE department_id = 5
-- ___ (leftmost column present)
-- b) W
...Common Errors (and the fix)
- Hash index for a range query: you build
USING hash (price)then runWHERE price > 100and the plan shows aSeq Scan. Hashing has no order — use a B-tree for anything but exact equality. - Wrong composite column order: an index on
(hire_date, department_id)won't helpWHERE department_id = 5becausedepartment_idisn't the leftmost column. Reorder to put the most-filtered (usually the=) column first. - Over-indexing a write-heavy table: every
INSERT/UPDATE/DELETEmust update every index, so piling indexes onto an OLTP table makes writes crawl. Index the queries that matter, then drop unused ones (checkidx_scan = 0inpg_stat_user_indexes). - Low-selectivity B-tree index: indexing
is_active(only true/false) rarely helps — when a value matches ~half the rows the planner ignores the index because a scan is cheaper. Index high-selectivity columns; consider a bitmap for low-cardinality reporting. - Function on the indexed column:
WHERE LOWER(email) = 'a@b.com'can't use a plain index onemail— the function hides the raw value. Index the expression (CREATE INDEX ... (LOWER(email))) or store it normalized.
📘 Quick Reference
| Index type | Best for | Limitation |
|---|---|---|
| B-tree | Equality, range, sort, prefix LIKE | Wider/slower to update than needed if over-used |
| Hash | Exact equality (O(1)) | No range, sort, or LIKE |
| Bitmap | Low-cardinality + AND/OR filters | Costly writes — read-heavy tables only |
| Clustered | Physical row order (one per table) | Only one allowed; defines storage order |
| Covering | Index-only scan (return from index) | Extra storage; slower writes |
Frequently Asked Questions
Q: If indexes make reads so fast, why not index every column?
Because every index must be kept up to date on every write, and each one eats storage and memory. More indexes mean slower INSERT/UPDATE/DELETE and a bigger database. Index for the queries you actually run, then drop the ones the stats show are unused.
Q: What's the difference between a B-tree and a B+tree?
In a B+tree only the leaf level stores data, and the leaves are linked in a chain — which is what makes range scans and ordered reads fast. Databases say "B-tree" but almost all of them actually implement B+trees.
Q: My index exists but the query still does a Seq Scan — why?
Common causes: the filter doesn't match the leftmost prefix of a composite index; a function wraps the column (LOWER(col)); the column has low selectivity so a scan is genuinely cheaper; or table statistics are stale (run ANALYZE). Read the EXPLAIN plan to see which.
Q: Does the order of columns in a composite index matter?
Hugely. (A, B) and (B, A) are different indexes that serve different queries. Lead with the column(s) you filter by equality, then the range/sort column. The leftmost-prefix rule means only the leading columns can be seeked.
Mini-Challenge: Design the Optimal Index
Put it all together — equality first, sort column next, payload via INCLUDE. Write the CREATE INDEX, then check it against the expected idea in the comments.
🎯 Mini-Challenge
One index that seeks, sorts, and covers a hot query.
-- 🎯 MINI-CHALLENGE: design the optimal index for this workload.
-- This ONE query runs thousands of times per minute on a huge,
-- read-heavy "events" table:
--
-- SELECT event_time, payload
-- FROM events
-- WHERE tenant_id = ?
-- AND event_type = 'click'
-- ORDER BY event_time DESC
-- LIMIT 50;
--
-- Design a single index that:
-- 1. Seeks on the two equality columns (think leftmost-prefix
-- order: which columns are matched with = ?)
-- 2. Returns rows already in
...🎉 Lesson Complete
- ✅ A B-tree/B+tree keeps values sorted for O(log n) lookups, range scans, and free ordering
- ✅ Hash indexes are equality-only; bitmap indexes suit low-cardinality, read-heavy columns
- ✅ Clustered = physical row order (one per table); non-clustered = a sorted structure of pointers
- ✅ A covering index answers a query from the index alone (index-only scan)
- ✅ The leftmost-prefix rule decides which queries a composite index serves — column order is everything
- ✅ Read the
EXPLAINplan:Index Scangood,Seq Scanmeans your index was skipped - ✅ Next: Query Optimization — turn these plans into faster queries
Sign up for free to track which lessons you've completed and get learning reminders.