Skip to main content
    Back

    Advanced Track

    Indexing Internals: B-Trees, Hash & Bitmap Indexes

    By the end of this lesson you'll know what actually happens inside an index — how a B-tree turns a million-row scan into a handful of hops, when a hash or bitmap index wins, the difference between clustered and non-clustered storage, why a covering index can skip the table entirely, and the leftmost-prefix rule that decides whether your composite index gets used at all.

    What You'll Learn

    • How a B-tree/B+tree gives O(log n) lookups, range scans, and free sorting
    • Why hash indexes are equality-only and bitmap indexes suit low-cardinality columns
    • Clustered vs non-clustered indexes — physical order vs pointers
    • Covering indexes and the index-only scan that never touches the table
    • The leftmost-prefix rule for composite (multi-column) indexes
    • How to read an EXPLAIN plan to confirm an index is actually used

    Our Sample Table: employees

    Picture this employees table with 1,000,000 rows. Without an index, finding one salary means reading all million rows. The examples below build indexes on it and watch the plan change.

    Result:

    idnamedepartment_idhire_datesalary
    1Ada Lovelace52021-03-1492000
    2Alan Turing52019-06-2388000
    3Grace Hopper32020-11-0275000
    4Linus Torvalds72023-01-0981000

    1. B-Tree Indexes — The Workhorse

    An index is a separate data structure the database keeps next to your table to find rows quickly — like the index at the back of a book. The default everywhere (PostgreSQL, MySQL, SQLite, SQL Server) is the B-tree (technically a B+tree: only the leaf level holds the data, and the leaves are chained together).

    📖 Real-world analogy

    A B-tree is a sorted phone book. The values are kept in order, so you never read it cover to cover. You open near the middle, see you've gone too far, jump back — a few hops and you're there. Each hop roughly halves what's left, which is why lookups are logarithmic: 1,000,000 rows take only about 20 comparisons (log₂ 1,000,000 ≈ 20), not a million.

    How it works under the hood: internal nodes are signposts ("salaries from 50k–60k are down this branch"); the bottom leaf nodes hold the actual entries in sorted order and are linked left-to-right in a chain. That structure gives you three wins for free:

    • Equality (= 75000): walk down the tree to one leaf.
    • Range (BETWEEN 50000 AND 80000): find the first matching leaf, then follow the leaf chain until you pass the upper bound.
    • Sorting (ORDER BY salary): the leaves are already sorted, so no separate sort step is needed.

    CREATE INDEX — a B-tree

    Build a B-tree and use it for equality, range, and sorting.

    Try it Yourself »
    SQL
    -- A B-tree index keeps the indexed values in SORTED order,
    -- so the database can binary-search them instead of reading
    -- every row. This is the default in PostgreSQL, MySQL, SQLite...
    CREATE INDEX idx_employees_salary
    ON employees (salary);
    
    -- Now an equality lookup walks down the tree (a few hops),
    -- instead of scanning all 1,000,000 rows:
    SELECT id, name, salary
    FROM employees
    WHERE salary = 75000;
    
    -- A range scan finds the FIRST matching leaf, then walks the
    -- sorted leaves left-to-rig
    ...

    2. Reading the EXPLAIN Plan

    How do you prove the index is being used? Ask the planner with EXPLAIN. It prints the plan — the steps the database will take, as text. The two lines you care about most:

    • Seq Scan — a full table scan; every row read. The index was not used.
    • Index Scan — it jumped through the index. That's what you want.

    EXPLAIN — confirm the index is used

    Read a plan as text: Index Scan good, Seq Scan bad.

    Try it Yourself »
    SQL
    -- Ask the planner what it would do (this text is the EXPLAIN
    -- "plan" — read it bottom-up; the deepest node runs first):
    EXPLAIN
    SELECT id, name, salary FROM employees WHERE salary = 75000;
    
    -- ✅ Expected plan (uses the index we just created):
    --   Index Scan using idx_employees_salary on employees
    --     Index Cond: (salary = 75000)
    --
    -- "Index Scan" = good — it jumped straight to the rows.
    -- If you instead see "Seq Scan on employees", the index was
    -- NOT used and every row was read one by
    ...

    Your Turn: pick the index type

    Choose B-tree vs hash vs bitmap for the described column and justify it. The expected answer is in the comments so you can check yourself.

    🎯 Your Turn: choose & justify

    Fill in the index type and the one-line reason.

    Try it Yourself »
    SQL
    -- 🎯 YOUR TURN — fill in the blanks and justify your choice.
    -- Scenario: the "orders" table has a "status" column with just
    -- 4 possible values (pending, shipped, delivered, cancelled).
    -- It is ONLY ever filtered with status = '...' and the table is
    -- read far more often than it is written.
    
    -- 1) Which index TYPE fits a low-cardinality, read-heavy,
    --    equality-only column?  (btree / hash / bitmap)
    -- 👉 Replace ___ with one of those three words:
    CREATE ___ INDEX idx_orders_status ON ord
    ...

    3. Hash Indexes — Equality Only

    A hash index runs each value through a hash function and stores it in a bucket. Looking up an exact value is O(1) on average — jump straight to the bucket. The catch: hashing scrambles order, so there is no "next" value. That means equality only — no ranges, no ORDER BY, no prefix LIKE.

    Think of a hash index as a coat check: hand over your ticket number and you get the exact coat instantly — but you can't ask for "all coats with tickets between 40 and 80", because the rack isn't in number order.

    Hash index — = and nothing else

    O(1) equality, but no range, sort, or LIKE.

    Try it Yourself »
    SQL
    -- A HASH index stores hash(value) -> row location in buckets.
    -- Lookup is O(1) average for EQUALITY, but the hash destroys
    -- ordering, so it can do = and nothing else.
    CREATE INDEX idx_users_email_hash
    ON users USING hash (email);
    
    -- ✅ Great — exact match jumps to one bucket:
    SELECT * FROM users WHERE email = 'alice@example.com';
    
    -- ❌ A hash index CANNOT serve any of these (no order exists):
    --   WHERE email > 'a'          -- no range scan
    --   WHERE email LIKE 'alice%'  -- no prefix search
    ...

    4. Bitmap Indexes — Low Cardinality

    Cardinality means "how many distinct values a column has". email is high-cardinality (nearly unique); status with four values is low-cardinality. A bitmap index stores one bit-string per distinct value — a row of 1s and 0s where bit position N means "row N has this value". To find "shipped AND west", the database just bitwise-ANDs two bit-strings, which is blazingly fast.

    Imagine a wall of light switches, one per row. The "shipped" bitmap flips on every shipped order; the "west" bitmap flips on every western order. Overlay the two and the switches still on are exactly the rows you want.

    Oracle creates bitmap indexes explicitly; PostgreSQL builds equivalent bitmap scans on the fly to combine several B-tree indexes. The downside is writes: every insert or update must flip bits across whole bit-strings, so bitmaps belong on read-heavy tables (reporting, data warehouses), not busy transactional tables.

    Bitmap index — combine cheap filters

    One bit-string per value; AND/OR them together.

    Try it Yourself »
    SQL
    -- A BITMAP index shines on LOW-CARDINALITY columns — ones with
    -- only a handful of distinct values (status, is_active, region).
    -- It stores one bit-string per distinct value; bit position N
    -- means "row N has this value":
    --   status='shipped'   -> 1 0 0 1 1 0 ...
    --   status='pending'   -> 0 1 1 0 0 1 ...
    -- (Oracle creates these explicitly; PostgreSQL builds them
    --  on the fly when combining several B-tree indexes.)
    CREATE BITMAP INDEX idx_orders_status ON orders (status);
    
    -- The magic: 
    ...

    5. Clustered vs Non-Clustered

    A clustered index decides the physical order of the rows on disk — the table is the index, with full rows living in the leaves. Because rows can only be stored one way, a table can have exactly one clustered index (usually the primary key). A non-clustered index is a separate sorted structure whose leaves hold the indexed value plus a pointer back to the row; you can have many of these, but each lookup pays for a pointer hop to fetch the rest of the row.

    A clustered index is a dictionary: the words and their definitions are physically printed in sorted order. A non-clustered index is the index at the back of a textbook: it lists terms with page numbers you must flip to.

    Clustered vs non-clustered

    Physical row order vs a sorted structure of pointers.

    Try it Yourself »
    SQL
    -- CLUSTERED index = the table rows are PHYSICALLY stored in the
    -- index's order. There can be only ONE per table (rows can only
    -- be sorted one way). The primary key is usually the cluster key.
    -- A "leaf" of the clustered index IS the full row.
    --   (SQL Server: CREATE CLUSTERED INDEX; MySQL/InnoDB clusters
    --    on the PRIMARY KEY automatically.)
    CREATE CLUSTERED INDEX idx_orders_pk ON orders (order_id);
    
    -- NON-CLUSTERED index = a separate sorted structure whose leaves
    -- hold the indexed 
    ...

    6. Covering Indexes — Skip the Table

    A covering index includes every column a query touches, so the database answers it from the index alone and never visits the table — an index-only scan. You keep the filter/sort columns as the index keys and tack on the columns you only need to return as the payload (via INCLUDE in PostgreSQL/SQL Server, or just by listing them in MySQL).

    Covering index — Index Only Scan

    INCLUDE the returned column so the table is never read.

    Try it Yourself »
    SQL
    -- A COVERING index contains EVERY column a query needs, so the
    -- database answers the query from the index alone and never
    -- touches the table — an "index-only scan".
    
    -- This query needs customer_id (to filter) and total (to return):
    SELECT total FROM orders WHERE customer_id = 42;
    
    -- Put the filter column first, then INCLUDE the payload column:
    CREATE INDEX idx_orders_cust_total
    ON orders (customer_id) INCLUDE (total);
    -- (PostgreSQL/SQL Server use INCLUDE; MySQL just lists both
    --  column
    ...

    7. Composite Indexes & the Leftmost-Prefix Rule

    A composite (multi-column) index on (A, B, C) is sorted by A first, ties broken by B, then C — exactly like a phone book sorted by (last name, then first name). The leftmost-prefix rule follows directly: the index can seek only on a prefix of its columns with no gaps. It serves filters on A, on A+B, or on A+B+C — but not on B alone or C alone, because without the leading column the entries are scattered all over the index.

    💡 Rule of thumb — column order is everything

    Put the columns you filter with = first, the column you range/sort on next, and payload-only columns last. Get the order wrong and the index sits unused while your query falls back to a Seq Scan.

    Your Turn: which queries does it serve?

    Given the composite index below, mark each query ✅ (index helps) or ❌ (it can't) using the leftmost-prefix rule.

    🎯 Your Turn: leftmost-prefix

    Decide which of four queries the (dept, hire_date, salary) index serves.

    Try it Yourself »
    SQL
    -- 🎯 YOUR TURN — leftmost-prefix rule.
    -- A composite B-tree index is sorted by its FIRST column, then
    -- ties broken by the second, then the third — like a phone book
    -- sorted by (last_name, first_name). You can only use it from
    -- the LEFT with no gaps.
    CREATE INDEX idx_emp ON employees (department_id, hire_date, salary);
    
    -- For EACH query, mark ✅ if the index helps or ❌ if it cannot,
    -- 👉 by replacing each ___ :
    
    -- a) WHERE department_id = 5
    --    ___   (leftmost column present)
    
    -- b) W
    ...

    Common Errors (and the fix)

    • Hash index for a range query: you build USING hash (price) then run WHERE price > 100 and the plan shows a Seq Scan. Hashing has no order — use a B-tree for anything but exact equality.
    • Wrong composite column order: an index on (hire_date, department_id) won't help WHERE department_id = 5 because department_id isn't the leftmost column. Reorder to put the most-filtered (usually the =) column first.
    • Over-indexing a write-heavy table: every INSERT/UPDATE/DELETE must update every index, so piling indexes onto an OLTP table makes writes crawl. Index the queries that matter, then drop unused ones (check idx_scan = 0 in pg_stat_user_indexes).
    • Low-selectivity B-tree index: indexing is_active (only true/false) rarely helps — when a value matches ~half the rows the planner ignores the index because a scan is cheaper. Index high-selectivity columns; consider a bitmap for low-cardinality reporting.
    • Function on the indexed column: WHERE LOWER(email) = 'a@b.com' can't use a plain index on email — the function hides the raw value. Index the expression (CREATE INDEX ... (LOWER(email))) or store it normalized.

    📘 Quick Reference

    Index typeBest forLimitation
    B-treeEquality, range, sort, prefix LIKEWider/slower to update than needed if over-used
    HashExact equality (O(1))No range, sort, or LIKE
    BitmapLow-cardinality + AND/OR filtersCostly writes — read-heavy tables only
    ClusteredPhysical row order (one per table)Only one allowed; defines storage order
    CoveringIndex-only scan (return from index)Extra storage; slower writes

    Frequently Asked Questions

    Q: If indexes make reads so fast, why not index every column?

    Because every index must be kept up to date on every write, and each one eats storage and memory. More indexes mean slower INSERT/UPDATE/DELETE and a bigger database. Index for the queries you actually run, then drop the ones the stats show are unused.

    Q: What's the difference between a B-tree and a B+tree?

    In a B+tree only the leaf level stores data, and the leaves are linked in a chain — which is what makes range scans and ordered reads fast. Databases say "B-tree" but almost all of them actually implement B+trees.

    Q: My index exists but the query still does a Seq Scan — why?

    Common causes: the filter doesn't match the leftmost prefix of a composite index; a function wraps the column (LOWER(col)); the column has low selectivity so a scan is genuinely cheaper; or table statistics are stale (run ANALYZE). Read the EXPLAIN plan to see which.

    Q: Does the order of columns in a composite index matter?

    Hugely. (A, B) and (B, A) are different indexes that serve different queries. Lead with the column(s) you filter by equality, then the range/sort column. The leftmost-prefix rule means only the leading columns can be seeked.

    Mini-Challenge: Design the Optimal Index

    Put it all together — equality first, sort column next, payload via INCLUDE. Write the CREATE INDEX, then check it against the expected idea in the comments.

    🎯 Mini-Challenge

    One index that seeks, sorts, and covers a hot query.

    Try it Yourself »
    SQL
    -- 🎯 MINI-CHALLENGE: design the optimal index for this workload.
    -- This ONE query runs thousands of times per minute on a huge,
    -- read-heavy "events" table:
    --
    --   SELECT event_time, payload
    --   FROM   events
    --   WHERE  tenant_id = ?
    --     AND  event_type = 'click'
    --   ORDER BY event_time DESC
    --   LIMIT 50;
    --
    -- Design a single index that:
    --   1. Seeks on the two equality columns (think leftmost-prefix
    --      order: which columns are matched with = ?)
    --   2. Returns rows already in 
    ...

    🎉 Lesson Complete

    • ✅ A B-tree/B+tree keeps values sorted for O(log n) lookups, range scans, and free ordering
    • Hash indexes are equality-only; bitmap indexes suit low-cardinality, read-heavy columns
    • Clustered = physical row order (one per table); non-clustered = a sorted structure of pointers
    • ✅ A covering index answers a query from the index alone (index-only scan)
    • ✅ The leftmost-prefix rule decides which queries a composite index serves — column order is everything
    • ✅ Read the EXPLAIN plan: Index Scan good, Seq Scan means your index was skipped
    • Next: Query Optimization — turn these plans into faster queries

    Sign up for free to track which lessons you've completed and get learning reminders.

    Previous

    Cookie & Privacy Settings

    We use cookies to improve your experience, analyze traffic, and show personalized ads. You can manage your preferences below.

    By clicking "Accept All", you consent to our use of cookies for analytics and personalized advertising. You can customize your preferences or reject non-essential cookies.

    Privacy PolicyTerms of Service