Skip to main content

    Lesson 48 • Advanced Track

    Data Warehousing & Star Schemas

    By the end of this lesson you'll be able to design an analytical data warehouse the way professionals do — separating fast-changing transactions (OLTP) from a read-optimised warehouse (OLAP), modelling a star schema from a central fact table and surrounding dimensions, choosing the right grain, and writing the star-join GROUP BY query that turns raw events into business numbers.

    What You'll Learn

    • Tell OLTP (transactions) from OLAP (analytics)
    • Model a star schema: one fact + dimension tables
    • Pin down the fact-table grain before you build
    • Separate measures (numbers) from dimensions (context)
    • Use surrogate keys and handle changing dimensions (SCD)
    • Write the star-join GROUP BY that produces reports

    1. OLTP vs OLAP — Two Different Jobs

    Your normal app database is OLTP (Online Transaction Processing): it handles lots of tiny, fast writes — place an order, update a profile, one row at a time. It's normalised (data split across many tables to avoid duplication) so writes stay safe and cheap.

    A data warehouse is OLAP (Online Analytical Processing): it answers big questions like "revenue by category by month for the last three years". It's loaded in batches and shaped so those huge read queries are simple and fast — even if that means duplicating descriptive data.

    🏪 Real-world analogy

    OLTP is the supermarket checkout — fast, one customer at a time, optimised for the moment of sale. OLAP is the head office, where analysts compare sales across 500 stores over 5 years. Same business, but you wouldn't want the head-office reports slowing down the tills — so you give them separate systems.

     OLTP (app DB)OLAP (warehouse)
    WorkloadMany small writesFew huge reads
    DesignNormalisedStar schema
    QueryOne order / userAggregate over millions
    FreshnessLive, instantBatched (e.g. nightly)

    2. The Star Schema — Facts at the Centre

    A star schema has one central fact table ringed by dimension tables. Sketch it and it looks like a star, hence the name. It's the default warehouse design — when in doubt, build a star.

    • A fact table holds the events you measure — sales, clicks, shipments. It's long and skinny: mostly foreign keys plus a few numbers.
    • A dimension table holds the descriptive context — product, date, customer, store. It's short and wide: lots of text columns you'll filter and group by.
    • Measures are the numbers in the fact you SUM/AVG (quantity, total_amount). Dimensions are the labels you GROUP BY (category, month_name).
    • A surrogate key is the warehouse's own meaningless id (product_key), separate from the source system's natural key (product_id). Surrogate keys let a product have several versions over time (you'll see why in SCD).

    Star schema — fact + dimensions

    Create fact_sales surrounded by dim_product and dim_date.

    Try it Yourself »
    SQL
    -- A STAR SCHEMA = one central FACT table + several DIMENSION tables.
    -- Draw it and it looks like a star: the fact in the middle, dimensions
    -- radiating out. This is the default warehouse design — start here.
    
    -- DIMENSION: who/what/when/where context. Wide, descriptive, few rows.
    CREATE TABLE dim_product (
        product_key  INT PRIMARY KEY,   -- SURROGATE key (warehouse's own id)
        product_id   INT,               -- NATURAL key from the source system
        product_name VARCHAR(100),
        catego
    ...

    Our Sample Star

    Load these rows so the analytical query below has data. Notice fact_sales stores only keys and numbers — every label lives in a dimension.

    Result — dim_product:

    product_keyproduct_idproduct_namecategorybrand
    11001Wireless MouseElectronicsLogi
    21002Mechanical KeyboardElectronicsLogi
    31003Coffee MugKitchenAcme

    Result — dim_date:

    date_keyfull_dateyearmonthmonth_name
    202401152024-01-1520241January
    202402202024-02-2020242February

    Result — fact_sales (grain: one product per sale line):

    sale_keydate_keyproduct_keyquantitytotal_amount
    1202401151250
    2202401153440
    3202402201125
    42024022023240

    Load the sample rows

    Insert a few rows into each table so the join has data.

    Try it Yourself »
    SQL
    -- A few rows of each table so the join below has data to work on.
    
    INSERT INTO dim_product (product_key, product_id, product_name, category, brand) VALUES
        (1, 1001, 'Wireless Mouse',      'Electronics', 'Logi'),
        (2, 1002, 'Mechanical Keyboard', 'Electronics', 'Logi'),
        (3, 1003, 'Coffee Mug',          'Kitchen',     'Acme');
    
    INSERT INTO dim_date (date_key, full_date, year, month, month_name) VALUES
        (20240115, '2024-01-15', 2024, 1, 'January'),
        (20240220, '2024-02-20', 2024, 
    ...

    3. The Star-Join Query — Turning Events into Reports

    Almost every warehouse report follows the same shape: start at the fact table, JOIN out to each dimension you need, then GROUP BY the dimension labels and SUM the measures. You join "out from the centre of the star" — one hop per dimension.

    Star-join aggregation

    Revenue and units by category and month.

    Try it Yourself »
    SQL
    -- The classic warehouse query: join the FACT to its DIMENSIONS,
    -- then GROUP BY the dimension labels and SUM the measures.
    
    SELECT
        p.category,                       -- dimension attribute (the label)
        d.month_name,                     -- dimension attribute (the label)
        SUM(f.total_amount) AS revenue,   -- measure aggregated
        SUM(f.quantity)     AS units      -- measure aggregated
    FROM fact_sales f                     -- start from the fact (centre of star)
    JOIN dim_product p ON f
    ...

    Result — hand-verified from the sample rows:

    categorymonth_namerevenueunits
    ElectronicsJanuary502
    KitchenJanuary404
    ElectronicsFebruary2654

    Electronics in February is 25.00 + 240.00 = 265.00 and 1 + 3 = 4 units — the two February electronics sales collapse into one grouped row. That collapsing is exactly what a warehouse is built to do quickly.

    Your Turn: complete the star-join

    Fill in the three blanks to total revenue per category per month. The expected result is in the comments so you can check yourself.

    🎯 Your Turn: sales by category & month

    Replace the ___ blanks to finish the GROUP BY query.

    Try it Yourself »
    SQL
    -- 🎯 YOUR TURN — fill in the three blanks, then press "Try it Yourself".
    -- Goal: total revenue per category per month (a star-join aggregation).
    
    SELECT
        p.category,
        d.month_name,
        ___(f.total_amount) AS revenue   -- 👉 the function that totals a measure
    FROM fact_sales f
    JOIN dim_product p ON f.product_key = ___   -- 👉 the matching key in dim_product
    JOIN dim_date    d ON f.date_key    = d.date_key
    GROUP BY p.category, d.month_name, ___       -- 👉 add d.month so months sort right
    
    ...

    4. Snowflake Schema — When Dimensions Are Normalised

    A snowflake schema is a star whose dimensions have been normalised — split into sub-tables to remove repeated text. In a star, dim_product repeats the word "Electronics" on every Electronics row. Snowflaking moves category into its own dim_category table and links to it by key.

    The trade-off: snowflaking saves a little storage and keeps category data in one place, but every category report now needs an extra join. Stars are simpler and faster to query, so most warehouses stay with stars and only snowflake a dimension when it's genuinely large or shared across many facts.

    Snowflaking a dimension

    Normalise category out of dim_product — and pay an extra join.

    Try it Yourself »
    SQL
    -- SNOWFLAKE SCHEMA = a star whose dimensions are NORMALISED (split up).
    -- Here dim_product no longer stores the category text; it points at a
    -- separate dim_category table. The dimension "snowflakes" outward.
    
    CREATE TABLE dim_category (
        category_key  INT PRIMARY KEY,
        category_name VARCHAR(50),
        department    VARCHAR(50)
    );
    
    CREATE TABLE dim_product (
        product_key  INT PRIMARY KEY,
        product_name VARCHAR(100),
        category_key INT REFERENCES dim_category(category_key)  -- extr
    ...

    5. Slowly Changing Dimensions (SCD)

    Dimension attributes drift over time — a product gets re-categorised, a customer moves city. A Slowly Changing Dimension strategy decides what happens to history when that occurs.

    • SCD Type 1 — overwrite. Just UPDATE the value. Simple, but old reports change because history is lost. Fine for fixing typos; bad for anything you'll analyse over time.
    • SCD Type 2 — add a new row. Insert a new version with a fresh surrogate key plus effective_from/effective_to/is_current columns. Old facts keep pointing at the old version, so past reports stay correct. This is the workhorse of real warehouses.

    SCD Type 1 vs Type 2

    Overwrite (lose history) vs new row (keep history).

    Try it Yourself »
    SQL
    -- SLOWLY CHANGING DIMENSIONS (SCD): how to handle a dimension attribute
    -- that changes over time, e.g. a product moves from 'Kitchen' to 'Premium'.
    
    -- SCD TYPE 1 — OVERWRITE. Simple, but you LOSE the old value forever.
    UPDATE dim_product
    SET category = 'Premium'
    WHERE product_id = 1003;
    -- After this, every old sale now looks like it was 'Premium'. History gone.
    
    -- SCD TYPE 2 — ADD A NEW ROW. Keeps full history. The dimension needs
    -- effective_from / effective_to / is_current columns and a 
    ...

    Your Turn: name the grain, classify the columns

    No query to run — read the fact row and decide, for each column, whether it's a measure or a dimension. The answers are in the comments.

    🎯 Your Turn: grain & measures vs dimensions

    Classify each column, then check against the expected answers.

    Try it Yourself »
    SQL
    -- 🎯 YOUR TURN — no blanks to run; this is a "read & classify" drill.
    -- A clothing retailer logs one row every time an item is scanned at a till.
    -- Below is one fact row. Answer the three questions in the comments.
    --
    --   till_scan( store_id, product_id, scan_timestamp, qty, line_total )
    --   ( 12,       8841,        2024-03-02 14:05,    1,   19.99 )
    --
    -- 1) GRAIN — what does ONE row represent?
    --    👉 ✅ One item scanned at a till (one sale line), at one store, one moment.
    --
    -- 2) Classif
    ...

    Common Errors (and the fix)

    • Wrong (or undecided) grain: mixing "one order line" rows with "one whole order" rows in the same fact table makes every SUM wrong. Decide the grain first, write it as a comment, and keep every row to it.
    • Snowflaking everything: normalising every dimension out of habit buries simple reports under extra joins. Start with a star; snowflake only a big or shared dimension when there's a real reason.
    • Counting a measure twice (fan-out): joining the fact to a dimension that has many rows per fact row multiplies the fact rows, so SUM(total_amount) double-counts. Join facts only to dimensions at the same or coarser grain; if you must, aggregate the fact first, then join.
    • OLTP design for analytics: running heavy GROUP BY reports on the highly-normalised production database is slow and hurts the live app. Load a separate star-schema warehouse instead.
    • Reusing the natural key as the primary key: if product_id is the dimension's only key, you can't keep two versions of the same product (SCD Type 2). Add a surrogate product_key.

    📘 Quick Reference

    TermWhat it is
    Fact tableMeasurable events; keys + numbers (fact_sales)
    Dimension tableDescriptive context to group by (dim_product, dim_date)
    GrainWhat one fact row represents — decide first
    MeasureA number in the fact you SUM/AVG (total_amount)
    Dimension attributeA label you GROUP BY (category, month_name)
    Surrogate keyWarehouse's own id (product_key) ≠ natural key (product_id)
    Star schemaOne fact + flat dimensions — fast, simple
    Snowflake schemaStar with normalised dimensions — extra joins
    SCD Type 1Overwrite, lose history
    SCD Type 2New row per change, keep full history

    Frequently Asked Questions

    Q: Star or snowflake — which should I use?

    Default to a star. It has fewer joins, so reports are simpler to write and faster to run. Snowflake a single dimension only when it's very large or shared across several fact tables and the duplication genuinely hurts.

    Q: Why use a surrogate key instead of the existing product_id?

    So one real-world product can have several historical versions (SCD Type 2). Each version gets its own product_key; old facts keep pointing at the version that was current when they happened. A surrogate key also insulates the warehouse from messy or changing source ids.

    Q: Why a separate dim_date instead of just storing a date in the fact?

    A date dimension pre-computes year, quarter, month name, weekday, holiday flags and fiscal periods once. Then every report can GROUP BY them with a plain join instead of repeating fragile date maths in each query.

    Q: Do I need a special "OLAP database" to do this?

    No — a star schema is just ordinary tables and SQL; you can build one in any database. Dedicated analytical engines (column stores like BigQuery, Redshift, DuckDB) run the same star-join queries far faster on huge data, but the design is identical.

    Mini-Challenge: Revenue by Brand

    Put it all together — a brief, a blank canvas, and the expected result in the comments. Write it, then copy it into a playground to confirm.

    🎯 Mini-Challenge

    Total revenue per brand, highest first.

    Try it Yourself »
    SQL
    -- 🎯 MINI-CHALLENGE
    -- Using the star schema (fact_sales, dim_product, dim_date) from this lesson:
    --   1. Show each BRAND and its total revenue.
    --   2. Join fact_sales to dim_product on product_key.
    --   3. GROUP BY the brand, SUM(total_amount), and order highest revenue first.
    --
    -- ✅ Expected (2 grouped rows):
    --    Logi | 315.00   (50 + 25 + 240)
    --    Acme |  40.00
    --
    -- your query here

    🎉 Lesson Complete

    • ✅ OLTP handles small fast writes; OLAP warehouses handle huge reads
    • ✅ A star schema = one fact table + surrounding dimension tables
    • ✅ Decide the fact grain first; separate measures from dimensions
    • ✅ Surrogate keys let dimensions keep history (SCD Type 2)
    • ✅ Snowflaking normalises dimensions at the cost of extra joins
    • Next: running these star-join queries at scale with Big Data SQL tools like Hive, SparkSQL and DuckDB

    Sign up for free to track which lessons you've completed and get learning reminders.

    Previous

    Cookie & Privacy Settings

    We use cookies to improve your experience, analyze traffic, and show personalized ads. You can manage your preferences below.

    By clicking "Accept All", you consent to our use of cookies for analytics and personalized advertising. You can customize your preferences or reject non-essential cookies.

    Privacy PolicyTerms of Service

    Install LearnCodingFast

    Learn faster with the app on your home screen.