Courses/SQL/Data Warehousing

Lesson 48 • Advanced Track

Data Warehousing & Star Schemas

By the end of this lesson you'll be able to design an analytical data warehouse the way professionals do — separating fast-changing transactions (OLTP) from a read-optimised warehouse (OLAP), modelling a star schema from a central fact table and surrounding dimensions, choosing the right grain, and writing the star-join GROUP BY query that turns raw events into business numbers.

What You'll Learn

✓Tell OLTP (transactions) from OLAP (analytics)
✓Model a star schema: one fact + dimension tables
✓Pin down the fact-table grain before you build
✓Separate measures (numbers) from dimensions (context)
✓Use surrogate keys and handle changing dimensions (SCD)
✓Write the star-join GROUP BY that produces reports

Before you start: you should be comfortable with JOINs and GROUP BY & aggregates — this lesson is those two ideas applied to a deliberate table design. The in-browser editor lets you write and edit SQL; to run it, copy your query into a free playground like sqliteonline.com or db-fiddle.com. Every example below shows the expected result so you can check yourself.

1. OLTP vs OLAP — Two Different Jobs

Your normal app database is OLTP (Online Transaction Processing): it handles lots of tiny, fast writes — place an order, update a profile, one row at a time. It's normalised (data split across many tables to avoid duplication) so writes stay safe and cheap.

A data warehouse is OLAP (Online Analytical Processing): it answers big questions like "revenue by category by month for the last three years". It's loaded in batches and shaped so those huge read queries are simple and fast — even if that means duplicating descriptive data.

🏪 Real-world analogy

OLTP is the supermarket checkout — fast, one customer at a time, optimised for the moment of sale. OLAP is the head office, where analysts compare sales across 500 stores over 5 years. Same business, but you wouldn't want the head-office reports slowing down the tills — so you give them separate systems.

	OLTP (app DB)	OLAP (warehouse)
Workload	Many small writes	Few huge reads
Design	Normalised	Star schema
Query	One order / user	Aggregate over millions
Freshness	Live, instant	Batched (e.g. nightly)

2. The Star Schema — Facts at the Centre

A star schema has one central fact table ringed by dimension tables. Sketch it and it looks like a star, hence the name. It's the default warehouse design — when in doubt, build a star.

A fact table holds the events you measure — sales, clicks, shipments. It's long and skinny: mostly foreign keys plus a few numbers.
A dimension table holds the descriptive context — product, date, customer, store. It's short and wide: lots of text columns you'll filter and group by.
Measures are the numbers in the fact you SUM/AVG (quantity, total_amount). Dimensions are the labels you GROUP BY (category, month_name).
A surrogate key is the warehouse's own meaningless id (product_key), separate from the source system's natural key (product_id). Surrogate keys let a product have several versions over time (you'll see why in SCD).

Grain first. The single most important decision is the fact-table grain: what exactly does one row represent? "One order line" vs "one whole order" vs "daily total per store" are three different grains — pick one and write it down before you create the table.

Star schema — fact + dimensions

Create fact_sales surrounded by dim_product and dim_date.

Try it Yourself »

SQL

-- A STAR SCHEMA = one central FACT table + several DIMENSION tables.
-- Draw it and it looks like a star: the fact in the middle, dimensions
-- radiating out. This is the default warehouse design — start here.

-- DIMENSION: who/what/when/where context. Wide, descriptive, few rows.
CREATE TABLE dim_product (
    product_key  INT PRIMARY KEY,   -- SURROGATE key (warehouse's own id)
    product_id   INT,               -- NATURAL key from the source system
    product_name VARCHAR(100),
    catego
...

Our Sample Star

Load these rows so the analytical query below has data. Notice fact_sales stores only keys and numbers — every label lives in a dimension.

Result — dim_product:

product_key	product_id	product_name	category	brand
1	1001	Wireless Mouse	Electronics	Logi
2	1002	Mechanical Keyboard	Electronics	Logi
3	1003	Coffee Mug	Kitchen	Acme

Result — dim_date:

date_key	full_date	year	month	month_name
20240115	2024-01-15	2024	1	January
20240220	2024-02-20	2024	2	February

Result — fact_sales (grain: one product per sale line):

sale_key	date_key	product_key	quantity	total_amount
1	20240115	1	2	50
2	20240115	3	4	40
3	20240220	1	1	25
4	20240220	2	3	240

Load the sample rows

Insert a few rows into each table so the join has data.

Try it Yourself »

SQL

-- A few rows of each table so the join below has data to work on.

INSERT INTO dim_product (product_key, product_id, product_name, category, brand) VALUES
    (1, 1001, 'Wireless Mouse',      'Electronics', 'Logi'),
    (2, 1002, 'Mechanical Keyboard', 'Electronics', 'Logi'),
    (3, 1003, 'Coffee Mug',          'Kitchen',     'Acme');

INSERT INTO dim_date (date_key, full_date, year, month, month_name) VALUES
    (20240115, '2024-01-15', 2024, 1, 'January'),
    (20240220, '2024-02-20', 2024, 
...

3. The Star-Join Query — Turning Events into Reports

Almost every warehouse report follows the same shape: start at the fact table, JOIN out to each dimension you need, then GROUP BY the dimension labels and SUM the measures. You join "out from the centre of the star" — one hop per dimension.

Star-join aggregation

Revenue and units by category and month.

Try it Yourself »

SQL

-- The classic warehouse query: join the FACT to its DIMENSIONS,
-- then GROUP BY the dimension labels and SUM the measures.

SELECT
    p.category,                       -- dimension attribute (the label)
    d.month_name,                     -- dimension attribute (the label)
    SUM(f.total_amount) AS revenue,   -- measure aggregated
    SUM(f.quantity)     AS units      -- measure aggregated
FROM fact_sales f                     -- start from the fact (centre of star)
JOIN dim_product p ON f
...

Result — hand-verified from the sample rows:

category	month_name	revenue	units
Electronics	January	50	2
Kitchen	January	40	4
Electronics	February	265	4

Electronics in February is 25.00 + 240.00 = 265.00 and 1 + 3 = 4 units — the two February electronics sales collapse into one grouped row. That collapsing is exactly what a warehouse is built to do quickly.

Your Turn: complete the star-join

Fill in the three blanks to total revenue per category per month. The expected result is in the comments so you can check yourself.

🎯 Your Turn: sales by category & month

Replace the ___ blanks to finish the GROUP BY query.

Try it Yourself »

SQL

-- 🎯 YOUR TURN — fill in the three blanks, then press "Try it Yourself".
-- Goal: total revenue per category per month (a star-join aggregation).

SELECT
    p.category,
    d.month_name,
    ___(f.total_amount) AS revenue   -- 👉 the function that totals a measure
FROM fact_sales f
JOIN dim_product p ON f.product_key = ___   -- 👉 the matching key in dim_product
JOIN dim_date    d ON f.date_key    = d.date_key
GROUP BY p.category, d.month_name, ___       -- 👉 add d.month so months sort right

...

4. Snowflake Schema — When Dimensions Are Normalised

A snowflake schema is a star whose dimensions have been normalised — split into sub-tables to remove repeated text. In a star, dim_product repeats the word "Electronics" on every Electronics row. Snowflaking moves category into its own dim_category table and links to it by key.

The trade-off: snowflaking saves a little storage and keeps category data in one place, but every category report now needs an extra join. Stars are simpler and faster to query, so most warehouses stay with stars and only snowflake a dimension when it's genuinely large or shared across many facts.

Snowflaking a dimension

Normalise category out of dim_product — and pay an extra join.

Try it Yourself »

SQL

-- SNOWFLAKE SCHEMA = a star whose dimensions are NORMALISED (split up).
-- Here dim_product no longer stores the category text; it points at a
-- separate dim_category table. The dimension "snowflakes" outward.

CREATE TABLE dim_category (
    category_key  INT PRIMARY KEY,
    category_name VARCHAR(50),
    department    VARCHAR(50)
);

CREATE TABLE dim_product (
    product_key  INT PRIMARY KEY,
    product_name VARCHAR(100),
    category_key INT REFERENCES dim_category(category_key)  -- extr
...

5. Slowly Changing Dimensions (SCD)

Dimension attributes drift over time — a product gets re-categorised, a customer moves city. A Slowly Changing Dimension strategy decides what happens to history when that occurs.

SCD Type 1 — overwrite. Just UPDATE the value. Simple, but old reports change because history is lost. Fine for fixing typos; bad for anything you'll analyse over time.
SCD Type 2 — add a new row. Insert a new version with a fresh surrogate key plus effective_from/effective_to/is_current columns. Old facts keep pointing at the old version, so past reports stay correct. This is the workhorse of real warehouses.

SCD Type 1 vs Type 2

Overwrite (lose history) vs new row (keep history).

Try it Yourself »

SQL

-- SLOWLY CHANGING DIMENSIONS (SCD): how to handle a dimension attribute
-- that changes over time, e.g. a product moves from 'Kitchen' to 'Premium'.

-- SCD TYPE 1 — OVERWRITE. Simple, but you LOSE the old value forever.
UPDATE dim_product
SET category = 'Premium'
WHERE product_id = 1003;
-- After this, every old sale now looks like it was 'Premium'. History gone.

-- SCD TYPE 2 — ADD A NEW ROW. Keeps full history. The dimension needs
-- effective_from / effective_to / is_current columns and a 
...

Your Turn: name the grain, classify the columns

No query to run — read the fact row and decide, for each column, whether it's a measure or a dimension. The answers are in the comments.

🎯 Your Turn: grain & measures vs dimensions

Classify each column, then check against the expected answers.

Try it Yourself »

SQL

-- 🎯 YOUR TURN — no blanks to run; this is a "read & classify" drill.
-- A clothing retailer logs one row every time an item is scanned at a till.
-- Below is one fact row. Answer the three questions in the comments.
--
--   till_scan( store_id, product_id, scan_timestamp, qty, line_total )
--   ( 12,       8841,        2024-03-02 14:05,    1,   19.99 )
--
-- 1) GRAIN — what does ONE row represent?
--    👉 ✅ One item scanned at a till (one sale line), at one store, one moment.
--
-- 2) Classif
...

Common Errors (and the fix)

Wrong (or undecided) grain: mixing "one order line" rows with "one whole order" rows in the same fact table makes every SUM wrong. Decide the grain first, write it as a comment, and keep every row to it.
Snowflaking everything: normalising every dimension out of habit buries simple reports under extra joins. Start with a star; snowflake only a big or shared dimension when there's a real reason.
Counting a measure twice (fan-out): joining the fact to a dimension that has many rows per fact row multiplies the fact rows, so SUM(total_amount) double-counts. Join facts only to dimensions at the same or coarser grain; if you must, aggregate the fact first, then join.
OLTP design for analytics: running heavy GROUP BY reports on the highly-normalised production database is slow and hurts the live app. Load a separate star-schema warehouse instead.
Reusing the natural key as the primary key: if product_id is the dimension's only key, you can't keep two versions of the same product (SCD Type 2). Add a surrogate product_key.

📘 Quick Reference

Term	What it is
Fact table	Measurable events; keys + numbers (fact_sales)
Dimension table	Descriptive context to group by (dim_product, dim_date)
Grain	What one fact row represents — decide first
Measure	A number in the fact you SUM/AVG (total_amount)
Dimension attribute	A label you GROUP BY (category, month_name)
Surrogate key	Warehouse's own id (product_key) ≠ natural key (product_id)
Star schema	One fact + flat dimensions — fast, simple
Snowflake schema	Star with normalised dimensions — extra joins
SCD Type 1	Overwrite, lose history
SCD Type 2	New row per change, keep full history

Frequently Asked Questions

Q: Star or snowflake — which should I use?

Default to a star. It has fewer joins, so reports are simpler to write and faster to run. Snowflake a single dimension only when it's very large or shared across several fact tables and the duplication genuinely hurts.

Q: Why use a surrogate key instead of the existing product_id?

So one real-world product can have several historical versions (SCD Type 2). Each version gets its own product_key; old facts keep pointing at the version that was current when they happened. A surrogate key also insulates the warehouse from messy or changing source ids.

Q: Why a separate dim_date instead of just storing a date in the fact?

A date dimension pre-computes year, quarter, month name, weekday, holiday flags and fiscal periods once. Then every report can GROUP BY them with a plain join instead of repeating fragile date maths in each query.

Q: Do I need a special "OLAP database" to do this?

No — a star schema is just ordinary tables and SQL; you can build one in any database. Dedicated analytical engines (column stores like BigQuery, Redshift, DuckDB) run the same star-join queries far faster on huge data, but the design is identical.

Mini-Challenge: Revenue by Brand

Put it all together — a brief, a blank canvas, and the expected result in the comments. Write it, then copy it into a playground to confirm.

🎯 Mini-Challenge

Total revenue per brand, highest first.

Try it Yourself »

SQL

-- 🎯 MINI-CHALLENGE
-- Using the star schema (fact_sales, dim_product, dim_date) from this lesson:
--   1. Show each BRAND and its total revenue.
--   2. Join fact_sales to dim_product on product_key.
--   3. GROUP BY the brand, SUM(total_amount), and order highest revenue first.
--
-- ✅ Expected (2 grouped rows):
--    Logi | 315.00   (50 + 25 + 240)
--    Acme |  40.00
--
-- your query here

🎉 Lesson Complete

✅ OLTP handles small fast writes; OLAP warehouses handle huge reads
✅ A star schema = one fact table + surrounding dimension tables
✅ Decide the fact grain first; separate measures from dimensions
✅ Surrogate keys let dimensions keep history (SCD Type 2)
✅ Snowflaking normalises dimensions at the cost of extra joins
✅ Next: running these star-join queries at scale with Big Data SQL tools like Hive, SparkSQL and DuckDB

Data Warehousing & Star Schemas

What You'll Learn

1. OLTP vs OLAP — Two Different Jobs

2. The Star Schema — Facts at the Centre

Star schema — fact + dimensions

Our Sample Star

Load the sample rows

3. The Star-Join Query — Turning Events into Reports

Star-join aggregation

Your Turn: complete the star-join

🎯 Your Turn: sales by category & month

4. Snowflake Schema — When Dimensions Are Normalised

Snowflaking a dimension

5. Slowly Changing Dimensions (SCD)

SCD Type 1 vs Type 2

Your Turn: name the grain, classify the columns

🎯 Your Turn: grain & measures vs dimensions

Common Errors (and the fix)

📘 Quick Reference

Frequently Asked Questions

Mini-Challenge: Revenue by Brand

🎯 Mini-Challenge

🎉 Lesson Complete

Cookie & Privacy Settings