Lesson 48 • Advanced Track
Data Warehousing & Star Schemas
By the end of this lesson you'll be able to design an analytical data warehouse the way professionals do — separating fast-changing transactions (OLTP) from a read-optimised warehouse (OLAP), modelling a star schema from a central fact table and surrounding dimensions, choosing the right grain, and writing the star-join GROUP BY query that turns raw events into business numbers.
What You'll Learn
- ✓Tell OLTP (transactions) from OLAP (analytics)
- ✓Model a star schema: one fact + dimension tables
- ✓Pin down the fact-table grain before you build
- ✓Separate measures (numbers) from dimensions (context)
- ✓Use surrogate keys and handle changing dimensions (SCD)
- ✓Write the star-join GROUP BY that produces reports
1. OLTP vs OLAP — Two Different Jobs
Your normal app database is OLTP (Online Transaction Processing): it handles lots of tiny, fast writes — place an order, update a profile, one row at a time. It's normalised (data split across many tables to avoid duplication) so writes stay safe and cheap.
A data warehouse is OLAP (Online Analytical Processing): it answers big questions like "revenue by category by month for the last three years". It's loaded in batches and shaped so those huge read queries are simple and fast — even if that means duplicating descriptive data.
🏪 Real-world analogy
OLTP is the supermarket checkout — fast, one customer at a time, optimised for the moment of sale. OLAP is the head office, where analysts compare sales across 500 stores over 5 years. Same business, but you wouldn't want the head-office reports slowing down the tills — so you give them separate systems.
| OLTP (app DB) | OLAP (warehouse) | |
|---|---|---|
| Workload | Many small writes | Few huge reads |
| Design | Normalised | Star schema |
| Query | One order / user | Aggregate over millions |
| Freshness | Live, instant | Batched (e.g. nightly) |
2. The Star Schema — Facts at the Centre
A star schema has one central fact table ringed by dimension tables. Sketch it and it looks like a star, hence the name. It's the default warehouse design — when in doubt, build a star.
- A fact table holds the events you measure — sales, clicks, shipments. It's long and skinny: mostly foreign keys plus a few numbers.
- A dimension table holds the descriptive context — product, date, customer, store. It's short and wide: lots of text columns you'll filter and group by.
- Measures are the numbers in the fact you
SUM/AVG(quantity, total_amount). Dimensions are the labels youGROUP BY(category, month_name). - A surrogate key is the warehouse's own meaningless id (
product_key), separate from the source system's natural key (product_id). Surrogate keys let a product have several versions over time (you'll see why in SCD).
Star schema — fact + dimensions
Create fact_sales surrounded by dim_product and dim_date.
-- A STAR SCHEMA = one central FACT table + several DIMENSION tables.
-- Draw it and it looks like a star: the fact in the middle, dimensions
-- radiating out. This is the default warehouse design — start here.
-- DIMENSION: who/what/when/where context. Wide, descriptive, few rows.
CREATE TABLE dim_product (
product_key INT PRIMARY KEY, -- SURROGATE key (warehouse's own id)
product_id INT, -- NATURAL key from the source system
product_name VARCHAR(100),
catego
...Our Sample Star
Load these rows so the analytical query below has data. Notice fact_sales stores only keys and numbers — every label lives in a dimension.
Result — dim_product:
| product_key | product_id | product_name | category | brand |
|---|---|---|---|---|
| 1 | 1001 | Wireless Mouse | Electronics | Logi |
| 2 | 1002 | Mechanical Keyboard | Electronics | Logi |
| 3 | 1003 | Coffee Mug | Kitchen | Acme |
Result — dim_date:
| date_key | full_date | year | month | month_name |
|---|---|---|---|---|
| 20240115 | 2024-01-15 | 2024 | 1 | January |
| 20240220 | 2024-02-20 | 2024 | 2 | February |
Result — fact_sales (grain: one product per sale line):
| sale_key | date_key | product_key | quantity | total_amount |
|---|---|---|---|---|
| 1 | 20240115 | 1 | 2 | 50 |
| 2 | 20240115 | 3 | 4 | 40 |
| 3 | 20240220 | 1 | 1 | 25 |
| 4 | 20240220 | 2 | 3 | 240 |
Load the sample rows
Insert a few rows into each table so the join has data.
-- A few rows of each table so the join below has data to work on.
INSERT INTO dim_product (product_key, product_id, product_name, category, brand) VALUES
(1, 1001, 'Wireless Mouse', 'Electronics', 'Logi'),
(2, 1002, 'Mechanical Keyboard', 'Electronics', 'Logi'),
(3, 1003, 'Coffee Mug', 'Kitchen', 'Acme');
INSERT INTO dim_date (date_key, full_date, year, month, month_name) VALUES
(20240115, '2024-01-15', 2024, 1, 'January'),
(20240220, '2024-02-20', 2024,
...3. The Star-Join Query — Turning Events into Reports
Almost every warehouse report follows the same shape: start at the fact table, JOIN out to each dimension you need, then GROUP BY the dimension labels and SUM the measures. You join "out from the centre of the star" — one hop per dimension.
Star-join aggregation
Revenue and units by category and month.
-- The classic warehouse query: join the FACT to its DIMENSIONS,
-- then GROUP BY the dimension labels and SUM the measures.
SELECT
p.category, -- dimension attribute (the label)
d.month_name, -- dimension attribute (the label)
SUM(f.total_amount) AS revenue, -- measure aggregated
SUM(f.quantity) AS units -- measure aggregated
FROM fact_sales f -- start from the fact (centre of star)
JOIN dim_product p ON f
...Result — hand-verified from the sample rows:
| category | month_name | revenue | units |
|---|---|---|---|
| Electronics | January | 50 | 2 |
| Kitchen | January | 40 | 4 |
| Electronics | February | 265 | 4 |
Electronics in February is 25.00 + 240.00 = 265.00 and 1 + 3 = 4 units — the two February electronics sales collapse into one grouped row. That collapsing is exactly what a warehouse is built to do quickly.
Your Turn: complete the star-join
Fill in the three blanks to total revenue per category per month. The expected result is in the comments so you can check yourself.
🎯 Your Turn: sales by category & month
Replace the ___ blanks to finish the GROUP BY query.
-- 🎯 YOUR TURN — fill in the three blanks, then press "Try it Yourself".
-- Goal: total revenue per category per month (a star-join aggregation).
SELECT
p.category,
d.month_name,
___(f.total_amount) AS revenue -- 👉 the function that totals a measure
FROM fact_sales f
JOIN dim_product p ON f.product_key = ___ -- 👉 the matching key in dim_product
JOIN dim_date d ON f.date_key = d.date_key
GROUP BY p.category, d.month_name, ___ -- 👉 add d.month so months sort right
...4. Snowflake Schema — When Dimensions Are Normalised
A snowflake schema is a star whose dimensions have been normalised — split into sub-tables to remove repeated text. In a star, dim_product repeats the word "Electronics" on every Electronics row. Snowflaking moves category into its own dim_category table and links to it by key.
The trade-off: snowflaking saves a little storage and keeps category data in one place, but every category report now needs an extra join. Stars are simpler and faster to query, so most warehouses stay with stars and only snowflake a dimension when it's genuinely large or shared across many facts.
Snowflaking a dimension
Normalise category out of dim_product — and pay an extra join.
-- SNOWFLAKE SCHEMA = a star whose dimensions are NORMALISED (split up).
-- Here dim_product no longer stores the category text; it points at a
-- separate dim_category table. The dimension "snowflakes" outward.
CREATE TABLE dim_category (
category_key INT PRIMARY KEY,
category_name VARCHAR(50),
department VARCHAR(50)
);
CREATE TABLE dim_product (
product_key INT PRIMARY KEY,
product_name VARCHAR(100),
category_key INT REFERENCES dim_category(category_key) -- extr
...5. Slowly Changing Dimensions (SCD)
Dimension attributes drift over time — a product gets re-categorised, a customer moves city. A Slowly Changing Dimension strategy decides what happens to history when that occurs.
- SCD Type 1 — overwrite. Just
UPDATEthe value. Simple, but old reports change because history is lost. Fine for fixing typos; bad for anything you'll analyse over time. - SCD Type 2 — add a new row. Insert a new version with a fresh surrogate key plus
effective_from/effective_to/is_currentcolumns. Old facts keep pointing at the old version, so past reports stay correct. This is the workhorse of real warehouses.
SCD Type 1 vs Type 2
Overwrite (lose history) vs new row (keep history).
-- SLOWLY CHANGING DIMENSIONS (SCD): how to handle a dimension attribute
-- that changes over time, e.g. a product moves from 'Kitchen' to 'Premium'.
-- SCD TYPE 1 — OVERWRITE. Simple, but you LOSE the old value forever.
UPDATE dim_product
SET category = 'Premium'
WHERE product_id = 1003;
-- After this, every old sale now looks like it was 'Premium'. History gone.
-- SCD TYPE 2 — ADD A NEW ROW. Keeps full history. The dimension needs
-- effective_from / effective_to / is_current columns and a
...Your Turn: name the grain, classify the columns
No query to run — read the fact row and decide, for each column, whether it's a measure or a dimension. The answers are in the comments.
🎯 Your Turn: grain & measures vs dimensions
Classify each column, then check against the expected answers.
-- 🎯 YOUR TURN — no blanks to run; this is a "read & classify" drill.
-- A clothing retailer logs one row every time an item is scanned at a till.
-- Below is one fact row. Answer the three questions in the comments.
--
-- till_scan( store_id, product_id, scan_timestamp, qty, line_total )
-- ( 12, 8841, 2024-03-02 14:05, 1, 19.99 )
--
-- 1) GRAIN — what does ONE row represent?
-- 👉 ✅ One item scanned at a till (one sale line), at one store, one moment.
--
-- 2) Classif
...Common Errors (and the fix)
- Wrong (or undecided) grain: mixing "one order line" rows with "one whole order" rows in the same fact table makes every
SUMwrong. Decide the grain first, write it as a comment, and keep every row to it. - Snowflaking everything: normalising every dimension out of habit buries simple reports under extra joins. Start with a star; snowflake only a big or shared dimension when there's a real reason.
- Counting a measure twice (fan-out): joining the fact to a dimension that has many rows per fact row multiplies the fact rows, so
SUM(total_amount)double-counts. Join facts only to dimensions at the same or coarser grain; if you must, aggregate the fact first, then join. - OLTP design for analytics: running heavy
GROUP BYreports on the highly-normalised production database is slow and hurts the live app. Load a separate star-schema warehouse instead. - Reusing the natural key as the primary key: if
product_idis the dimension's only key, you can't keep two versions of the same product (SCD Type 2). Add a surrogateproduct_key.
📘 Quick Reference
| Term | What it is |
|---|---|
| Fact table | Measurable events; keys + numbers (fact_sales) |
| Dimension table | Descriptive context to group by (dim_product, dim_date) |
| Grain | What one fact row represents — decide first |
| Measure | A number in the fact you SUM/AVG (total_amount) |
| Dimension attribute | A label you GROUP BY (category, month_name) |
| Surrogate key | Warehouse's own id (product_key) ≠ natural key (product_id) |
| Star schema | One fact + flat dimensions — fast, simple |
| Snowflake schema | Star with normalised dimensions — extra joins |
| SCD Type 1 | Overwrite, lose history |
| SCD Type 2 | New row per change, keep full history |
Frequently Asked Questions
Q: Star or snowflake — which should I use?
Default to a star. It has fewer joins, so reports are simpler to write and faster to run. Snowflake a single dimension only when it's very large or shared across several fact tables and the duplication genuinely hurts.
Q: Why use a surrogate key instead of the existing product_id?
So one real-world product can have several historical versions (SCD Type 2). Each version gets its own product_key; old facts keep pointing at the version that was current when they happened. A surrogate key also insulates the warehouse from messy or changing source ids.
Q: Why a separate dim_date instead of just storing a date in the fact?
A date dimension pre-computes year, quarter, month name, weekday, holiday flags and fiscal periods once. Then every report can GROUP BY them with a plain join instead of repeating fragile date maths in each query.
Q: Do I need a special "OLAP database" to do this?
No — a star schema is just ordinary tables and SQL; you can build one in any database. Dedicated analytical engines (column stores like BigQuery, Redshift, DuckDB) run the same star-join queries far faster on huge data, but the design is identical.
Mini-Challenge: Revenue by Brand
Put it all together — a brief, a blank canvas, and the expected result in the comments. Write it, then copy it into a playground to confirm.
🎯 Mini-Challenge
Total revenue per brand, highest first.
-- 🎯 MINI-CHALLENGE
-- Using the star schema (fact_sales, dim_product, dim_date) from this lesson:
-- 1. Show each BRAND and its total revenue.
-- 2. Join fact_sales to dim_product on product_key.
-- 3. GROUP BY the brand, SUM(total_amount), and order highest revenue first.
--
-- ✅ Expected (2 grouped rows):
-- Logi | 315.00 (50 + 25 + 240)
-- Acme | 40.00
--
-- your query here🎉 Lesson Complete
- ✅ OLTP handles small fast writes; OLAP warehouses handle huge reads
- ✅ A star schema = one fact table + surrounding dimension tables
- ✅ Decide the fact grain first; separate measures from dimensions
- ✅ Surrogate keys let dimensions keep history (SCD Type 2)
- ✅ Snowflaking normalises dimensions at the cost of extra joins
- ✅ Next: running these star-join queries at scale with Big Data SQL tools like Hive, SparkSQL and DuckDB
Sign up for free to track which lessons you've completed and get learning reminders.