Lesson 49 • Advanced Track
SQL with Big Data Tools
By the end of this lesson you'll be able to run the same SQL you already know across terabytes and petabytes of data — using distributed engines like Hive, Spark SQL, and Presto/Trino, and the laptop-sized powerhouse DuckDB — and you'll know why columnar storage and partitioning make those queries fast.
What You'll Learn
- ✓Run SQL on huge datasets with Hive, Spark SQL, Presto/Trino & DuckDB
- ✓Point an EXTERNAL table at files in a data lake (no copying)
- ✓Read Parquet/CSV files directly with DuckDB's read_parquet
- ✓Partition data so queries scan folders, not the whole lake
- ✓Why columnar storage (Parquet/ORC) beats rows for analytics
- ✓When distributed SQL beats a single database — and when it doesn't
read_parquet for real. Every example shows its expected result so you can check yourself. Our Sample Dataset: events
Imagine billions of these rows spread across thousands of files in a data lake, one folder per day. Every query in this lesson runs against this events shape — a tiny preview is shown here so you can picture the columns.
Result:
| event_id | user_id | event_type | amount | event_date |
|---|---|---|---|---|
| 9001 | 4471 | purchase | 49.99 | 2024-06-01 |
| 9002 | 4471 | view | 0 | 2024-06-01 |
| 9003 | 5520 | signup | 0 | 2024-06-01 |
| 9004 | 3310 | purchase | 8.5 | 2024-06-02 |
| … | … | … | … | … |
Big data just means data too large or fast for a single database to handle comfortably — so the work is spread across many machines. The good news: you query it with the SQL you already know.
1. The Same SQL, a Bigger Engine
A normal database (PostgreSQL, MySQL) runs your query on one machine. A distributed query engine splits the work across a cluster of machines that each scan part of the data, then combines the partial answers. You still write SELECT … GROUP BY …; the engine handles the splitting.
The big four you'll meet: Apache Hive (SQL compiled into batch jobs over Hadoop — slow but bottomless), Spark SQL (the same idea but in memory, far faster for interactive work), Presto/Trino (a "federated" MPP engine — MPP = massively parallel processing — that can join across many different sources at once), and DuckDB (not distributed at all, but astonishingly fast on a single machine for files up to ~100 GB).
📚 Real-world analogy
A single database is one librarian searching one library. A distributed engine is a thousand librarians, each searching one wing, then handing their slips to a head librarian who tallies the total. Same request ("how many books on SQL?"), wildly more shelves — and it finishes in about the same time because everyone works at once.
2. Columnar Storage (Parquet/ORC) vs Rows
A traditional database stores data row by row: all of row 1's values together, then all of row 2's. That's perfect when you fetch whole records one at a time. Analytics is the opposite — you usually want one or two columns across millions of rows (e.g. SUM(amount)).
Columnar formats like Parquet and ORC store all of one column's values together. So SUM(amount) reads only the amount column and skips the other 49 columns on disk. Columns also compress brilliantly (similar values sit next to each other), so files are smaller and faster to read.
| Layout | On disk | Best for |
|---|---|---|
| Row (CSV, OLTP DB) | row1 all-cols, row2 all-cols… | Fetching/updating whole records |
| Columnar (Parquet/ORC) | all event_type, then all amount… | Aggregating a few columns over many rows |
3. External Tables & Partition Pruning (Hive)
An EXTERNAL table is a SQL definition layered over files that already exist in storage (HDFS, S3). Hive doesn't copy or own the data — it just describes the columns so you can query the files with SQL. Drop the table and the files stay.
Partitioning physically splits the data into folders by a column — here, one folder per event_date. When your WHERE filters on that column, the engine reads only the matching folders. That's partition pruning (a form of predicate pushdown: the filter is pushed down to the storage layer so unneeded data is never read). Filtering one day out of a year can mean scanning 1 folder instead of 365.
Hive: external table + partition pruning
Define a table over data-lake files and query one partition.
-- Apache Hive: write SQL, Hive runs it as distributed jobs over Hadoop.
-- An EXTERNAL table is just a SQL "view" pointing at files already in storage.
-- Hive never copies the data — it reads the files where they live.
CREATE EXTERNAL TABLE web_events (
event_id BIGINT,
user_id BIGINT,
event_type STRING,
amount DOUBLE
)
PARTITIONED BY (event_date STRING) -- folder-per-day on disk (see below)
STORED AS PARQUET -- columnar format: only reads needed
...Result — WHERE event_date = '2024-06-01':
| event_type | events | revenue |
|---|---|---|
| purchase | 1840 | 8210.5 |
| view | 7720 | 0 |
| signup | 410 | 0 |
Your Turn: a partitioned aggregation
Fill in the blanks so the query prunes to a single day's folder and totals revenue per event_type. The expected result is in the comments so you can check yourself.
🎯 Your Turn: partition pruning
Filter and group by the partition column.
-- 🎯 YOUR TURN — fill in the two blanks, then press "Try it Yourself".
-- Goal: total revenue per event_type for ONE day, scanning only that day's folder.
SELECT
event_type,
ROUND(SUM(amount), 2) AS revenue
FROM web_events
WHERE ___ = '2024-06-01' -- 👉 the PARTITION column, so Hive prunes to 1 folder
GROUP BY ___ -- 👉 group by the same column you SELECTed (not amount)
ORDER BY revenue DESC;
-- ✅ Expected: 3 rows (purchase, signup, view), e.g.
-- purchase | 8
...4. Spark SQL — Distributed, In Memory
Spark SQL runs the same standard SQL but keeps data in memory across the cluster, which makes repeated, interactive queries far faster than Hive's batch jobs. You register your files as a temporary view once, then query them like any table — window functions, CTEs, and joins all work.
Notice APPROX_COUNT_DISTINCT: at billions of rows, an exact distinct count is expensive, so big-data engines offer fast approximate versions (a couple of percent error) that finish in a fraction of the time.
Spark SQL: in-memory analytics
Aggregate a week of partitions with an approximate distinct count.
-- Spark SQL: the same SQL, run IN MEMORY across a cluster.
-- Often 10-100x faster than Hive for interactive, repeated queries.
-- In Python you'd register the files as a view first:
-- spark.read.parquet('s3://acme-lake/events/').createOrReplaceTempView('events')
-- After that, plain SQL works:
SELECT
event_date,
event_type,
COUNT(*) AS events,
APPROX_COUNT_DISTINCT(user_id) AS unique_users -- fast, ~2% error
FROM events
WHERE event_date BETWEEN '202
...Result — first rows:
| event_date | event_type | events | unique_users |
|---|---|---|---|
| 2024-06-01 | view | 7720 | 5102 |
| 2024-06-01 | purchase | 1840 | 1611 |
| 2024-06-02 | view | 7610 | 5044 |
| … | … | … | … |
5. DuckDB — Files Straight to SQL
DuckDB is the "SQLite for analytics": a single embedded engine, no server, that reads CSV and Parquet files directly. There's no loading step — you point read_parquet('…') (or read_csv_auto('…')) at a file or a wildcard of files and query them immediately.
Because Parquet is columnar, DuckDB reads only the columns your query touches — exactly the storage win from Section 2, now in one line on your laptop. For datasets up to ~100 GB it's often faster than spinning up a cluster, and it costs nothing to run.
DuckDB: read_parquet over a wildcard
Aggregate a month of Parquet files with zero setup.
-- DuckDB: "SQLite for analytics". No server, runs on your laptop,
-- reads Parquet/CSV files directly — no CREATE TABLE, no loading step.
-- Read every Parquet file matching a wildcard and aggregate it:
SELECT
event_type,
COUNT(*) AS events,
ROUND(SUM(amount), 2) AS revenue
FROM read_parquet('events/2024-06-*.parquet') -- a whole month of files
GROUP BY event_type
ORDER BY revenue DESC;
-- Because Parquet is columnar, DuckDB reads ONLY the 3 columns this query
--
...Result — a month of files:
| event_type | events | revenue |
|---|---|---|
| purchase | 55210 | 248900.75 |
| view | 231880 | 0 |
| signup | 12410 | 0 |
Your Turn: read the files
One concept this time — name the DuckDB function that reads Parquet files, and the count function for the body.
🎯 Your Turn: read_parquet
Count rows and unique users across a folder of Parquet files.
-- 🎯 YOUR TURN — fill in the two blanks for a DuckDB query over files.
-- Goal: count rows and unique users across a folder of Parquet files.
SELECT
COUNT(*) AS events,
APPROX_COUNT_DISTINCT(user_id) AS unique_users
FROM ___('events/2024-06-*.parquet'); -- 👉 the DuckDB function that reads Parquet
-- (Hint: for a CSV instead, the function would be read_csv_auto('...').)
-- ✅ Expected: 1 row, 2 columns, e.g.
-- events | unique_users
-- 41980 | 91236. Presto/Trino — One Query, Many Sources
Presto/Trino is a federated MPP engine: it can join data living in different systems in a single SQL statement — a Postgres table, files in a Hive lake, even a document store — without first copying everything into one place. You reference each source by a catalog.schema.table name.
When should distributed SQL beat a single database? When the data no longer fits (or no longer fits affordably) on one machine, when it spans multiple systems, or when scans are so large that splitting them across a cluster finishes work a lone server never could. For everyday app data — frequent small reads and writes — a single relational database is still simpler, cheaper, and faster.
Common Errors (and the fix)
- Full scan with no partition pruning: if you don't filter on the partition column (or you wrap it in a function like
YEAR(event_date)), the engine reads every folder. Filter the raw partition column directly —WHERE event_date = '2024-06-01'— so the pushdown can prune. - The small-files problem: thousands of tiny files (a few KB each) cripple Hive/Spark — there's huge per-file overhead. Compact them into fewer, larger files (~128 MB+). Symptom: a trivial query runs for minutes.
- Row format for analytics: storing big tables as CSV/JSON means every query scans every column. Convert to Parquet/ORC so scans read only the columns they need.
- Treating a warehouse like OLTP: engines like Hive, Spark SQL, and BigQuery are built for big scans, not single-row lookups or constant
UPDATE/DELETE. Row-at-a-timeSELECT … WHERE id = 42belongs in PostgreSQL/MySQL, not a data lake. - Exact distinct on billions of rows:
COUNT(DISTINCT user_id)can be brutally slow at scale. UseAPPROX_COUNT_DISTINCT()when a couple of percent error is fine.
📘 Quick Reference
| Engine / Syntax | What it is / does |
|---|---|
| Apache Hive | SQL → batch jobs on Hadoop; bottomless scale, slow |
| Spark SQL | Same SQL, in-memory across a cluster; fast & interactive |
| Presto/Trino | Federated MPP — join across many sources in one query |
| DuckDB | Single-node, server-less; reads files directly, very fast (~100GB) |
| CREATE EXTERNAL TABLE | Define SQL columns over files already in storage (no copy) |
| PARTITIONED BY (col) | Split data into folders by col → enables partition pruning |
| STORED AS PARQUET | Columnar storage: reads only the columns a query needs |
| read_parquet('*.parquet') | DuckDB: query Parquet files directly, no load step |
| read_csv_auto('f.csv') | DuckDB: query a CSV with inferred column types |
| APPROX_COUNT_DISTINCT() | Fast approximate unique count for huge data |
Frequently Asked Questions
Q: Do I have to learn a new language for big data?
No — that's the whole point. Hive, Spark SQL, Presto/Trino, DuckDB, and BigQuery all speak standard SQL. Your SELECT … GROUP BY … JOIN skills transfer directly; only the execution engine changes.
Q: When should I use DuckDB instead of a cluster?
If your data fits on one machine (roughly up to ~100 GB of Parquet), DuckDB is usually faster, simpler, and free — no cluster to manage. Reach for Spark/Presto when the data outgrows a single node or spans many systems.
Q: What's the real difference between Parquet and CSV?
CSV is row-oriented text: every query reads every column and there's no compression. Parquet is columnar and compressed, so analytics read only the needed columns and far fewer bytes. For big-data SQL, prefer Parquet (or ORC).
Q: Can these engines replace my PostgreSQL?
No. They're built for big scans and aggregation, not single-row lookups or constant updates. Keep your transactional app on PostgreSQL/MySQL; use big-data SQL for the analytics on top of it.
Mini-Challenge: Daily Purchase Revenue
Put it all together — a brief, a blank canvas, and the expected result in the comments. Write it, then run it in DuckDB (or any SQL playground with sample data) to confirm.
🎯 Mini-Challenge
Per-day purchase revenue from a month of Parquet files, biggest day first.
-- 🎯 MINI-CHALLENGE
-- Using ONLY what this lesson covered (read_parquet / external tables,
-- partition pruning with WHERE, GROUP BY, an aggregate, ORDER BY):
--
-- 1. Read 'events/2024-06-*.parquet' in DuckDB
-- 2. Keep only event_type = 'purchase'
-- 3. Return event_date and total revenue (SUM of amount) per day
-- 4. Show the biggest-revenue day first
--
-- ✅ Expected: one row per date, columns event_date and revenue,
-- e.g. 2024-06-14 | 3120.00 , 2024-06-01 | 2980.50 , ...
-
...🎉 Lesson Complete
- ✅ Distributed engines (Hive, Spark SQL, Presto/Trino) run the SQL you already know across clusters
- ✅ DuckDB queries Parquet/CSV files directly on one machine — no server, very fast
- ✅ Columnar storage (Parquet/ORC) reads only the columns a query needs
- ✅ Partitioning + predicate pushdown prune scans to a few folders instead of the whole lake
- ✅ Use distributed SQL when data is too big or too spread out for one database — not for OLTP
- ✅ Next: the
Final Project— build a complete, production-ready database system
Sign up for free to track which lessons you've completed and get learning reminders.