Lesson 49 • Advanced Track

SQL with Big Data Tools

By the end of this lesson you'll be able to run the same SQL you already know across terabytes and petabytes of data — using distributed engines like Hive, Spark SQL, and Presto/Trino, and the laptop-sized powerhouse DuckDB — and you'll know why columnar storage and partitioning make those queries fast.

What You'll Learn

✓Run SQL on huge datasets with Hive, Spark SQL, Presto/Trino & DuckDB
✓Point an EXTERNAL table at files in a data lake (no copying)
✓Read Parquet/CSV files directly with DuckDB's read_parquet
✓Partition data so queries scan folders, not the whole lake
✓Why columnar storage (Parquet/ORC) beats rows for analytics
✓When distributed SQL beats a single database — and when it doesn't

Before you start: finish Data Warehousing so you're comfortable with fact tables, aggregation, and OLAP-style queries. The in-browser editor lets you write and edit SQL; the real big-data engines below run on clusters or files, so treat these as read-and-understand examples. To run standard SQL and check the shapes of results, copy a query into a free playground like sqliteonline.com, or install DuckDB (a single file) to try read_parquet for real. Every example shows its expected result so you can check yourself.

Our Sample Dataset: `events`

Imagine billions of these rows spread across thousands of files in a data lake, one folder per day. Every query in this lesson runs against this events shape — a tiny preview is shown here so you can picture the columns.

Result:

event_id	user_id	event_type	amount	event_date
9001	4471	purchase	49.99	2024-06-01
9002	4471	view	0	2024-06-01
9003	5520	signup	0	2024-06-01
9004	3310	purchase	8.5	2024-06-02
…	…	…	…	…

Big data just means data too large or fast for a single database to handle comfortably — so the work is spread across many machines. The good news: you query it with the SQL you already know.

1. The Same SQL, a Bigger Engine

A normal database (PostgreSQL, MySQL) runs your query on one machine. A distributed query engine splits the work across a cluster of machines that each scan part of the data, then combines the partial answers. You still write SELECT … GROUP BY …; the engine handles the splitting.

The big four you'll meet: Apache Hive (SQL compiled into batch jobs over Hadoop — slow but bottomless), Spark SQL (the same idea but in memory, far faster for interactive work), Presto/Trino (a "federated" MPP engine — MPP = massively parallel processing — that can join across many different sources at once), and DuckDB (not distributed at all, but astonishingly fast on a single machine for files up to ~100 GB).

📚 Real-world analogy

A single database is one librarian searching one library. A distributed engine is a thousand librarians, each searching one wing, then handing their slips to a head librarian who tallies the total. Same request ("how many books on SQL?"), wildly more shelves — and it finishes in about the same time because everyone works at once.

2. Columnar Storage (Parquet/ORC) vs Rows

A traditional database stores data row by row: all of row 1's values together, then all of row 2's. That's perfect when you fetch whole records one at a time. Analytics is the opposite — you usually want one or two columns across millions of rows (e.g. SUM(amount)).

Columnar formats like Parquet and ORC store all of one column's values together. So SUM(amount) reads only the amount column and skips the other 49 columns on disk. Columns also compress brilliantly (similar values sit next to each other), so files are smaller and faster to read.

Rule of thumb: row storage for OLTP (many small reads/writes of whole records); columnar (Parquet/ORC) for analytics (big scans of a few columns). Nearly every big-data engine prefers Parquet.

Layout	On disk	Best for
Row (CSV, OLTP DB)	row1 all-cols, row2 all-cols…	Fetching/updating whole records
Columnar (Parquet/ORC)	all event_type, then all amount…	Aggregating a few columns over many rows

3. External Tables & Partition Pruning (Hive)

An EXTERNAL table is a SQL definition layered over files that already exist in storage (HDFS, S3). Hive doesn't copy or own the data — it just describes the columns so you can query the files with SQL. Drop the table and the files stay.

Partitioning physically splits the data into folders by a column — here, one folder per event_date. When your WHERE filters on that column, the engine reads only the matching folders. That's partition pruning (a form of predicate pushdown: the filter is pushed down to the storage layer so unneeded data is never read). Filtering one day out of a year can mean scanning 1 folder instead of 365.

Hive: external table + partition pruning

Define a table over data-lake files and query one partition.

Try it Yourself »

SQL

-- Apache Hive: write SQL, Hive runs it as distributed jobs over Hadoop.
-- An EXTERNAL table is just a SQL "view" pointing at files already in storage.
-- Hive never copies the data — it reads the files where they live.

CREATE EXTERNAL TABLE web_events (
    event_id   BIGINT,
    user_id    BIGINT,
    event_type STRING,
    amount     DOUBLE
)
PARTITIONED BY (event_date STRING)   -- folder-per-day on disk (see below)
STORED AS PARQUET                    -- columnar format: only reads needed 
...

Result — WHERE event_date = '2024-06-01':

event_type	events	revenue
purchase	1840	8210.5
view	7720	0
signup	410	0

Your Turn: a partitioned aggregation

Fill in the blanks so the query prunes to a single day's folder and totals revenue per event_type. The expected result is in the comments so you can check yourself.

🎯 Your Turn: partition pruning

Filter and group by the partition column.

Try it Yourself »

SQL

-- 🎯 YOUR TURN — fill in the two blanks, then press "Try it Yourself".
-- Goal: total revenue per event_type for ONE day, scanning only that day's folder.

SELECT
    event_type,
    ROUND(SUM(amount), 2) AS revenue
FROM web_events
WHERE ___ = '2024-06-01'      -- 👉 the PARTITION column, so Hive prunes to 1 folder
GROUP BY ___                  -- 👉 group by the same column you SELECTed (not amount)
ORDER BY revenue DESC;

-- ✅ Expected: 3 rows (purchase, signup, view), e.g.
--    purchase | 8
...

4. Spark SQL — Distributed, In Memory

Spark SQL runs the same standard SQL but keeps data in memory across the cluster, which makes repeated, interactive queries far faster than Hive's batch jobs. You register your files as a temporary view once, then query them like any table — window functions, CTEs, and joins all work.

Notice APPROX_COUNT_DISTINCT: at billions of rows, an exact distinct count is expensive, so big-data engines offer fast approximate versions (a couple of percent error) that finish in a fraction of the time.

Spark SQL: in-memory analytics

Aggregate a week of partitions with an approximate distinct count.

Try it Yourself »

SQL

-- Spark SQL: the same SQL, run IN MEMORY across a cluster.
-- Often 10-100x faster than Hive for interactive, repeated queries.

-- In Python you'd register the files as a view first:
--   spark.read.parquet('s3://acme-lake/events/').createOrReplaceTempView('events')
-- After that, plain SQL works:

SELECT
    event_date,
    event_type,
    COUNT(*)                       AS events,
    APPROX_COUNT_DISTINCT(user_id) AS unique_users   -- fast, ~2% error
FROM events
WHERE event_date BETWEEN '202
...

Result — first rows:

event_date	event_type	events	unique_users
2024-06-01	view	7720	5102
2024-06-01	purchase	1840	1611
2024-06-02	view	7610	5044
…	…	…	…

5. DuckDB — Files Straight to SQL

DuckDB is the "SQLite for analytics": a single embedded engine, no server, that reads CSV and Parquet files directly. There's no loading step — you point read_parquet('…') (or read_csv_auto('…')) at a file or a wildcard of files and query them immediately.

Because Parquet is columnar, DuckDB reads only the columns your query touches — exactly the storage win from Section 2, now in one line on your laptop. For datasets up to ~100 GB it's often faster than spinning up a cluster, and it costs nothing to run.

DuckDB: read_parquet over a wildcard

Aggregate a month of Parquet files with zero setup.

Try it Yourself »

SQL

-- DuckDB: "SQLite for analytics". No server, runs on your laptop,
-- reads Parquet/CSV files directly — no CREATE TABLE, no loading step.

-- Read every Parquet file matching a wildcard and aggregate it:
SELECT
    event_type,
    COUNT(*)               AS events,
    ROUND(SUM(amount), 2)  AS revenue
FROM read_parquet('events/2024-06-*.parquet')   -- a whole month of files
GROUP BY event_type
ORDER BY revenue DESC;

-- Because Parquet is columnar, DuckDB reads ONLY the 3 columns this query
-- 
...

Result — a month of files:

event_type	events	revenue
purchase	55210	248900.75
view	231880	0
signup	12410	0

Your Turn: read the files

One concept this time — name the DuckDB function that reads Parquet files, and the count function for the body.

🎯 Your Turn: read_parquet

Count rows and unique users across a folder of Parquet files.

Try it Yourself »

SQL

-- 🎯 YOUR TURN — fill in the two blanks for a DuckDB query over files.
-- Goal: count rows and unique users across a folder of Parquet files.

SELECT
    COUNT(*)                       AS events,
    APPROX_COUNT_DISTINCT(user_id) AS unique_users
FROM ___('events/2024-06-*.parquet');   -- 👉 the DuckDB function that reads Parquet

-- (Hint: for a CSV instead, the function would be read_csv_auto('...').)

-- ✅ Expected: 1 row, 2 columns, e.g.
--    events | unique_users
--    41980  | 9123

6. Presto/Trino — One Query, Many Sources

Presto/Trino is a federated MPP engine: it can join data living in different systems in a single SQL statement — a Postgres table, files in a Hive lake, even a document store — without first copying everything into one place. You reference each source by a catalog.schema.table name.

When should distributed SQL beat a single database? When the data no longer fits (or no longer fits affordably) on one machine, when it spans multiple systems, or when scans are so large that splitting them across a cluster finishes work a lone server never could. For everyday app data — frequent small reads and writes — a single relational database is still simpler, cheaper, and faster.

Common Errors (and the fix)

Full scan with no partition pruning: if you don't filter on the partition column (or you wrap it in a function like YEAR(event_date)), the engine reads every folder. Filter the raw partition column directly — WHERE event_date = '2024-06-01' — so the pushdown can prune.
The small-files problem: thousands of tiny files (a few KB each) cripple Hive/Spark — there's huge per-file overhead. Compact them into fewer, larger files (~128 MB+). Symptom: a trivial query runs for minutes.
Row format for analytics: storing big tables as CSV/JSON means every query scans every column. Convert to Parquet/ORC so scans read only the columns they need.
Treating a warehouse like OLTP: engines like Hive, Spark SQL, and BigQuery are built for big scans, not single-row lookups or constant UPDATE/DELETE. Row-at-a-time SELECT … WHERE id = 42 belongs in PostgreSQL/MySQL, not a data lake.
Exact distinct on billions of rows: COUNT(DISTINCT user_id) can be brutally slow at scale. Use APPROX_COUNT_DISTINCT() when a couple of percent error is fine.

📘 Quick Reference

Engine / Syntax	What it is / does
Apache Hive	SQL → batch jobs on Hadoop; bottomless scale, slow
Spark SQL	Same SQL, in-memory across a cluster; fast & interactive
Presto/Trino	Federated MPP — join across many sources in one query
DuckDB	Single-node, server-less; reads files directly, very fast (~100GB)
CREATE EXTERNAL TABLE	Define SQL columns over files already in storage (no copy)
PARTITIONED BY (col)	Split data into folders by col → enables partition pruning
STORED AS PARQUET	Columnar storage: reads only the columns a query needs
read_parquet('*.parquet')	DuckDB: query Parquet files directly, no load step
read_csv_auto('f.csv')	DuckDB: query a CSV with inferred column types
APPROX_COUNT_DISTINCT()	Fast approximate unique count for huge data

Frequently Asked Questions

Q: Do I have to learn a new language for big data?

No — that's the whole point. Hive, Spark SQL, Presto/Trino, DuckDB, and BigQuery all speak standard SQL. Your SELECT … GROUP BY … JOIN skills transfer directly; only the execution engine changes.

Q: When should I use DuckDB instead of a cluster?

If your data fits on one machine (roughly up to ~100 GB of Parquet), DuckDB is usually faster, simpler, and free — no cluster to manage. Reach for Spark/Presto when the data outgrows a single node or spans many systems.

Q: What's the real difference between Parquet and CSV?

CSV is row-oriented text: every query reads every column and there's no compression. Parquet is columnar and compressed, so analytics read only the needed columns and far fewer bytes. For big-data SQL, prefer Parquet (or ORC).

Q: Can these engines replace my PostgreSQL?

No. They're built for big scans and aggregation, not single-row lookups or constant updates. Keep your transactional app on PostgreSQL/MySQL; use big-data SQL for the analytics on top of it.

Mini-Challenge: Daily Purchase Revenue

Put it all together — a brief, a blank canvas, and the expected result in the comments. Write it, then run it in DuckDB (or any SQL playground with sample data) to confirm.

🎯 Mini-Challenge

Per-day purchase revenue from a month of Parquet files, biggest day first.

Try it Yourself »

SQL

-- 🎯 MINI-CHALLENGE
-- Using ONLY what this lesson covered (read_parquet / external tables,
-- partition pruning with WHERE, GROUP BY, an aggregate, ORDER BY):
--
--   1. Read 'events/2024-06-*.parquet' in DuckDB
--   2. Keep only event_type = 'purchase'
--   3. Return event_date and total revenue (SUM of amount) per day
--   4. Show the biggest-revenue day first
--
-- ✅ Expected: one row per date, columns event_date and revenue,
--    e.g.  2024-06-14 | 3120.00 ,  2024-06-01 | 2980.50 , ...

-
...

🎉 Lesson Complete

✅ Distributed engines (Hive, Spark SQL, Presto/Trino) run the SQL you already know across clusters
✅ DuckDB queries Parquet/CSV files directly on one machine — no server, very fast
✅ Columnar storage (Parquet/ORC) reads only the columns a query needs
✅ Partitioning + predicate pushdown prune scans to a few folders instead of the whole lake
✅ Use distributed SQL when data is too big or too spread out for one database — not for OLTP
✅ Next: the Final Project — build a complete, production-ready database system

SQL with Big Data Tools

What You'll Learn

Our Sample Dataset: events

1. The Same SQL, a Bigger Engine

2. Columnar Storage (Parquet/ORC) vs Rows

3. External Tables & Partition Pruning (Hive)

Hive: external table + partition pruning

Your Turn: a partitioned aggregation

🎯 Your Turn: partition pruning

4. Spark SQL — Distributed, In Memory

Spark SQL: in-memory analytics

5. DuckDB — Files Straight to SQL

DuckDB: read_parquet over a wildcard

Your Turn: read the files

🎯 Your Turn: read_parquet

6. Presto/Trino — One Query, Many Sources

Common Errors (and the fix)

📘 Quick Reference

Frequently Asked Questions

Mini-Challenge: Daily Purchase Revenue

🎯 Mini-Challenge

🎉 Lesson Complete

Cookie & Privacy Settings

Our Sample Dataset: `events`