Skip to main content

    Lesson 49 • Advanced Track

    SQL with Big Data Tools

    By the end of this lesson you'll be able to run the same SQL you already know across terabytes and petabytes of data — using distributed engines like Hive, Spark SQL, and Presto/Trino, and the laptop-sized powerhouse DuckDB — and you'll know why columnar storage and partitioning make those queries fast.

    What You'll Learn

    • Run SQL on huge datasets with Hive, Spark SQL, Presto/Trino & DuckDB
    • Point an EXTERNAL table at files in a data lake (no copying)
    • Read Parquet/CSV files directly with DuckDB's read_parquet
    • Partition data so queries scan folders, not the whole lake
    • Why columnar storage (Parquet/ORC) beats rows for analytics
    • When distributed SQL beats a single database — and when it doesn't

    Our Sample Dataset: events

    Imagine billions of these rows spread across thousands of files in a data lake, one folder per day. Every query in this lesson runs against this events shape — a tiny preview is shown here so you can picture the columns.

    Result:

    event_iduser_idevent_typeamountevent_date
    90014471purchase49.992024-06-01
    90024471view02024-06-01
    90035520signup02024-06-01
    90043310purchase8.52024-06-02

    Big data just means data too large or fast for a single database to handle comfortably — so the work is spread across many machines. The good news: you query it with the SQL you already know.

    1. The Same SQL, a Bigger Engine

    A normal database (PostgreSQL, MySQL) runs your query on one machine. A distributed query engine splits the work across a cluster of machines that each scan part of the data, then combines the partial answers. You still write SELECT … GROUP BY …; the engine handles the splitting.

    The big four you'll meet: Apache Hive (SQL compiled into batch jobs over Hadoop — slow but bottomless), Spark SQL (the same idea but in memory, far faster for interactive work), Presto/Trino (a "federated" MPP engine — MPP = massively parallel processing — that can join across many different sources at once), and DuckDB (not distributed at all, but astonishingly fast on a single machine for files up to ~100 GB).

    📚 Real-world analogy

    A single database is one librarian searching one library. A distributed engine is a thousand librarians, each searching one wing, then handing their slips to a head librarian who tallies the total. Same request ("how many books on SQL?"), wildly more shelves — and it finishes in about the same time because everyone works at once.

    2. Columnar Storage (Parquet/ORC) vs Rows

    A traditional database stores data row by row: all of row 1's values together, then all of row 2's. That's perfect when you fetch whole records one at a time. Analytics is the opposite — you usually want one or two columns across millions of rows (e.g. SUM(amount)).

    Columnar formats like Parquet and ORC store all of one column's values together. So SUM(amount) reads only the amount column and skips the other 49 columns on disk. Columns also compress brilliantly (similar values sit next to each other), so files are smaller and faster to read.

    LayoutOn diskBest for
    Row (CSV, OLTP DB)row1 all-cols, row2 all-cols…Fetching/updating whole records
    Columnar (Parquet/ORC)all event_type, then all amount…Aggregating a few columns over many rows

    3. External Tables & Partition Pruning (Hive)

    An EXTERNAL table is a SQL definition layered over files that already exist in storage (HDFS, S3). Hive doesn't copy or own the data — it just describes the columns so you can query the files with SQL. Drop the table and the files stay.

    Partitioning physically splits the data into folders by a column — here, one folder per event_date. When your WHERE filters on that column, the engine reads only the matching folders. That's partition pruning (a form of predicate pushdown: the filter is pushed down to the storage layer so unneeded data is never read). Filtering one day out of a year can mean scanning 1 folder instead of 365.

    Hive: external table + partition pruning

    Define a table over data-lake files and query one partition.

    Try it Yourself »
    SQL
    -- Apache Hive: write SQL, Hive runs it as distributed jobs over Hadoop.
    -- An EXTERNAL table is just a SQL "view" pointing at files already in storage.
    -- Hive never copies the data — it reads the files where they live.
    
    CREATE EXTERNAL TABLE web_events (
        event_id   BIGINT,
        user_id    BIGINT,
        event_type STRING,
        amount     DOUBLE
    )
    PARTITIONED BY (event_date STRING)   -- folder-per-day on disk (see below)
    STORED AS PARQUET                    -- columnar format: only reads needed 
    ...

    Result — WHERE event_date = '2024-06-01':

    event_typeeventsrevenue
    purchase18408210.5
    view77200
    signup4100

    Your Turn: a partitioned aggregation

    Fill in the blanks so the query prunes to a single day's folder and totals revenue per event_type. The expected result is in the comments so you can check yourself.

    🎯 Your Turn: partition pruning

    Filter and group by the partition column.

    Try it Yourself »
    SQL
    -- 🎯 YOUR TURN — fill in the two blanks, then press "Try it Yourself".
    -- Goal: total revenue per event_type for ONE day, scanning only that day's folder.
    
    SELECT
        event_type,
        ROUND(SUM(amount), 2) AS revenue
    FROM web_events
    WHERE ___ = '2024-06-01'      -- 👉 the PARTITION column, so Hive prunes to 1 folder
    GROUP BY ___                  -- 👉 group by the same column you SELECTed (not amount)
    ORDER BY revenue DESC;
    
    -- ✅ Expected: 3 rows (purchase, signup, view), e.g.
    --    purchase | 8
    ...

    4. Spark SQL — Distributed, In Memory

    Spark SQL runs the same standard SQL but keeps data in memory across the cluster, which makes repeated, interactive queries far faster than Hive's batch jobs. You register your files as a temporary view once, then query them like any table — window functions, CTEs, and joins all work.

    Notice APPROX_COUNT_DISTINCT: at billions of rows, an exact distinct count is expensive, so big-data engines offer fast approximate versions (a couple of percent error) that finish in a fraction of the time.

    Spark SQL: in-memory analytics

    Aggregate a week of partitions with an approximate distinct count.

    Try it Yourself »
    SQL
    -- Spark SQL: the same SQL, run IN MEMORY across a cluster.
    -- Often 10-100x faster than Hive for interactive, repeated queries.
    
    -- In Python you'd register the files as a view first:
    --   spark.read.parquet('s3://acme-lake/events/').createOrReplaceTempView('events')
    -- After that, plain SQL works:
    
    SELECT
        event_date,
        event_type,
        COUNT(*)                       AS events,
        APPROX_COUNT_DISTINCT(user_id) AS unique_users   -- fast, ~2% error
    FROM events
    WHERE event_date BETWEEN '202
    ...

    Result — first rows:

    event_dateevent_typeeventsunique_users
    2024-06-01view77205102
    2024-06-01purchase18401611
    2024-06-02view76105044

    5. DuckDB — Files Straight to SQL

    DuckDB is the "SQLite for analytics": a single embedded engine, no server, that reads CSV and Parquet files directly. There's no loading step — you point read_parquet('…') (or read_csv_auto('…')) at a file or a wildcard of files and query them immediately.

    Because Parquet is columnar, DuckDB reads only the columns your query touches — exactly the storage win from Section 2, now in one line on your laptop. For datasets up to ~100 GB it's often faster than spinning up a cluster, and it costs nothing to run.

    DuckDB: read_parquet over a wildcard

    Aggregate a month of Parquet files with zero setup.

    Try it Yourself »
    SQL
    -- DuckDB: "SQLite for analytics". No server, runs on your laptop,
    -- reads Parquet/CSV files directly — no CREATE TABLE, no loading step.
    
    -- Read every Parquet file matching a wildcard and aggregate it:
    SELECT
        event_type,
        COUNT(*)               AS events,
        ROUND(SUM(amount), 2)  AS revenue
    FROM read_parquet('events/2024-06-*.parquet')   -- a whole month of files
    GROUP BY event_type
    ORDER BY revenue DESC;
    
    -- Because Parquet is columnar, DuckDB reads ONLY the 3 columns this query
    -- 
    ...

    Result — a month of files:

    event_typeeventsrevenue
    purchase55210248900.75
    view2318800
    signup124100

    Your Turn: read the files

    One concept this time — name the DuckDB function that reads Parquet files, and the count function for the body.

    🎯 Your Turn: read_parquet

    Count rows and unique users across a folder of Parquet files.

    Try it Yourself »
    SQL
    -- 🎯 YOUR TURN — fill in the two blanks for a DuckDB query over files.
    -- Goal: count rows and unique users across a folder of Parquet files.
    
    SELECT
        COUNT(*)                       AS events,
        APPROX_COUNT_DISTINCT(user_id) AS unique_users
    FROM ___('events/2024-06-*.parquet');   -- 👉 the DuckDB function that reads Parquet
    
    -- (Hint: for a CSV instead, the function would be read_csv_auto('...').)
    
    -- ✅ Expected: 1 row, 2 columns, e.g.
    --    events | unique_users
    --    41980  | 9123

    6. Presto/Trino — One Query, Many Sources

    Presto/Trino is a federated MPP engine: it can join data living in different systems in a single SQL statement — a Postgres table, files in a Hive lake, even a document store — without first copying everything into one place. You reference each source by a catalog.schema.table name.

    When should distributed SQL beat a single database? When the data no longer fits (or no longer fits affordably) on one machine, when it spans multiple systems, or when scans are so large that splitting them across a cluster finishes work a lone server never could. For everyday app data — frequent small reads and writes — a single relational database is still simpler, cheaper, and faster.

    Common Errors (and the fix)

    • Full scan with no partition pruning: if you don't filter on the partition column (or you wrap it in a function like YEAR(event_date)), the engine reads every folder. Filter the raw partition column directly — WHERE event_date = '2024-06-01' — so the pushdown can prune.
    • The small-files problem: thousands of tiny files (a few KB each) cripple Hive/Spark — there's huge per-file overhead. Compact them into fewer, larger files (~128 MB+). Symptom: a trivial query runs for minutes.
    • Row format for analytics: storing big tables as CSV/JSON means every query scans every column. Convert to Parquet/ORC so scans read only the columns they need.
    • Treating a warehouse like OLTP: engines like Hive, Spark SQL, and BigQuery are built for big scans, not single-row lookups or constant UPDATE/DELETE. Row-at-a-time SELECT … WHERE id = 42 belongs in PostgreSQL/MySQL, not a data lake.
    • Exact distinct on billions of rows: COUNT(DISTINCT user_id) can be brutally slow at scale. Use APPROX_COUNT_DISTINCT() when a couple of percent error is fine.

    📘 Quick Reference

    Engine / SyntaxWhat it is / does
    Apache HiveSQL → batch jobs on Hadoop; bottomless scale, slow
    Spark SQLSame SQL, in-memory across a cluster; fast & interactive
    Presto/TrinoFederated MPP — join across many sources in one query
    DuckDBSingle-node, server-less; reads files directly, very fast (~100GB)
    CREATE EXTERNAL TABLEDefine SQL columns over files already in storage (no copy)
    PARTITIONED BY (col)Split data into folders by col → enables partition pruning
    STORED AS PARQUETColumnar storage: reads only the columns a query needs
    read_parquet('*.parquet')DuckDB: query Parquet files directly, no load step
    read_csv_auto('f.csv')DuckDB: query a CSV with inferred column types
    APPROX_COUNT_DISTINCT()Fast approximate unique count for huge data

    Frequently Asked Questions

    Q: Do I have to learn a new language for big data?

    No — that's the whole point. Hive, Spark SQL, Presto/Trino, DuckDB, and BigQuery all speak standard SQL. Your SELECT … GROUP BY … JOIN skills transfer directly; only the execution engine changes.

    Q: When should I use DuckDB instead of a cluster?

    If your data fits on one machine (roughly up to ~100 GB of Parquet), DuckDB is usually faster, simpler, and free — no cluster to manage. Reach for Spark/Presto when the data outgrows a single node or spans many systems.

    Q: What's the real difference between Parquet and CSV?

    CSV is row-oriented text: every query reads every column and there's no compression. Parquet is columnar and compressed, so analytics read only the needed columns and far fewer bytes. For big-data SQL, prefer Parquet (or ORC).

    Q: Can these engines replace my PostgreSQL?

    No. They're built for big scans and aggregation, not single-row lookups or constant updates. Keep your transactional app on PostgreSQL/MySQL; use big-data SQL for the analytics on top of it.

    Mini-Challenge: Daily Purchase Revenue

    Put it all together — a brief, a blank canvas, and the expected result in the comments. Write it, then run it in DuckDB (or any SQL playground with sample data) to confirm.

    🎯 Mini-Challenge

    Per-day purchase revenue from a month of Parquet files, biggest day first.

    Try it Yourself »
    SQL
    -- 🎯 MINI-CHALLENGE
    -- Using ONLY what this lesson covered (read_parquet / external tables,
    -- partition pruning with WHERE, GROUP BY, an aggregate, ORDER BY):
    --
    --   1. Read 'events/2024-06-*.parquet' in DuckDB
    --   2. Keep only event_type = 'purchase'
    --   3. Return event_date and total revenue (SUM of amount) per day
    --   4. Show the biggest-revenue day first
    --
    -- ✅ Expected: one row per date, columns event_date and revenue,
    --    e.g.  2024-06-14 | 3120.00 ,  2024-06-01 | 2980.50 , ...
    
    -
    ...

    🎉 Lesson Complete

    • ✅ Distributed engines (Hive, Spark SQL, Presto/Trino) run the SQL you already know across clusters
    • ✅ DuckDB queries Parquet/CSV files directly on one machine — no server, very fast
    • ✅ Columnar storage (Parquet/ORC) reads only the columns a query needs
    • ✅ Partitioning + predicate pushdown prune scans to a few folders instead of the whole lake
    • ✅ Use distributed SQL when data is too big or too spread out for one database — not for OLTP
    • Next: the Final Project — build a complete, production-ready database system

    Sign up for free to track which lessons you've completed and get learning reminders.

    Previous

    Cookie & Privacy Settings

    We use cookies to improve your experience, analyze traffic, and show personalized ads. You can manage your preferences below.

    By clicking "Accept All", you consent to our use of cookies for analytics and personalized advertising. You can customize your preferences or reject non-essential cookies.

    Privacy PolicyTerms of Service

    Install LearnCodingFast

    Learn faster with the app on your home screen.