Lesson 49 • Advanced
SQL with Big Data Tools
Run SQL at massive scale with Hive, Spark SQL, DuckDB, and Presto — the same SQL you already know.
✅ What You'll Learn
- • Apache Hive: SQL on Hadoop for batch processing
- • Spark SQL: In-memory distributed analytics
- • DuckDB: Local analytics on CSV/Parquet files
- • Presto/Trino: Federated queries across data sources
🌊 SQL Beyond Traditional Databases
🎯 Real-World Analogy: Traditional SQL databases are like a library with an amazing card catalog — great for finding books (rows). Big Data SQL tools are like having the same card catalog but for the entire British Library, Library of Congress, and every public library on Earth — combined. Same search skills, astronomically more data.
The key insight: you don't need to learn a new language. All these tools speak SQL. Your knowledge from this course transfers directly. The difference is in how they execute queries: distributing work across dozens or hundreds of machines.
💡 Pro Tip:
Start with DuckDB for local analytics on files up to ~100GB. It requires zero setup and runs entirely on your laptop. Graduate to Spark SQL or Presto when you need cluster-scale processing.
Try It: Hive SQL on Hadoop
Query external tables and partitioned data on HDFS
-- Apache Hive: SQL on Hadoop (HDFS)
-- Write SQL, Hive translates to MapReduce/Tez/Spark jobs
-- Create an external table over files in HDFS
CREATE EXTERNAL TABLE web_logs (
ip_address STRING,
request_time TIMESTAMP,
method STRING,
url STRING,
status_code INT,
response_size BIGINT,
user_agent STRING
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE
LOCATION '/data/logs/2024/';
-- Partitioned table for efficient queries
CREATE TABLE events (
ev
...Try It: Spark SQL
Run distributed analytics with window functions and CTEs
-- Apache Spark SQL: In-memory distributed SQL
-- 100x faster than Hive for interactive queries
-- Register a DataFrame as a SQL table
-- df = spark.read.parquet('/data/sales/')
-- df.createOrReplaceTempView('sales')
-- Standard SQL works on Spark DataFrames
SELECT
DATE_TRUNC('month', sale_date) AS month,
category,
SUM(amount) AS revenue,
COUNT(*) AS transactions,
PERCENTILE_APPROX(amount, 0.5) AS median_sale
FROM sales
WHERE sale_date >= '2024-01-01'
GROUP BY DATE_TRUNC('m
...Try It: DuckDB Local Analytics
Query CSV and Parquet files with zero setup using DuckDB
-- DuckDB: The "SQLite for analytics"
-- Runs locally, no server needed, blazing fast on CSVs/Parquet
-- Query a CSV file directly (no CREATE TABLE needed!)
SELECT
country,
COUNT(*) AS total_orders,
ROUND(AVG(total), 2) AS avg_order,
ROUND(SUM(total), 2) AS revenue
FROM read_csv_auto('orders_2024.csv')
GROUP BY country
ORDER BY revenue DESC
LIMIT 10;
-- Query Parquet files (columnar, compressed)
SELECT
product_category,
DATE_TRUNC('quarter', order_date) AS quarter,
...Try It: Presto Federated Queries
Query across PostgreSQL, Hive, and MongoDB in a single statement
-- Presto / Trino: Federated SQL engine
-- Query across multiple data sources with one SQL statement
-- Query data from different sources in one query!
SELECT
pg.user_id,
pg.email,
hive.total_pageviews,
mongo.preferences
FROM postgresql.public.users AS pg
JOIN hive.analytics.user_pageviews AS hive
ON pg.user_id = hive.user_id
JOIN mongodb.app.user_settings AS mongo
ON pg.user_id = mongo.user_id
WHERE hive.visit_date >= DATE '2024-01-01';
-- Comparison of Big Data SQL en
...📋 Quick Reference
| Engine | Scale | Speed | Best For |
|---|---|---|---|
| Hive | Petabytes | Batch | ETL on Hadoop |
| Spark SQL | Petabytes | Fast | Interactive analytics |
| DuckDB | ~100GB | Very fast | Local file analytics |
| Presto/Trino | Petabytes | Fast | Federated queries |
| BigQuery | Petabytes | Fast | Serverless analytics |
🎉 Lesson Complete!
You now know how to use SQL at massive scale! Next up: the Final Project where you'll build a complete production-ready database system.
Sign up for free to track which lessons you've completed and get learning reminders.