Lesson 41 • Advanced
Massive Data Handling: Bulk Inserts, Batching & ETL
Load millions of rows efficiently using COPY, batch inserts, staging tables, and ETL patterns.
✅ What You'll Learn
- • Bulk loading with PostgreSQL COPY and MySQL LOAD DATA
- • Batch INSERT strategies for application-level loading
- • ETL pipelines: staging, transformation, and error handling
- • Loading data into partitioned tables efficiently
🏗️ Why Bulk Loading Matters
🎯 Real-World Analogy: Imagine moving house. You wouldn't carry items one at a time — you'd pack boxes and load them all into a truck. Bulk loading works the same way: instead of inserting one row at a time (carrying one item), you pack thousands of rows and load them in a single operation.
Single-row INSERTs create enormous overhead: each statement requires a network round trip, query parsing, transaction commit, and WAL write. Loading 10 million rows one at a time could take hours. The same data via COPY completes in seconds.
⚡ Speed Comparison (10M rows)
- • Single INSERT: ~45 minutes
- • Batch INSERT (1000/batch): ~3 minutes
- • COPY command: ~15 seconds
Try It: COPY for Bulk Loading
Use PostgreSQL COPY to load and export CSV data
-- PostgreSQL: COPY command for bulk loading
-- Fastest way to load millions of rows
-- Load from CSV file
COPY products (name, price, category_id, stock)
FROM '/data/products.csv'
WITH (FORMAT csv, HEADER true, DELIMITER ',');
-- Load from stdin (pipe data)
COPY orders FROM STDIN WITH (FORMAT csv);
1,101,2024-01-15,99.99,shipped
2,102,2024-01-16,149.50,pending
\.
-- Export data to CSV
COPY (
SELECT o.order_id, c.email, o.total
FROM orders o JOIN customers c ON o.customer_id = c.custo
...Try It: Batch INSERT Strategies
Compare single-row vs multi-row vs UNNEST inserts
-- Batch INSERT for application-level bulk loading
-- Much faster than single-row inserts
-- ❌ SLOW: One insert per row (1 million round trips!)
INSERT INTO logs (level, message) VALUES ('INFO', 'Started');
INSERT INTO logs (level, message) VALUES ('WARN', 'Slow query');
-- ... 999,998 more statements
-- ✅ FAST: Multi-row VALUES (one round trip per batch)
INSERT INTO logs (level, message) VALUES
('INFO', 'Started'),
('WARN', 'Slow query'),
('ERROR', 'Connection timeout'),
('INF
...🔄 The ETL Pipeline
ETL stands for Extract, Transform, Load — the standard pattern for moving data between systems.
📥 Extract
Pull raw data from CSV files, APIs, or other databases into a staging area.
🔧 Transform
Clean, validate, normalise, and enrich the data. Reject bad rows.
📤 Load
Insert clean data into production tables with proper constraints.
⚠️ Common Mistake: Loading raw data directly into production tables. Always use a staging table first — it lets you validate data before it touches your real tables.
Try It: ETL Pipeline
Build a complete Extract → Transform → Load pipeline with error handling
-- ETL Pattern: Extract → Transform → Load
-- Stage raw data, clean it, then load into production tables
-- Step 1: Create staging table (no constraints)
CREATE TEMPORARY TABLE staging_orders (
raw_date TEXT,
customer_email TEXT,
product_name TEXT,
quantity TEXT,
price TEXT
);
-- Step 2: Bulk load raw data
COPY staging_orders FROM '/data/raw_orders.csv'
WITH (FORMAT csv, HEADER true);
-- Step 3: Transform and load into production
INSERT INTO orders (customer_id, order_date
...Try It: Loading into Partitioned Tables
Bulk load data that auto-routes to the correct partition
-- Loading into partitioned tables
-- Data automatically routes to correct partition
-- Table partitioned by month
CREATE TABLE events (
event_id BIGSERIAL,
event_type TEXT NOT NULL,
payload JSONB,
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
) PARTITION BY RANGE (created_at);
CREATE TABLE events_2024_01 PARTITION OF events
FOR VALUES FROM ('2024-01-01') TO ('2024-02-01');
CREATE TABLE events_2024_02 PARTITION OF events
FOR VALUES FROM ('2024-02-01') TO ('2024-03-01');
...📋 Quick Reference
| Method | Speed | Best For |
|---|---|---|
| COPY | ⚡⚡⚡ | Server-side file loading |
| Batch INSERT | ⚡⚡ | Application-level loading |
| LOAD DATA INFILE | ⚡⚡⚡ | MySQL bulk loading |
| Staging + Transform | ⚡⚡ | Dirty data with validation |
💡 Pro Tip: Before a massive bulk load, disable indexes and triggers, load the data, then re-enable and rebuild indexes. This can make loads 10x faster because the database doesn't need to update index structures for every single row.
🎉 Lesson Complete!
You can now load millions of rows efficiently! Next, learn when to use database triggers vs application logic.
Sign up for free to track which lessons you've completed and get learning reminders.