Advanced Track
Massive Data Handling: Bulk Loads, Batching & ETL
By the end of this lesson you'll be able to load and reshape millions of rows without locking up your database — using COPY/LOAD DATA instead of row-by-row inserts, batching and sizing transactions sensibly, dropping and rebuilding indexes around big loads, running idempotent upserts, and deleting in safe chunks. These are the techniques that turn a 45-minute load into a 15-second one.
What You'll Learn
- ✓Bulk-load with COPY / LOAD DATA INFILE (100x faster than INSERTs)
- ✓Batch many rows into one multi-row INSERT
- ✓Size transactions: commit every few thousand rows
- ✓Drop and rebuild indexes around a big load
- ✓Run idempotent UPSERTs at scale (ON CONFLICT / ON DUPLICATE KEY)
- ✓Delete and update millions of rows in safe chunks
COPY from disk, LOAD DATA INFILE) need a real server with file access, so study those as worked examples.The Scenario: 10 Million Orders
Throughout this lesson, picture a nightly job that loads a fresh export of 10 million order rows into an orders table, then deletes last year's events. Every technique below is judged by one question: how do we do that in seconds instead of an hour, without freezing the live database?
🚚 Real-world analogy
Moving house, you don't carry items one at a time across town — you pack boxes and load a truck. Row-by-row INSERT is carrying one item per trip; COPY is the truck. Same boxes, a tiny fraction of the trips.
1. Why Row-by-Row INSERT Is So Slow
Each separate INSERT statement pays a fixed tax: a network round trip to the server, parsing and planning the query, and a transaction commit that forces a write to the write-ahead log (the on-disk journal the database uses to stay crash-safe). For one row that tax is invisible. For ten million rows you pay it ten million times — and that is where your minutes go.
The slow way — one INSERT per row
See why ten million separate statements is the problem.
-- ❌ The slow way: one INSERT statement per row
-- Each statement is its own round trip + parse + commit.
INSERT INTO orders (customer_id, total) VALUES (101, 99.99);
INSERT INTO orders (customer_id, total) VALUES (102, 149.50);
INSERT INTO orders (customer_id, total) VALUES (103, 12.00);
-- ... imagine 9,999,997 more of these.
-- Loading 10 million rows like this can take 30–60 MINUTES,
-- because the database pays the overhead 10 million times.2. COPY & LOAD DATA INFILE — the Fast Path
PostgreSQL's COPY and MySQL's LOAD DATA INFILE are bulk loaders: they stream an entire file into a table in a single operation, skipping the per-row overhead entirely. This is almost always the fastest way to get data in — typically orders of magnitude faster than individual inserts.
COPY reads files from the server's disk. If your file is on your own machine, use the client-side variant \copy in psql, which streams the file to the server for you.The fast way — COPY / LOAD DATA
Bulk-load a CSV in one streaming operation.
-- ✅ The fast way: bulk-load straight from a file with COPY
-- COPY streams rows in one operation — no per-row overhead.
-- PostgreSQL: read a CSV from disk on the server
COPY orders (customer_id, total, order_date)
FROM '/data/orders.csv'
WITH (FORMAT csv, HEADER true, DELIMITER ',');
-- MySQL does the same job with LOAD DATA INFILE:
-- LOAD DATA INFILE '/data/orders.csv'
-- INTO TABLE orders
-- FIELDS TERMINATED BY ','
-- LINES TERMINATED BY '\n'
-- IGNORE 1 ROWS; -- skip the head
...Result — what the server reports:
| command | rows loaded | time |
|---|---|---|
| COPY | 10000000 | ~15s |
| 10M single INSERTs | 10000000 | ~45m |
Your Turn: complete the COPY
Fill in the blanks to bulk-load customers.csv (which includes a header row). The expected result is in the comments so you can check yourself.
🎯 Your Turn: bulk-load with COPY
Replace the ___ blanks to finish the COPY statement.
-- 🎯 YOUR TURN — fill in the two blanks, then press "Try it Yourself"
-- Goal: bulk-load the customers.csv file (which HAS a header row)
-- into the customers table using PostgreSQL COPY.
___ customers (name, email, country) -- 👉 the bulk-load keyword
FROM '/data/customers.csv'
WITH (FORMAT csv, HEADER ___); -- 👉 the file has a header line
-- ✅ Expected: all rows from customers.csv loaded in one pass,
-- e.g. "COPY 500000" (Postgres reports the row count loaded).3. Multi-Row INSERT Batching
Sometimes the data lives in your application, not a file. You still don't want one statement per row — instead, list many rows in a single INSERT ... VALUES. One statement, one round trip, thousands of rows. The sweet spot is usually 1,000–10,000 rows per batch: big enough to amortise the overhead, small enough to keep memory and transaction size sane.
Multi-row INSERT
Pack many rows into one statement.
-- When the data comes from your APP (not a file), batch it.
-- ❌ One INSERT per row = one round trip per row.
-- ✅ One INSERT with many VALUES = one round trip per batch.
INSERT INTO orders (customer_id, total) VALUES
(101, 99.99),
(102, 149.50),
(103, 12.00),
(104, 45.20),
(105, 7.99);
-- Five rows, ONE round trip. Repeat in batches of
-- 1,000–10,000 rows — that is the sweet spot for most drivers.Result — fewer round trips = faster:
| approach | rows | round trips |
|---|---|---|
| one INSERT per row | 5 | 5 |
| one multi-row INSERT | 5 | 1 |
4. Transactions & Commit Size
A transaction groups statements into one all-or-nothing unit. Committing once per transaction instead of once per row removes a huge amount of disk-flushing overhead. But don't swing too far the other way: wrapping all ten million rows in a single transaction holds locks for the whole load and balloons the write-ahead log. Commit in chunks — every 5,000–50,000 rows is a good range.
Think of COMMIT as saving a document. Saving after every keystroke is slow; never saving risks losing everything. Save in sensible chunks.
Sizing transactions
Commit once per batch, not once per row.
-- Wrap a batch in a transaction so it commits ONCE, not per row.
-- A "transaction" is an all-or-nothing unit of work.
BEGIN; -- start the unit of work
INSERT INTO orders (customer_id, total) VALUES (201, 10.00);
INSERT INTO orders (customer_id, total) VALUES (202, 20.00);
-- ... thousands more inside this same transaction ...
COMMIT; -- flush everything to disk ONCE
-- One giant transaction is just as bad, though: it h
...5. Drop & Rebuild Indexes Around Big Loads
Indexes make reads fast, but they slow writes: every inserted row must also update every index on the table. During a one-off bulk load that maintenance is pure waste, because you can rebuild the whole index once at the end far more cheaply. The pattern is: drop the indexes, load, recreate the indexes, then ANALYZE so the query planner has fresh statistics.
Drop, load, rebuild
Build the index once instead of per-row.
-- Around a HUGE load, drop the indexes first, then rebuild.
-- Updating an index for every one of 10M rows is the slow part.
DROP INDEX idx_orders_customer; -- 1) remove the index
COPY orders FROM '/data/orders.csv' -- 2) bulk-load with no index upkeep
WITH (FORMAT csv, HEADER true);
CREATE INDEX idx_orders_customer -- 3) rebuild ONCE at the end
ON orders (customer_id);
ANALYZE orders; -- 4) refresh stats for the planner
-- Bui
...6. UPSERT at Scale
An upsert means "insert this row, but if a row with the same key already exists, update it instead". It makes a load idempotent — safe to run twice without creating duplicates or crashing on a duplicate-key error. PostgreSQL spells it ON CONFLICT ... DO UPDATE; MySQL spells it ON DUPLICATE KEY UPDATE. In Postgres, the special table EXCLUDED refers to the row you tried to insert.
Batched UPSERT
Insert-or-update without duplicates.
-- UPSERT = "insert, but UPDATE the row if it already exists".
-- Perfect for re-running a load without creating duplicates.
-- PostgreSQL: ON CONFLICT (the conflicting column) DO UPDATE
INSERT INTO inventory (sku, on_hand) VALUES
('A-100', 50),
('A-200', 30),
('A-300', 75)
ON CONFLICT (sku) DO UPDATE
SET on_hand = EXCLUDED.on_hand; -- EXCLUDED = the row you tried to insert
-- MySQL says the same thing as ON DUPLICATE KEY UPDATE:
-- INSERT INTO inventory (sku, on_hand) VA
...Result:
| sku | on_hand | action |
|---|---|---|
| A-100 | 50 | updated (existed) |
| A-200 | 30 | inserted (new) |
| A-300 | 75 | inserted (new) |
Your Turn: complete the UPSERT
Fill in the blanks so the load updates the price when a sku already exists, and inserts it otherwise.
🎯 Your Turn: ON CONFLICT upsert
Replace the ___ blanks to make the load idempotent.
-- 🎯 YOUR TURN — fill in the two blanks.
-- Goal: load price updates. If a sku already exists, UPDATE its price;
-- otherwise INSERT it. (sku is the unique/primary key.)
INSERT INTO prices (sku, price) VALUES
('B-1', 9.99),
('B-2', 4.50)
ON ___ (sku) DO UPDATE -- 👉 the keyword pair for "if it clashes"
SET price = ___.price; -- 👉 the alias for the row being inserted
-- ✅ Expected: B-1 and B-2 inserted if new, or their price
-- overwritten if
...7. Chunked Deletes & Updates (and the ETL Pattern)
Deleting or updating millions of rows in a single statement takes one long lock and writes a giant chunk of the write-ahead log — other queries stall and disk usage spikes. Instead, work in chunks: delete a few thousand rows, let the locks release, repeat until nothing is left. The same idea drives ETL/ELT batch jobs (Extract → Transform → Load): pull raw data into a no-constraints staging table, clean and validate it there, then load the good rows into production in sized batches.
Chunked delete
Delete in batches to avoid one long lock.
-- Deleting/updating MILLIONS of rows in ONE statement locks the
-- table for a long time and bloats the write-ahead log.
-- ❌ DELETE FROM events WHERE created_at < '2023-01-01'; -- 50M rows, one lock
-- ✅ Delete in chunks so each statement is short and releases locks.
DELETE FROM events
WHERE id IN (
SELECT id FROM events
WHERE created_at < '2023-01-01'
LIMIT 10000 -- 10k rows at a time
);
-- Run this in a loop until 0 rows are affected.
-- Same idea for
...Common Errors (and the fix)
- Row-by-row inserts in a loop: the classic "my import takes forever" mistake. Batch into multi-row
INSERTs, or better, useCOPY/LOAD DATA. The fix is almost always "do fewer, bigger statements". - One giant transaction: wrapping all 10M rows in a single
BEGIN ... COMMITholds locks for the entire load and bloats the WAL — sometimes it runs out of disk. Commit every 5k–50k rows. - Loading with indexes on: leaving every index in place forces per-row index maintenance and can make a load 5–10x slower. Drop indexes, load, then
CREATE INDEX+ANALYZE. - Deleting millions in one statement:
DELETE FROM events WHERE ...over 50M rows locks the table and bloats it with dead rows. Delete in chunks withLIMITin a loop instead. - "ERROR: duplicate key value violates unique constraint": your load hit a row that already exists. Make it an
UPSERT(ON CONFLICT DO UPDATE/ON DUPLICATE KEY UPDATE) so re-runs are safe. - "could not open file ... Permission denied": server-side
COPYneeds the database server to read the path. Use client-side\copyin psql for files on your own machine.
📘 Quick Reference
| Technique | Syntax / Idea | When to use |
|---|---|---|
| COPY (Postgres) | COPY t FROM 'f.csv' WITH (FORMAT csv) | Fastest server-side file load |
| LOAD DATA (MySQL) | LOAD DATA INFILE 'f.csv' INTO TABLE t | Fastest MySQL file load |
| Multi-row INSERT | INSERT ... VALUES (..),(..),(..) | App data, no file |
| Commit size | COMMIT every 5k–50k rows | Avoid long locks / WAL bloat |
| Drop & rebuild index | DROP INDEX … load … CREATE INDEX | One-off big loads |
| Upsert | ON CONFLICT … DO UPDATE | Idempotent, re-runnable loads |
| Chunked delete | DELETE … WHERE id IN (… LIMIT n) | Purge millions safely |
Frequently Asked Questions
Q: How big should each batch or transaction be?
There's no universal number, but 1,000–10,000 rows per multi-row INSERT and a COMMIT every 5,000–50,000 rows works well for most systems. Measure with your real data — too small wastes round trips, too large holds locks and grows the WAL.
Q: COPY or multi-row INSERT — which should I reach for?
If the data is (or can become) a file the server can read, COPY/LOAD DATA wins by a wide margin. If the data is generated in your application, batched multi-row INSERTs are the practical choice.
Q: Is it always worth dropping indexes before a load?
For large one-off loads into a table that's mostly idle, yes — rebuilding once is cheaper. For small incremental loads into a live, heavily-queried table, no: dropping indexes would slow every other query in the meantime.
Q: What does "idempotent" mean here?
Running the same load twice produces the same final state — no duplicates, no errors. UPSERT gives you that: a re-run updates existing rows instead of failing on a duplicate key.
Mini-Challenge: A Safe, Re-runnable Batch Load
Put it all together — a brief, a blank canvas, and the expected result in the comments. Write it, then copy it into a playground to confirm.
🎯 Mini-Challenge
A transaction + multi-row INSERT + ON CONFLICT upsert.
-- 🎯 MINI-CHALLENGE
-- Using ONLY what this lesson covered (transactions, batched multi-row
-- INSERT, and ON CONFLICT upsert):
-- 1. Open a transaction with BEGIN
-- 2. Insert THREE rows into stock (sku, qty) in ONE multi-row INSERT
-- 3. Make it an UPSERT on sku: if the sku exists, set qty to the new value
-- 4. COMMIT
--
-- ✅ Expected: three rows inserted-or-updated, committed as one unit,
-- with no duplicate-key errors if you run it twice.
-- your query here🎉 Lesson Complete
- ✅
COPY/LOAD DATA INFILEbulk-load files orders of magnitude faster than row-by-rowINSERTs - ✅ Multi-row
INSERTs and right-sized transactions cut round trips and disk flushes - ✅ Dropping and rebuilding indexes around a big load avoids per-row index maintenance
- ✅
UPSERT(ON CONFLICT/ON DUPLICATE KEY) makes loads idempotent - ✅ Chunked deletes/updates and a staging-table ETL flow keep big jobs from locking the database
- ✅ Next: when to push logic into the database with triggers vs application logic
Sign up for free to track which lessons you've completed and get learning reminders.