Courses/SQL/Massive Data Handling

Advanced Track

Massive Data Handling: Bulk Loads, Batching & ETL

By the end of this lesson you'll be able to load and reshape millions of rows without locking up your database — using COPY/LOAD DATA instead of row-by-row inserts, batching and sizing transactions sensibly, dropping and rebuilding indexes around big loads, running idempotent upserts, and deleting in safe chunks. These are the techniques that turn a 45-minute load into a 15-second one.

What You'll Learn

✓Bulk-load with COPY / LOAD DATA INFILE (100x faster than INSERTs)
✓Batch many rows into one multi-row INSERT
✓Size transactions: commit every few thousand rows
✓Drop and rebuild indexes around a big load
✓Run idempotent UPSERTs at scale (ON CONFLICT / ON DUPLICATE KEY)
✓Delete and update millions of rows in safe chunks

Before you start: finish Buffer Pool & Memory so you understand how the database moves data between disk and RAM — that's what bulk loading is really optimising. The in-browser editor lets you write and edit SQL; to run the application-level examples, copy them into a free playground like db-fiddle.com. The file-based commands (COPY from disk, LOAD DATA INFILE) need a real server with file access, so study those as worked examples.

The Scenario: 10 Million Orders

Throughout this lesson, picture a nightly job that loads a fresh export of 10 million order rows into an orders table, then deletes last year's events. Every technique below is judged by one question: how do we do that in seconds instead of an hour, without freezing the live database?

🚚 Real-world analogy

Moving house, you don't carry items one at a time across town — you pack boxes and load a truck. Row-by-row INSERT is carrying one item per trip; COPY is the truck. Same boxes, a tiny fraction of the trips.

1. Why Row-by-Row INSERT Is So Slow

Each separate INSERT statement pays a fixed tax: a network round trip to the server, parsing and planning the query, and a transaction commit that forces a write to the write-ahead log (the on-disk journal the database uses to stay crash-safe). For one row that tax is invisible. For ten million rows you pay it ten million times — and that is where your minutes go.

The slow way — one INSERT per row

See why ten million separate statements is the problem.

Try it Yourself »

SQL

-- ❌ The slow way: one INSERT statement per row
-- Each statement is its own round trip + parse + commit.
INSERT INTO orders (customer_id, total) VALUES (101, 99.99);
INSERT INTO orders (customer_id, total) VALUES (102, 149.50);
INSERT INTO orders (customer_id, total) VALUES (103, 12.00);
-- ... imagine 9,999,997 more of these.

-- Loading 10 million rows like this can take 30–60 MINUTES,
-- because the database pays the overhead 10 million times.

2. COPY & LOAD DATA INFILE — the Fast Path

PostgreSQL's COPY and MySQL's LOAD DATA INFILE are bulk loaders: they stream an entire file into a table in a single operation, skipping the per-row overhead entirely. This is almost always the fastest way to get data in — typically orders of magnitude faster than individual inserts.

Pro tip: COPY reads files from the server's disk. If your file is on your own machine, use the client-side variant \copy in psql, which streams the file to the server for you.

The fast way — COPY / LOAD DATA

Bulk-load a CSV in one streaming operation.

Try it Yourself »

SQL

-- ✅ The fast way: bulk-load straight from a file with COPY
-- COPY streams rows in one operation — no per-row overhead.

-- PostgreSQL: read a CSV from disk on the server
COPY orders (customer_id, total, order_date)
FROM '/data/orders.csv'
WITH (FORMAT csv, HEADER true, DELIMITER ',');

-- MySQL does the same job with LOAD DATA INFILE:
-- LOAD DATA INFILE '/data/orders.csv'
-- INTO TABLE orders
-- FIELDS TERMINATED BY ','
-- LINES TERMINATED BY '\n'
-- IGNORE 1 ROWS;            -- skip the head
...

Result — what the server reports:

command	rows loaded	time
COPY	10000000	~15s
10M single INSERTs	10000000	~45m

Your Turn: complete the COPY

Fill in the blanks to bulk-load customers.csv (which includes a header row). The expected result is in the comments so you can check yourself.

🎯 Your Turn: bulk-load with COPY

Replace the ___ blanks to finish the COPY statement.

Try it Yourself »

SQL

-- 🎯 YOUR TURN — fill in the two blanks, then press "Try it Yourself"
-- Goal: bulk-load the customers.csv file (which HAS a header row)
--       into the customers table using PostgreSQL COPY.

___ customers (name, email, country)   -- 👉 the bulk-load keyword
FROM '/data/customers.csv'
WITH (FORMAT csv, HEADER ___);          -- 👉 the file has a header line

-- ✅ Expected: all rows from customers.csv loaded in one pass,
--    e.g. "COPY 500000"  (Postgres reports the row count loaded).

3. Multi-Row INSERT Batching

Sometimes the data lives in your application, not a file. You still don't want one statement per row — instead, list many rows in a single INSERT ... VALUES. One statement, one round trip, thousands of rows. The sweet spot is usually 1,000–10,000 rows per batch: big enough to amortise the overhead, small enough to keep memory and transaction size sane.

Multi-row INSERT

Pack many rows into one statement.

Try it Yourself »

SQL

-- When the data comes from your APP (not a file), batch it.
-- ❌ One INSERT per row = one round trip per row.
-- ✅ One INSERT with many VALUES = one round trip per batch.

INSERT INTO orders (customer_id, total) VALUES
    (101, 99.99),
    (102, 149.50),
    (103, 12.00),
    (104, 45.20),
    (105, 7.99);
-- Five rows, ONE round trip. Repeat in batches of
-- 1,000–10,000 rows — that is the sweet spot for most drivers.

Result — fewer round trips = faster:

approach	rows	round trips
one INSERT per row	5	5
one multi-row INSERT	5	1

4. Transactions & Commit Size

A transaction groups statements into one all-or-nothing unit. Committing once per transaction instead of once per row removes a huge amount of disk-flushing overhead. But don't swing too far the other way: wrapping all ten million rows in a single transaction holds locks for the whole load and balloons the write-ahead log. Commit in chunks — every 5,000–50,000 rows is a good range.

Think of COMMIT as saving a document. Saving after every keystroke is slow; never saving risks losing everything. Save in sensible chunks.

Sizing transactions

Commit once per batch, not once per row.

Try it Yourself »

SQL

-- Wrap a batch in a transaction so it commits ONCE, not per row.
-- A "transaction" is an all-or-nothing unit of work.
BEGIN;                                    -- start the unit of work
  INSERT INTO orders (customer_id, total) VALUES (201, 10.00);
  INSERT INTO orders (customer_id, total) VALUES (202, 20.00);
  -- ... thousands more inside this same transaction ...
COMMIT;                                   -- flush everything to disk ONCE

-- One giant transaction is just as bad, though: it h
...

5. Drop & Rebuild Indexes Around Big Loads

Indexes make reads fast, but they slow writes: every inserted row must also update every index on the table. During a one-off bulk load that maintenance is pure waste, because you can rebuild the whole index once at the end far more cheaply. The pattern is: drop the indexes, load, recreate the indexes, then ANALYZE so the query planner has fresh statistics.

Drop, load, rebuild

Build the index once instead of per-row.

Try it Yourself »

SQL

-- Around a HUGE load, drop the indexes first, then rebuild.
-- Updating an index for every one of 10M rows is the slow part.

DROP INDEX idx_orders_customer;            -- 1) remove the index

COPY orders FROM '/data/orders.csv'        -- 2) bulk-load with no index upkeep
    WITH (FORMAT csv, HEADER true);

CREATE INDEX idx_orders_customer           -- 3) rebuild ONCE at the end
    ON orders (customer_id);

ANALYZE orders;                            -- 4) refresh stats for the planner

-- Bui
...

6. UPSERT at Scale

An upsert means "insert this row, but if a row with the same key already exists, update it instead". It makes a load idempotent — safe to run twice without creating duplicates or crashing on a duplicate-key error. PostgreSQL spells it ON CONFLICT ... DO UPDATE; MySQL spells it ON DUPLICATE KEY UPDATE. In Postgres, the special table EXCLUDED refers to the row you tried to insert.

Batched UPSERT

Insert-or-update without duplicates.

Try it Yourself »

SQL

-- UPSERT = "insert, but UPDATE the row if it already exists".
-- Perfect for re-running a load without creating duplicates.

-- PostgreSQL: ON CONFLICT (the conflicting column) DO UPDATE
INSERT INTO inventory (sku, on_hand) VALUES
    ('A-100', 50),
    ('A-200', 30),
    ('A-300', 75)
ON CONFLICT (sku) DO UPDATE
    SET on_hand = EXCLUDED.on_hand;       -- EXCLUDED = the row you tried to insert

-- MySQL says the same thing as ON DUPLICATE KEY UPDATE:
-- INSERT INTO inventory (sku, on_hand) VA
...

Result:

sku	on_hand	action
A-100	50	updated (existed)
A-200	30	inserted (new)
A-300	75	inserted (new)

Your Turn: complete the UPSERT

Fill in the blanks so the load updates the price when a sku already exists, and inserts it otherwise.

🎯 Your Turn: ON CONFLICT upsert

Replace the ___ blanks to make the load idempotent.

Try it Yourself »

SQL

-- 🎯 YOUR TURN — fill in the two blanks.
-- Goal: load price updates. If a sku already exists, UPDATE its price;
--       otherwise INSERT it. (sku is the unique/primary key.)

INSERT INTO prices (sku, price) VALUES
    ('B-1', 9.99),
    ('B-2', 4.50)
ON ___ (sku) DO UPDATE                  -- 👉 the keyword pair for "if it clashes"
    SET price = ___.price;              -- 👉 the alias for the row being inserted

-- ✅ Expected: B-1 and B-2 inserted if new, or their price
--    overwritten if
...

7. Chunked Deletes & Updates (and the ETL Pattern)

Deleting or updating millions of rows in a single statement takes one long lock and writes a giant chunk of the write-ahead log — other queries stall and disk usage spikes. Instead, work in chunks: delete a few thousand rows, let the locks release, repeat until nothing is left. The same idea drives ETL/ELT batch jobs (Extract → Transform → Load): pull raw data into a no-constraints staging table, clean and validate it there, then load the good rows into production in sized batches.

Staging first: never bulk-load raw, untrusted data straight into a production table. A staging table lets you validate, deduplicate, and reject bad rows before they touch your real data.

Chunked delete

Delete in batches to avoid one long lock.

Try it Yourself »

SQL

-- Deleting/updating MILLIONS of rows in ONE statement locks the
-- table for a long time and bloats the write-ahead log.
-- ❌ DELETE FROM events WHERE created_at < '2023-01-01';  -- 50M rows, one lock

-- ✅ Delete in chunks so each statement is short and releases locks.
DELETE FROM events
WHERE id IN (
    SELECT id FROM events
    WHERE created_at < '2023-01-01'
    LIMIT 10000                            -- 10k rows at a time
);
-- Run this in a loop until 0 rows are affected.
-- Same idea for
...

Common Errors (and the fix)

Row-by-row inserts in a loop: the classic "my import takes forever" mistake. Batch into multi-row INSERTs, or better, use COPY/LOAD DATA. The fix is almost always "do fewer, bigger statements".
One giant transaction: wrapping all 10M rows in a single BEGIN ... COMMIT holds locks for the entire load and bloats the WAL — sometimes it runs out of disk. Commit every 5k–50k rows.
Loading with indexes on: leaving every index in place forces per-row index maintenance and can make a load 5–10x slower. Drop indexes, load, then CREATE INDEX + ANALYZE.
Deleting millions in one statement: DELETE FROM events WHERE ... over 50M rows locks the table and bloats it with dead rows. Delete in chunks with LIMIT in a loop instead.
"ERROR: duplicate key value violates unique constraint": your load hit a row that already exists. Make it an UPSERT (ON CONFLICT DO UPDATE / ON DUPLICATE KEY UPDATE) so re-runs are safe.
"could not open file ... Permission denied": server-side COPY needs the database server to read the path. Use client-side \copy in psql for files on your own machine.

📘 Quick Reference

Technique	Syntax / Idea	When to use
COPY (Postgres)	COPY t FROM 'f.csv' WITH (FORMAT csv)	Fastest server-side file load
LOAD DATA (MySQL)	LOAD DATA INFILE 'f.csv' INTO TABLE t	Fastest MySQL file load
Multi-row INSERT	INSERT ... VALUES (..),(..),(..)	App data, no file
Commit size	COMMIT every 5k–50k rows	Avoid long locks / WAL bloat
Drop & rebuild index	DROP INDEX … load … CREATE INDEX	One-off big loads
Upsert	ON CONFLICT … DO UPDATE	Idempotent, re-runnable loads
Chunked delete	DELETE … WHERE id IN (… LIMIT n)	Purge millions safely

Frequently Asked Questions

Q: How big should each batch or transaction be?

There's no universal number, but 1,000–10,000 rows per multi-row INSERT and a COMMIT every 5,000–50,000 rows works well for most systems. Measure with your real data — too small wastes round trips, too large holds locks and grows the WAL.

Q: COPY or multi-row INSERT — which should I reach for?

If the data is (or can become) a file the server can read, COPY/LOAD DATA wins by a wide margin. If the data is generated in your application, batched multi-row INSERTs are the practical choice.

Q: Is it always worth dropping indexes before a load?

For large one-off loads into a table that's mostly idle, yes — rebuilding once is cheaper. For small incremental loads into a live, heavily-queried table, no: dropping indexes would slow every other query in the meantime.

Q: What does "idempotent" mean here?

Running the same load twice produces the same final state — no duplicates, no errors. UPSERT gives you that: a re-run updates existing rows instead of failing on a duplicate key.

Mini-Challenge: A Safe, Re-runnable Batch Load

Put it all together — a brief, a blank canvas, and the expected result in the comments. Write it, then copy it into a playground to confirm.

🎯 Mini-Challenge

A transaction + multi-row INSERT + ON CONFLICT upsert.

Try it Yourself »

SQL

-- 🎯 MINI-CHALLENGE
-- Using ONLY what this lesson covered (transactions, batched multi-row
-- INSERT, and ON CONFLICT upsert):
--   1. Open a transaction with BEGIN
--   2. Insert THREE rows into stock (sku, qty) in ONE multi-row INSERT
--   3. Make it an UPSERT on sku: if the sku exists, set qty to the new value
--   4. COMMIT
--
-- ✅ Expected: three rows inserted-or-updated, committed as one unit,
--    with no duplicate-key errors if you run it twice.

-- your query here

🎉 Lesson Complete

✅ COPY / LOAD DATA INFILE bulk-load files orders of magnitude faster than row-by-row INSERTs
✅ Multi-row INSERTs and right-sized transactions cut round trips and disk flushes
✅ Dropping and rebuilding indexes around a big load avoids per-row index maintenance
✅ UPSERT (ON CONFLICT / ON DUPLICATE KEY) makes loads idempotent
✅ Chunked deletes/updates and a staging-table ETL flow keep big jobs from locking the database
✅ Next: when to push logic into the database with triggers vs application logic

Massive Data Handling: Bulk Loads, Batching & ETL

What You'll Learn

The Scenario: 10 Million Orders

1. Why Row-by-Row INSERT Is So Slow

The slow way — one INSERT per row

2. COPY & LOAD DATA INFILE — the Fast Path

The fast way — COPY / LOAD DATA

Your Turn: complete the COPY

🎯 Your Turn: bulk-load with COPY

3. Multi-Row INSERT Batching

Multi-row INSERT

4. Transactions & Commit Size

Sizing transactions

5. Drop & Rebuild Indexes Around Big Loads

Drop, load, rebuild

6. UPSERT at Scale

Batched UPSERT

Your Turn: complete the UPSERT

🎯 Your Turn: ON CONFLICT upsert

7. Chunked Deletes & Updates (and the ETL Pattern)

Chunked delete

Common Errors (and the fix)

📘 Quick Reference

Frequently Asked Questions

Mini-Challenge: A Safe, Re-runnable Batch Load

🎯 Mini-Challenge

🎉 Lesson Complete

Cookie & Privacy Settings