Skip to main content
    Courses/SQL/Massive Data Handling

    Advanced Track

    Massive Data Handling: Bulk Loads, Batching & ETL

    By the end of this lesson you'll be able to load and reshape millions of rows without locking up your database — using COPY/LOAD DATA instead of row-by-row inserts, batching and sizing transactions sensibly, dropping and rebuilding indexes around big loads, running idempotent upserts, and deleting in safe chunks. These are the techniques that turn a 45-minute load into a 15-second one.

    What You'll Learn

    • Bulk-load with COPY / LOAD DATA INFILE (100x faster than INSERTs)
    • Batch many rows into one multi-row INSERT
    • Size transactions: commit every few thousand rows
    • Drop and rebuild indexes around a big load
    • Run idempotent UPSERTs at scale (ON CONFLICT / ON DUPLICATE KEY)
    • Delete and update millions of rows in safe chunks

    The Scenario: 10 Million Orders

    Throughout this lesson, picture a nightly job that loads a fresh export of 10 million order rows into an orders table, then deletes last year's events. Every technique below is judged by one question: how do we do that in seconds instead of an hour, without freezing the live database?

    🚚 Real-world analogy

    Moving house, you don't carry items one at a time across town — you pack boxes and load a truck. Row-by-row INSERT is carrying one item per trip; COPY is the truck. Same boxes, a tiny fraction of the trips.

    1. Why Row-by-Row INSERT Is So Slow

    Each separate INSERT statement pays a fixed tax: a network round trip to the server, parsing and planning the query, and a transaction commit that forces a write to the write-ahead log (the on-disk journal the database uses to stay crash-safe). For one row that tax is invisible. For ten million rows you pay it ten million times — and that is where your minutes go.

    The slow way — one INSERT per row

    See why ten million separate statements is the problem.

    Try it Yourself »
    SQL
    -- ❌ The slow way: one INSERT statement per row
    -- Each statement is its own round trip + parse + commit.
    INSERT INTO orders (customer_id, total) VALUES (101, 99.99);
    INSERT INTO orders (customer_id, total) VALUES (102, 149.50);
    INSERT INTO orders (customer_id, total) VALUES (103, 12.00);
    -- ... imagine 9,999,997 more of these.
    
    -- Loading 10 million rows like this can take 30–60 MINUTES,
    -- because the database pays the overhead 10 million times.

    2. COPY & LOAD DATA INFILE — the Fast Path

    PostgreSQL's COPY and MySQL's LOAD DATA INFILE are bulk loaders: they stream an entire file into a table in a single operation, skipping the per-row overhead entirely. This is almost always the fastest way to get data in — typically orders of magnitude faster than individual inserts.

    The fast way — COPY / LOAD DATA

    Bulk-load a CSV in one streaming operation.

    Try it Yourself »
    SQL
    -- ✅ The fast way: bulk-load straight from a file with COPY
    -- COPY streams rows in one operation — no per-row overhead.
    
    -- PostgreSQL: read a CSV from disk on the server
    COPY orders (customer_id, total, order_date)
    FROM '/data/orders.csv'
    WITH (FORMAT csv, HEADER true, DELIMITER ',');
    
    -- MySQL does the same job with LOAD DATA INFILE:
    -- LOAD DATA INFILE '/data/orders.csv'
    -- INTO TABLE orders
    -- FIELDS TERMINATED BY ','
    -- LINES TERMINATED BY '\n'
    -- IGNORE 1 ROWS;            -- skip the head
    ...

    Result — what the server reports:

    commandrows loadedtime
    COPY10000000~15s
    10M single INSERTs10000000~45m

    Your Turn: complete the COPY

    Fill in the blanks to bulk-load customers.csv (which includes a header row). The expected result is in the comments so you can check yourself.

    🎯 Your Turn: bulk-load with COPY

    Replace the ___ blanks to finish the COPY statement.

    Try it Yourself »
    SQL
    -- 🎯 YOUR TURN — fill in the two blanks, then press "Try it Yourself"
    -- Goal: bulk-load the customers.csv file (which HAS a header row)
    --       into the customers table using PostgreSQL COPY.
    
    ___ customers (name, email, country)   -- 👉 the bulk-load keyword
    FROM '/data/customers.csv'
    WITH (FORMAT csv, HEADER ___);          -- 👉 the file has a header line
    
    -- ✅ Expected: all rows from customers.csv loaded in one pass,
    --    e.g. "COPY 500000"  (Postgres reports the row count loaded).

    3. Multi-Row INSERT Batching

    Sometimes the data lives in your application, not a file. You still don't want one statement per row — instead, list many rows in a single INSERT ... VALUES. One statement, one round trip, thousands of rows. The sweet spot is usually 1,000–10,000 rows per batch: big enough to amortise the overhead, small enough to keep memory and transaction size sane.

    Multi-row INSERT

    Pack many rows into one statement.

    Try it Yourself »
    SQL
    -- When the data comes from your APP (not a file), batch it.
    -- ❌ One INSERT per row = one round trip per row.
    -- ✅ One INSERT with many VALUES = one round trip per batch.
    
    INSERT INTO orders (customer_id, total) VALUES
        (101, 99.99),
        (102, 149.50),
        (103, 12.00),
        (104, 45.20),
        (105, 7.99);
    -- Five rows, ONE round trip. Repeat in batches of
    -- 1,000–10,000 rows — that is the sweet spot for most drivers.

    Result — fewer round trips = faster:

    approachrowsround trips
    one INSERT per row55
    one multi-row INSERT51

    4. Transactions & Commit Size

    A transaction groups statements into one all-or-nothing unit. Committing once per transaction instead of once per row removes a huge amount of disk-flushing overhead. But don't swing too far the other way: wrapping all ten million rows in a single transaction holds locks for the whole load and balloons the write-ahead log. Commit in chunks — every 5,000–50,000 rows is a good range.

    Think of COMMIT as saving a document. Saving after every keystroke is slow; never saving risks losing everything. Save in sensible chunks.

    Sizing transactions

    Commit once per batch, not once per row.

    Try it Yourself »
    SQL
    -- Wrap a batch in a transaction so it commits ONCE, not per row.
    -- A "transaction" is an all-or-nothing unit of work.
    BEGIN;                                    -- start the unit of work
      INSERT INTO orders (customer_id, total) VALUES (201, 10.00);
      INSERT INTO orders (customer_id, total) VALUES (202, 20.00);
      -- ... thousands more inside this same transaction ...
    COMMIT;                                   -- flush everything to disk ONCE
    
    -- One giant transaction is just as bad, though: it h
    ...

    5. Drop & Rebuild Indexes Around Big Loads

    Indexes make reads fast, but they slow writes: every inserted row must also update every index on the table. During a one-off bulk load that maintenance is pure waste, because you can rebuild the whole index once at the end far more cheaply. The pattern is: drop the indexes, load, recreate the indexes, then ANALYZE so the query planner has fresh statistics.

    Drop, load, rebuild

    Build the index once instead of per-row.

    Try it Yourself »
    SQL
    -- Around a HUGE load, drop the indexes first, then rebuild.
    -- Updating an index for every one of 10M rows is the slow part.
    
    DROP INDEX idx_orders_customer;            -- 1) remove the index
    
    COPY orders FROM '/data/orders.csv'        -- 2) bulk-load with no index upkeep
        WITH (FORMAT csv, HEADER true);
    
    CREATE INDEX idx_orders_customer           -- 3) rebuild ONCE at the end
        ON orders (customer_id);
    
    ANALYZE orders;                            -- 4) refresh stats for the planner
    
    -- Bui
    ...

    6. UPSERT at Scale

    An upsert means "insert this row, but if a row with the same key already exists, update it instead". It makes a load idempotent — safe to run twice without creating duplicates or crashing on a duplicate-key error. PostgreSQL spells it ON CONFLICT ... DO UPDATE; MySQL spells it ON DUPLICATE KEY UPDATE. In Postgres, the special table EXCLUDED refers to the row you tried to insert.

    Batched UPSERT

    Insert-or-update without duplicates.

    Try it Yourself »
    SQL
    -- UPSERT = "insert, but UPDATE the row if it already exists".
    -- Perfect for re-running a load without creating duplicates.
    
    -- PostgreSQL: ON CONFLICT (the conflicting column) DO UPDATE
    INSERT INTO inventory (sku, on_hand) VALUES
        ('A-100', 50),
        ('A-200', 30),
        ('A-300', 75)
    ON CONFLICT (sku) DO UPDATE
        SET on_hand = EXCLUDED.on_hand;       -- EXCLUDED = the row you tried to insert
    
    -- MySQL says the same thing as ON DUPLICATE KEY UPDATE:
    -- INSERT INTO inventory (sku, on_hand) VA
    ...

    Result:

    skuon_handaction
    A-10050updated (existed)
    A-20030inserted (new)
    A-30075inserted (new)

    Your Turn: complete the UPSERT

    Fill in the blanks so the load updates the price when a sku already exists, and inserts it otherwise.

    🎯 Your Turn: ON CONFLICT upsert

    Replace the ___ blanks to make the load idempotent.

    Try it Yourself »
    SQL
    -- 🎯 YOUR TURN — fill in the two blanks.
    -- Goal: load price updates. If a sku already exists, UPDATE its price;
    --       otherwise INSERT it. (sku is the unique/primary key.)
    
    INSERT INTO prices (sku, price) VALUES
        ('B-1', 9.99),
        ('B-2', 4.50)
    ON ___ (sku) DO UPDATE                  -- 👉 the keyword pair for "if it clashes"
        SET price = ___.price;              -- 👉 the alias for the row being inserted
    
    -- ✅ Expected: B-1 and B-2 inserted if new, or their price
    --    overwritten if
    ...

    7. Chunked Deletes & Updates (and the ETL Pattern)

    Deleting or updating millions of rows in a single statement takes one long lock and writes a giant chunk of the write-ahead log — other queries stall and disk usage spikes. Instead, work in chunks: delete a few thousand rows, let the locks release, repeat until nothing is left. The same idea drives ETL/ELT batch jobs (Extract → Transform → Load): pull raw data into a no-constraints staging table, clean and validate it there, then load the good rows into production in sized batches.

    Chunked delete

    Delete in batches to avoid one long lock.

    Try it Yourself »
    SQL
    -- Deleting/updating MILLIONS of rows in ONE statement locks the
    -- table for a long time and bloats the write-ahead log.
    -- ❌ DELETE FROM events WHERE created_at < '2023-01-01';  -- 50M rows, one lock
    
    -- ✅ Delete in chunks so each statement is short and releases locks.
    DELETE FROM events
    WHERE id IN (
        SELECT id FROM events
        WHERE created_at < '2023-01-01'
        LIMIT 10000                            -- 10k rows at a time
    );
    -- Run this in a loop until 0 rows are affected.
    -- Same idea for
    ...

    Common Errors (and the fix)

    • Row-by-row inserts in a loop: the classic "my import takes forever" mistake. Batch into multi-row INSERTs, or better, use COPY/LOAD DATA. The fix is almost always "do fewer, bigger statements".
    • One giant transaction: wrapping all 10M rows in a single BEGIN ... COMMIT holds locks for the entire load and bloats the WAL — sometimes it runs out of disk. Commit every 5k–50k rows.
    • Loading with indexes on: leaving every index in place forces per-row index maintenance and can make a load 5–10x slower. Drop indexes, load, then CREATE INDEX + ANALYZE.
    • Deleting millions in one statement: DELETE FROM events WHERE ... over 50M rows locks the table and bloats it with dead rows. Delete in chunks with LIMIT in a loop instead.
    • "ERROR: duplicate key value violates unique constraint": your load hit a row that already exists. Make it an UPSERT (ON CONFLICT DO UPDATE / ON DUPLICATE KEY UPDATE) so re-runs are safe.
    • "could not open file ... Permission denied": server-side COPY needs the database server to read the path. Use client-side \copy in psql for files on your own machine.

    📘 Quick Reference

    TechniqueSyntax / IdeaWhen to use
    COPY (Postgres)COPY t FROM 'f.csv' WITH (FORMAT csv)Fastest server-side file load
    LOAD DATA (MySQL)LOAD DATA INFILE 'f.csv' INTO TABLE tFastest MySQL file load
    Multi-row INSERTINSERT ... VALUES (..),(..),(..)App data, no file
    Commit sizeCOMMIT every 5k–50k rowsAvoid long locks / WAL bloat
    Drop & rebuild indexDROP INDEX … load … CREATE INDEXOne-off big loads
    UpsertON CONFLICT … DO UPDATEIdempotent, re-runnable loads
    Chunked deleteDELETE … WHERE id IN (… LIMIT n)Purge millions safely

    Frequently Asked Questions

    Q: How big should each batch or transaction be?

    There's no universal number, but 1,000–10,000 rows per multi-row INSERT and a COMMIT every 5,000–50,000 rows works well for most systems. Measure with your real data — too small wastes round trips, too large holds locks and grows the WAL.

    Q: COPY or multi-row INSERT — which should I reach for?

    If the data is (or can become) a file the server can read, COPY/LOAD DATA wins by a wide margin. If the data is generated in your application, batched multi-row INSERTs are the practical choice.

    Q: Is it always worth dropping indexes before a load?

    For large one-off loads into a table that's mostly idle, yes — rebuilding once is cheaper. For small incremental loads into a live, heavily-queried table, no: dropping indexes would slow every other query in the meantime.

    Q: What does "idempotent" mean here?

    Running the same load twice produces the same final state — no duplicates, no errors. UPSERT gives you that: a re-run updates existing rows instead of failing on a duplicate key.

    Mini-Challenge: A Safe, Re-runnable Batch Load

    Put it all together — a brief, a blank canvas, and the expected result in the comments. Write it, then copy it into a playground to confirm.

    🎯 Mini-Challenge

    A transaction + multi-row INSERT + ON CONFLICT upsert.

    Try it Yourself »
    SQL
    -- 🎯 MINI-CHALLENGE
    -- Using ONLY what this lesson covered (transactions, batched multi-row
    -- INSERT, and ON CONFLICT upsert):
    --   1. Open a transaction with BEGIN
    --   2. Insert THREE rows into stock (sku, qty) in ONE multi-row INSERT
    --   3. Make it an UPSERT on sku: if the sku exists, set qty to the new value
    --   4. COMMIT
    --
    -- ✅ Expected: three rows inserted-or-updated, committed as one unit,
    --    with no duplicate-key errors if you run it twice.
    
    -- your query here

    🎉 Lesson Complete

    • COPY / LOAD DATA INFILE bulk-load files orders of magnitude faster than row-by-row INSERTs
    • ✅ Multi-row INSERTs and right-sized transactions cut round trips and disk flushes
    • ✅ Dropping and rebuilding indexes around a big load avoids per-row index maintenance
    • UPSERT (ON CONFLICT / ON DUPLICATE KEY) makes loads idempotent
    • ✅ Chunked deletes/updates and a staging-table ETL flow keep big jobs from locking the database
    • Next: when to push logic into the database with triggers vs application logic

    Sign up for free to track which lessons you've completed and get learning reminders.

    Previous

    Cookie & Privacy Settings

    We use cookies to improve your experience, analyze traffic, and show personalized ads. You can manage your preferences below.

    By clicking "Accept All", you consent to our use of cookies for analytics and personalized advertising. You can customize your preferences or reject non-essential cookies.

    Privacy PolicyTerms of Service