Courses/SQL/Massive Data Handling

    Lesson 41 • Advanced

    Massive Data Handling: Bulk Inserts, Batching & ETL

    Load millions of rows efficiently using COPY, batch inserts, staging tables, and ETL patterns.

    ✅ What You'll Learn

    • • Bulk loading with PostgreSQL COPY and MySQL LOAD DATA
    • • Batch INSERT strategies for application-level loading
    • • ETL pipelines: staging, transformation, and error handling
    • • Loading data into partitioned tables efficiently

    🏗️ Why Bulk Loading Matters

    🎯 Real-World Analogy: Imagine moving house. You wouldn't carry items one at a time — you'd pack boxes and load them all into a truck. Bulk loading works the same way: instead of inserting one row at a time (carrying one item), you pack thousands of rows and load them in a single operation.

    Single-row INSERTs create enormous overhead: each statement requires a network round trip, query parsing, transaction commit, and WAL write. Loading 10 million rows one at a time could take hours. The same data via COPY completes in seconds.

    ⚡ Speed Comparison (10M rows)

    • • Single INSERT: ~45 minutes
    • • Batch INSERT (1000/batch): ~3 minutes
    • • COPY command: ~15 seconds

    Try It: COPY for Bulk Loading

    Use PostgreSQL COPY to load and export CSV data

    Try it Yourself »
    SQL
    -- PostgreSQL: COPY command for bulk loading
    -- Fastest way to load millions of rows
    
    -- Load from CSV file
    COPY products (name, price, category_id, stock)
    FROM '/data/products.csv'
    WITH (FORMAT csv, HEADER true, DELIMITER ',');
    
    -- Load from stdin (pipe data)
    COPY orders FROM STDIN WITH (FORMAT csv);
    1,101,2024-01-15,99.99,shipped
    2,102,2024-01-16,149.50,pending
    \.
    
    -- Export data to CSV
    COPY (
        SELECT o.order_id, c.email, o.total
        FROM orders o JOIN customers c ON o.customer_id = c.custo
    ...

    Try It: Batch INSERT Strategies

    Compare single-row vs multi-row vs UNNEST inserts

    Try it Yourself »
    SQL
    -- Batch INSERT for application-level bulk loading
    -- Much faster than single-row inserts
    
    -- ❌ SLOW: One insert per row (1 million round trips!)
    INSERT INTO logs (level, message) VALUES ('INFO', 'Started');
    INSERT INTO logs (level, message) VALUES ('WARN', 'Slow query');
    -- ... 999,998 more statements
    
    -- ✅ FAST: Multi-row VALUES (one round trip per batch)
    INSERT INTO logs (level, message) VALUES
        ('INFO', 'Started'),
        ('WARN', 'Slow query'),
        ('ERROR', 'Connection timeout'),
        ('INF
    ...

    🔄 The ETL Pipeline

    ETL stands for Extract, Transform, Load — the standard pattern for moving data between systems.

    📥 Extract

    Pull raw data from CSV files, APIs, or other databases into a staging area.

    🔧 Transform

    Clean, validate, normalise, and enrich the data. Reject bad rows.

    📤 Load

    Insert clean data into production tables with proper constraints.

    ⚠️ Common Mistake: Loading raw data directly into production tables. Always use a staging table first — it lets you validate data before it touches your real tables.

    Try It: ETL Pipeline

    Build a complete Extract → Transform → Load pipeline with error handling

    Try it Yourself »
    SQL
    -- ETL Pattern: Extract → Transform → Load
    -- Stage raw data, clean it, then load into production tables
    
    -- Step 1: Create staging table (no constraints)
    CREATE TEMPORARY TABLE staging_orders (
        raw_date TEXT,
        customer_email TEXT,
        product_name TEXT,
        quantity TEXT,
        price TEXT
    );
    
    -- Step 2: Bulk load raw data
    COPY staging_orders FROM '/data/raw_orders.csv'
    WITH (FORMAT csv, HEADER true);
    
    -- Step 3: Transform and load into production
    INSERT INTO orders (customer_id, order_date
    ...

    Try It: Loading into Partitioned Tables

    Bulk load data that auto-routes to the correct partition

    Try it Yourself »
    SQL
    -- Loading into partitioned tables
    -- Data automatically routes to correct partition
    
    -- Table partitioned by month
    CREATE TABLE events (
        event_id BIGSERIAL,
        event_type TEXT NOT NULL,
        payload JSONB,
        created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
    ) PARTITION BY RANGE (created_at);
    
    CREATE TABLE events_2024_01 PARTITION OF events
        FOR VALUES FROM ('2024-01-01') TO ('2024-02-01');
    CREATE TABLE events_2024_02 PARTITION OF events
        FOR VALUES FROM ('2024-02-01') TO ('2024-03-01');
    ...

    📋 Quick Reference

    MethodSpeedBest For
    COPY⚡⚡⚡Server-side file loading
    Batch INSERT⚡⚡Application-level loading
    LOAD DATA INFILE⚡⚡⚡MySQL bulk loading
    Staging + Transform⚡⚡Dirty data with validation

    💡 Pro Tip: Before a massive bulk load, disable indexes and triggers, load the data, then re-enable and rebuild indexes. This can make loads 10x faster because the database doesn't need to update index structures for every single row.

    🎉 Lesson Complete!

    You can now load millions of rows efficiently! Next, learn when to use database triggers vs application logic.

    Sign up for free to track which lessons you've completed and get learning reminders.

    Previous

    Cookie & Privacy Settings

    We use cookies to improve your experience, analyze traffic, and show personalized ads. You can manage your preferences below.

    By clicking "Accept All", you consent to our use of cookies for analytics and personalized advertising. You can customize your preferences or reject non-essential cookies.

    Privacy PolicyTerms of Service