Lesson 3 • Beginner

    Data Preprocessing

    Clean, transform, and prepare raw data — because garbage in means garbage out.

    ✅ What You'll Learn

    • • Handling missing data (mean, median, drop)
    • • Feature scaling (Min-Max and Standardisation)
    • • Encoding categorical data (label and one-hot)
    • • Train/test splitting and why it prevents overfitting

    🧹 Why Preprocessing Matters

    🎯 Real-World Analogy: Imagine cooking with unwashed vegetables, unsorted spices, and ingredients still in packaging. The result would be terrible. Data preprocessing is like washing, peeling, and chopping before cooking — it's not glamorous, but it determines the quality of the final dish.

    Data scientists spend 60-80% of their time on data preparation. Real data is messy: missing values, inconsistent formats, different scales, and text categories that algorithms can't read. Preprocessing transforms this chaos into clean, model-ready data.

    ⚠️ Golden Rule: "Garbage in, garbage out." The best algorithm in the world can't fix bad data. Always clean your data before training.

    Try It: Handling Missing Data

    Learn three strategies for dealing with missing values in datasets

    Try it Yourself »
    Python
    import numpy as np
    
    # Handling Missing Data — the #1 real-world data problem
    # Real datasets are NEVER perfectly clean
    
    # Simulate a dataset with missing values
    data = {
        'age':    [25, None, 35, 28, None, 42, 31, None, 38, 29],
        'salary': [50000, 60000, None, 55000, 70000, None, 65000, 48000, 72000, 51000],
        'rating': [4.2, 3.8, 4.5, None, 3.9, 4.1, None, 3.7, 4.8, 4.0]
    }
    
    # Count missing values
    for col, values in data.items():
        missing = sum(1 for v in values if v is None)
        prin
    ...

    Try It: Feature Scaling

    Normalise features to the same scale with Min-Max and Z-score

    Try it Yourself »
    Python
    import numpy as np
    
    # Feature Scaling — CRITICAL for most ML algorithms
    # Without scaling, features with large ranges dominate
    
    # Raw data: age (20-60) vs salary (30000-150000)
    ages = np.array([25, 30, 35, 40, 45, 50, 55, 60])
    salaries = np.array([35000, 45000, 55000, 70000, 85000, 95000, 110000, 140000])
    
    print("=== Before Scaling ===")
    print(f"Age range: {ages.min()} - {ages.max()}")
    print(f"Salary range: {salaries.min()} - {salaries.max()}")
    print("⚠️ Salary values are 1000x larger — they'll 
    ...

    Try It: Encoding Categories

    Convert text categories to numbers with label and one-hot encoding

    Try it Yourself »
    Python
    import numpy as np
    
    # Encoding Categorical Data — ML only understands numbers
    # Convert text categories to numerical values
    
    # Example: Customer data with categorical features
    customers = [
        {"city": "London", "plan": "Premium", "active": True},
        {"city": "Paris", "plan": "Basic", "active": False},
        {"city": "Berlin", "plan": "Premium", "active": True},
        {"city": "London", "plan": "Pro", "active": True},
        {"city": "Paris", "plan": "Basic", "active": False},
    ]
    
    # Method 1: Label En
    ...

    Try It: Train/Test Split

    Split data properly to prevent your model from cheating

    Try it Yourself »
    Python
    import numpy as np
    
    # Train/Test Split — Prevent your model from "cheating"
    # If you test on training data, you measure memorisation, not learning
    
    np.random.seed(42)
    # Generate 100 samples
    X = np.random.rand(100, 2)  # 100 samples, 2 features
    y = (X[:, 0] + X[:, 1] > 1).astype(int)  # binary target
    
    print(f"Total dataset: {len(X)} samples")
    print(f"  Class 0: {sum(y == 0)} samples")
    print(f"  Class 1: {sum(y == 1)} samples")
    print()
    
    # Standard split: 80% train, 20% test
    split_idx = 80
    indices 
    ...

    📋 Quick Reference

    StepMethodWhen to Use
    Missing datafillna(mean)Numerical, normal distribution
    Missing datafillna(median)Numerical, skewed distribution
    ScalingMinMaxScalerNeural networks, bounded range
    ScalingStandardScalerLinear models, SVM, k-NN
    EncodingLabelEncoderOrdinal categories (low/med/high)
    EncodingOneHotEncoderNominal categories (red/blue/green)

    🎉 Lesson Complete!

    Your data is clean and ready! Next, build your first real ML model with Linear Regression.

    Sign up for free to track which lessons you've completed and get learning reminders.

    Previous

    Cookie & Privacy Settings

    We use cookies to improve your experience, analyze traffic, and show personalized ads. You can manage your preferences below.

    By clicking "Accept All", you consent to our use of cookies for analytics and personalized advertising. You can customize your preferences or reject non-essential cookies.

    Privacy PolicyTerms of Service