Lesson 3 • Beginner
Data Preprocessing
Clean, transform, and prepare raw data — because garbage in means garbage out.
✅ What You'll Learn
- • Handling missing data (mean, median, drop)
- • Feature scaling (Min-Max and Standardisation)
- • Encoding categorical data (label and one-hot)
- • Train/test splitting and why it prevents overfitting
🧹 Why Preprocessing Matters
🎯 Real-World Analogy: Imagine cooking with unwashed vegetables, unsorted spices, and ingredients still in packaging. The result would be terrible. Data preprocessing is like washing, peeling, and chopping before cooking — it's not glamorous, but it determines the quality of the final dish.
Data scientists spend 60-80% of their time on data preparation. Real data is messy: missing values, inconsistent formats, different scales, and text categories that algorithms can't read. Preprocessing transforms this chaos into clean, model-ready data.
⚠️ Golden Rule: "Garbage in, garbage out." The best algorithm in the world can't fix bad data. Always clean your data before training.
Try It: Handling Missing Data
Learn three strategies for dealing with missing values in datasets
import numpy as np
# Handling Missing Data — the #1 real-world data problem
# Real datasets are NEVER perfectly clean
# Simulate a dataset with missing values
data = {
'age': [25, None, 35, 28, None, 42, 31, None, 38, 29],
'salary': [50000, 60000, None, 55000, 70000, None, 65000, 48000, 72000, 51000],
'rating': [4.2, 3.8, 4.5, None, 3.9, 4.1, None, 3.7, 4.8, 4.0]
}
# Count missing values
for col, values in data.items():
missing = sum(1 for v in values if v is None)
prin
...Try It: Feature Scaling
Normalise features to the same scale with Min-Max and Z-score
import numpy as np
# Feature Scaling — CRITICAL for most ML algorithms
# Without scaling, features with large ranges dominate
# Raw data: age (20-60) vs salary (30000-150000)
ages = np.array([25, 30, 35, 40, 45, 50, 55, 60])
salaries = np.array([35000, 45000, 55000, 70000, 85000, 95000, 110000, 140000])
print("=== Before Scaling ===")
print(f"Age range: {ages.min()} - {ages.max()}")
print(f"Salary range: {salaries.min()} - {salaries.max()}")
print("⚠️ Salary values are 1000x larger — they'll
...Try It: Encoding Categories
Convert text categories to numbers with label and one-hot encoding
import numpy as np
# Encoding Categorical Data — ML only understands numbers
# Convert text categories to numerical values
# Example: Customer data with categorical features
customers = [
{"city": "London", "plan": "Premium", "active": True},
{"city": "Paris", "plan": "Basic", "active": False},
{"city": "Berlin", "plan": "Premium", "active": True},
{"city": "London", "plan": "Pro", "active": True},
{"city": "Paris", "plan": "Basic", "active": False},
]
# Method 1: Label En
...Try It: Train/Test Split
Split data properly to prevent your model from cheating
import numpy as np
# Train/Test Split — Prevent your model from "cheating"
# If you test on training data, you measure memorisation, not learning
np.random.seed(42)
# Generate 100 samples
X = np.random.rand(100, 2) # 100 samples, 2 features
y = (X[:, 0] + X[:, 1] > 1).astype(int) # binary target
print(f"Total dataset: {len(X)} samples")
print(f" Class 0: {sum(y == 0)} samples")
print(f" Class 1: {sum(y == 1)} samples")
print()
# Standard split: 80% train, 20% test
split_idx = 80
indices
...📋 Quick Reference
| Step | Method | When to Use |
|---|---|---|
| Missing data | fillna(mean) | Numerical, normal distribution |
| Missing data | fillna(median) | Numerical, skewed distribution |
| Scaling | MinMaxScaler | Neural networks, bounded range |
| Scaling | StandardScaler | Linear models, SVM, k-NN |
| Encoding | LabelEncoder | Ordinal categories (low/med/high) |
| Encoding | OneHotEncoder | Nominal categories (red/blue/green) |
🎉 Lesson Complete!
Your data is clean and ready! Next, build your first real ML model with Linear Regression.
Sign up for free to track which lessons you've completed and get learning reminders.