Lesson 3 • Beginner

Data Preprocessing

Turn messy, raw data into clean, model-ready numbers — handle missing values, scale features, encode categories, and split safely without leaking the answers.

What You'll Learn in This Lesson

✓You'll be able to find and fill missing values with the mean or median
✓You'll be able to scale features with min-max (0–1) by hand
✓You'll be able to standardise features to a z-score (mean 0, std 1)
✓You'll be able to one-hot encode nominal categories with lists
✓You'll be able to pick label vs one-hot encoding correctly
✓You'll be able to split train/test data without leaking it

Before you start: Make sure you've completed Python for ML — you'll use lists, dictionaries, loops, and list comprehensions throughout this lesson.

🍳 Real-World Analogy: Prepping Ingredients

Imagine cooking with unwashed vegetables, ingredients still in their packaging, and spices measured in totally different units — pinches, kilograms, cups. The meal would be a disaster, no matter how good the recipe is.

Data preprocessing is the prep work before cooking. Filling missing values is like replacing the one carrot that went bad. Scaling is putting every ingredient on the same measuring scale so none overwhelms the dish. Encoding is translating "a handful of basil" into an exact number the recipe can use. And splitting train/test is tasting with a clean spoon you never dipped back in the pot — so your judgement stays honest.

⚠️ Golden Rule: "Garbage in, garbage out." Data scientists spend 60–80% of their time here because the best algorithm on earth cannot rescue badly prepared data.

1Handling Missing Values

A missing value is a hole in your data — a survey question left blank, a sensor that dropped out, a field that was never filled in. In Python you'll see these as None (or NaN in pandas). Most ML algorithms crash or misbehave if you feed them holes, so you must deal with every one.

You have three everyday strategies:

Fill with the mean — the average of the values you do have. Good for roughly symmetric numeric data.
Fill with the median — the middle value. Better when the data is skewed (a few huge salaries pull the mean upward).
Drop the row — delete records that have holes. Only safe when very few rows are affected.

Read the worked example below line by line, then run it. Every line states the result in a comment.

Worked Example: Handling Missing Data

Count, mean-fill, median-fill, and drop missing values using plain lists

Try it Yourself »

Python

# Handling Missing Data — the #1 real-world data problem
# Real datasets are NEVER perfectly clean. A "missing value" is a hole
# in your data — here we mark holes with Python's None.

# A small table stored as plain lists (no pandas needed)
age    = [25, None, 35, 28, None, 42, 31, None, 38, 29]
salary = [50000, 60000, None, 55000, 70000, None, 65000, 48000, 72000, 51000]

# 1) Count the holes in each column
def count_missing(col):
    return sum(1 for v in col if v is None)   # None counts as 
...

2Scaling & Normalisation

Scaling means rewriting every feature so they all cover a similar range. Why bother? Imagine one column is age (20–60) and another is salary (35,000–140,000). Many algorithms measure distance between numbers — and salary's huge values would completely dominate, drowning out age entirely. Scaling levels the playing field.

Two methods cover almost everything:

Min-Max scaling → range 0 to 1

(x - min) / (max - min)

The smallest value becomes 0, the largest becomes 1. Great for neural networks and any bounded input.

Standardisation → mean 0, std 1

(x - mean) / std

Re-centres data around 0. The "std" (standard deviation) measures spread. Great for linear models, SVMs, and k-NN.

Worked Example: Min-Max & Z-Score Scaling

Implement both scaling formulas from scratch with plain functions

Try it Yourself »

Python

# Feature Scaling — put every feature on the SAME scale
# Without it, big-number features (salary) drown out small ones (age).

ages    = [25, 30, 35, 40, 45, 50, 55, 60]
salaries = [35000, 45000, 55000, 70000, 85000, 110000, 125000, 140000]

# --- Min-Max scaling: squashes every value into the range 0..1 ---
# formula:  (x - min) / (max - min)
def min_max(col):
    lo, hi = min(col), max(col)
    return [round((x - lo) / (hi - lo), 2) for x in col]

print("ages  min-max  :", min_max(ages))     
...

🎯 Your Turn: Min-Max Scaling

Fill in the blanks to scale a list of scores into the 0–1 range

Try it Yourself »

Python

# 🎯 YOUR TURN — implement min-max scaling yourself
# Fill in the blanks marked with ___ to squash every score into 0..1.

scores = [40, 55, 70, 85, 100]

# 1) Find the smallest and largest values
lo = ___        # 👉 replace ___ with min(scores)
hi = ___        # 👉 replace ___ with max(scores)

# 2) Apply the min-max formula to each score:  (x - lo) / (hi - lo)
scaled = [round((x - lo) / (hi - lo), 2) for x in scores]

print("scaled:", scaled)

# ✅ Expected output:
# scaled: [0.0, 0.25, 0.5, 0
...

3Encoding Categorical Data

A categorical feature is text that names a group — a city, a plan, a colour. ML maths needs numbers, so you must translate categories into numbers. How you translate depends on whether the categories have an order.

Ordinal categories have a real order: Basic < Pro < Premium, or low < medium < high. Use label encoding — one number per category (0, 1, 2…). The order you assign actually means something.
Nominal categories have no order: London, Paris, Berlin. Use one-hot encoding — one column per category with a single 1 and the rest 0. This avoids implying a fake ranking.

The classic mistake: label-encoding nominal data. If London=0 and Paris=1, the model thinks Paris is "greater than" London — a relationship that does not exist. That silently corrupts your results.

Worked Example: Label & One-Hot Encoding

Build both encoders by hand with dictionaries and list comprehensions

Try it Yourself »

Python

# Encoding Categorical Data — ML maths only understands NUMBERS
# So every text category has to become a number first.

cities = ["London", "Paris", "Berlin", "London", "Paris"]
plans  = ["Basic", "Pro", "Premium", "Pro", "Basic"]

# --- Label encoding: ONE number per category. Use for ORDINAL data ---
# "ordinal" = the categories have a real order (Basic < Pro < Premium).
plan_rank = {"Basic": 0, "Pro": 1, "Premium": 2}   # you choose the order
plan_codes = [plan_rank[p] for p in plans]
print("
...

🎯 Your Turn: One-Hot Encoding

Fill in the comprehension that turns each colour into a one-hot row

Try it Yourself »

Python

# 🎯 YOUR TURN — build a one-hot encoder
# Each colour should become a list with a single 1 and the rest 0.

colours = ["red", "blue", "green", "red"]

# 1) Get the sorted list of unique categories
categories = sorted(set(colours))   # ['blue', 'green', 'red']

# 2) For each colour, put a 1 in the matching slot, 0 everywhere else
for colour in colours:
    # 👉 replace ___ with: 1 if colour == c else 0
    one_hot = [___ for c in categories]
    print(colour.ljust(6), "->", one_hot)

# ✅ Expecte
...

4Train/Test Split & Avoiding Data Leakage

To know if a model actually learned (rather than just memorised), you hold back some data. You train on most of it (commonly 80%) and test on the rest (20%) — data the model has never seen. Always shuffle first so the split isn't accidentally ordered by date or class.

Data leakage is the silent killer of ML projects. It happens whenever information from the test set sneaks into training. The most common cause is scaling before splitting: if you compute the min and max over the whole dataset, those numbers already "know" about your test rows.

The safe order, every time: 1) split → 2) fit the scaler/encoder on the training set only → 3) apply that fitted transform to the test set. Never fit on the test data.

Worked Example: Safe Train/Test Split

Shuffle, split 80/20, and scale the right way to avoid leakage

Try it Yourself »

Python

# Train/Test Split — and the trap that ruins real projects: LEAKAGE
# Rule: the model must NEVER see the test data while it is learning.

import random
random.seed(42)                       # makes the shuffle repeatable

data = list(range(1, 11))             # 10 samples: 1..10
random.shuffle(data)                  # ALWAYS shuffle before splitting
print("shuffled:", data)

# 80% train, 20% test
cut = int(len(data) * 0.8)            # 8
train, test = data[:cut], data[cut:]
print("train   :", tr
...

🐼 The Real-World Version: pandas + scikit-learn

You just implemented every step by hand so you understand exactly what happens. In real projects you'd reach for pandas and scikit-learn, which do the same work in a handful of lines — and crucially, enforce the "fit on train, transform on test" rule for you.

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer

df = pd.read_csv("customers.csv")

# 1) Split FIRST — before any scaling or fitting (prevents leakage)
X = df[["age", "salary", "city"]]
y = df["churned"]
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y)
# Expected: X_train has ~80% of rows, X_test the other ~20%

# 2) Fill missing numbers with the column mean — FIT on train only
imputer = SimpleImputer(strategy="mean")
X_train[["age", "salary"]] = imputer.fit_transform(X_train[["age", "salary"]])
X_test[["age", "salary"]]  = imputer.transform(X_test[["age", "salary"]])
# Expected: no NaNs remain in age/salary

# 3) Standardise numbers — FIT on train, then reuse on test
scaler = StandardScaler()
X_train[["age", "salary"]] = scaler.fit_transform(X_train[["age", "salary"]])
X_test[["age", "salary"]]  = scaler.transform(X_test[["age", "salary"]])
# Expected: train columns now have mean ~0 and std ~1

# 4) One-hot encode the nominal "city" column
encoder = OneHotEncoder(handle_unknown="ignore", sparse_output=False)
city_train = encoder.fit_transform(X_train[["city"]])
city_test  = encoder.transform(X_test[["city"]])
# Expected: one 0/1 column per city, e.g. city_Berlin, city_London, city_Paris

Notice the pattern repeated four times: fit_transform on train, then transform on test. That single discipline is what keeps your evaluation honest.

🎯 Mini-Challenge: Median Imputation

Time to fly with less support. Fill in the missing temperatures using the median of the values that are present. The starter below is just a comment outline — write the logic yourself.

Mini-Challenge: Median Imputation

Write median-fill from scratch using the comment outline

Try it Yourself »

Python

# 🎯 MINI-CHALLENGE: fill missing values with the MEDIAN
#
# temps below has two holes (None). Replace each None with the median
# of the values that are NOT None.
#
# Steps:
#   1. Build a list of only the non-None values
#   2. Sort it
#   3. The median is the middle item:  sorted_list[len(sorted_list) // 2]
#   4. Build a new list, swapping each None for that median
#   5. print the filled list
#
# ✅ Expected output:
# filled: [18, 21, 20, 24, 20, 20, 19]
# (present values sorted: [18, 19, 20
...

!Common Errors (And How to Fix Them)

These four mistakes quietly wreck more ML projects than any algorithm bug. Learn to spot them.

❌ Scaling before the split (data leakage)

Fitting the scaler on the whole dataset lets the test set influence training. Scores look great, then collapse in production.

# ❌ WRONG — min/max computed over ALL data, then split
scaled = (data - data.min()) / (data.max() - data.min())
train, test = split(scaled)

✅ Fix: split first, fit on train, apply to test:

train, test = split(data)             # split FIRST
lo, hi = train.min(), train.max()     # learn from train only
train_s = (train - lo) / (hi - lo)
test_s  = (test  - lo) / (hi - lo)    # reuse train's lo/hi

❌ Fitting the scaler on the test set

Calling fit (or recomputing min/max) on test data is leakage in disguise — the test set is supposed to be unseen.

scaler.fit(X_test)        # ❌ never fit on test
X_test = scaler.transform(X_test)

✅ Fix: only ever transform the test set:

scaler.fit(X_train)              # ✅ fit on train
X_test = scaler.transform(X_test)  # ✅ transform only

❌ Dropping every row with a missing value

If 30% of rows have one hole each, dropping them throws away a third of your data — and may bias what's left.

clean = [r for r in rows if None not in r]  # ❌ lost most rows

✅ Fix: impute (fill) instead, drop only when <5% affected:

# Fill numeric holes with the mean/median, keep the rows
filled = [v if v is not None else mean for v in column]

❌ Label-encoding nominal categories

Mapping unordered categories to 0,1,2 invents a fake ranking the model will believe.

{"London": 0, "Paris": 1, "Berlin": 2}  # ❌ implies Berlin > London

✅ Fix: one-hot encode anything without a natural order:

# London -> [0,1,0], Paris -> [0,0,1], Berlin -> [1,0,0]
[1 if city == c else 0 for c in sorted(set(cities))]

Reserve label encoding for genuinely ordinal data (low/medium/high).

📋 Quick Reference

Step	Method (sklearn)	When to Use
Missing data	SimpleImputer(mean)	Numeric, roughly symmetric
Missing data	SimpleImputer(median)	Numeric, skewed distribution
Scaling	MinMaxScaler	Neural nets, bounded 0–1 range
Scaling	StandardScaler	Linear models, SVM, k-NN
Encoding	LabelEncoder	Ordinal (low/medium/high)
Encoding	OneHotEncoder	Nominal (red/blue/green)
Splitting	train_test_split	Always — shuffle + stratify=y
Order	split → fit → transform	Always, to avoid leakage

❓ Frequently Asked Questions

Q: Should I scale my data before or after the train/test split?

A: After. Split first, fit the scaler on the training set only, then apply that same scaler to the test set. Fitting on all the data before splitting lets test information leak into training and inflates your scores.

Q: What is the difference between min-max scaling and standardisation?

A: Min-max scaling squashes values into a fixed 0-to-1 range using (x - min) / (max - min). Standardisation (z-score) re-centres values so the mean is 0 and the standard deviation is 1 using (x - mean) / std. Use min-max for bounded inputs like neural networks; use z-score for linear models, SVMs, and k-NN.

Q: When do I use label encoding versus one-hot encoding?

A: Use label encoding for ordinal categories that have a real order (low/medium/high). Use one-hot encoding for nominal categories with no order (London/Paris/Berlin), because label encoding would wrongly imply Paris > London.

Q: Is it OK to just delete rows that have missing values?

A: Only if very few rows are affected (roughly under 5%). Dropping rows blindly throws away real signal and can bias your dataset. For numeric columns, filling with the mean (normal data) or median (skewed data) usually keeps more information.

Q: Do I need pandas and scikit-learn to do preprocessing?

A: No — every step here is plain arithmetic you can do with lists and dictionaries, which is why this lesson implements them by hand. In real projects you reach for pandas (fillna) and scikit-learn (MinMaxScaler, StandardScaler, OneHotEncoder, train_test_split) because they are faster and battle-tested.

🎉 Lesson Complete!

You can now spot and fill missing values, scale features with min-max and z-score, encode categories the right way, and split data without leaking it into your evaluation. You implemented every step in plain Python, so the pandas and scikit-learn versions will feel like shortcuts rather than magic.

🚀 Up next: Linear Regression — feed your freshly cleaned data into your first real machine-learning model and watch it make predictions.

Data Preprocessing

What You'll Learn in This Lesson

🍳 Real-World Analogy: Prepping Ingredients

1Handling Missing Values

Worked Example: Handling Missing Data

2Scaling & Normalisation

Worked Example: Min-Max & Z-Score Scaling

🎯 Your Turn: Min-Max Scaling

3Encoding Categorical Data

Worked Example: Label & One-Hot Encoding

🎯 Your Turn: One-Hot Encoding

4Train/Test Split & Avoiding Data Leakage

Worked Example: Safe Train/Test Split

🐼 The Real-World Version: pandas + scikit-learn

🎯 Mini-Challenge: Median Imputation

Mini-Challenge: Median Imputation

!Common Errors (And How to Fix Them)

📋 Quick Reference

❓ Frequently Asked Questions

🎉 Lesson Complete!

Cookie & Privacy Settings