Lesson 31 • Advanced
Computer Vision Pipelines
Follow an image all the way from raw photo to live prediction — collecting and labelling data, augmenting and preprocessing it, transfer-learning a pretrained backbone, running the train/validate loop, evaluating honestly, and deploying for inference.
What You'll Learn in This Lesson
- ✓How to collect and label data, then split it into train and validation sets
- ✓Why augmentation (flips, crops, colour) only ever touches the training set
- ✓How to preprocess images by resizing and normalizing to a fixed range
- ✓How transfer learning reuses a pretrained backbone for your own classes
- ✓How the train/validate loop works and how to read its numbers
- ✓How to evaluate a model and deploy it for inference on new images
🏭 Real-World Analogy: An Assembly Line
A CV pipeline is an assembly line that turns raw photos into predictions. Picture a factory floor with stations in a row:
- Loading dock — photos arrive and get labelled (data collection & labeling).
- Copy station — make varied copies by flipping, cropping and re-colouring (augmentation, training only).
- Standardisation — every photo is resized and its pixels rescaled to the same range (preprocessing).
- Expert inspector — a pretrained backbone that already knows edges and textures examines each photo (transfer learning).
- Sorter — a small classifier head drops each photo into a labelled bin (prediction).
- Quality control — a held-back sample is scored honestly (evaluation), then the line ships (deployment).
If one station is mis-calibrated — say standardisation differs between the factory and the field — every later station produces junk. The whole point of a pipeline is that each stage is reliable and consistent from training through to deployment.
1Collect, Label, and Split Your Data
Everything starts with labelled data — images paired with the correct answer (the label). Before any training, you split that data into a train set the model learns from and a validation set you hold back to check progress honestly. A common split is 80/20.
The split below is the simplest possible version — slice a list. Run it and watch which samples land where.
Worked Example: Train / Validation Split
Slice a labelled dataset into train and validation piles
# Before training you split your labelled data into two piles:
# train -> the model learns from these
# val -> held back, used ONLY to check progress honestly
# A common split is 80% train, 20% validation.
samples = ["img0", "img1", "img2", "img3", "img4",
"img5", "img6", "img7", "img8", "img9"]
def train_val_split(data, val_fraction=0.2):
n_val = int(len(data) * val_fraction) # 10 * 0.2 = 2 go to val
val = data[:n_val] # first slice -> validati
...2Augmentation — Free Variety for Training
Augmentation creates safe, label-preserving copies of training images so the model sees more variety: horizontal flips, random crops, and colour jitter (brightness/contrast). A flipped cat is still a cat, so the label stays the same while the pixels change.
Here's the simplest augmentation — a horizontal flip — written in plain Python so you can see exactly what changes.
Worked Example: Horizontal Flip
Mirror a nested-list image left to right
# Augmentation = make safe copies of training images so the model sees
# more variety. A horizontal flip mirrors the image left<->right, which
# teaches the model that a cat facing left is still a cat.
image = [
[1, 2, 3],
[4, 5, 6],
]
def horizontal_flip(img):
# Reverse the pixels WITHIN each row (columns flip), rows stay in order.
return [list(reversed(row)) for row in img]
flipped = horizontal_flip(image)
for row in flipped:
print(row)
# Expected output:
# [3, 2, 1]
#
...3Preprocessing — Resize and Normalize
Preprocessing makes every image look the same to the model. Two steps dominate: resize (so all images share one width and height the model expects) and normalize (rescale pixels from 0..255 down to a small range like 0..1). Normalizing keeps training stable.
Run the example below to normalize a tiny "image" by hand — every pixel divided by 255.0.
Worked Example: Normalize to 0..1
Scale a nested-list image from 0..255 down to 0..1
# A camera gives pixel values from 0 (black) to 255 (white).
# Models train best when inputs sit in a small, fixed range like 0..1.
# "Normalizing" = divide every pixel by 255.0 so 0->0.0 and 255->1.0.
# A tiny 2x3 grayscale "image" stored as a nested list (rows of pixels).
image = [
[0, 128, 255],
[64, 192, 32],
]
def normalize(img):
# Walk each row, then each pixel, dividing by 255.0.
return [[pixel / 255.0 for pixel in row] for row in img]
normalized = normalize(image)
for
...In real projects you don't do this by hand — torchvision.transforms composes resize, crop, tensor-conversion and normalization into one reusable pipeline. The same recipe must run at training, validation, and inference time.
# In production you don't normalize by hand — torchvision composes the
# whole preprocessing pipeline for you. This is the standard recipe for a
# model pretrained on ImageNet.
import torch
from torchvision import transforms
# Resize -> crop to the model's input size -> to tensor -> normalize.
preprocess = transforms.Compose([
transforms.Resize(256),
transforms.CenterCrop(224),
transforms.ToTensor(), # pixels become 0..1 floats
transforms.Normalize( # then ImageNet mean/std
mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225],
),
])
# Pretend we already loaded a 224x224 RGB image as a tensor of zeros.
fake_image = torch.zeros(3, 224, 224)
print("input shape :", tuple(fake_image.shape))
print("output shape:", tuple(transforms.ToTensor()(
transforms.ToPILImage()(fake_image)).shape))
# Expected output:
# input shape : (3, 224, 224)
# output shape: (3, 224, 224)4Transfer Learning with a Pretrained Backbone
Training a vision model from scratch needs millions of images. Transfer learning avoids that: you take a backbone (like ResNet-50) that already learned generic features — edges, textures, shapes — from ImageNet, and you only replace its final layer (the head) so it outputs your classes.
Below, Albumentations builds the augmentation pipelines (note: train augments, val does not) and a pretrained ResNet-50 has its head swapped for 5 classes.
# Albumentations is the go-to library for fast image augmentation, and
# torchvision.models gives you a pretrained backbone for transfer learning.
import albumentations as A
import torchvision.models as models
import torch.nn as nn
# Build a TRAINING augmentation pipeline (flips, crops, colour jitter).
train_aug = A.Compose([
A.HorizontalFlip(p=0.5), # mirror half the time
A.RandomResizedCrop(height=224, width=224, scale=(0.8, 1.0)),
A.ColorJitter(brightness=0.2, contrast=0.2, p=0.5),
A.Normalize(), # 0..1 then ImageNet stats
])
# The VALIDATION pipeline must NOT augment — only resize + normalize.
val_aug = A.Compose([
A.Resize(256, 256),
A.CenterCrop(224, 224),
A.Normalize(),
])
# Transfer learning: take a pretrained ResNet-50 and swap its final layer
# for one that outputs YOUR number of classes (here: 5).
model = models.resnet50(weights="IMAGENET1K_V2")
model.fc = nn.Linear(model.fc.in_features, 5) # 2048 -> 5 classes
print("ready:", type(model).__name__, "with", model.fc.out_features, "classes")
# Expected output:
# ready: ResNet with 5 classes5The Train / Validate Loop
Training runs in epochs — one full pass over the training data. Each epoch you let the model learn from the train set (compute loss, back-propagate, update weights), then validate on the held-back set without learning from it. Watching train loss fall while val accuracy rises tells you it's working; if val accuracy stalls or drops while train keeps improving, the model is overfitting.
# The training loop: for each epoch, learn from train data, then check
# val data WITHOUT learning from it. Watch train vs val to spot overfitting.
import torch
def train_one_epoch(model, loader, optimizer, loss_fn):
model.train() # enable dropout/batchnorm updates
for images, labels in loader:
optimizer.zero_grad() # reset old gradients
preds = model(images)
loss = loss_fn(preds, labels)
loss.backward() # compute gradients
optimizer.step() # nudge the weights
@torch.no_grad() # no gradients = faster, safer
def validate(model, loader):
model.eval() # freeze dropout/batchnorm
correct = total = 0
for images, labels in loader:
preds = model(images).argmax(dim=1)
correct += (preds == labels).sum().item()
total += labels.size(0)
return correct / total
# A typical printout across 3 epochs:
# Epoch 1 train_loss=1.42 val_acc=0.61
# Epoch 2 train_loss=0.88 val_acc=0.74
# Epoch 3 train_loss=0.55 val_acc=0.81
# Expected output:
# Epoch 1 train_loss=1.42 val_acc=0.61
# Epoch 2 train_loss=0.88 val_acc=0.74
# Epoch 3 train_loss=0.55 val_acc=0.81model.train() and model.eval() switch behaviours like dropout and batch-norm. Always call model.eval() before validating or deploying, or your scores will be inconsistent.6Evaluate — Beyond Plain Accuracy
Accuracy alone lies on imbalanced data. If 95% of images are "not cancer", a model that always says "not cancer" scores 95% yet catches nothing. So you also look at precision (of the things I flagged, how many were right?), recall (of the things I should have caught, how many did I?), and F1 (their balance). A confusion matrix shows exactly which classes get mixed up.
| Metric | Use when… |
|---|---|
Accuracy | Classes are balanced |
Precision | False positives are costly (spam filter) |
Recall | False negatives are costly (medical) |
F1 | You want one balanced number |
mAP | Object detection (IoU-based) |
7Deployment and Inference
Deployment means running the trained model on one new image at a time. The golden rule: apply the exact same preprocessing you used for validation — never the training augmentation — switch to model.eval(), and run a single forward pass. A softmax turns the raw scores into probabilities so you can report a confidence.
# Deployment = use the trained model on ONE new image. The golden rule:
# apply the EXACT same preprocessing you used for validation — never the
# training augmentation. Then run a single forward pass in eval mode.
import torch
import torch.nn.functional as F
classes = ["cat", "dog", "car", "house", "tree"]
@torch.no_grad()
def predict(model, image_tensor):
model.eval()
logits = model(image_tensor.unsqueeze(0)) # add a batch dimension
probs = F.softmax(logits, dim=1)[0] # turn scores into 0..1
idx = int(probs.argmax())
return classes[idx], float(probs[idx])
# label, confidence = predict(model, preprocessed_image)
# print(f"{label} ({confidence:.1%})")
# Expected output:
# dog (92.3%)🎯 Your Turn 1: Normalize the Pixels
Fill in the blank so every pixel is scaled to the 0..1 range. Use the expected output to check yourself.
Your Turn: Normalize to 0..1
Replace ___ so pixels divide by 255.0
# 🎯 YOUR TURN — fill in the blanks marked with ___
# Goal: scale every pixel from 0..255 down to 0..1.
image = [
[0, 255],
[51, 204],
]
def normalize(img):
# 👉 divide each pixel by 255.0 (use 255.0, not 255, to get a float)
return [[pixel / ___ for pixel in row] for row in img]
for row in normalize(image):
print([round(v, 2) for v in row])
# ✅ Expected output:
# [0.0, 1.0]
# [0.2, 0.8]🎯 Your Turn 2: Split 75 / 25
Fill in the two slice indices so the first 25% becomes validation and the rest becomes training.
Your Turn: Train / Val Split
Replace the ___ slice points to split 75/25
# 🎯 YOUR TURN — fill in the blanks marked with ___
# Goal: put the FIRST 25% of the data in validation, the rest in training.
samples = ["a", "b", "c", "d", "e", "f", "g", "h"]
def train_val_split(data, val_fraction):
n_val = int(len(data) * val_fraction) # 8 * 0.25 = 2
val = data[:___] # 👉 first n_val items -> val
train = data[___:] # 👉 everything after -> train
return train, val
train, val = train_val_split(samples, 0.25)
p
...Common Errors (And How to Fix Them)
❌ Train/test preprocessing mismatch
You normalize with one mean/std (or resize differently) at training but another at inference. The model sees inputs it was never trained on, so accuracy quietly collapses in production.
✅ Fix: define preprocessing once and reuse the identical transform everywhere — train, validate, and deploy.
❌ Augmenting the validation/test set
Flips and crops on your val set make every run report a different, unrealistic score. You can no longer trust the number.
✅ Fix: keep a separate val transform with only Resize + Normalize — no random ops.
❌ Data leakage
Near-duplicate frames of the same scene land in both train and val, or you compute normalization statistics over the whole dataset before splitting. Offline scores look amazing; real-world performance is poor.
✅ Fix: split first, then compute stats only on the train set; group related images so they never straddle the split.
❌ Not normalizing at all
Feeding raw 0..255 pixels makes loss spike to NaN or stall, because the gradients blow up.
✅ Fix: always scale pixels to a small range (0..1, then ImageNet mean/std) before the model.
📋 Quick Reference
| Pipeline Stage | Tools | Key Decisions |
|---|---|---|
| Collect & label | Label Studio, CVAT | Class balance, label quality |
| Split | sklearn, slicing | Train/val ratio, no leakage |
| Augment (train only) | albumentations, torchvision | Flip, crop, colour jitter |
| Preprocess | torchvision.transforms | Resize, normalize (same everywhere) |
| Backbone | timm, torchvision.models | ResNet, ViT, EfficientNet |
| Train/validate | torch, optimizer, loss_fn | Epochs, watch overfitting |
| Evaluate | sklearn.metrics | F1, mAP, confusion matrix |
| Deploy | ONNX, TensorRT | Val transform, eval mode, latency |
❓ Frequently Asked Questions
Q: What is a computer vision pipeline?
A: It is the full assembly line that turns raw photos into predictions: collect and label data, augment and preprocess the images, feed them through a model (usually a pretrained backbone plus a small classifier head), then evaluate and deploy. Each stage feeds the next, so a problem early on quietly corrupts everything downstream.
Q: Why must I normalize pixel values?
A: Cameras give pixels in the 0..255 range, but models train far more stably when inputs sit in a small fixed range like 0..1 (often followed by subtracting a mean and dividing by a standard deviation). If you skip normalization, gradients can explode or vanish and training stalls. Crucially, use the SAME normalization at train, validation and inference time.
Q: Should I augment my validation and test sets?
A: No. Augmentation (flips, crops, colour jitter) belongs to training only — it teaches the model variety. Your validation and test sets must stay fixed and realistic so the score you read is honest. Augmenting them gives you a number that does not reflect real-world performance.
Q: What is transfer learning and why use a pretrained backbone?
A: A backbone like ResNet-50 has already learned generic visual features (edges, textures, shapes) from millions of ImageNet images. Transfer learning reuses those weights and only retrains a small new head for your classes, so you need far less data and compute than training from scratch — and you usually get better accuracy too.
Q: What is data leakage in a CV pipeline?
A: Leakage is when information from your validation or test set sneaks into training. Common causes: computing normalization statistics over the whole dataset before splitting, putting near-duplicate frames of the same scene in both train and val, or tuning on the test set. The symptom is great offline scores that collapse in production.
🎯 Mini Challenge: Flip an Image
Now with the support faded — only a comment outline is given. Write the augmentation yourself: mirror each row of a nested-list image left to right.
Mini Challenge: Horizontal Flip
Write flip(img) from the outline and match the expected output
# 🎯 MINI-CHALLENGE: horizontal-flip an image
# 1. Define a nested-list image, e.g. [[10, 20, 30], [40, 50, 60]]
# 2. Write flip(img) that mirrors each ROW left<->right
# (hint: list(reversed(row)) reverses one row)
# 3. Print each flipped row
#
# ✅ Expected output:
# [30, 20, 10]
# [60, 50, 40]
# your code here🎉 Lesson Complete!
You can now walk an image down the whole assembly line: collect and split data, augment the training set, preprocess by resizing and normalizing, transfer-learn a pretrained backbone, run the train/validate loop, evaluate beyond plain accuracy, and deploy for inference — all while keeping preprocessing consistent and avoiding leakage.
🚀 Up next: Object Detection — go from "what is in this image?" to "what is where?", drawing labelled boxes around every object.
Sign up for free to track which lessons you've completed and get learning reminders.