Lesson 31 • Advanced

Computer Vision Pipelines

Follow an image all the way from raw photo to live prediction — collecting and labelling data, augmenting and preprocessing it, transfer-learning a pretrained backbone, running the train/validate loop, evaluating honestly, and deploying for inference.

What You'll Learn in This Lesson

✓How to collect and label data, then split it into train and validation sets
✓Why augmentation (flips, crops, colour) only ever touches the training set
✓How to preprocess images by resizing and normalizing to a fixed range
✓How transfer learning reuses a pretrained backbone for your own classes
✓How the train/validate loop works and how to read its numbers
✓How to evaluate a model and deploy it for inference on new images

Before you start: Make sure you've completed Computer Vision Basics so you're comfortable that an image is just a grid of pixel numbers.

🏭 Real-World Analogy: An Assembly Line

A CV pipeline is an assembly line that turns raw photos into predictions. Picture a factory floor with stations in a row:

Loading dock — photos arrive and get labelled (data collection & labeling).
Copy station — make varied copies by flipping, cropping and re-colouring (augmentation, training only).
Standardisation — every photo is resized and its pixels rescaled to the same range (preprocessing).
Expert inspector — a pretrained backbone that already knows edges and textures examines each photo (transfer learning).
Sorter — a small classifier head drops each photo into a labelled bin (prediction).
Quality control — a held-back sample is scored honestly (evaluation), then the line ships (deployment).

If one station is mis-calibrated — say standardisation differs between the factory and the field — every later station produces junk. The whole point of a pipeline is that each stage is reliable and consistent from training through to deployment.

1Collect, Label, and Split Your Data

Everything starts with labelled data — images paired with the correct answer (the label). Before any training, you split that data into a train set the model learns from and a validation set you hold back to check progress honestly. A common split is 80/20.

The split below is the simplest possible version — slice a list. Run it and watch which samples land where.

Worked Example: Train / Validation Split

Slice a labelled dataset into train and validation piles

Try it Yourself »

Python

# Before training you split your labelled data into two piles:
#   train -> the model learns from these
#   val   -> held back, used ONLY to check progress honestly
# A common split is 80% train, 20% validation.

samples = ["img0", "img1", "img2", "img3", "img4",
           "img5", "img6", "img7", "img8", "img9"]

def train_val_split(data, val_fraction=0.2):
    n_val = int(len(data) * val_fraction)   # 10 * 0.2 = 2 go to val
    val = data[:n_val]                       # first slice -> validati
...

2Augmentation — Free Variety for Training

Augmentation creates safe, label-preserving copies of training images so the model sees more variety: horizontal flips, random crops, and colour jitter (brightness/contrast). A flipped cat is still a cat, so the label stays the same while the pixels change.

Training only. Never augment your validation or test set — those must stay fixed so the score you read reflects reality.

Here's the simplest augmentation — a horizontal flip — written in plain Python so you can see exactly what changes.

Worked Example: Horizontal Flip

Mirror a nested-list image left to right

Try it Yourself »

Python

# Augmentation = make safe copies of training images so the model sees
# more variety. A horizontal flip mirrors the image left<->right, which
# teaches the model that a cat facing left is still a cat.

image = [
    [1, 2, 3],
    [4, 5, 6],
]

def horizontal_flip(img):
    # Reverse the pixels WITHIN each row (columns flip), rows stay in order.
    return [list(reversed(row)) for row in img]

flipped = horizontal_flip(image)

for row in flipped:
    print(row)

# Expected output:
# [3, 2, 1]
#
...

3Preprocessing — Resize and Normalize

Preprocessing makes every image look the same to the model. Two steps dominate: resize (so all images share one width and height the model expects) and normalize (rescale pixels from 0..255 down to a small range like 0..1). Normalizing keeps training stable.

Run the example below to normalize a tiny "image" by hand — every pixel divided by 255.0.

Worked Example: Normalize to 0..1

Scale a nested-list image from 0..255 down to 0..1

Try it Yourself »

Python

# A camera gives pixel values from 0 (black) to 255 (white).
# Models train best when inputs sit in a small, fixed range like 0..1.
# "Normalizing" = divide every pixel by 255.0 so 0->0.0 and 255->1.0.

# A tiny 2x3 grayscale "image" stored as a nested list (rows of pixels).
image = [
    [0, 128, 255],
    [64, 192, 32],
]

def normalize(img):
    # Walk each row, then each pixel, dividing by 255.0.
    return [[pixel / 255.0 for pixel in row] for row in img]

normalized = normalize(image)

for
...

In real projects you don't do this by hand — torchvision.transforms composes resize, crop, tensor-conversion and normalization into one reusable pipeline. The same recipe must run at training, validation, and inference time.

# In production you don't normalize by hand — torchvision composes the
# whole preprocessing pipeline for you. This is the standard recipe for a
# model pretrained on ImageNet.
import torch
from torchvision import transforms

# Resize -> crop to the model's input size -> to tensor -> normalize.
preprocess = transforms.Compose([
    transforms.Resize(256),
    transforms.CenterCrop(224),
    transforms.ToTensor(),                       # pixels become 0..1 floats
    transforms.Normalize(                        # then ImageNet mean/std
        mean=[0.485, 0.456, 0.406],
        std=[0.229, 0.224, 0.225],
    ),
])

# Pretend we already loaded a 224x224 RGB image as a tensor of zeros.
fake_image = torch.zeros(3, 224, 224)
print("input shape :", tuple(fake_image.shape))
print("output shape:", tuple(transforms.ToTensor()(
    transforms.ToPILImage()(fake_image)).shape))

# Expected output:
# input shape : (3, 224, 224)
# output shape: (3, 224, 224)

4Transfer Learning with a Pretrained Backbone

Training a vision model from scratch needs millions of images. Transfer learning avoids that: you take a backbone (like ResNet-50) that already learned generic features — edges, textures, shapes — from ImageNet, and you only replace its final layer (the head) so it outputs your classes.

Below, Albumentations builds the augmentation pipelines (note: train augments, val does not) and a pretrained ResNet-50 has its head swapped for 5 classes.

# Albumentations is the go-to library for fast image augmentation, and
# torchvision.models gives you a pretrained backbone for transfer learning.
import albumentations as A
import torchvision.models as models
import torch.nn as nn

# Build a TRAINING augmentation pipeline (flips, crops, colour jitter).
train_aug = A.Compose([
    A.HorizontalFlip(p=0.5),                 # mirror half the time
    A.RandomResizedCrop(height=224, width=224, scale=(0.8, 1.0)),
    A.ColorJitter(brightness=0.2, contrast=0.2, p=0.5),
    A.Normalize(),                           # 0..1 then ImageNet stats
])

# The VALIDATION pipeline must NOT augment — only resize + normalize.
val_aug = A.Compose([
    A.Resize(256, 256),
    A.CenterCrop(224, 224),
    A.Normalize(),
])

# Transfer learning: take a pretrained ResNet-50 and swap its final layer
# for one that outputs YOUR number of classes (here: 5).
model = models.resnet50(weights="IMAGENET1K_V2")
model.fc = nn.Linear(model.fc.in_features, 5)   # 2048 -> 5 classes
print("ready:", type(model).__name__, "with", model.fc.out_features, "classes")

# Expected output:
# ready: ResNet with 5 classes

5The Train / Validate Loop

Training runs in epochs — one full pass over the training data. Each epoch you let the model learn from the train set (compute loss, back-propagate, update weights), then validate on the held-back set without learning from it. Watching train loss fall while val accuracy rises tells you it's working; if val accuracy stalls or drops while train keeps improving, the model is overfitting.

# The training loop: for each epoch, learn from train data, then check
# val data WITHOUT learning from it. Watch train vs val to spot overfitting.
import torch

def train_one_epoch(model, loader, optimizer, loss_fn):
    model.train()                            # enable dropout/batchnorm updates
    for images, labels in loader:
        optimizer.zero_grad()                # reset old gradients
        preds = model(images)
        loss = loss_fn(preds, labels)
        loss.backward()                      # compute gradients
        optimizer.step()                     # nudge the weights

@torch.no_grad()                             # no gradients = faster, safer
def validate(model, loader):
    model.eval()                             # freeze dropout/batchnorm
    correct = total = 0
    for images, labels in loader:
        preds = model(images).argmax(dim=1)
        correct += (preds == labels).sum().item()
        total += labels.size(0)
    return correct / total

# A typical printout across 3 epochs:
# Epoch 1  train_loss=1.42  val_acc=0.61
# Epoch 2  train_loss=0.88  val_acc=0.74
# Epoch 3  train_loss=0.55  val_acc=0.81

# Expected output:
# Epoch 1  train_loss=1.42  val_acc=0.61
# Epoch 2  train_loss=0.88  val_acc=0.74
# Epoch 3  train_loss=0.55  val_acc=0.81

Key insight: model.train() and model.eval() switch behaviours like dropout and batch-norm. Always call model.eval() before validating or deploying, or your scores will be inconsistent.

6Evaluate — Beyond Plain Accuracy

Accuracy alone lies on imbalanced data. If 95% of images are "not cancer", a model that always says "not cancer" scores 95% yet catches nothing. So you also look at precision (of the things I flagged, how many were right?), recall (of the things I should have caught, how many did I?), and F1 (their balance). A confusion matrix shows exactly which classes get mixed up.

Metric	Use when…
`Accuracy`	Classes are balanced
`Precision`	False positives are costly (spam filter)
`Recall`	False negatives are costly (medical)
`F1`	You want one balanced number
`mAP`	Object detection (IoU-based)

7Deployment and Inference

Deployment means running the trained model on one new image at a time. The golden rule: apply the exact same preprocessing you used for validation — never the training augmentation — switch to model.eval(), and run a single forward pass. A softmax turns the raw scores into probabilities so you can report a confidence.

# Deployment = use the trained model on ONE new image. The golden rule:
# apply the EXACT same preprocessing you used for validation — never the
# training augmentation. Then run a single forward pass in eval mode.
import torch
import torch.nn.functional as F

classes = ["cat", "dog", "car", "house", "tree"]

@torch.no_grad()
def predict(model, image_tensor):
    model.eval()
    logits = model(image_tensor.unsqueeze(0))   # add a batch dimension
    probs = F.softmax(logits, dim=1)[0]         # turn scores into 0..1
    idx = int(probs.argmax())
    return classes[idx], float(probs[idx])

# label, confidence = predict(model, preprocessed_image)
# print(f"{label} ({confidence:.1%})")

# Expected output:
# dog (92.3%)

🎯 Your Turn 1: Normalize the Pixels

Fill in the blank so every pixel is scaled to the 0..1 range. Use the expected output to check yourself.

Your Turn: Normalize to 0..1

Replace ___ so pixels divide by 255.0

Try it Yourself »

Python

# 🎯 YOUR TURN — fill in the blanks marked with ___

# Goal: scale every pixel from 0..255 down to 0..1.
image = [
    [0, 255],
    [51, 204],
]

def normalize(img):
    # 👉 divide each pixel by 255.0 (use 255.0, not 255, to get a float)
    return [[pixel / ___ for pixel in row] for row in img]

for row in normalize(image):
    print([round(v, 2) for v in row])

# ✅ Expected output:
# [0.0, 1.0]
# [0.2, 0.8]

🎯 Your Turn 2: Split 75 / 25

Fill in the two slice indices so the first 25% becomes validation and the rest becomes training.

Your Turn: Train / Val Split

Replace the ___ slice points to split 75/25

Try it Yourself »

Python

# 🎯 YOUR TURN — fill in the blanks marked with ___

# Goal: put the FIRST 25% of the data in validation, the rest in training.
samples = ["a", "b", "c", "d", "e", "f", "g", "h"]

def train_val_split(data, val_fraction):
    n_val = int(len(data) * val_fraction)   # 8 * 0.25 = 2
    val = data[:___]                         # 👉 first n_val items -> val
    train = data[___:]                       # 👉 everything after  -> train
    return train, val

train, val = train_val_split(samples, 0.25)
p
...

Common Errors (And How to Fix Them)

❌ Train/test preprocessing mismatch

You normalize with one mean/std (or resize differently) at training but another at inference. The model sees inputs it was never trained on, so accuracy quietly collapses in production.

✅ Fix: define preprocessing once and reuse the identical transform everywhere — train, validate, and deploy.

❌ Augmenting the validation/test set

Flips and crops on your val set make every run report a different, unrealistic score. You can no longer trust the number.

✅ Fix: keep a separate val transform with only Resize + Normalize — no random ops.

❌ Data leakage

Near-duplicate frames of the same scene land in both train and val, or you compute normalization statistics over the whole dataset before splitting. Offline scores look amazing; real-world performance is poor.

✅ Fix: split first, then compute stats only on the train set; group related images so they never straddle the split.

❌ Not normalizing at all

Feeding raw 0..255 pixels makes loss spike to NaN or stall, because the gradients blow up.

✅ Fix: always scale pixels to a small range (0..1, then ImageNet mean/std) before the model.

📋 Quick Reference

Pipeline Stage	Tools	Key Decisions
Collect & label	Label Studio, CVAT	Class balance, label quality
Split	sklearn, slicing	Train/val ratio, no leakage
Augment (train only)	albumentations, torchvision	Flip, crop, colour jitter
Preprocess	torchvision.transforms	Resize, normalize (same everywhere)
Backbone	timm, torchvision.models	ResNet, ViT, EfficientNet
Train/validate	torch, optimizer, loss_fn	Epochs, watch overfitting
Evaluate	sklearn.metrics	F1, mAP, confusion matrix
Deploy	ONNX, TensorRT	Val transform, eval mode, latency

❓ Frequently Asked Questions

Q: What is a computer vision pipeline?

A: It is the full assembly line that turns raw photos into predictions: collect and label data, augment and preprocess the images, feed them through a model (usually a pretrained backbone plus a small classifier head), then evaluate and deploy. Each stage feeds the next, so a problem early on quietly corrupts everything downstream.

Q: Why must I normalize pixel values?

A: Cameras give pixels in the 0..255 range, but models train far more stably when inputs sit in a small fixed range like 0..1 (often followed by subtracting a mean and dividing by a standard deviation). If you skip normalization, gradients can explode or vanish and training stalls. Crucially, use the SAME normalization at train, validation and inference time.

Q: Should I augment my validation and test sets?

A: No. Augmentation (flips, crops, colour jitter) belongs to training only — it teaches the model variety. Your validation and test sets must stay fixed and realistic so the score you read is honest. Augmenting them gives you a number that does not reflect real-world performance.

Q: What is transfer learning and why use a pretrained backbone?

A: A backbone like ResNet-50 has already learned generic visual features (edges, textures, shapes) from millions of ImageNet images. Transfer learning reuses those weights and only retrains a small new head for your classes, so you need far less data and compute than training from scratch — and you usually get better accuracy too.

Q: What is data leakage in a CV pipeline?

A: Leakage is when information from your validation or test set sneaks into training. Common causes: computing normalization statistics over the whole dataset before splitting, putting near-duplicate frames of the same scene in both train and val, or tuning on the test set. The symptom is great offline scores that collapse in production.

🎯 Mini Challenge: Flip an Image

Now with the support faded — only a comment outline is given. Write the augmentation yourself: mirror each row of a nested-list image left to right.

Mini Challenge: Horizontal Flip

Write flip(img) from the outline and match the expected output

Try it Yourself »

Python

# 🎯 MINI-CHALLENGE: horizontal-flip an image
# 1. Define a nested-list image, e.g. [[10, 20, 30], [40, 50, 60]]
# 2. Write flip(img) that mirrors each ROW left<->right
#    (hint: list(reversed(row)) reverses one row)
# 3. Print each flipped row
#
# ✅ Expected output:
# [30, 20, 10]
# [60, 50, 40]

# your code here

🎉 Lesson Complete!

You can now walk an image down the whole assembly line: collect and split data, augment the training set, preprocess by resizing and normalizing, transfer-learn a pretrained backbone, run the train/validate loop, evaluate beyond plain accuracy, and deploy for inference — all while keeping preprocessing consistent and avoiding leakage.

🚀 Up next: Object Detection — go from "what is in this image?" to "what is where?", drawing labelled boxes around every object.

Computer Vision Pipelines

What You'll Learn in This Lesson

🏭 Real-World Analogy: An Assembly Line

1Collect, Label, and Split Your Data

Worked Example: Train / Validation Split

2Augmentation — Free Variety for Training

Worked Example: Horizontal Flip

3Preprocessing — Resize and Normalize

Worked Example: Normalize to 0..1

4Transfer Learning with a Pretrained Backbone

5The Train / Validate Loop

6Evaluate — Beyond Plain Accuracy

7Deployment and Inference

🎯 Your Turn 1: Normalize the Pixels

Your Turn: Normalize to 0..1

🎯 Your Turn 2: Split 75 / 25

Your Turn: Train / Val Split

Common Errors (And How to Fix Them)

📋 Quick Reference

❓ Frequently Asked Questions

🎯 Mini Challenge: Flip an Image

Mini Challenge: Horizontal Flip

🎉 Lesson Complete!

Cookie & Privacy Settings