Skip to main content

    Lesson 31 • Advanced

    Computer Vision Pipelines

    Follow an image all the way from raw photo to live prediction — collecting and labelling data, augmenting and preprocessing it, transfer-learning a pretrained backbone, running the train/validate loop, evaluating honestly, and deploying for inference.

    What You'll Learn in This Lesson

    • How to collect and label data, then split it into train and validation sets
    • Why augmentation (flips, crops, colour) only ever touches the training set
    • How to preprocess images by resizing and normalizing to a fixed range
    • How transfer learning reuses a pretrained backbone for your own classes
    • How the train/validate loop works and how to read its numbers
    • How to evaluate a model and deploy it for inference on new images

    🏭 Real-World Analogy: An Assembly Line

    A CV pipeline is an assembly line that turns raw photos into predictions. Picture a factory floor with stations in a row:

    • Loading dock — photos arrive and get labelled (data collection & labeling).
    • Copy station — make varied copies by flipping, cropping and re-colouring (augmentation, training only).
    • Standardisation — every photo is resized and its pixels rescaled to the same range (preprocessing).
    • Expert inspector — a pretrained backbone that already knows edges and textures examines each photo (transfer learning).
    • Sorter — a small classifier head drops each photo into a labelled bin (prediction).
    • Quality control — a held-back sample is scored honestly (evaluation), then the line ships (deployment).

    If one station is mis-calibrated — say standardisation differs between the factory and the field — every later station produces junk. The whole point of a pipeline is that each stage is reliable and consistent from training through to deployment.

    1Collect, Label, and Split Your Data

    Everything starts with labelled data — images paired with the correct answer (the label). Before any training, you split that data into a train set the model learns from and a validation set you hold back to check progress honestly. A common split is 80/20.

    The split below is the simplest possible version — slice a list. Run it and watch which samples land where.

    Worked Example: Train / Validation Split

    Slice a labelled dataset into train and validation piles

    Try it Yourself »
    Python
    # Before training you split your labelled data into two piles:
    #   train -> the model learns from these
    #   val   -> held back, used ONLY to check progress honestly
    # A common split is 80% train, 20% validation.
    
    samples = ["img0", "img1", "img2", "img3", "img4",
               "img5", "img6", "img7", "img8", "img9"]
    
    def train_val_split(data, val_fraction=0.2):
        n_val = int(len(data) * val_fraction)   # 10 * 0.2 = 2 go to val
        val = data[:n_val]                       # first slice -> validati
    ...

    2Augmentation — Free Variety for Training

    Augmentation creates safe, label-preserving copies of training images so the model sees more variety: horizontal flips, random crops, and colour jitter (brightness/contrast). A flipped cat is still a cat, so the label stays the same while the pixels change.

    Here's the simplest augmentation — a horizontal flip — written in plain Python so you can see exactly what changes.

    Worked Example: Horizontal Flip

    Mirror a nested-list image left to right

    Try it Yourself »
    Python
    # Augmentation = make safe copies of training images so the model sees
    # more variety. A horizontal flip mirrors the image left<->right, which
    # teaches the model that a cat facing left is still a cat.
    
    image = [
        [1, 2, 3],
        [4, 5, 6],
    ]
    
    def horizontal_flip(img):
        # Reverse the pixels WITHIN each row (columns flip), rows stay in order.
        return [list(reversed(row)) for row in img]
    
    flipped = horizontal_flip(image)
    
    for row in flipped:
        print(row)
    
    # Expected output:
    # [3, 2, 1]
    #
    ...

    3Preprocessing — Resize and Normalize

    Preprocessing makes every image look the same to the model. Two steps dominate: resize (so all images share one width and height the model expects) and normalize (rescale pixels from 0..255 down to a small range like 0..1). Normalizing keeps training stable.

    Run the example below to normalize a tiny "image" by hand — every pixel divided by 255.0.

    Worked Example: Normalize to 0..1

    Scale a nested-list image from 0..255 down to 0..1

    Try it Yourself »
    Python
    # A camera gives pixel values from 0 (black) to 255 (white).
    # Models train best when inputs sit in a small, fixed range like 0..1.
    # "Normalizing" = divide every pixel by 255.0 so 0->0.0 and 255->1.0.
    
    # A tiny 2x3 grayscale "image" stored as a nested list (rows of pixels).
    image = [
        [0, 128, 255],
        [64, 192, 32],
    ]
    
    def normalize(img):
        # Walk each row, then each pixel, dividing by 255.0.
        return [[pixel / 255.0 for pixel in row] for row in img]
    
    normalized = normalize(image)
    
    for
    ...

    In real projects you don't do this by hand — torchvision.transforms composes resize, crop, tensor-conversion and normalization into one reusable pipeline. The same recipe must run at training, validation, and inference time.

    # In production you don't normalize by hand — torchvision composes the
    # whole preprocessing pipeline for you. This is the standard recipe for a
    # model pretrained on ImageNet.
    import torch
    from torchvision import transforms
    
    # Resize -> crop to the model's input size -> to tensor -> normalize.
    preprocess = transforms.Compose([
        transforms.Resize(256),
        transforms.CenterCrop(224),
        transforms.ToTensor(),                       # pixels become 0..1 floats
        transforms.Normalize(                        # then ImageNet mean/std
            mean=[0.485, 0.456, 0.406],
            std=[0.229, 0.224, 0.225],
        ),
    ])
    
    # Pretend we already loaded a 224x224 RGB image as a tensor of zeros.
    fake_image = torch.zeros(3, 224, 224)
    print("input shape :", tuple(fake_image.shape))
    print("output shape:", tuple(transforms.ToTensor()(
        transforms.ToPILImage()(fake_image)).shape))
    
    # Expected output:
    # input shape : (3, 224, 224)
    # output shape: (3, 224, 224)

    4Transfer Learning with a Pretrained Backbone

    Training a vision model from scratch needs millions of images. Transfer learning avoids that: you take a backbone (like ResNet-50) that already learned generic features — edges, textures, shapes — from ImageNet, and you only replace its final layer (the head) so it outputs your classes.

    Below, Albumentations builds the augmentation pipelines (note: train augments, val does not) and a pretrained ResNet-50 has its head swapped for 5 classes.

    # Albumentations is the go-to library for fast image augmentation, and
    # torchvision.models gives you a pretrained backbone for transfer learning.
    import albumentations as A
    import torchvision.models as models
    import torch.nn as nn
    
    # Build a TRAINING augmentation pipeline (flips, crops, colour jitter).
    train_aug = A.Compose([
        A.HorizontalFlip(p=0.5),                 # mirror half the time
        A.RandomResizedCrop(height=224, width=224, scale=(0.8, 1.0)),
        A.ColorJitter(brightness=0.2, contrast=0.2, p=0.5),
        A.Normalize(),                           # 0..1 then ImageNet stats
    ])
    
    # The VALIDATION pipeline must NOT augment — only resize + normalize.
    val_aug = A.Compose([
        A.Resize(256, 256),
        A.CenterCrop(224, 224),
        A.Normalize(),
    ])
    
    # Transfer learning: take a pretrained ResNet-50 and swap its final layer
    # for one that outputs YOUR number of classes (here: 5).
    model = models.resnet50(weights="IMAGENET1K_V2")
    model.fc = nn.Linear(model.fc.in_features, 5)   # 2048 -> 5 classes
    print("ready:", type(model).__name__, "with", model.fc.out_features, "classes")
    
    # Expected output:
    # ready: ResNet with 5 classes

    5The Train / Validate Loop

    Training runs in epochs — one full pass over the training data. Each epoch you let the model learn from the train set (compute loss, back-propagate, update weights), then validate on the held-back set without learning from it. Watching train loss fall while val accuracy rises tells you it's working; if val accuracy stalls or drops while train keeps improving, the model is overfitting.

    # The training loop: for each epoch, learn from train data, then check
    # val data WITHOUT learning from it. Watch train vs val to spot overfitting.
    import torch
    
    def train_one_epoch(model, loader, optimizer, loss_fn):
        model.train()                            # enable dropout/batchnorm updates
        for images, labels in loader:
            optimizer.zero_grad()                # reset old gradients
            preds = model(images)
            loss = loss_fn(preds, labels)
            loss.backward()                      # compute gradients
            optimizer.step()                     # nudge the weights
    
    @torch.no_grad()                             # no gradients = faster, safer
    def validate(model, loader):
        model.eval()                             # freeze dropout/batchnorm
        correct = total = 0
        for images, labels in loader:
            preds = model(images).argmax(dim=1)
            correct += (preds == labels).sum().item()
            total += labels.size(0)
        return correct / total
    
    # A typical printout across 3 epochs:
    # Epoch 1  train_loss=1.42  val_acc=0.61
    # Epoch 2  train_loss=0.88  val_acc=0.74
    # Epoch 3  train_loss=0.55  val_acc=0.81
    
    # Expected output:
    # Epoch 1  train_loss=1.42  val_acc=0.61
    # Epoch 2  train_loss=0.88  val_acc=0.74
    # Epoch 3  train_loss=0.55  val_acc=0.81

    6Evaluate — Beyond Plain Accuracy

    Accuracy alone lies on imbalanced data. If 95% of images are "not cancer", a model that always says "not cancer" scores 95% yet catches nothing. So you also look at precision (of the things I flagged, how many were right?), recall (of the things I should have caught, how many did I?), and F1 (their balance). A confusion matrix shows exactly which classes get mixed up.

    MetricUse when…
    AccuracyClasses are balanced
    PrecisionFalse positives are costly (spam filter)
    RecallFalse negatives are costly (medical)
    F1You want one balanced number
    mAPObject detection (IoU-based)

    7Deployment and Inference

    Deployment means running the trained model on one new image at a time. The golden rule: apply the exact same preprocessing you used for validation — never the training augmentation — switch to model.eval(), and run a single forward pass. A softmax turns the raw scores into probabilities so you can report a confidence.

    # Deployment = use the trained model on ONE new image. The golden rule:
    # apply the EXACT same preprocessing you used for validation — never the
    # training augmentation. Then run a single forward pass in eval mode.
    import torch
    import torch.nn.functional as F
    
    classes = ["cat", "dog", "car", "house", "tree"]
    
    @torch.no_grad()
    def predict(model, image_tensor):
        model.eval()
        logits = model(image_tensor.unsqueeze(0))   # add a batch dimension
        probs = F.softmax(logits, dim=1)[0]         # turn scores into 0..1
        idx = int(probs.argmax())
        return classes[idx], float(probs[idx])
    
    # label, confidence = predict(model, preprocessed_image)
    # print(f"{label} ({confidence:.1%})")
    
    # Expected output:
    # dog (92.3%)

    🎯 Your Turn 1: Normalize the Pixels

    Fill in the blank so every pixel is scaled to the 0..1 range. Use the expected output to check yourself.

    Your Turn: Normalize to 0..1

    Replace ___ so pixels divide by 255.0

    Try it Yourself »
    Python
    # 🎯 YOUR TURN — fill in the blanks marked with ___
    
    # Goal: scale every pixel from 0..255 down to 0..1.
    image = [
        [0, 255],
        [51, 204],
    ]
    
    def normalize(img):
        # 👉 divide each pixel by 255.0 (use 255.0, not 255, to get a float)
        return [[pixel / ___ for pixel in row] for row in img]
    
    for row in normalize(image):
        print([round(v, 2) for v in row])
    
    # ✅ Expected output:
    # [0.0, 1.0]
    # [0.2, 0.8]

    🎯 Your Turn 2: Split 75 / 25

    Fill in the two slice indices so the first 25% becomes validation and the rest becomes training.

    Your Turn: Train / Val Split

    Replace the ___ slice points to split 75/25

    Try it Yourself »
    Python
    # 🎯 YOUR TURN — fill in the blanks marked with ___
    
    # Goal: put the FIRST 25% of the data in validation, the rest in training.
    samples = ["a", "b", "c", "d", "e", "f", "g", "h"]
    
    def train_val_split(data, val_fraction):
        n_val = int(len(data) * val_fraction)   # 8 * 0.25 = 2
        val = data[:___]                         # 👉 first n_val items -> val
        train = data[___:]                       # 👉 everything after  -> train
        return train, val
    
    train, val = train_val_split(samples, 0.25)
    p
    ...

    Common Errors (And How to Fix Them)

    ❌ Train/test preprocessing mismatch

    You normalize with one mean/std (or resize differently) at training but another at inference. The model sees inputs it was never trained on, so accuracy quietly collapses in production.

    ✅ Fix: define preprocessing once and reuse the identical transform everywhere — train, validate, and deploy.

    ❌ Augmenting the validation/test set

    Flips and crops on your val set make every run report a different, unrealistic score. You can no longer trust the number.

    ✅ Fix: keep a separate val transform with only Resize + Normalize — no random ops.

    ❌ Data leakage

    Near-duplicate frames of the same scene land in both train and val, or you compute normalization statistics over the whole dataset before splitting. Offline scores look amazing; real-world performance is poor.

    ✅ Fix: split first, then compute stats only on the train set; group related images so they never straddle the split.

    ❌ Not normalizing at all

    Feeding raw 0..255 pixels makes loss spike to NaN or stall, because the gradients blow up.

    ✅ Fix: always scale pixels to a small range (0..1, then ImageNet mean/std) before the model.

    📋 Quick Reference

    Pipeline StageToolsKey Decisions
    Collect & labelLabel Studio, CVATClass balance, label quality
    Splitsklearn, slicingTrain/val ratio, no leakage
    Augment (train only)albumentations, torchvisionFlip, crop, colour jitter
    Preprocesstorchvision.transformsResize, normalize (same everywhere)
    Backbonetimm, torchvision.modelsResNet, ViT, EfficientNet
    Train/validatetorch, optimizer, loss_fnEpochs, watch overfitting
    Evaluatesklearn.metricsF1, mAP, confusion matrix
    DeployONNX, TensorRTVal transform, eval mode, latency

    ❓ Frequently Asked Questions

    Q: What is a computer vision pipeline?

    A: It is the full assembly line that turns raw photos into predictions: collect and label data, augment and preprocess the images, feed them through a model (usually a pretrained backbone plus a small classifier head), then evaluate and deploy. Each stage feeds the next, so a problem early on quietly corrupts everything downstream.

    Q: Why must I normalize pixel values?

    A: Cameras give pixels in the 0..255 range, but models train far more stably when inputs sit in a small fixed range like 0..1 (often followed by subtracting a mean and dividing by a standard deviation). If you skip normalization, gradients can explode or vanish and training stalls. Crucially, use the SAME normalization at train, validation and inference time.

    Q: Should I augment my validation and test sets?

    A: No. Augmentation (flips, crops, colour jitter) belongs to training only — it teaches the model variety. Your validation and test sets must stay fixed and realistic so the score you read is honest. Augmenting them gives you a number that does not reflect real-world performance.

    Q: What is transfer learning and why use a pretrained backbone?

    A: A backbone like ResNet-50 has already learned generic visual features (edges, textures, shapes) from millions of ImageNet images. Transfer learning reuses those weights and only retrains a small new head for your classes, so you need far less data and compute than training from scratch — and you usually get better accuracy too.

    Q: What is data leakage in a CV pipeline?

    A: Leakage is when information from your validation or test set sneaks into training. Common causes: computing normalization statistics over the whole dataset before splitting, putting near-duplicate frames of the same scene in both train and val, or tuning on the test set. The symptom is great offline scores that collapse in production.

    🎯 Mini Challenge: Flip an Image

    Now with the support faded — only a comment outline is given. Write the augmentation yourself: mirror each row of a nested-list image left to right.

    Mini Challenge: Horizontal Flip

    Write flip(img) from the outline and match the expected output

    Try it Yourself »
    Python
    # 🎯 MINI-CHALLENGE: horizontal-flip an image
    # 1. Define a nested-list image, e.g. [[10, 20, 30], [40, 50, 60]]
    # 2. Write flip(img) that mirrors each ROW left<->right
    #    (hint: list(reversed(row)) reverses one row)
    # 3. Print each flipped row
    #
    # ✅ Expected output:
    # [30, 20, 10]
    # [60, 50, 40]
    
    # your code here

    🎉 Lesson Complete!

    You can now walk an image down the whole assembly line: collect and split data, augment the training set, preprocess by resizing and normalizing, transfer-learn a pretrained backbone, run the train/validate loop, evaluate beyond plain accuracy, and deploy for inference — all while keeping preprocessing consistent and avoiding leakage.

    🚀 Up next: Object Detection — go from "what is in this image?" to "what is where?", drawing labelled boxes around every object.

    Sign up for free to track which lessons you've completed and get learning reminders.

    Previous

    Cookie & Privacy Settings

    We use cookies to improve your experience, analyze traffic, and show personalized ads. You can manage your preferences below.

    By clicking "Accept All", you consent to our use of cookies for analytics and personalized advertising. You can customize your preferences or reject non-essential cookies.

    Privacy PolicyTerms of Service