Courses/AI & ML/Semantic Segmentation

Lesson 33 • Advanced

Semantic Segmentation

Label every pixel in an image. By the end you'll understand encoder-decoder networks like U-Net, compute IoU and Dice by hand, and know when each segmentation flavour and metric is the right tool.

What You'll Learn in This Lesson

✓What pixel-level classification means and how it differs from detection
✓Semantic vs instance vs panoptic segmentation, and when to use each
✓How encoder-decoder networks (FCN, U-Net) turn an image into a mask
✓Why skip connections and transposed convolutions sharpen boundaries
✓How to compute IoU and Dice between two masks, pixel by pixel
✓Why mean IoU beats pixel accuracy when classes are imbalanced

Before you start: Make sure you've completed Object Detection — segmentation builds on the idea of locating objects, but pushes it down to the level of individual pixels.

🎨 Real-World Analogy: Colouring Every Pixel by What It Is

Object detection draws rectangles around things. Segmentation goes further: it hands you a colouring book of the photo and asks you to colour every single pixel by what it is — sky blue, road grey, car red, person yellow. Nothing is left uncoloured, and the colour has to follow the object's exact outline, not a loose box.

A self-driving car needs that precision to know exactly where the road ends and the kerb begins. A radiologist needs it to trace the precise boundary of a tumour. That is the job of semantic segmentation: assign a class label to each pixel so the whole image becomes a clean, colour-coded map.

1Pixel-Level Classification

In image classification you predict one label for the whole image("this is a cat"). In segmentation you predict one label for every pixel. The output is a grid the same height and width as the input, where each cell holds a class id — a mask.

Under the hood the model produces, for each pixel, a score per class. You then take the argmax (the index of the highest score) to get that pixel's predicted class. That single operation is what turns a stack of score maps into the final colour-coded picture.

Here is that whole idea in plain Python — no libraries — so you can see the argmax happen on a tiny grid:

Worked Example: Argmax a Class Per Pixel

Turn per-pixel class scores into a labelled class map with argmax

Try it Yourself »

Python

# Turning model SCORES into a class map with argmax — PLAIN Python
# A segmentation model outputs, for EVERY pixel, one score per class.
# argmax picks the index of the highest score = the predicted class.

class_names = ["background", "road", "car"]

# scores[row][col] is a list [score_bg, score_road, score_car] for that pixel
scores = [
    [[0.8, 0.1, 0.1], [0.2, 0.7, 0.1]],   # row 0: bg, road
    [[0.1, 0.2, 0.7], [0.3, 0.3, 0.4]],   # row 1: car, car
]

def argmax(values):
    """Return th
...

2Semantic vs Instance vs Panoptic

There are three flavours of segmentation. They answer different questions:

Semantic

Every pixel gets a class, but all objects of a class merge. Three cars are just one big "car" region. Question: what is this pixel?

Instance

Separates individual objects (car #1, car #2, car #3) with their own masks, but usually ignores background "stuff". Question: which object is this?

Panoptic

Both at once: every pixel gets a class and countable objects get a unique id. Question: what is it, and which one?

Jargon check: classes like road, sky and grass that you can't count are called "stuff"; countable classes like cars and people are "things". Semantic handles stuff well; instance handles things; panoptic does both.

3Encoder-Decoder Architectures (FCN & U-Net)

A normal classifier ends in dense layers that throw away spatial layout — fine for one label, useless for a per-pixel map. The Fully Convolutional Network (FCN) fixed this by replacing those dense layers with convolutions, so the output stays a 2-D grid.

Modern segmentation networks share an encoder-decoder shape:

The encoder repeatedly downsamples the image (via pooling or strided convolutions). It captures what is present but loses precise location.
The decoder repeatedly upsamples back to full resolution. It rebuilds where everything is, ending in a 1×1 convolution that outputs one score per class per pixel.

U-Net (2015) is the famous example. Its big idea is skip connections: at each decoder stage it concatenates the matching high-resolution feature map copied straight from the encoder. That restores the fine edge detail the encoder threw away, so boundaries come out crisp instead of blurry. The diagram below shows the symmetric "U" shape:

Input ─[Conv↓]─[Conv↓]─[Conv↓]─┐ (encoder: captures WHAT)
          │       │       │     │
        (skip)  (skip)  (skip)  ▼
          │       │       │  Bottleneck
          ▼       ▼       ▼     │
Output ─[Conv↑]─[Conv↑]─[Conv↑]─┘ (decoder: recovers WHERE)

The worked example below builds a tiny U-Net in PyTorch so you can see the skip connection (torch.cat) and the upsampling step in real code. It's read-only here — run it locally with PyTorch installed:

🔧 Worked Example: A Tiny U-Net in PyTorch

# A tiny U-Net forward pass in PyTorch (encoder-decoder + skip connections)
import torch
import torch.nn as nn

class TinyUNet(nn.Module):
    def __init__(self, in_ch=3, num_classes=3):
        super().__init__()
        # Encoder: each block downsamples (captures WHAT is in the image)
        self.enc1 = nn.Conv2d(in_ch, 16, 3, padding=1)
        self.enc2 = nn.Conv2d(16, 32, 3, padding=1)
        self.pool = nn.MaxPool2d(2)            # halve height & width
        # Decoder: transposed conv UPSAMPLES (recovers WHERE things are)
        self.up = nn.ConvTranspose2d(32, 16, 2, stride=2)
        # After concatenating the skip connection we have 16 + 16 = 32 channels
        self.dec1 = nn.Conv2d(32, 16, 3, padding=1)
        # 1x1 conv produces one score per class, per pixel
        self.head = nn.Conv2d(16, num_classes, 1)

    def forward(self, x):
        s1 = torch.relu(self.enc1(x))          # skip feature, full resolution
        x = torch.relu(self.enc2(self.pool(s1)))
        x = self.up(x)                         # back to full resolution
        x = torch.cat([x, s1], dim=1)          # SKIP CONNECTION (concatenate)
        x = torch.relu(self.dec1(x))
        return self.head(x)                    # logits: (N, classes, H, W)

model = TinyUNet(num_classes=3)
image = torch.randn(1, 3, 64, 64)             # 1 RGB image, 64x64
logits = model(image)
pred = logits.argmax(dim=1)                    # class id per pixel

print("logits shape:", tuple(logits.shape))
print("pred shape:  ", tuple(pred.shape))

# Expected output:
# logits shape: (1, 3, 64, 64)
# pred shape:   (1, 64, 64)

Notice nn.ConvTranspose2d doing the learnable upsampling and torch.cat([x, s1], dim=1) performing the skip connection. The 1×1 head maps features to one score per class per pixel.

4Upsampling & Transposed Convolution

The bottleneck feature map is tiny — maybe 16×16 for a 256×256 input. The decoder must grow it back to full size. There are two common ways:

Transposed convolution (ConvTranspose2d, sometimes called "deconvolution" or fractionally-strided conv): a learnableupsampling that inserts spacing between inputs and convolves, doubling the resolution while the network decides how to fill the gaps.
Bilinear upsampling followed by an ordinary convolution: a non-learned resize, then a conv to clean it up. Cheaper and avoids the checkerboard artefacts transposed conv can produce.

Either way, the goal is the same: recover the original height and width so you can output one prediction per input pixel. DeepLab takes a different route and uses dilated (atrous) convolutions — convolutions with gaps between weights — to widen the receptive field without downsampling as aggressively, keeping resolution high throughout.

5Metrics: IoU and Dice

How do you score a mask? You compare the predicted pixels for a class against the true pixels. Two measures dominate:

IoU (Intersection over Union), a.k.a. the Jaccard index: overlap / (predicted ∪ truth). The standard benchmark metric.
Dice coefficient (the F1 score for pixels): 2 × overlap / (predicted + truth). Always a touch higher than IoU, and popular as a loss in medical imaging.

Mean IoU (mIoU) is just the IoU averaged across all classes — the headline number on datasets like Cityscapes and Pascal VOC. The runnable example below computes both IoU and Dice on two tiny masks, entirely in plain Python so you can follow every pixel:

Worked Example: IoU & Dice on Tiny Masks

Compute intersection, union, IoU and Dice between two nested-list masks

Try it Yourself »

Python

# Pixel-wise IoU and Dice between two small masks — PLAIN Python, no libraries
# A "mask" here is a grid (nested list) of class labels, one label per pixel.
# 0 = background, 1 = object (e.g. a car).

prediction = [          # what the model THINKS each pixel is
    [0, 0, 1, 1],
    [0, 1, 1, 1],
    [0, 1, 1, 0],
    [0, 0, 0, 0],
]

ground_truth = [        # what each pixel REALLY is (the labelled answer)
    [0, 1, 1, 1],
    [0, 1, 1, 0],
    [0, 1, 1, 0],
    [0, 0, 0, 0],
]

# We measure 
...

🌍 In Practice: Pretrained U-Net with segmentation-models-pytorch

You almost never hand-build U-Net for real work. Libraries like segmentation_models_pytorch give you battle-tested architectures with pretrained encoders in a few lines. Read-only here; run it locally:

# In practice you rarely build U-Net by hand — use a pretrained library.
import segmentation_models_pytorch as smp
import torch

# A U-Net with a pretrained ResNet encoder, 5 output classes
model = smp.Unet(
    encoder_name="resnet34",       # backbone pretrained on ImageNet
    encoder_weights="imagenet",
    in_channels=3,                 # RGB input
    classes=5,                     # one score map per class
)

image = torch.randn(1, 3, 256, 256)
logits = model(image)              # (N, classes, H, W) at FULL resolution
masks = logits.argmax(dim=1)       # (N, H, W) — class id per pixel

print("logits shape:", tuple(logits.shape))
print("masks shape: ", tuple(masks.shape))

# Expected output:
# logits shape: (1, 5, 256, 256)
# masks shape:  (1, 256, 256)

Swap encoder_name for a deeper backbone, or smp.Unet for smp.DeepLabV3Plus, and the rest of your training loop stays identical.

🎯 Your Turn #1: Pixel Accuracy

Fill in the two blanks so the loop counts how many pixels were labelled correctly, then divides by the total. The expected output is in the comments — match it.

Your Turn: Pixel Accuracy

Finish the comparison and the counter to compute pixel accuracy

Try it Yourself »

Python

# 🎯 YOUR TURN — finish pixel accuracy: the fraction of pixels labelled correctly.

prediction = [
    [0, 1, 1],
    [2, 1, 0],
    [0, 0, 2],
]

ground_truth = [
    [0, 1, 2],   # last pixel wrong
    [2, 1, 0],
    [0, 1, 2],   # middle pixel wrong
]

correct = 0
total = 0

for row in range(len(ground_truth)):
    for col in range(len(ground_truth[0])):
        total += 1
        # 👉 add 1 to "correct" only when the two labels match
        if prediction[row][col] ___ ground_truth[row][col]
...

🎯 Your Turn #2: Argmax Per Pixel

Complete the argmax comparison and the list comprehension so each pixel's highest-scoring class becomes its label. Match the expected output.

Your Turn: Argmax Per Pixel

Finish argmax and the per-pixel comprehension to build a class map

Try it Yourself »

Python

# 🎯 YOUR TURN — pick the winning class for each pixel with argmax.

class_names = ["background", "person", "dog"]

# One [bg, person, dog] score list per pixel, laid out as a 2x2 grid
scores = [
    [[0.6, 0.3, 0.1], [0.1, 0.8, 0.1]],
    [[0.2, 0.2, 0.6], [0.7, 0.2, 0.1]],
]

def argmax(values):
    best = 0
    for i in range(1, len(values)):
        # 👉 keep "best" pointing at the index of the LARGEST score
        if values[i] ___ values[best]:   # 👉 replace ___ with the comparison that m
...

🎯 Mini-Challenge: Mean IoU Over Three Classes

Time to fly with less support. Using only the comment outline, compute the IoU for each of the three classes and then their mean. The expected numbers are in the comments so you can self-check.

Mini-Challenge: Mean IoU

Compute per-class IoU and the mean IoU from scratch

Try it Yourself »

Python

# 🎯 MINI-CHALLENGE: Mean IoU over two classes
#
# You are given a 3x3 prediction and ground-truth mask with labels 0, 1, 2.
# 1. For EACH class c in (0, 1, 2):
#       - intersection = pixels where pred == c AND truth == c
#       - union        = pixels where pred == c OR  truth == c
#       - class IoU     = intersection / union   (skip the class if union == 0)
# 2. Mean IoU (mIoU) = average of the per-class IoUs you computed
# 3. print each class IoU and the final mIoU rounded to 3 decimals

...

Common Errors (And How to Fix Them)

❌ Ignoring class imbalance

Background often fills 80–95% of an image. Train with plain cross-entropy and the model learns to predict "background" everywhere — small classes vanish.

✅ Fix: use Dice loss or a weighted/focal cross-entropy so rare classes count more, and always report per-class IoU, not just the average.

❌ Trusting pixel accuracy over IoU

A model scoring 92% pixel accuracy can still miss every pedestrian if pedestrians are a tiny fraction of pixels.

✅ Fix: judge models by mean IoU (mIoU), which weights every class equally regardless of how many pixels it covers.

❌ Blurry, wrong boundaries

Edges of objects come out smeared and the mask leaks across object borders — usually because spatial detail was lost during heavy downsampling.

✅ Fix: use an architecture with skip connections (U-Net) or dilated convolutions (DeepLab) so high-resolution detail reaches the output.

❌ No skip connections

A pure encoder-decoder with no skips upsamples only from the tiny bottleneck, so it never recovers fine structure and masks look like soft blobs.

✅ Fix: concatenate (or add) the matching encoder feature map into each decoder stage — that is exactly what makes U-Net work.

Pro tip: For medical imaging start with U-Net + Dice loss — it handles class imbalance naturally. For general scenes, DeepLabV3+ with a strong backbone is a great accuracy/speed trade-off, and SAM (Segment Anything) can segment almost any object from a simple prompt with zero training.

📋 Quick Reference

Architectures

Architecture	Key Feature	Best For
FCN	No dense layers, all conv	The original per-pixel net
U-Net	Skip connections	Medical, small datasets
DeepLabV3+	Atrous (dilated) conv	General segmentation
Mask R-CNN	Per-instance masks	Instance segmentation
SAM	Zero-shot prompts	Universal segmentation

Metrics & terms

Term	Formula / Meaning
IoU (Jaccard)	`intersection / union`
Dice (F1)	`2 × intersection / (pred + truth)`
mIoU	Mean of per-class IoU (primary metric)
Pixel accuracy	Correct pixels / total (misleading)
argmax	Index of the highest class score per pixel
Transposed conv	Learnable upsampling in the decoder

❓ Frequently Asked Questions

Q: What is the difference between semantic, instance, and panoptic segmentation?

A: Semantic segmentation labels every pixel with a class but treats all objects of the same class as one blob (all cars share one 'car' label). Instance segmentation separates individual objects (car #1 vs car #2) but usually only covers 'things'. Panoptic segmentation combines both: every pixel gets a class, and each countable object also gets a unique instance id.

Q: Why does U-Net use skip connections?

A: The encoder shrinks the image to capture context (WHAT is present) but loses fine spatial detail. Skip connections copy high-resolution feature maps from the encoder straight across to the matching decoder stage, so the decoder can recover exact boundaries (WHERE things are). Without them, masks come out blurry along edges.

Q: What is the difference between IoU and Dice?

A: Both measure mask overlap. IoU (Jaccard) is intersection / union. Dice is 2*intersection / (predicted + truth). Dice always gives a slightly higher number and weights the overlap more, which makes it popular as a loss in medical imaging. IoU is the standard reporting metric on benchmarks like Cityscapes and Pascal VOC.

Q: Why is pixel accuracy a misleading metric?

A: If 80% of an image is background, a model that predicts 'background everywhere' scores 80% pixel accuracy while being useless for the classes you care about. Mean IoU averages the IoU of each class equally, so a tiny but important class (like a tumour) counts just as much as the background.

Q: What is transposed convolution used for in segmentation?

A: A transposed (a.k.a. 'deconv' or fractionally-strided) convolution is a learnable way to upsample a feature map back toward the input resolution. The decoder uses it to grow the small, deep feature maps from the bottleneck back into a full-size, per-pixel prediction. Bilinear upsampling is a non-learned alternative.

🎉

Lesson 33 complete — you can label every pixel!

You now know the three segmentation flavours, how encoder-decoder networks like FCN and U-Net rebuild a full-resolution mask, why skip connections and transposed convolutions matter, and how to compute IoU, Dice and mIoU by hand.

🚀 Up next: Speech Recognition — switch from pixels to audio and learn how models turn sound waves into text.

Semantic Segmentation

What You'll Learn in This Lesson

🎨 Real-World Analogy: Colouring Every Pixel by What It Is

1Pixel-Level Classification

Worked Example: Argmax a Class Per Pixel

2Semantic vs Instance vs Panoptic

3Encoder-Decoder Architectures (FCN & U-Net)

🔧 Worked Example: A Tiny U-Net in PyTorch

4Upsampling & Transposed Convolution

5Metrics: IoU and Dice

Worked Example: IoU & Dice on Tiny Masks

🌍 In Practice: Pretrained U-Net with segmentation-models-pytorch

🎯 Your Turn #1: Pixel Accuracy

Your Turn: Pixel Accuracy

🎯 Your Turn #2: Argmax Per Pixel

Your Turn: Argmax Per Pixel

🎯 Mini-Challenge: Mean IoU Over Three Classes

Mini-Challenge: Mean IoU

Common Errors (And How to Fix Them)

📋 Quick Reference

❓ Frequently Asked Questions

Lesson 33 complete — you can label every pixel!

Cookie & Privacy Settings