Lesson 33 • Advanced
Semantic Segmentation
Label every pixel in an image. By the end you'll understand encoder-decoder networks like U-Net, compute IoU and Dice by hand, and know when each segmentation flavour and metric is the right tool.
What You'll Learn in This Lesson
- ✓What pixel-level classification means and how it differs from detection
- ✓Semantic vs instance vs panoptic segmentation, and when to use each
- ✓How encoder-decoder networks (FCN, U-Net) turn an image into a mask
- ✓Why skip connections and transposed convolutions sharpen boundaries
- ✓How to compute IoU and Dice between two masks, pixel by pixel
- ✓Why mean IoU beats pixel accuracy when classes are imbalanced
🎨 Real-World Analogy: Colouring Every Pixel by What It Is
Object detection draws rectangles around things. Segmentation goes further: it hands you a colouring book of the photo and asks you to colour every single pixel by what it is — sky blue, road grey, car red, person yellow. Nothing is left uncoloured, and the colour has to follow the object's exact outline, not a loose box.
A self-driving car needs that precision to know exactly where the road ends and the kerb begins. A radiologist needs it to trace the precise boundary of a tumour. That is the job of semantic segmentation: assign a class label to each pixel so the whole image becomes a clean, colour-coded map.
1Pixel-Level Classification
In image classification you predict one label for the whole image("this is a cat"). In segmentation you predict one label for every pixel. The output is a grid the same height and width as the input, where each cell holds a class id — a mask.
Under the hood the model produces, for each pixel, a score per class. You then take the argmax (the index of the highest score) to get that pixel's predicted class. That single operation is what turns a stack of score maps into the final colour-coded picture.
Here is that whole idea in plain Python — no libraries — so you can see the argmax happen on a tiny grid:
Worked Example: Argmax a Class Per Pixel
Turn per-pixel class scores into a labelled class map with argmax
# Turning model SCORES into a class map with argmax — PLAIN Python
# A segmentation model outputs, for EVERY pixel, one score per class.
# argmax picks the index of the highest score = the predicted class.
class_names = ["background", "road", "car"]
# scores[row][col] is a list [score_bg, score_road, score_car] for that pixel
scores = [
[[0.8, 0.1, 0.1], [0.2, 0.7, 0.1]], # row 0: bg, road
[[0.1, 0.2, 0.7], [0.3, 0.3, 0.4]], # row 1: car, car
]
def argmax(values):
"""Return th
...2Semantic vs Instance vs Panoptic
There are three flavours of segmentation. They answer different questions:
Semantic
Every pixel gets a class, but all objects of a class merge. Three cars are just one big "car" region. Question: what is this pixel?
Instance
Separates individual objects (car #1, car #2, car #3) with their own masks, but usually ignores background "stuff". Question: which object is this?
Panoptic
Both at once: every pixel gets a class and countable objects get a unique id. Question: what is it, and which one?
Jargon check: classes like road, sky and grass that you can't count are called "stuff"; countable classes like cars and people are "things". Semantic handles stuff well; instance handles things; panoptic does both.
3Encoder-Decoder Architectures (FCN & U-Net)
A normal classifier ends in dense layers that throw away spatial layout — fine for one label, useless for a per-pixel map. The Fully Convolutional Network (FCN) fixed this by replacing those dense layers with convolutions, so the output stays a 2-D grid.
Modern segmentation networks share an encoder-decoder shape:
- The encoder repeatedly downsamples the image (via pooling or strided convolutions). It captures what is present but loses precise location.
- The decoder repeatedly upsamples back to full resolution. It rebuilds where everything is, ending in a 1×1 convolution that outputs one score per class per pixel.
U-Net (2015) is the famous example. Its big idea is skip connections: at each decoder stage it concatenates the matching high-resolution feature map copied straight from the encoder. That restores the fine edge detail the encoder threw away, so boundaries come out crisp instead of blurry. The diagram below shows the symmetric "U" shape:
Input ─[Conv↓]─[Conv↓]─[Conv↓]─┐ (encoder: captures WHAT)
│ │ │ │
(skip) (skip) (skip) ▼
│ │ │ Bottleneck
▼ ▼ ▼ │
Output ─[Conv↑]─[Conv↑]─[Conv↑]─┘ (decoder: recovers WHERE)The worked example below builds a tiny U-Net in PyTorch so you can see the skip connection (torch.cat) and the upsampling step in real code. It's read-only here — run it locally with PyTorch installed:
🔧 Worked Example: A Tiny U-Net in PyTorch
# A tiny U-Net forward pass in PyTorch (encoder-decoder + skip connections)
import torch
import torch.nn as nn
class TinyUNet(nn.Module):
def __init__(self, in_ch=3, num_classes=3):
super().__init__()
# Encoder: each block downsamples (captures WHAT is in the image)
self.enc1 = nn.Conv2d(in_ch, 16, 3, padding=1)
self.enc2 = nn.Conv2d(16, 32, 3, padding=1)
self.pool = nn.MaxPool2d(2) # halve height & width
# Decoder: transposed conv UPSAMPLES (recovers WHERE things are)
self.up = nn.ConvTranspose2d(32, 16, 2, stride=2)
# After concatenating the skip connection we have 16 + 16 = 32 channels
self.dec1 = nn.Conv2d(32, 16, 3, padding=1)
# 1x1 conv produces one score per class, per pixel
self.head = nn.Conv2d(16, num_classes, 1)
def forward(self, x):
s1 = torch.relu(self.enc1(x)) # skip feature, full resolution
x = torch.relu(self.enc2(self.pool(s1)))
x = self.up(x) # back to full resolution
x = torch.cat([x, s1], dim=1) # SKIP CONNECTION (concatenate)
x = torch.relu(self.dec1(x))
return self.head(x) # logits: (N, classes, H, W)
model = TinyUNet(num_classes=3)
image = torch.randn(1, 3, 64, 64) # 1 RGB image, 64x64
logits = model(image)
pred = logits.argmax(dim=1) # class id per pixel
print("logits shape:", tuple(logits.shape))
print("pred shape: ", tuple(pred.shape))
# Expected output:
# logits shape: (1, 3, 64, 64)
# pred shape: (1, 64, 64)Notice nn.ConvTranspose2d doing the learnable upsampling and torch.cat([x, s1], dim=1) performing the skip connection. The 1×1 head maps features to one score per class per pixel.
4Upsampling & Transposed Convolution
The bottleneck feature map is tiny — maybe 16×16 for a 256×256 input. The decoder must grow it back to full size. There are two common ways:
- Transposed convolution (
ConvTranspose2d, sometimes called "deconvolution" or fractionally-strided conv): a learnableupsampling that inserts spacing between inputs and convolves, doubling the resolution while the network decides how to fill the gaps. - Bilinear upsampling followed by an ordinary convolution: a non-learned resize, then a conv to clean it up. Cheaper and avoids the checkerboard artefacts transposed conv can produce.
Either way, the goal is the same: recover the original height and width so you can output one prediction per input pixel. DeepLab takes a different route and uses dilated (atrous) convolutions — convolutions with gaps between weights — to widen the receptive field without downsampling as aggressively, keeping resolution high throughout.
5Metrics: IoU and Dice
How do you score a mask? You compare the predicted pixels for a class against the true pixels. Two measures dominate:
- IoU (Intersection over Union), a.k.a. the Jaccard index:
overlap / (predicted ∪ truth). The standard benchmark metric. - Dice coefficient (the F1 score for pixels):
2 × overlap / (predicted + truth). Always a touch higher than IoU, and popular as a loss in medical imaging.
Mean IoU (mIoU) is just the IoU averaged across all classes — the headline number on datasets like Cityscapes and Pascal VOC. The runnable example below computes both IoU and Dice on two tiny masks, entirely in plain Python so you can follow every pixel:
Worked Example: IoU & Dice on Tiny Masks
Compute intersection, union, IoU and Dice between two nested-list masks
# Pixel-wise IoU and Dice between two small masks — PLAIN Python, no libraries
# A "mask" here is a grid (nested list) of class labels, one label per pixel.
# 0 = background, 1 = object (e.g. a car).
prediction = [ # what the model THINKS each pixel is
[0, 0, 1, 1],
[0, 1, 1, 1],
[0, 1, 1, 0],
[0, 0, 0, 0],
]
ground_truth = [ # what each pixel REALLY is (the labelled answer)
[0, 1, 1, 1],
[0, 1, 1, 0],
[0, 1, 1, 0],
[0, 0, 0, 0],
]
# We measure
...🌍 In Practice: Pretrained U-Net with segmentation-models-pytorch
You almost never hand-build U-Net for real work. Libraries like segmentation_models_pytorch give you battle-tested architectures with pretrained encoders in a few lines. Read-only here; run it locally:
# In practice you rarely build U-Net by hand — use a pretrained library.
import segmentation_models_pytorch as smp
import torch
# A U-Net with a pretrained ResNet encoder, 5 output classes
model = smp.Unet(
encoder_name="resnet34", # backbone pretrained on ImageNet
encoder_weights="imagenet",
in_channels=3, # RGB input
classes=5, # one score map per class
)
image = torch.randn(1, 3, 256, 256)
logits = model(image) # (N, classes, H, W) at FULL resolution
masks = logits.argmax(dim=1) # (N, H, W) — class id per pixel
print("logits shape:", tuple(logits.shape))
print("masks shape: ", tuple(masks.shape))
# Expected output:
# logits shape: (1, 5, 256, 256)
# masks shape: (1, 256, 256)Swap encoder_name for a deeper backbone, or smp.Unet for smp.DeepLabV3Plus, and the rest of your training loop stays identical.
🎯 Your Turn #1: Pixel Accuracy
Fill in the two blanks so the loop counts how many pixels were labelled correctly, then divides by the total. The expected output is in the comments — match it.
Your Turn: Pixel Accuracy
Finish the comparison and the counter to compute pixel accuracy
# 🎯 YOUR TURN — finish pixel accuracy: the fraction of pixels labelled correctly.
prediction = [
[0, 1, 1],
[2, 1, 0],
[0, 0, 2],
]
ground_truth = [
[0, 1, 2], # last pixel wrong
[2, 1, 0],
[0, 1, 2], # middle pixel wrong
]
correct = 0
total = 0
for row in range(len(ground_truth)):
for col in range(len(ground_truth[0])):
total += 1
# 👉 add 1 to "correct" only when the two labels match
if prediction[row][col] ___ ground_truth[row][col]
...🎯 Your Turn #2: Argmax Per Pixel
Complete the argmax comparison and the list comprehension so each pixel's highest-scoring class becomes its label. Match the expected output.
Your Turn: Argmax Per Pixel
Finish argmax and the per-pixel comprehension to build a class map
# 🎯 YOUR TURN — pick the winning class for each pixel with argmax.
class_names = ["background", "person", "dog"]
# One [bg, person, dog] score list per pixel, laid out as a 2x2 grid
scores = [
[[0.6, 0.3, 0.1], [0.1, 0.8, 0.1]],
[[0.2, 0.2, 0.6], [0.7, 0.2, 0.1]],
]
def argmax(values):
best = 0
for i in range(1, len(values)):
# 👉 keep "best" pointing at the index of the LARGEST score
if values[i] ___ values[best]: # 👉 replace ___ with the comparison that m
...🎯 Mini-Challenge: Mean IoU Over Three Classes
Time to fly with less support. Using only the comment outline, compute the IoU for each of the three classes and then their mean. The expected numbers are in the comments so you can self-check.
Mini-Challenge: Mean IoU
Compute per-class IoU and the mean IoU from scratch
# 🎯 MINI-CHALLENGE: Mean IoU over two classes
#
# You are given a 3x3 prediction and ground-truth mask with labels 0, 1, 2.
# 1. For EACH class c in (0, 1, 2):
# - intersection = pixels where pred == c AND truth == c
# - union = pixels where pred == c OR truth == c
# - class IoU = intersection / union (skip the class if union == 0)
# 2. Mean IoU (mIoU) = average of the per-class IoUs you computed
# 3. print each class IoU and the final mIoU rounded to 3 decimals
...Common Errors (And How to Fix Them)
❌ Ignoring class imbalance
Background often fills 80–95% of an image. Train with plain cross-entropy and the model learns to predict "background" everywhere — small classes vanish.
✅ Fix: use Dice loss or a weighted/focal cross-entropy so rare classes count more, and always report per-class IoU, not just the average.
❌ Trusting pixel accuracy over IoU
A model scoring 92% pixel accuracy can still miss every pedestrian if pedestrians are a tiny fraction of pixels.
✅ Fix: judge models by mean IoU (mIoU), which weights every class equally regardless of how many pixels it covers.
❌ Blurry, wrong boundaries
Edges of objects come out smeared and the mask leaks across object borders — usually because spatial detail was lost during heavy downsampling.
✅ Fix: use an architecture with skip connections (U-Net) or dilated convolutions (DeepLab) so high-resolution detail reaches the output.
❌ No skip connections
A pure encoder-decoder with no skips upsamples only from the tiny bottleneck, so it never recovers fine structure and masks look like soft blobs.
✅ Fix: concatenate (or add) the matching encoder feature map into each decoder stage — that is exactly what makes U-Net work.
📋 Quick Reference
Architectures
| Architecture | Key Feature | Best For |
|---|---|---|
| FCN | No dense layers, all conv | The original per-pixel net |
| U-Net | Skip connections | Medical, small datasets |
| DeepLabV3+ | Atrous (dilated) conv | General segmentation |
| Mask R-CNN | Per-instance masks | Instance segmentation |
| SAM | Zero-shot prompts | Universal segmentation |
Metrics & terms
| Term | Formula / Meaning |
|---|---|
| IoU (Jaccard) | intersection / union |
| Dice (F1) | 2 × intersection / (pred + truth) |
| mIoU | Mean of per-class IoU (primary metric) |
| Pixel accuracy | Correct pixels / total (misleading) |
| argmax | Index of the highest class score per pixel |
| Transposed conv | Learnable upsampling in the decoder |
❓ Frequently Asked Questions
Q: What is the difference between semantic, instance, and panoptic segmentation?
A: Semantic segmentation labels every pixel with a class but treats all objects of the same class as one blob (all cars share one 'car' label). Instance segmentation separates individual objects (car #1 vs car #2) but usually only covers 'things'. Panoptic segmentation combines both: every pixel gets a class, and each countable object also gets a unique instance id.
Q: Why does U-Net use skip connections?
A: The encoder shrinks the image to capture context (WHAT is present) but loses fine spatial detail. Skip connections copy high-resolution feature maps from the encoder straight across to the matching decoder stage, so the decoder can recover exact boundaries (WHERE things are). Without them, masks come out blurry along edges.
Q: What is the difference between IoU and Dice?
A: Both measure mask overlap. IoU (Jaccard) is intersection / union. Dice is 2*intersection / (predicted + truth). Dice always gives a slightly higher number and weights the overlap more, which makes it popular as a loss in medical imaging. IoU is the standard reporting metric on benchmarks like Cityscapes and Pascal VOC.
Q: Why is pixel accuracy a misleading metric?
A: If 80% of an image is background, a model that predicts 'background everywhere' scores 80% pixel accuracy while being useless for the classes you care about. Mean IoU averages the IoU of each class equally, so a tiny but important class (like a tumour) counts just as much as the background.
Q: What is transposed convolution used for in segmentation?
A: A transposed (a.k.a. 'deconv' or fractionally-strided) convolution is a learnable way to upsample a feature map back toward the input resolution. The decoder uses it to grow the small, deep feature maps from the bottleneck back into a full-size, per-pixel prediction. Bilinear upsampling is a non-learned alternative.
Lesson 33 complete — you can label every pixel!
You now know the three segmentation flavours, how encoder-decoder networks like FCN and U-Net rebuild a full-resolution mask, why skip connections and transposed convolutions matter, and how to compute IoU, Dice and mIoU by hand.
🚀 Up next: Speech Recognition — switch from pixels to audio and learn how models turn sound waves into text.
Sign up for free to track which lessons you've completed and get learning reminders.