Lesson 32 • Advanced
Object Detection: YOLO, SSD & Faster R-CNN
Go from "is there a cat?" to "draw a box around every cat, dog, and car." You'll measure box overlap with IoU, clean up duplicates with NMS, and know which detector to reach for.
What You'll Learn in This Lesson
- ✓How detection = classification + localisation (bounding boxes)
- ✓How to compute IoU (Intersection over Union) in plain Python
- ✓What anchor boxes are and why detectors use them
- ✓One-stage (YOLO, SSD) vs two-stage (Faster R-CNN) trade-offs
- ✓How Non-Max Suppression removes duplicate boxes
- ✓How mAP scores and compares detectors
🎯 Real-World Analogy: Spotting and Boxing Objects in a Photo
Imagine handing a friend a busy holiday photo and a marker pen. Classification is asking "is there a dog in this photo?" — one yes/no answer. Object detection is asking your friend to go further: "draw a box around every dog, every person, and every car, and write what each one is." That is the whole job — find the objects and box them.
Now imagine your friend gets over-excited and scribbles five boxes around the same dog. You'd keep the neatest box and cross out the rest. That clean-up step is exactly what Non-Max Suppressiondoes, and the way you decide two boxes are "the same dog" is by measuring their overlap — that's IoU.
1Detection = Classification + Localisation
A classifier outputs one label for the whole image: "cat". A detector must output a list of objects, and for each one it predicts three things:
- Class — what it is ("cat", "car", "person").
- Bounding box — where it is, as four numbers.
- Confidence — how sure the model is (0 to 1).
A bounding box is the rectangle drawn around an object. The most common format is the two corners: [x1, y1, x2, y2] — the top-left corner(x1, y1) and the bottom-right corner(x2, y2), all in pixels. (YOLO often uses a centre format[cx, cy, w, h] instead — same rectangle, different numbers.)
# One detection from a model, in corner format:
detection = {
"class": "cat",
"box": [50, 50, 200, 200], # x1, y1, x2, y2 in pixels
"conf": 0.93, # 93% confident
}2IoU — Measuring Box Overlap
How do you score a predicted box against the true box? You use IoU (Intersection over Union): the area where the two boxes overlap, divided by the total area they cover together.
IoU in one line:
IoU = overlap_area / (area_a + area_b - overlap_area)
- IoU = 0 — the boxes don't touch at all.
- IoU = 1 — the boxes are identical.
- IoU ≥ 0.5 — the usual cut-off for "correct detection".
Here is the full, commented version. Read it, then run it — the output is at the bottom.
# Intersection over Union (IoU) — the core detection metric
# IoU = overlap area / combined area. Ranges 0 (no overlap) to 1 (perfect).
# Each box is [x1, y1, x2, y2] = top-left corner and bottom-right corner.
def compute_iou(box_a, box_b):
# Coordinates of the overlapping rectangle
x1 = max(box_a[0], box_b[0]) # leftmost right edge
y1 = max(box_a[1], box_b[1]) # topmost bottom edge
x2 = min(box_a[2], box_b[2]) # rightmost left edge
y2 = min(box_a[3], box_b[3])
# Width/height of overlap (0 if the boxes do not touch)
overlap_w = max(0, x2 - x1)
overlap_h = max(0, y2 - y1)
intersection = overlap_w * overlap_h # shared area
# Area of each box, then the union (avoid double-counting overlap)
area_a = (box_a[2] - box_a[0]) * (box_a[3] - box_a[1])
area_b = (box_b[2] - box_b[0]) * (box_b[3] - box_b[1])
union = area_a + area_b - intersection
return intersection / union if union > 0 else 0.0
ground_truth = [50, 50, 200, 200] # the real object's box
prediction = [60, 60, 210, 210] # what the model guessed
iou = compute_iou(ground_truth, prediction)
print("IoU:", round(iou, 3))
print("Match (IoU >= 0.5)?", iou >= 0.5)
# Expected output:
# IoU: 0.658
# Match (IoU >= 0.5)? TrueThe two max(0, ...) calls are important: if the boxes don't overlap, x2 - x1 goes negative, and clamping it to 0 keeps the intersection at 0 instead of producing a fake positive area.
Try It: Compute IoU of Two Boxes
Run the worked IoU function and check the result against the expected output
# Intersection over Union (IoU) — the core detection metric
# IoU = overlap area / combined area. Ranges 0 (no overlap) to 1 (perfect).
# Each box is [x1, y1, x2, y2] = top-left corner and bottom-right corner.
def compute_iou(box_a, box_b):
# Coordinates of the overlapping rectangle
x1 = max(box_a[0], box_b[0]) # leftmost right edge
y1 = max(box_a[1], box_b[1]) # topmost bottom edge
x2 = min(box_a[2], box_b[2]) # rightmost left edge
y2 = min(box_a[3], box_b[3]
...🎯 Your Turn: Finish the IoU Function
Fill in the blanks marked ___ to complete the IoU calculation
# 🎯 YOUR TURN: finish the IoU calculation
# Fill in each ___ . Each box is [x1, y1, x2, y2].
def compute_iou(box_a, box_b):
# Overlap rectangle: inner edges of the two boxes
x1 = max(box_a[0], box_b[0])
y1 = max(box_a[1], box_b[1])
x2 = min(box_a[2], box_b[2])
y2 = min(box_a[3], box_b[3])
# 👉 overlap width and height (clamp negatives to 0 with max(0, ...))
overlap_w = max(0, ___) # 👉 use x2 - x1
overlap_h = max(0, ___) # 👉 use y2 - y1
i
...3Anchor Boxes — Pre-Set Shapes to Refine
Predicting box coordinates from scratch is hard. Instead, many detectors start from anchor boxes: a fixed set of reference rectangles in a range of shapes and sizes (a tall one for people, a wide one for cars, a square one, and so on). The model doesn't invent a box — it picks the closest anchor and nudges its size and position to fit the object.
Think of anchors as templates printed on tracing paper. The model slides the best-matching template over the object and tweaks it slightly, which is far easier than drawing freehand.
4One-Stage vs Two-Stage Detectors
There are two broad families of detector, and they trade speed against accuracy:
One-stage — YOLO, SSD
Predict every box and class in a single pass over the image. "You Only Look Once." Very fast (real-time video, mobile, edge devices), slightly weaker on tiny or crowded objects.
Two-stage — Faster R-CNN
First propose regions that might hold an object, then classify and refine each one. Slower, but typically more accurate, especially on small and overlapping objects.
A good default: reach for YOLO when you need speed or real-time, and Faster R-CNNwhen accuracy on hard images matters more than frame rate.
🌍 Worked Example: Real Detection with YOLO (Ultralytics)
In practice you rarely write IoU and NMS by hand — a library does it. This is read-only (it needspip install ultralytics and an image), but it shows how little code real detection takes. Ultralytics runs NMS for you, so the boxes come back already de-duplicated.
# Real detection with Ultralytics YOLO (run locally: pip install ultralytics)
# YOLO = "You Only Look Once" — one forward pass over the whole image.
from ultralytics import YOLO
model = YOLO("yolov8n.pt") # tiny pretrained model (COCO, 80 classes)
results = model("street.jpg", conf=0.25) # conf = confidence threshold
# Ultralytics applies NMS for you, so you get clean, de-duplicated boxes.
for box in results[0].boxes:
cls_id = int(box.cls[0])
label = model.names[cls_id] # e.g. "person", "car"
conf = float(box.conf[0])
x1, y1, x2, y2 = box.xyxy[0].tolist() # corner coordinates in pixels
print(f"{label:8} {conf:.2f} box=({x1:.0f},{y1:.0f},{x2:.0f},{y2:.0f})")
# Expected output (depends on the image):
# person 0.91 box=(34,58,121,402)
# car 0.88 box=(220,180,540,360)
# dog 0.76 box=(410,260,505,395)5Non-Max Suppression (NMS) — Removing Duplicates
A raw detector fires dozens of overlapping boxes around each object. Non-Maximum Suppression (NMS)tidies them up with a simple rule: sort boxes by confidence, keep the most confident one, then throw away every other box that overlaps it too much (IoU above a threshold, usually 0.5). Repeat until none are left.
Here is NMS written out in plain Python, reusing the IoU function. Run it and watch four boxes become two.
# Non-Maximum Suppression (NMS): keep the best box, drop overlapping duplicates.
# A detector often fires several boxes for ONE object. NMS cleans that up.
def compute_iou(box_a, box_b):
x1 = max(box_a[0], box_b[0]); y1 = max(box_a[1], box_b[1])
x2 = min(box_a[2], box_b[2]); y2 = min(box_a[3], box_b[3])
intersection = max(0, x2 - x1) * max(0, y2 - y1)
area_a = (box_a[2] - box_a[0]) * (box_a[3] - box_a[1])
area_b = (box_b[2] - box_b[0]) * (box_b[3] - box_b[1])
union = area_a + area_b - intersection
return intersection / union if union > 0 else 0.0
# Each detection: (box, confidence, class)
detections = [
([48, 48, 202, 202], 0.95, "cat"),
([50, 52, 198, 200], 0.87, "cat"), # overlaps the cat above
([52, 46, 205, 198], 0.72, "cat"), # also the same cat
([300, 100, 400, 250], 0.91, "dog"),
]
def nms(dets, iou_threshold=0.5):
# 1) Sort by confidence, highest first
dets = sorted(dets, key=lambda d: d[1], reverse=True)
kept = []
for box, conf, cls in dets:
# 2) Keep this box only if it does not overlap a kept box of the same class
duplicate = any(
cls == k_cls and compute_iou(box, k_box) > iou_threshold
for k_box, k_conf, k_cls in kept
)
if not duplicate:
kept.append((box, conf, cls))
return kept
print("Before NMS:", len(detections), "boxes")
result = nms(detections)
print("After NMS: ", len(result), "boxes")
for box, conf, cls in result:
print(f" {cls}: conf={conf:.2f} box={box}")
# Expected output:
# Before NMS: 4 boxes
# After NMS: 2 boxes
# cat: conf=0.95 box=[48, 48, 202, 202]
# dog: conf=0.91 box=[300, 100, 400, 250]Try It: Simple NMS Over a Few Boxes
Run NMS and watch overlapping duplicates get removed
# Non-Maximum Suppression (NMS): keep the best box, drop overlapping duplicates.
# A detector often fires several boxes for ONE object. NMS cleans that up.
def compute_iou(box_a, box_b):
x1 = max(box_a[0], box_b[0]); y1 = max(box_a[1], box_b[1])
x2 = min(box_a[2], box_b[2]); y2 = min(box_a[3], box_b[3])
intersection = max(0, x2 - x1) * max(0, y2 - y1)
area_a = (box_a[2] - box_a[0]) * (box_a[3] - box_a[1])
area_b = (box_b[2] - box_b[0]) * (box_b[3] - box_b[1])
union = are
...🎯 Your Turn: Finish the NMS Keep/Drop Test
Fill in the blanks to decide whether a box is a duplicate
# 🎯 YOUR TURN: finish the NMS "keep or drop" decision
# Keep a box only if it does NOT overlap an already-kept box too much.
def compute_iou(box_a, box_b):
x1 = max(box_a[0], box_b[0]); y1 = max(box_a[1], box_b[1])
x2 = min(box_a[2], box_b[2]); y2 = min(box_a[3], box_b[3])
inter = max(0, x2 - x1) * max(0, y2 - y1)
area_a = (box_a[2] - box_a[0]) * (box_a[3] - box_a[1])
area_b = (box_b[2] - box_b[0]) * (box_b[3] - box_b[1])
union = area_a + area_b - inter
return inter
...6mAP — Scoring and Comparing Detectors
One number summarises a detector's quality: mAP (mean Average Precision). For each class you measure precision across recall levels to get its Average Precision, then average across all classes. A box counts as correct only if its IoU with the truth clears a threshold.
- mAP50 — uses a single IoU threshold of 0.50 (more forgiving).
- mAP50-95 — averages over IoU 0.50 to 0.95; the headline COCO score. Higher means tighter boxes.
This is read-only (it needs the library and a dataset), but it shows how you'd measure mAP in practice:
# Evaluating a detector: mAP (mean Average Precision) with Ultralytics.
from ultralytics import YOLO
model = YOLO("yolov8n.pt")
metrics = model.val(data="coco128.yaml") # run validation on a labelled set
print(f"mAP50: {metrics.box.map50:.3f}") # IoU threshold 0.50 only
print(f"mAP50-95: {metrics.box.map:.3f}") # averaged over IoU 0.50..0.95
# Expected output (approximate):
# mAP50: 0.61
# mAP50-95: 0.45
# Rule of thumb when picking a detector:
# 1-stage (YOLO, SSD) -> fast, great for real-time / edge / video
# 2-stage (Faster R-CNN)-> slower, stronger on small + crowded objects
# mAP50-95 is the headline number on the COCO leaderboard: higher = tighter boxes.Common Errors (And How to Fix Them)
❌ Wrong IoU threshold
Setting the match threshold too low counts sloppy boxes as correct; too high rejects boxes that are actually fine.
✅ Fix:
# Use 0.5 as the standard "is this a correct detection" cut-off. match = iou >= 0.5 # COCO AP50; report mAP50-95 for the full picture # Don't reuse the SAME number for the NMS overlap test without thinking — # they're different jobs (matching vs. de-duplicating).
❌ Forgetting NMS entirely
Raw model output has many overlapping boxes. Without NMS you draw five boxes on one object.
✅ Fix:
# Always run NMS on raw outputs. Libraries do it for you:
results = model("img.jpg", iou=0.5, conf=0.25) # iou = NMS overlap threshold
# Writing your own loop? Sort by confidence, then suppress high-IoU overlaps.❌ Class imbalance
If 95% of your training boxes are "car" and 5% are "bicycle", the model learns to ignore bicycles and your per-class mAP for the rare class collapses — even though overall accuracy looks fine.
✅ Fix:
# Gather more examples of rare classes, oversample them, or weight the loss. # Always read PER-CLASS mAP, not just the overall average, to catch this.
❌ Small objects get missed
Tiny objects (distant faces, far-off signs) shrink to a few pixels after downsampling and vanish, or no anchor shape fits them.
✅ Fix:
# Train and infer at a higher resolution, e.g.:
results = model("img.jpg", imgsz=1280) # bigger input keeps small objects alive
# Add anchor sizes (or feature levels) suited to small boxes; tile large images.📋 Quick Reference
| Concept | What It Does | Key Number |
|---|---|---|
| Bounding box | Locates an object | [x1, y1, x2, y2] |
| IoU | Measures box overlap | ≥ 0.5 = match |
| Anchor box | Pre-set shape to refine | matched by IoU |
| NMS | Removes duplicate boxes | IoU > 0.5 suppressed |
| One-stage | YOLO / SSD — fast | real-time FPS |
| Two-stage | Faster R-CNN — accurate | higher mAP |
| mAP50-95 | Overall detector quality | higher = better |
❓ Frequently Asked Questions
Q: What is the difference between image classification and object detection?
A: Classification answers 'what is in this image?' with a single label for the whole picture. Object detection answers 'what objects are here AND where is each one?' by drawing a bounding box around every object and labelling it. Detection is classification plus localisation.
Q: What is IoU (Intersection over Union)?
A: IoU measures how much a predicted box overlaps the true box. It is the area where the two boxes overlap divided by the total area they cover together. It ranges from 0 (no overlap) to 1 (perfect overlap). A prediction usually counts as correct when IoU is at least 0.5.
Q: Why do I need Non-Maximum Suppression (NMS)?
A: A detector outputs many overlapping boxes for the same object. NMS keeps the box with the highest confidence and removes every other box that overlaps it too much (high IoU). Without NMS you get five boxes drawn around one cat instead of one.
Q: What is the difference between one-stage and two-stage detectors?
A: One-stage detectors (YOLO, SSD) predict boxes and classes in a single pass over the image — fast, great for real-time video. Two-stage detectors (Faster R-CNN) first propose regions that might contain objects, then classify each one — slower but usually more accurate on small or crowded objects.
Q: What does mAP mean when comparing detectors?
A: mAP (mean Average Precision) is the standard score for detectors. For each class you measure precision across recall levels to get its Average Precision, then average across all classes. COCO reports mAP averaged over IoU thresholds from 0.5 to 0.95, so a higher mAP means better and tighter boxes.
🎯 Mini-Challenge: Count the Hits
Time to fly solo. Using IoU, count how many predicted boxes actually hit the ground-truth object. The starter block gives you the brief and the data — write the logic yourself.
Mini-Challenge
Count predictions whose IoU with the ground truth is at least 0.5
# 🎯 MINI-CHALLENGE: count how many predictions "hit" the ground truth
#
# You are given one ground-truth box and a list of predicted boxes.
# 1. Write (or reuse) a compute_iou(box_a, box_b) function.
# 2. A prediction is a HIT when its IoU with the ground truth is >= 0.5.
# 3. Print how many predictions hit, e.g. "Hits: 2 / 4".
#
# Starter data:
# ground_truth = [50, 50, 150, 150]
# predictions = [[52, 48, 150, 152], [60, 60, 160, 160],
# [0, 0, 40, 40], [200, 200, 260, 260]]
#
...Lesson complete — you can detect and localise objects!
You can describe a detection as class + box + confidence, compute IoU by hand, run NMS to remove duplicates, choose between one-stage and two-stage detectors, and read an mAP score. These are the building blocks behind every modern detector.
🚀 Up next: Semantic Segmentation — go beyond boxes and label every single pixel in the image.
Sign up for free to track which lessons you've completed and get learning reminders.