Skip to main content

    Lesson 32 • Advanced

    Object Detection: YOLO, SSD & Faster R-CNN

    Go from "is there a cat?" to "draw a box around every cat, dog, and car." You'll measure box overlap with IoU, clean up duplicates with NMS, and know which detector to reach for.

    What You'll Learn in This Lesson

    • How detection = classification + localisation (bounding boxes)
    • How to compute IoU (Intersection over Union) in plain Python
    • What anchor boxes are and why detectors use them
    • One-stage (YOLO, SSD) vs two-stage (Faster R-CNN) trade-offs
    • How Non-Max Suppression removes duplicate boxes
    • How mAP scores and compares detectors

    🎯 Real-World Analogy: Spotting and Boxing Objects in a Photo

    Imagine handing a friend a busy holiday photo and a marker pen. Classification is asking "is there a dog in this photo?" — one yes/no answer. Object detection is asking your friend to go further: "draw a box around every dog, every person, and every car, and write what each one is." That is the whole job — find the objects and box them.

    Now imagine your friend gets over-excited and scribbles five boxes around the same dog. You'd keep the neatest box and cross out the rest. That clean-up step is exactly what Non-Max Suppressiondoes, and the way you decide two boxes are "the same dog" is by measuring their overlap — that's IoU.

    1Detection = Classification + Localisation

    A classifier outputs one label for the whole image: "cat". A detector must output a list of objects, and for each one it predicts three things:

    • Class — what it is ("cat", "car", "person").
    • Bounding box — where it is, as four numbers.
    • Confidence — how sure the model is (0 to 1).

    A bounding box is the rectangle drawn around an object. The most common format is the two corners: [x1, y1, x2, y2] — the top-left corner(x1, y1) and the bottom-right corner(x2, y2), all in pixels. (YOLO often uses a centre format[cx, cy, w, h] instead — same rectangle, different numbers.)

    # One detection from a model, in corner format:
    detection = {
        "class": "cat",
        "box":   [50, 50, 200, 200],   # x1, y1, x2, y2 in pixels
        "conf":  0.93,                 # 93% confident
    }

    2IoU — Measuring Box Overlap

    How do you score a predicted box against the true box? You use IoU (Intersection over Union): the area where the two boxes overlap, divided by the total area they cover together.

    IoU in one line:

    IoU = overlap_area / (area_a + area_b - overlap_area)

    • IoU = 0 — the boxes don't touch at all.
    • IoU = 1 — the boxes are identical.
    • IoU ≥ 0.5 — the usual cut-off for "correct detection".

    Here is the full, commented version. Read it, then run it — the output is at the bottom.

    # Intersection over Union (IoU) — the core detection metric
    # IoU = overlap area / combined area. Ranges 0 (no overlap) to 1 (perfect).
    # Each box is [x1, y1, x2, y2] = top-left corner and bottom-right corner.
    
    def compute_iou(box_a, box_b):
        # Coordinates of the overlapping rectangle
        x1 = max(box_a[0], box_b[0])      # leftmost right edge
        y1 = max(box_a[1], box_b[1])      # topmost bottom edge
        x2 = min(box_a[2], box_b[2])      # rightmost left edge
        y2 = min(box_a[3], box_b[3])
    
        # Width/height of overlap (0 if the boxes do not touch)
        overlap_w = max(0, x2 - x1)
        overlap_h = max(0, y2 - y1)
        intersection = overlap_w * overlap_h          # shared area
    
        # Area of each box, then the union (avoid double-counting overlap)
        area_a = (box_a[2] - box_a[0]) * (box_a[3] - box_a[1])
        area_b = (box_b[2] - box_b[0]) * (box_b[3] - box_b[1])
        union = area_a + area_b - intersection
    
        return intersection / union if union > 0 else 0.0
    
    ground_truth = [50, 50, 200, 200]     # the real object's box
    prediction   = [60, 60, 210, 210]     # what the model guessed
    
    iou = compute_iou(ground_truth, prediction)
    print("IoU:", round(iou, 3))
    print("Match (IoU >= 0.5)?", iou >= 0.5)
    
    # Expected output:
    # IoU: 0.658
    # Match (IoU >= 0.5)? True

    The two max(0, ...) calls are important: if the boxes don't overlap, x2 - x1 goes negative, and clamping it to 0 keeps the intersection at 0 instead of producing a fake positive area.

    Try It: Compute IoU of Two Boxes

    Run the worked IoU function and check the result against the expected output

    Try it Yourself »
    Python
    # Intersection over Union (IoU) — the core detection metric
    # IoU = overlap area / combined area. Ranges 0 (no overlap) to 1 (perfect).
    # Each box is [x1, y1, x2, y2] = top-left corner and bottom-right corner.
    
    def compute_iou(box_a, box_b):
        # Coordinates of the overlapping rectangle
        x1 = max(box_a[0], box_b[0])      # leftmost right edge
        y1 = max(box_a[1], box_b[1])      # topmost bottom edge
        x2 = min(box_a[2], box_b[2])      # rightmost left edge
        y2 = min(box_a[3], box_b[3]
    ...

    🎯 Your Turn: Finish the IoU Function

    Fill in the blanks marked ___ to complete the IoU calculation

    Try it Yourself »
    Python
    # 🎯 YOUR TURN: finish the IoU calculation
    # Fill in each ___ . Each box is [x1, y1, x2, y2].
    
    def compute_iou(box_a, box_b):
        # Overlap rectangle: inner edges of the two boxes
        x1 = max(box_a[0], box_b[0])
        y1 = max(box_a[1], box_b[1])
        x2 = min(box_a[2], box_b[2])
        y2 = min(box_a[3], box_b[3])
    
        # 👉 overlap width and height (clamp negatives to 0 with max(0, ...))
        overlap_w = max(0, ___)          # 👉 use x2 - x1
        overlap_h = max(0, ___)          # 👉 use y2 - y1
        i
    ...

    3Anchor Boxes — Pre-Set Shapes to Refine

    Predicting box coordinates from scratch is hard. Instead, many detectors start from anchor boxes: a fixed set of reference rectangles in a range of shapes and sizes (a tall one for people, a wide one for cars, a square one, and so on). The model doesn't invent a box — it picks the closest anchor and nudges its size and position to fit the object.

    Think of anchors as templates printed on tracing paper. The model slides the best-matching template over the object and tweaks it slightly, which is far easier than drawing freehand.

    4One-Stage vs Two-Stage Detectors

    There are two broad families of detector, and they trade speed against accuracy:

    One-stage — YOLO, SSD

    Predict every box and class in a single pass over the image. "You Only Look Once." Very fast (real-time video, mobile, edge devices), slightly weaker on tiny or crowded objects.

    Two-stage — Faster R-CNN

    First propose regions that might hold an object, then classify and refine each one. Slower, but typically more accurate, especially on small and overlapping objects.

    A good default: reach for YOLO when you need speed or real-time, and Faster R-CNNwhen accuracy on hard images matters more than frame rate.

    🌍 Worked Example: Real Detection with YOLO (Ultralytics)

    In practice you rarely write IoU and NMS by hand — a library does it. This is read-only (it needspip install ultralytics and an image), but it shows how little code real detection takes. Ultralytics runs NMS for you, so the boxes come back already de-duplicated.

    # Real detection with Ultralytics YOLO (run locally: pip install ultralytics)
    # YOLO = "You Only Look Once" — one forward pass over the whole image.
    from ultralytics import YOLO
    
    model = YOLO("yolov8n.pt")            # tiny pretrained model (COCO, 80 classes)
    results = model("street.jpg", conf=0.25)   # conf = confidence threshold
    
    # Ultralytics applies NMS for you, so you get clean, de-duplicated boxes.
    for box in results[0].boxes:
        cls_id = int(box.cls[0])
        label  = model.names[cls_id]      # e.g. "person", "car"
        conf   = float(box.conf[0])
        x1, y1, x2, y2 = box.xyxy[0].tolist()   # corner coordinates in pixels
        print(f"{label:8} {conf:.2f}  box=({x1:.0f},{y1:.0f},{x2:.0f},{y2:.0f})")
    
    # Expected output (depends on the image):
    # person   0.91  box=(34,58,121,402)
    # car      0.88  box=(220,180,540,360)
    # dog      0.76  box=(410,260,505,395)

    5Non-Max Suppression (NMS) — Removing Duplicates

    A raw detector fires dozens of overlapping boxes around each object. Non-Maximum Suppression (NMS)tidies them up with a simple rule: sort boxes by confidence, keep the most confident one, then throw away every other box that overlaps it too much (IoU above a threshold, usually 0.5). Repeat until none are left.

    Here is NMS written out in plain Python, reusing the IoU function. Run it and watch four boxes become two.

    # Non-Maximum Suppression (NMS): keep the best box, drop overlapping duplicates.
    # A detector often fires several boxes for ONE object. NMS cleans that up.
    
    def compute_iou(box_a, box_b):
        x1 = max(box_a[0], box_b[0]); y1 = max(box_a[1], box_b[1])
        x2 = min(box_a[2], box_b[2]); y2 = min(box_a[3], box_b[3])
        intersection = max(0, x2 - x1) * max(0, y2 - y1)
        area_a = (box_a[2] - box_a[0]) * (box_a[3] - box_a[1])
        area_b = (box_b[2] - box_b[0]) * (box_b[3] - box_b[1])
        union = area_a + area_b - intersection
        return intersection / union if union > 0 else 0.0
    
    # Each detection: (box, confidence, class)
    detections = [
        ([48, 48, 202, 202], 0.95, "cat"),
        ([50, 52, 198, 200], 0.87, "cat"),   # overlaps the cat above
        ([52, 46, 205, 198], 0.72, "cat"),   # also the same cat
        ([300, 100, 400, 250], 0.91, "dog"),
    ]
    
    def nms(dets, iou_threshold=0.5):
        # 1) Sort by confidence, highest first
        dets = sorted(dets, key=lambda d: d[1], reverse=True)
        kept = []
        for box, conf, cls in dets:
            # 2) Keep this box only if it does not overlap a kept box of the same class
            duplicate = any(
                cls == k_cls and compute_iou(box, k_box) > iou_threshold
                for k_box, k_conf, k_cls in kept
            )
            if not duplicate:
                kept.append((box, conf, cls))
        return kept
    
    print("Before NMS:", len(detections), "boxes")
    result = nms(detections)
    print("After NMS: ", len(result), "boxes")
    for box, conf, cls in result:
        print(f"  {cls}: conf={conf:.2f} box={box}")
    
    # Expected output:
    # Before NMS: 4 boxes
    # After NMS:  2 boxes
    #   cat: conf=0.95 box=[48, 48, 202, 202]
    #   dog: conf=0.91 box=[300, 100, 400, 250]

    Try It: Simple NMS Over a Few Boxes

    Run NMS and watch overlapping duplicates get removed

    Try it Yourself »
    Python
    # Non-Maximum Suppression (NMS): keep the best box, drop overlapping duplicates.
    # A detector often fires several boxes for ONE object. NMS cleans that up.
    
    def compute_iou(box_a, box_b):
        x1 = max(box_a[0], box_b[0]); y1 = max(box_a[1], box_b[1])
        x2 = min(box_a[2], box_b[2]); y2 = min(box_a[3], box_b[3])
        intersection = max(0, x2 - x1) * max(0, y2 - y1)
        area_a = (box_a[2] - box_a[0]) * (box_a[3] - box_a[1])
        area_b = (box_b[2] - box_b[0]) * (box_b[3] - box_b[1])
        union = are
    ...

    🎯 Your Turn: Finish the NMS Keep/Drop Test

    Fill in the blanks to decide whether a box is a duplicate

    Try it Yourself »
    Python
    # 🎯 YOUR TURN: finish the NMS "keep or drop" decision
    # Keep a box only if it does NOT overlap an already-kept box too much.
    
    def compute_iou(box_a, box_b):
        x1 = max(box_a[0], box_b[0]); y1 = max(box_a[1], box_b[1])
        x2 = min(box_a[2], box_b[2]); y2 = min(box_a[3], box_b[3])
        inter = max(0, x2 - x1) * max(0, y2 - y1)
        area_a = (box_a[2] - box_a[0]) * (box_a[3] - box_a[1])
        area_b = (box_b[2] - box_b[0]) * (box_b[3] - box_b[1])
        union = area_a + area_b - inter
        return inter
    ...

    6mAP — Scoring and Comparing Detectors

    One number summarises a detector's quality: mAP (mean Average Precision). For each class you measure precision across recall levels to get its Average Precision, then average across all classes. A box counts as correct only if its IoU with the truth clears a threshold.

    • mAP50 — uses a single IoU threshold of 0.50 (more forgiving).
    • mAP50-95 — averages over IoU 0.50 to 0.95; the headline COCO score. Higher means tighter boxes.

    This is read-only (it needs the library and a dataset), but it shows how you'd measure mAP in practice:

    # Evaluating a detector: mAP (mean Average Precision) with Ultralytics.
    from ultralytics import YOLO
    
    model = YOLO("yolov8n.pt")
    metrics = model.val(data="coco128.yaml")   # run validation on a labelled set
    
    print(f"mAP50:    {metrics.box.map50:.3f}")   # IoU threshold 0.50 only
    print(f"mAP50-95: {metrics.box.map:.3f}")      # averaged over IoU 0.50..0.95
    
    # Expected output (approximate):
    # mAP50:    0.61
    # mAP50-95: 0.45
    
    # Rule of thumb when picking a detector:
    #   1-stage (YOLO, SSD)   -> fast, great for real-time / edge / video
    #   2-stage (Faster R-CNN)-> slower, stronger on small + crowded objects
    # mAP50-95 is the headline number on the COCO leaderboard: higher = tighter boxes.

    Common Errors (And How to Fix Them)

    ❌ Wrong IoU threshold

    Setting the match threshold too low counts sloppy boxes as correct; too high rejects boxes that are actually fine.

    ✅ Fix:

    # Use 0.5 as the standard "is this a correct detection" cut-off.
    match = iou >= 0.5        # COCO AP50; report mAP50-95 for the full picture
    # Don't reuse the SAME number for the NMS overlap test without thinking —
    # they're different jobs (matching vs. de-duplicating).

    ❌ Forgetting NMS entirely

    Raw model output has many overlapping boxes. Without NMS you draw five boxes on one object.

    ✅ Fix:

    # Always run NMS on raw outputs. Libraries do it for you:
    results = model("img.jpg", iou=0.5, conf=0.25)   # iou = NMS overlap threshold
    # Writing your own loop? Sort by confidence, then suppress high-IoU overlaps.

    ❌ Class imbalance

    If 95% of your training boxes are "car" and 5% are "bicycle", the model learns to ignore bicycles and your per-class mAP for the rare class collapses — even though overall accuracy looks fine.

    ✅ Fix:

    # Gather more examples of rare classes, oversample them, or weight the loss.
    # Always read PER-CLASS mAP, not just the overall average, to catch this.

    ❌ Small objects get missed

    Tiny objects (distant faces, far-off signs) shrink to a few pixels after downsampling and vanish, or no anchor shape fits them.

    ✅ Fix:

    # Train and infer at a higher resolution, e.g.:
    results = model("img.jpg", imgsz=1280)   # bigger input keeps small objects alive
    # Add anchor sizes (or feature levels) suited to small boxes; tile large images.

    📋 Quick Reference

    ConceptWhat It DoesKey Number
    Bounding boxLocates an object[x1, y1, x2, y2]
    IoUMeasures box overlap≥ 0.5 = match
    Anchor boxPre-set shape to refinematched by IoU
    NMSRemoves duplicate boxesIoU > 0.5 suppressed
    One-stageYOLO / SSD — fastreal-time FPS
    Two-stageFaster R-CNN — accuratehigher mAP
    mAP50-95Overall detector qualityhigher = better

    ❓ Frequently Asked Questions

    Q: What is the difference between image classification and object detection?

    A: Classification answers 'what is in this image?' with a single label for the whole picture. Object detection answers 'what objects are here AND where is each one?' by drawing a bounding box around every object and labelling it. Detection is classification plus localisation.

    Q: What is IoU (Intersection over Union)?

    A: IoU measures how much a predicted box overlaps the true box. It is the area where the two boxes overlap divided by the total area they cover together. It ranges from 0 (no overlap) to 1 (perfect overlap). A prediction usually counts as correct when IoU is at least 0.5.

    Q: Why do I need Non-Maximum Suppression (NMS)?

    A: A detector outputs many overlapping boxes for the same object. NMS keeps the box with the highest confidence and removes every other box that overlaps it too much (high IoU). Without NMS you get five boxes drawn around one cat instead of one.

    Q: What is the difference between one-stage and two-stage detectors?

    A: One-stage detectors (YOLO, SSD) predict boxes and classes in a single pass over the image — fast, great for real-time video. Two-stage detectors (Faster R-CNN) first propose regions that might contain objects, then classify each one — slower but usually more accurate on small or crowded objects.

    Q: What does mAP mean when comparing detectors?

    A: mAP (mean Average Precision) is the standard score for detectors. For each class you measure precision across recall levels to get its Average Precision, then average across all classes. COCO reports mAP averaged over IoU thresholds from 0.5 to 0.95, so a higher mAP means better and tighter boxes.

    🎯 Mini-Challenge: Count the Hits

    Time to fly solo. Using IoU, count how many predicted boxes actually hit the ground-truth object. The starter block gives you the brief and the data — write the logic yourself.

    Mini-Challenge

    Count predictions whose IoU with the ground truth is at least 0.5

    Try it Yourself »
    Python
    # 🎯 MINI-CHALLENGE: count how many predictions "hit" the ground truth
    #
    # You are given one ground-truth box and a list of predicted boxes.
    # 1. Write (or reuse) a compute_iou(box_a, box_b) function.
    # 2. A prediction is a HIT when its IoU with the ground truth is >= 0.5.
    # 3. Print how many predictions hit, e.g. "Hits: 2 / 4".
    #
    # Starter data:
    # ground_truth = [50, 50, 150, 150]
    # predictions  = [[52, 48, 150, 152], [60, 60, 160, 160],
    #                 [0, 0, 40, 40], [200, 200, 260, 260]]
    #
    ...
    🎉

    Lesson complete — you can detect and localise objects!

    You can describe a detection as class + box + confidence, compute IoU by hand, run NMS to remove duplicates, choose between one-stage and two-stage detectors, and read an mAP score. These are the building blocks behind every modern detector.

    🚀 Up next: Semantic Segmentation — go beyond boxes and label every single pixel in the image.

    Sign up for free to track which lessons you've completed and get learning reminders.

    Previous

    Cookie & Privacy Settings

    We use cookies to improve your experience, analyze traffic, and show personalized ads. You can manage your preferences below.

    By clicking "Accept All", you consent to our use of cookies for analytics and personalized advertising. You can customize your preferences or reject non-essential cookies.

    Privacy PolicyTerms of Service