Skip to main content
    Courses/AI & ML/Computer Vision Basics

    Lesson 10 • Intermediate

    Computer Vision Basics

    Teach a computer to "see" — turn images into numbers, sharpen and blur them by hand, then understand how convolutional neural networks (CNNs) recognise objects.

    What You'll Learn in This Lesson

    • How images are stored as grids of pixel numbers (grayscale vs RGB)
    • Read and edit individual pixels with image[row][col]
    • Two core operations: brighten (with clipping) and threshold
    • Convolution by hand — slide a small filter to blur or find edges
    • CNN building blocks: convolution, pooling, and feature maps
    • The three core CV tasks: classification, detection, segmentation

    👁️ How Your Eyes and Brain "See"

    When you look at a dog, your eyes do not send "DOG" to your brain. They send millions of tiny light measurements. Your visual system then builds meaning in layers: first it spots edges (where light meets dark), then it groups edges into shapes (an ear, a snout), then it recognises textures (fur), and only at the end does it conclude "that's a dog."

    A computer starts in exactly the same place: an image arrives as a grid of brightness numbers. A convolutional neural network (CNN) then rebuilds the same ladder — early layers find edges, middle layers find shapes, deep layers find whole objects. The whole lesson is about understanding that ladder, one rung at a time, starting from raw numbers.

    1An Image Is Just a Grid of Numbers

    A pixel ("picture element") is one dot of an image. In a grayscale image each pixel is a single number from 0 (black) to 255 (white), with greys in between. You store the whole image as a list of rows — a nested list — and read any pixel with image[row][col].

    A colour image adds a third dimension: every pixel becomes three numbers — Red, Green and Blue (RGB). That is why a 224×224 colour photo is 224 × 224 × 3 = 150,528 numbers. Run the worked example below to read pixels and measure an image's size.

    Worked Example: Images as Numbers

    Store a 3x3 image as a nested list and read its pixels

    Try it Yourself »
    Python
    # An image is just a grid of numbers. No libraries needed.
    # Each number is a "pixel": 0 = black, 255 = white, in between = grey.
    
    # A tiny 3x3 grayscale image stored as a nested list (a list of rows)
    image = [
        [  0, 128, 255],   # row 0: black, grey, white
        [128, 255, 128],   # row 1
        [255, 128,   0],   # row 2
    ]
    
    # Print it like a picture so you can "see" the numbers
    print("=== Pixel values ===")
    for row in image:
        print(row)
    
    # Read one pixel: image[row][col]
    print()
    print("Top-l
    ...

    2Basic Operations: Brightness and Threshold

    Once an image is numbers, editing it is just arithmetic. Brightening adds a fixed amount to every pixel — but you must clip the result to the 0–255 range, because a pixel can never be darker than black or brighter than white. Thresholding turns the image pure black-and-white: any pixel at or above a cutoff becomes 255, everything else becomes 0. That is the simplest way to separate a bright object from a dark background.

    Worked Example: Brighten and Threshold

    Adjust brightness with clipping, then threshold to black & white

    Try it Yourself »
    Python
    # Two of the most common image operations, written by hand.
    
    image = [
        [  0, 128, 255],
        [128, 255, 128],
        [255, 128,   0],
    ]
    
    # 1) BRIGHTEN: add a value to every pixel, then "clip" to the 0-255 range.
    #    Clipping matters: pixels can never go below 0 or above 255.
    def brighten(img, amount):
        out = []
        for row in img:
            new_row = []
            for pixel in row:
                value = pixel + amount
                value = max(0, min(255, value))   # clip into 0..255
                new_
    ...

    3Convolution and Filters (Blur and Edges)

    Convolution is the single most important idea in computer vision, and it is much simpler than it sounds: slide a small grid of numbers (a "filter" or "kernel") across the image and combine the pixels underneath it. A 2×2 average filter replaces each region with the average of its four pixels — that softens the image (a blur). The result is slightly smaller than the input, because the window cannot hang off the edge.

    Worked Example: A 2x2 Average Filter

    Slide a tiny averaging filter across a 3x3 image to blur it

    Try it Yourself »
    Python
    # Convolution sounds scary; it is just "slide a small grid over the image
    # and combine the numbers underneath." Here is a 2x2 AVERAGE (blur) filter.
    
    image = [
        [ 10,  20,  30],
        [ 40,  50,  60],
        [ 70,  80,  90],
    ]
    
    # Slide a 2x2 window across the image. For a 3x3 image, the window fits in
    # 2 positions across and 2 down -> the result is 2x2 (it shrinks at the edges).
    def average_2x2(img):
        out = []
        for i in range(len(img) - 1):          # rows 0,1
            new_row = []
            
    ...

    Swap the filter's numbers and the same sliding machinery does something else entirely. An edge-detection kernel gives a strong response wherever brightness changes sharply, and near zero across flat regions. That is the intuition behind "finding edges." The numpy example below applies a real edge kernel to a 5×5 image.

    Worked Example: Edge Detection with numpy

    Apply a real 3x3 edge kernel and read the feature map

    Try it Yourself »
    Python
    import numpy as np   # the real world uses numpy, not nested lists
    
    # A 5x5 image with a bright square in the middle
    image = np.array([
        [0, 0, 0, 0, 0],
        [0, 1, 1, 1, 0],
        [0, 1, 1, 1, 0],
        [0, 1, 1, 1, 0],
        [0, 0, 0, 0, 0],
    ], dtype=float)
    
    # A 3x3 edge-detection kernel: it fires where the centre differs from neighbours
    edge_kernel = np.array([
        [-1, -1, -1],
        [-1,  8, -1],
        [-1, -1, -1],
    ], dtype=float)
    
    def convolve(img, kernel):
        kh, kw = kernel.shape
        out = np
    ...

    🎯 Your Turn: Invert an Image

    Fill in the invert formula so black becomes white

    Try it Yourself »
    Python
    # 🎯 YOUR TURN — fill in the blanks marked with ___
    
    image = [
        [  0, 128, 255],
        [255,  64,   0],
    ]
    
    # Invert means: a black pixel becomes white and vice-versa.
    # The rule for an 8-bit pixel is:  new_value = 255 - old_value
    def invert(img):
        out = []
        for row in img:
            new_row = []
            for pixel in row:
                new_row.append(___)   # 👉 replace ___ with the invert formula
            out.append(new_row)
        return out
    
    for row in invert(image):
        print(row)
    
    # ✅ Expecte
    ...

    🎯 Your Turn: One Step of an Average Filter

    Complete the top-left value of a 2x2 average filter

    Try it Yourself »
    Python
    # 🎯 YOUR TURN — fill in the blanks marked with ___
    
    image = [
        [ 4,  8, 12],
        [16, 20, 24],
        [28, 32, 36],
    ]
    
    # Compute just the TOP-LEFT value of a 2x2 average filter.
    # The 2x2 window covers image[0][0], image[0][1], image[1][0], image[1][1].
    top_left = image[0][0] + image[0][1] + ___ + ___   # 👉 add the two bottom pixels
    average  = top_left ___ 4                            # 👉 use integer division //
    
    print("Window sum:", top_left)
    print("Average:   ", average)
    
    # ✅ Expected outpu
    ...

    4From Filters to CNNs (Conv, Pool, Feature Maps)

    A CNN stacks the idea you just built. A convolution layer applies many filters at once — but instead of you choosing the numbers, the network learns them during training. Each filter produces a feature map: a grid showing where that pattern was found. Early layers learn edge filters, deeper layers learn shape and object filters — exactly the eye/brain ladder from the analogy.

    A pooling layer (usually MaxPool 2×2) then shrinks each feature map by keeping only the strongest value in each 2×2 block. This throws away precise positions but keeps "was the feature here, roughly?", which makes the network smaller and more robust. Conv → Pool → Conv → Pool repeats until a Flatten turns the maps into a vector for a final Dense classifier.

    In real frameworks you describe this stack in a few lines. The example below sketches an OpenCV pre-processing step (note the BGR→RGB fix) and a small Keras CNN. The shapes shrink layer by layer — read the # Expected output to see how.

    Worked Example: OpenCV + a Keras CNN

    Pre-process an image, then build a small CNN classifier

    Try it Yourself »
    Python
    # ── OpenCV reads images as BGR, not RGB! A classic bug. ──
    import cv2                       # pip install opencv-python
    img = cv2.imread("cat.jpg")      # shape (H, W, 3) in B, G, R order
    img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)   # convert before showing/feeding
    img = cv2.resize(img, (224, 224))            # CNNs need a fixed input size
    img = img / 255.0                            # normalise pixels to 0..1
    
    # ── A small image classifier in Keras (TensorFlow) ──
    from tensorflow import keras
    
    ...

    🗂️ The Three Core Vision Tasks

    Almost every computer-vision product is one of these three jobs, in increasing difficulty:

    Classification

    "What is in this image?" — one label for the whole picture (cat vs dog). The CNN you just saw does this.

    Object Detection

    "What objects are here, and where?" — draws a box around each object and labels it (e.g. YOLO, Faster R-CNN).

    Segmentation

    "Which pixels belong to which object?" — labels every single pixel, giving an exact outline (e.g. U-Net, Mask R-CNN).

    5Common Errors (And How to Fix Them)

    These four mistakes trip up nearly every beginner. Spotting them saves hours.

    ❌ Forgetting to normalise pixels

    Feeding raw 0–255 pixels into a network. The large values make training unstable and slow.

    ✅ Fix: scale to 0–1 before training:

    image = image / 255.0   # now every pixel is between 0.0 and 1.0

    ❌ Wrong channel order (BGR vs RGB)

    OpenCV's cv2.imread returns pixels in BGR order, but most models and display libraries expect RGB. Colours come out swapped (blue skies look orange).

    ✅ Fix: convert right after loading:

    img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)

    ❌ Not resizing to a fixed input size

    CNNs expect every input to be the same shape (e.g. 224×224). Passing mixed sizes raises a shape error like expected (224,224,3), got (480,640,3).

    ✅ Fix: resize every image first:

    img = cv2.resize(img, (224, 224))

    ❌ Giant dense layers instead of convolution

    Flattening a raw image straight into a Dense layer creates millions of weights and overfits instantly.

    ✅ Fix: use Conv2D + MaxPooling2D first to shrink and share weights, and flatten only near the end.

    📋 Quick Reference

    TermWhat It MeansExample / Note
    PixelOne dot of an image0=black … 255=white
    GrayscaleOne value per pixelshape H×W×1
    RGBRed, Green, Blue per pixelshape H×W×3
    NormaliseScale pixels to 0–1img / 255.0
    ThresholdMake black & white255 if p >= c else 0
    ConvolutionSlide a filter over the imageblur, sharpen, edges
    Conv2DLearns filters → feature mapsshares weights
    MaxPoolShrinks, keeps strongest value2×2 halves H and W
    Flatten2D maps → 1D vectorfeeds Dense layer
    CV tasksThree core jobsclassify · detect · segment

    ❓ Frequently Asked Questions

    Q: Why is an image just a grid of numbers?

    A: A camera sensor measures light at thousands of tiny points. Each measurement becomes a number (a pixel). Grayscale uses one number per pixel (0=black to 255=white); colour uses three numbers per pixel for Red, Green and Blue.

    Q: What is the difference between grayscale and RGB?

    A: Grayscale stores one brightness value per pixel, so a 28x28 image is 28x28x1 numbers. RGB stores three values per pixel (red, green, blue), so a 28x28 colour image is 28x28x3 numbers. RGB has three 'channels'; grayscale has one.

    Q: What does convolution actually do?

    A: It slides a small grid of numbers (a filter or kernel) across the image and combines the pixels underneath to produce a new value. Different filters highlight different things: averaging blurs, while an edge kernel lights up wherever brightness changes sharply.

    Q: Why use a CNN instead of a regular neural network for images?

    A: A 224x224x3 image is over 150,000 numbers. A plain dense layer over that needs millions of weights and ignores the 2D structure. CNNs share a small filter across the whole image, so they learn with far fewer parameters and respect spatial layout.

    Q: What are classification, detection and segmentation?

    A: Classification answers 'what is in this image?' with one label. Object detection draws boxes around each object and labels them. Segmentation labels every individual pixel, giving the exact outline of each object.

    🎯 Mini Challenge: Posterise an Image

    Time to fly solo. Snap every pixel to the nearest of three levels (0, 128, 255) — a classic poster-art effect. The starter below gives only the brief and the data; you write the logic. Check yourself against the expected output in the comments.

    Mini Challenge: Posterise

    Write the posterise function from scratch

    Try it Yourself »
    Python
    # 🎯 MINI-CHALLENGE: posterise an image
    # A 3x3 grayscale image is given. Write a "posterise" function that snaps
    # every pixel to the NEAREST of three levels: 0, 128, or 255.
    #
    # Rule:
    #   pixel < 64           -> 0
    #   64 <= pixel < 192    -> 128
    #   pixel >= 192         -> 255
    #
    # Steps:
    # 1. Loop over every row, then every pixel.
    # 2. Apply the rule above to choose 0, 128 or 255.
    # 3. Build and print the new image (a list of rows).
    #
    # ✅ Expected (for the image below):
    # [0, 128, 255]
    # [128,
    ...
    🎉

    Lesson 10 complete — you can make a computer see!

    You now know that images are grids of numbers, you can brighten, threshold, blur and find edges by hand, and you understand how convolution, pooling and feature maps stack into a CNN that classifies, detects, or segments. The "magic" of computer vision is just arithmetic on pixels, repeated in layers.

    🚀 Up next: Advanced Neural Networks — regularisation, batch normalisation, and the tricks that make deep models train reliably.

    Sign up for free to track which lessons you've completed and get learning reminders.

    Previous

    Cookie & Privacy Settings

    We use cookies to improve your experience, analyze traffic, and show personalized ads. You can manage your preferences below.

    By clicking "Accept All", you consent to our use of cookies for analytics and personalized advertising. You can customize your preferences or reject non-essential cookies.

    Privacy PolicyTerms of Service