Courses/AI & ML/Computer Vision Basics

Lesson 10 • Intermediate

Computer Vision Basics

Teach a computer to "see" — turn images into numbers, sharpen and blur them by hand, then understand how convolutional neural networks (CNNs) recognise objects.

What You'll Learn in This Lesson

✓How images are stored as grids of pixel numbers (grayscale vs RGB)
✓Read and edit individual pixels with image[row][col]
✓Two core operations: brighten (with clipping) and threshold
✓Convolution by hand — slide a small filter to blur or find edges
✓CNN building blocks: convolution, pooling, and feature maps
✓The three core CV tasks: classification, detection, segmentation

Before you start: You should be comfortable with Python lists and loops, and have met neural networks in the Neural Networks lesson. The runnable exercises here use plain Python — no libraries to install.

👁️ How Your Eyes and Brain "See"

When you look at a dog, your eyes do not send "DOG" to your brain. They send millions of tiny light measurements. Your visual system then builds meaning in layers: first it spots edges (where light meets dark), then it groups edges into shapes (an ear, a snout), then it recognises textures (fur), and only at the end does it conclude "that's a dog."

A computer starts in exactly the same place: an image arrives as a grid of brightness numbers. A convolutional neural network (CNN) then rebuilds the same ladder — early layers find edges, middle layers find shapes, deep layers find whole objects. The whole lesson is about understanding that ladder, one rung at a time, starting from raw numbers.

1An Image Is Just a Grid of Numbers

A pixel ("picture element") is one dot of an image. In a grayscale image each pixel is a single number from 0 (black) to 255 (white), with greys in between. You store the whole image as a list of rows — a nested list — and read any pixel with image[row][col].

A colour image adds a third dimension: every pixel becomes three numbers — Red, Green and Blue (RGB). That is why a 224×224 colour photo is 224 × 224 × 3 = 150,528 numbers. Run the worked example below to read pixels and measure an image's size.

Worked Example: Images as Numbers

Store a 3x3 image as a nested list and read its pixels

Try it Yourself »

Python

# An image is just a grid of numbers. No libraries needed.
# Each number is a "pixel": 0 = black, 255 = white, in between = grey.

# A tiny 3x3 grayscale image stored as a nested list (a list of rows)
image = [
    [  0, 128, 255],   # row 0: black, grey, white
    [128, 255, 128],   # row 1
    [255, 128,   0],   # row 2
]

# Print it like a picture so you can "see" the numbers
print("=== Pixel values ===")
for row in image:
    print(row)

# Read one pixel: image[row][col]
print()
print("Top-l
...

2Basic Operations: Brightness and Threshold

Once an image is numbers, editing it is just arithmetic. Brightening adds a fixed amount to every pixel — but you must clip the result to the 0–255 range, because a pixel can never be darker than black or brighter than white. Thresholding turns the image pure black-and-white: any pixel at or above a cutoff becomes 255, everything else becomes 0. That is the simplest way to separate a bright object from a dark background.

Worked Example: Brighten and Threshold

Adjust brightness with clipping, then threshold to black & white

Try it Yourself »

Python

# Two of the most common image operations, written by hand.

image = [
    [  0, 128, 255],
    [128, 255, 128],
    [255, 128,   0],
]

# 1) BRIGHTEN: add a value to every pixel, then "clip" to the 0-255 range.
#    Clipping matters: pixels can never go below 0 or above 255.
def brighten(img, amount):
    out = []
    for row in img:
        new_row = []
        for pixel in row:
            value = pixel + amount
            value = max(0, min(255, value))   # clip into 0..255
            new_
...

3Convolution and Filters (Blur and Edges)

Convolution is the single most important idea in computer vision, and it is much simpler than it sounds: slide a small grid of numbers (a "filter" or "kernel") across the image and combine the pixels underneath it. A 2×2 average filter replaces each region with the average of its four pixels — that softens the image (a blur). The result is slightly smaller than the input, because the window cannot hang off the edge.

Worked Example: A 2x2 Average Filter

Slide a tiny averaging filter across a 3x3 image to blur it

Try it Yourself »

Python

# Convolution sounds scary; it is just "slide a small grid over the image
# and combine the numbers underneath." Here is a 2x2 AVERAGE (blur) filter.

image = [
    [ 10,  20,  30],
    [ 40,  50,  60],
    [ 70,  80,  90],
]

# Slide a 2x2 window across the image. For a 3x3 image, the window fits in
# 2 positions across and 2 down -> the result is 2x2 (it shrinks at the edges).
def average_2x2(img):
    out = []
    for i in range(len(img) - 1):          # rows 0,1
        new_row = []
        
...

Swap the filter's numbers and the same sliding machinery does something else entirely. An edge-detection kernel gives a strong response wherever brightness changes sharply, and near zero across flat regions. That is the intuition behind "finding edges." The numpy example below applies a real edge kernel to a 5×5 image.

Worked Example: Edge Detection with numpy

Apply a real 3x3 edge kernel and read the feature map

Try it Yourself »

Python

import numpy as np   # the real world uses numpy, not nested lists

# A 5x5 image with a bright square in the middle
image = np.array([
    [0, 0, 0, 0, 0],
    [0, 1, 1, 1, 0],
    [0, 1, 1, 1, 0],
    [0, 1, 1, 1, 0],
    [0, 0, 0, 0, 0],
], dtype=float)

# A 3x3 edge-detection kernel: it fires where the centre differs from neighbours
edge_kernel = np.array([
    [-1, -1, -1],
    [-1,  8, -1],
    [-1, -1, -1],
], dtype=float)

def convolve(img, kernel):
    kh, kw = kernel.shape
    out = np
...

🎯 Your Turn: Invert an Image

Fill in the invert formula so black becomes white

Try it Yourself »

Python

# 🎯 YOUR TURN — fill in the blanks marked with ___

image = [
    [  0, 128, 255],
    [255,  64,   0],
]

# Invert means: a black pixel becomes white and vice-versa.
# The rule for an 8-bit pixel is:  new_value = 255 - old_value
def invert(img):
    out = []
    for row in img:
        new_row = []
        for pixel in row:
            new_row.append(___)   # 👉 replace ___ with the invert formula
        out.append(new_row)
    return out

for row in invert(image):
    print(row)

# ✅ Expecte
...

🎯 Your Turn: One Step of an Average Filter

Complete the top-left value of a 2x2 average filter

Try it Yourself »

Python

# 🎯 YOUR TURN — fill in the blanks marked with ___

image = [
    [ 4,  8, 12],
    [16, 20, 24],
    [28, 32, 36],
]

# Compute just the TOP-LEFT value of a 2x2 average filter.
# The 2x2 window covers image[0][0], image[0][1], image[1][0], image[1][1].
top_left = image[0][0] + image[0][1] + ___ + ___   # 👉 add the two bottom pixels
average  = top_left ___ 4                            # 👉 use integer division //

print("Window sum:", top_left)
print("Average:   ", average)

# ✅ Expected outpu
...

4From Filters to CNNs (Conv, Pool, Feature Maps)

A CNN stacks the idea you just built. A convolution layer applies many filters at once — but instead of you choosing the numbers, the network learns them during training. Each filter produces a feature map: a grid showing where that pattern was found. Early layers learn edge filters, deeper layers learn shape and object filters — exactly the eye/brain ladder from the analogy.

A pooling layer (usually MaxPool 2×2) then shrinks each feature map by keeping only the strongest value in each 2×2 block. This throws away precise positions but keeps "was the feature here, roughly?", which makes the network smaller and more robust. Conv → Pool → Conv → Pool repeats until a Flatten turns the maps into a vector for a final Dense classifier.

Why CNNs win: a convolution shares one small filter across the whole image, so it learns with thousands of weights, not millions. A dense layer over a raw 224×224×3 image would need far more parameters and would ignore the picture's 2D structure entirely.

In real frameworks you describe this stack in a few lines. The example below sketches an OpenCV pre-processing step (note the BGR→RGB fix) and a small Keras CNN. The shapes shrink layer by layer — read the # Expected output to see how.

Worked Example: OpenCV + a Keras CNN

Pre-process an image, then build a small CNN classifier

Try it Yourself »

Python

# ── OpenCV reads images as BGR, not RGB! A classic bug. ──
import cv2                       # pip install opencv-python
img = cv2.imread("cat.jpg")      # shape (H, W, 3) in B, G, R order
img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)   # convert before showing/feeding
img = cv2.resize(img, (224, 224))            # CNNs need a fixed input size
img = img / 255.0                            # normalise pixels to 0..1

# ── A small image classifier in Keras (TensorFlow) ──
from tensorflow import keras

...

🗂️ The Three Core Vision Tasks

Almost every computer-vision product is one of these three jobs, in increasing difficulty:

Classification

"What is in this image?" — one label for the whole picture (cat vs dog). The CNN you just saw does this.

Object Detection

"What objects are here, and where?" — draws a box around each object and labels it (e.g. YOLO, Faster R-CNN).

Segmentation

"Which pixels belong to which object?" — labels every single pixel, giving an exact outline (e.g. U-Net, Mask R-CNN).

5Common Errors (And How to Fix Them)

These four mistakes trip up nearly every beginner. Spotting them saves hours.

❌ Forgetting to normalise pixels

Feeding raw 0–255 pixels into a network. The large values make training unstable and slow.

✅ Fix: scale to 0–1 before training:

image = image / 255.0   # now every pixel is between 0.0 and 1.0

❌ Wrong channel order (BGR vs RGB)

OpenCV's cv2.imread returns pixels in BGR order, but most models and display libraries expect RGB. Colours come out swapped (blue skies look orange).

✅ Fix: convert right after loading:

img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)

❌ Not resizing to a fixed input size

CNNs expect every input to be the same shape (e.g. 224×224). Passing mixed sizes raises a shape error like expected (224,224,3), got (480,640,3).

✅ Fix: resize every image first:

img = cv2.resize(img, (224, 224))

❌ Giant dense layers instead of convolution

Flattening a raw image straight into a Dense layer creates millions of weights and overfits instantly.

✅ Fix: use Conv2D + MaxPooling2D first to shrink and share weights, and flatten only near the end.

📋 Quick Reference

Term	What It Means	Example / Note
Pixel	One dot of an image	`0`=black … `255`=white
Grayscale	One value per pixel	shape H×W×1
RGB	Red, Green, Blue per pixel	shape H×W×3
Normalise	Scale pixels to 0–1	`img / 255.0`
Threshold	Make black & white	`255 if p >= c else 0`
Convolution	Slide a filter over the image	blur, sharpen, edges
Conv2D	Learns filters → feature maps	shares weights
MaxPool	Shrinks, keeps strongest value	2×2 halves H and W
Flatten	2D maps → 1D vector	feeds Dense layer
CV tasks	Three core jobs	classify · detect · segment

❓ Frequently Asked Questions

Q: Why is an image just a grid of numbers?

A: A camera sensor measures light at thousands of tiny points. Each measurement becomes a number (a pixel). Grayscale uses one number per pixel (0=black to 255=white); colour uses three numbers per pixel for Red, Green and Blue.

Q: What is the difference between grayscale and RGB?

A: Grayscale stores one brightness value per pixel, so a 28x28 image is 28x28x1 numbers. RGB stores three values per pixel (red, green, blue), so a 28x28 colour image is 28x28x3 numbers. RGB has three 'channels'; grayscale has one.

Q: What does convolution actually do?

A: It slides a small grid of numbers (a filter or kernel) across the image and combines the pixels underneath to produce a new value. Different filters highlight different things: averaging blurs, while an edge kernel lights up wherever brightness changes sharply.

Q: Why use a CNN instead of a regular neural network for images?

A: A 224x224x3 image is over 150,000 numbers. A plain dense layer over that needs millions of weights and ignores the 2D structure. CNNs share a small filter across the whole image, so they learn with far fewer parameters and respect spatial layout.

Q: What are classification, detection and segmentation?

A: Classification answers 'what is in this image?' with one label. Object detection draws boxes around each object and labels them. Segmentation labels every individual pixel, giving the exact outline of each object.

🎯 Mini Challenge: Posterise an Image

Time to fly solo. Snap every pixel to the nearest of three levels (0, 128, 255) — a classic poster-art effect. The starter below gives only the brief and the data; you write the logic. Check yourself against the expected output in the comments.

Mini Challenge: Posterise

Write the posterise function from scratch

Try it Yourself »

Python

# 🎯 MINI-CHALLENGE: posterise an image
# A 3x3 grayscale image is given. Write a "posterise" function that snaps
# every pixel to the NEAREST of three levels: 0, 128, or 255.
#
# Rule:
#   pixel < 64           -> 0
#   64 <= pixel < 192    -> 128
#   pixel >= 192         -> 255
#
# Steps:
# 1. Loop over every row, then every pixel.
# 2. Apply the rule above to choose 0, 128 or 255.
# 3. Build and print the new image (a list of rows).
#
# ✅ Expected (for the image below):
# [0, 128, 255]
# [128,
...

🎉

Lesson 10 complete — you can make a computer see!

You now know that images are grids of numbers, you can brighten, threshold, blur and find edges by hand, and you understand how convolution, pooling and feature maps stack into a CNN that classifies, detects, or segments. The "magic" of computer vision is just arithmetic on pixels, repeated in layers.

🚀 Up next: Advanced Neural Networks — regularisation, batch normalisation, and the tricks that make deep models train reliably.

Computer Vision Basics

What You'll Learn in This Lesson

👁️ How Your Eyes and Brain "See"

1An Image Is Just a Grid of Numbers

Worked Example: Images as Numbers

2Basic Operations: Brightness and Threshold

Worked Example: Brighten and Threshold

3Convolution and Filters (Blur and Edges)

Worked Example: A 2x2 Average Filter

Worked Example: Edge Detection with numpy

🎯 Your Turn: Invert an Image

🎯 Your Turn: One Step of an Average Filter

4From Filters to CNNs (Conv, Pool, Feature Maps)

Worked Example: OpenCV + a Keras CNN

🗂️ The Three Core Vision Tasks

5Common Errors (And How to Fix Them)

📋 Quick Reference

❓ Frequently Asked Questions

🎯 Mini Challenge: Posterise an Image

Mini Challenge: Posterise

Lesson 10 complete — you can make a computer see!

Cookie & Privacy Settings