Courses/AI & ML/Deep Learning Fundamentals

Lesson 8 • Intermediate

Deep Learning Fundamentals

By the end of this lesson you'll be able to explain how a deep network turns numbers into predictions, how it learns from its mistakes with backpropagation and gradient descent, and how to build one in Keras without getting lost.

What You'll Learn in This Lesson

✓What makes a network "deep" — stacking many hidden layers
✓Run a forward pass by hand through a tiny 2-layer network
✓The intuition behind backpropagation and gradient descent
✓How loss functions measure "how wrong" a model is
✓What epochs, batches, and the learning rate actually control
✓Build a small Keras model and fight overfitting with dropout

Before you start: Make sure you've completed Lesson 7: Neural Networks so you're comfortable with neurons, weights, and activation functions. The "Try it Yourself" exercises here use plain Python only — no installs needed.

🎯 Real-World Analogy: A Factory Assembly Line

Picture a factory assembly line. Raw materials enter at one end; each station does one small job and passes the result on. By the last station, scattered parts have become a finished car. A deep network works the same way — each layer is a station that refines the data a little before passing it along.

The first layers detect simple things (edges in an image, fragments of a word). Middle layers combine those into shapes or phrases. The final layer makes the call: "that's a golden retriever" or "this review is positive." Deep just means many of these stations stacked in a row — and more stations means the network can learn more abstract ideas.

1What Makes a Network "Deep"?

A layer is a row of neurons. A network with one or two layers is called shallow; a network with many layers is deep. The data flows forward: the output of one layer becomes the input to the next. This step — running data through every layer to get a prediction — is called a forward pass.

Two breakthroughs made deep networks practical: the ReLU activation (which keeps positive numbers and zeroes out negatives) fixed a problem where gradients vanished in deep stacks, and GPUs made it fast enough to train millions of weights. Below, you'll do a forward pass entirely by hand so the layers stop being a mystery.

Read every comment in this worked example, then run it. Each hidden neuron is just a weighted sum of the inputs, plus a bias, passed through an activation.

Worked Example: A 2-Layer Forward Pass (Plain Python)

Run data through two layers by hand and watch a prediction appear

Try it Yourself »

Python

# A tiny "deep" network by hand: 2 inputs -> 2 hidden neurons -> 1 output
# No frameworks. Just lists and arithmetic so you can SEE what a layer does.

import math

def relu(x):
    return x if x > 0 else 0.0          # ReLU: keep positives, zero the rest

def sigmoid(x):
    return 1 / (1 + math.exp(-x))       # squashes any number into 0..1

# --- The input the network "sees" ---
inputs = [0.5, 0.9]                      # 2 features (e.g. brightness, size)

# --- Layer 1: 2 inputs -> 2 hidden 
...

🎯 Your Turn: Finish a Single Neuron

Fill in the blanks to complete one neuron's weighted sum and activation

Try it Yourself »

Python

# 🎯 YOUR TURN — finish this single neuron's forward pass.
# Fill in the blanks marked ___ . One neuron = weighted sum + bias + activation.

import math

def relu(x):
    return x if x > 0 else 0.0

inputs  = [1.0, 2.0, 3.0]          # 3 features
weights = [0.5, -0.2, 0.1]         # one weight per feature
bias    = 0.4

total = bias
for i in range(3):
    # 👉 add each input multiplied by its matching weight
    total += inputs[i] * ___        # 👉 replace ___ with weights[i]

# 👉 pass the weig
...

2How Networks Learn: Backprop & Gradient Descent

A fresh network guesses badly. Learning is the process of nudging every weight so the guesses get better. There are two parts:

Backpropagation answers "which weights are to blame, and in which direction?" It sends the error backwards through the layers (using the chain rule from calculus) to compute a gradient — the slope of the loss — for every weight.
Gradient descent uses those gradients to take a step. Each weight moves a little in the direction that lowers the loss. Repeat thousands of times and the network gets good.

The clearest way to feel gradient descent is on a single number. Below, you minimise f(x) = (x - 3)². You already know the answer is x = 3 — watch the algorithm discover it by always stepping downhill, opposite the gradient.

Worked Example: One Function, Many Gradient-Descent Steps

Watch x walk downhill toward the minimum, step by step

Try it Yourself »

Python

# Gradient descent: the rule that lets a network LEARN.
# Goal: find the x that makes f(x) = (x - 3)**2 as small as possible.
# The minimum is obviously x = 3 — watch the algorithm walk towards it.

def f(x):
    return (x - 3) ** 2                  # the "loss" we want to minimise

def gradient(x):
    return 2 * (x - 3)                   # slope of f at x (calculus gives this)

x = 0.0                                  # a bad starting guess
learning_rate = 0.1                      # how big ea
...

🎯 Your Turn: Take One Step Downhill

Apply the gradient-descent update rule for a single step

Try it Yourself »

Python

# 🎯 YOUR TURN — take ONE gradient-descent step by hand.
# Same function as before: f(x) = (x - 3)**2, gradient = 2*(x - 3).

def gradient(x):
    return 2 * (x - 3)

x = 5.0                  # current guess (too high — the minimum is at 3)
learning_rate = 0.1

g = gradient(x)          # slope at x = 5  ->  4.0
# 👉 move x DOWNHILL: subtract learning_rate * gradient
new_x = x - ___ * g      # 👉 replace ___ with learning_rate

print("gradient:", g)
print("new x:", round(new_x, 3))

# ✅ Expected 
...

3Loss Functions: Measuring "How Wrong"

Before a network can improve, it needs a single number that says how bad its current guesses are. That number is the loss. Gradient descent's whole job is to make this number smaller.

Different tasks use different loss functions:

Mean Squared Error (MSE) — for predicting numbers (regression). It squares the gap between prediction and truth, so big mistakes hurt a lot.
Cross-entropy — for classification (cat vs dog, spam vs not). It heavily punishes being confidently wrong.

# Loss = one number that says "how wrong are we?"
predictions = [2.8, 5.3, 6.5]
targets     = [3.0, 5.0, 7.0]

# Mean Squared Error: average of the squared gaps
errors  = [(p - t) for p, t in zip(predictions, targets)]
squared = [e * e for e in errors]
mse     = sum(squared) / len(squared)

print("errors:", [round(e, 2) for e in errors])  # [-0.2, 0.3, -0.5]
print("MSE:", round(mse, 4))                       # 0.1267
# Lower MSE = better predictions. Training drives this toward 0.

4Epochs, Batches, and the Learning Rate

These three dials control how training runs. They trip up beginners, so here they are in plain English:

Epoch

One full pass over ALL your training data. 10 epochs = the network sees every example 10 times.

Batch

A small chunk (e.g. 32 samples) used for one weight update. Smaller batches = more frequent, noisier updates.

Learning rate

How big each weight step is. Too high overshoots; too low crawls. A common default is 0.001.

Rule of thumb: if loss explodes to a huge number or NaN, your learning rate is probably too high — divide it by 10. If loss barely moves, it may be too low.

5Frameworks & Overfitting (TensorFlow/Keras, PyTorch)

You'd never hand-code backprop for a real model. Frameworks do the calculus and GPU work for you. The two most popular are TensorFlow/Keras (a model is a short list of layers — the gentlest start) and PyTorch (flexible and Pythonic, favoured in research). They teach the same ideas you just learned.

A big risk with deep models is overfitting: the network memorises the training data and then flops on new data. Regularisation fights this. The most common trick is dropout — during training it randomly switches off a fraction of neurons, forcing the network to spread its knowledge out instead of relying on a few memorised paths.

The Keras model below uses the same layers, ReLU, sigmoid, Adam optimiser, and binary cross-entropy loss from this lesson — plus one Dropout layer. Run it where TensorFlow is installed; here the expected summary is shown so you can read it as a reference.

Worked Example: A Small Keras Model (with Dropout)

The same concepts in a real framework — read the expected summary

Try it Yourself »

Python

# A real deep network in Keras — the SAME ideas, just fewer lines.
# (Run this where TensorFlow is installed; here it is a worked reference.)

import tensorflow as tf
from tensorflow import keras

# Build: 2 inputs -> 16 hidden (ReLU) -> 8 hidden (ReLU) -> 1 output (sigmoid)
model = keras.Sequential([
    keras.layers.Dense(16, activation="relu", input_shape=(2,)),
    keras.layers.Dropout(0.2),               # regularisation — fights overfitting
    keras.layers.Dense(8, activation="relu"),
   
...

6Common Errors (And How to Fix Them)

These four problems trip up almost every beginner. Here's how to spot and fix them:

🔥 Learning rate too high

The loss jumps around, balloons to a giant number, or becomes nan. The steps are so big that gradient descent overshoots the minimum every time.

✅ Fix: lower the learning rate (try 0.001, then divide by 10 if it still diverges).

📏 No input normalisation

One feature ranges 0–1 and another ranges 0–100,000. Training is unstable or painfully slow because the large feature dominates every gradient.

✅ Fix: scale features to a similar range (e.g. subtract the mean and divide by the standard deviation) before training.

🧠 Overfitting with no regularisation

Training accuracy hits 99% but new-data accuracy is poor. The model memorised the training set instead of the pattern.

✅ Fix: add Dropout, get more data, or stop training earlier (early stopping).

📉 Too little data

With only a handful of examples, a deep network has nothing to generalise from and overfits instantly — it can't tell signal from noise.

✅ Fix: gather more data, use data augmentation, or pick a smaller/simpler model that needs fewer examples.

📋 Quick Reference

Term	What It Means	Typical Choice
Forward pass	Run data through every layer to get a prediction	—
Backpropagation	Send error backwards to compute each weight's gradient	Automatic in frameworks
Loss (regression)	Measures error on number predictions	MSE
Loss (classification)	Measures error on category predictions	Cross-entropy
Learning rate	Step size of each weight update	0.001 (Adam)
Epoch / Batch	One full data pass / one update chunk	10–100 / 32
Activation (hidden)	Adds non-linearity between layers	ReLU
Regularisation	Fights overfitting	Dropout (0.2–0.5)

❓ Frequently Asked Questions

Q: What makes a neural network "deep"?

A: Depth means stacking many hidden layers between the input and output. Each layer transforms the previous layer's output, so early layers learn simple patterns (edges, word fragments) and later layers combine them into complex concepts (faces, meaning). One or two layers is "shallow"; many layers is "deep".

Q: What is backpropagation in plain English?

A: Backpropagation is how the network figures out which weights to blame for its mistakes. After a forward pass produces an answer, the error is sent backwards layer by layer using the chain rule from calculus, giving each weight a gradient. Gradient descent then nudges every weight in the direction that lowers the loss.

Q: What's the difference between an epoch, a batch, and the learning rate?

A: An epoch is one full pass over all your training data. A batch is a small chunk of that data the network updates from at a time (e.g. 32 samples). The learning rate controls how big each weight update is — too high and training diverges, too low and it crawls.

Q: Should I use TensorFlow/Keras or PyTorch?

A: Both are excellent and learn the same concepts. Keras (on top of TensorFlow) is the gentlest start — a model is a short list of layers. PyTorch is favoured in research for its flexible, Pythonic feel. Pick one, learn the ideas, and the other becomes easy.

Q: What is overfitting and how does dropout help?

A: Overfitting is when a model memorises the training data instead of learning the general pattern, so it does well on data it has seen but poorly on new data. Dropout randomly switches off a fraction of neurons during training, forcing the network to spread its learning out rather than rely on a few memorised paths.

🎯 Mini-Challenge: Gradient Descent in a Loop

You've taken one step by hand and watched a worked loop. Now write the loop yourself from the outline below — no filled-in logic this time, just the plan.

Mini-Challenge

Minimise f(x) = (x - 10)² with your own gradient-descent loop

Try it Yourself »

Python

# 🎯 MINI-CHALLENGE: run gradient descent in a loop.
# Minimise f(x) = (x - 10)**2 . The answer should march toward x = 10.
#
# 1. Define gradient(x) that returns 2 * (x - 10)
# 2. Start x at 0.0 and set learning_rate to 0.1
# 3. Loop 20 times: each pass, do  x = x - learning_rate * gradient(x)
# 4. After the loop, print the final x rounded to 2 places
#
# ✅ Expected (roughly): final x is close to 8.84 after 20 steps
#    (more steps -> even closer to 10)

# your code here

🎉

Lesson 8 complete — you understand deep learning's engine!

You can run a forward pass by hand, explain backpropagation and gradient descent, choose the right loss function, set epochs/batches/learning rate sensibly, build a small Keras model, and fight overfitting with dropout. That's the core of every deep learning system in the world.

🚀 Up next: Natural Language Processing — teaching computers to understand and generate human text.

Deep Learning Fundamentals

What You'll Learn in This Lesson

🎯 Real-World Analogy: A Factory Assembly Line

1What Makes a Network "Deep"?

Worked Example: A 2-Layer Forward Pass (Plain Python)

🎯 Your Turn: Finish a Single Neuron

2How Networks Learn: Backprop & Gradient Descent

Worked Example: One Function, Many Gradient-Descent Steps

🎯 Your Turn: Take One Step Downhill

3Loss Functions: Measuring "How Wrong"

4Epochs, Batches, and the Learning Rate

5Frameworks & Overfitting (TensorFlow/Keras, PyTorch)

Worked Example: A Small Keras Model (with Dropout)

6Common Errors (And How to Fix Them)

📋 Quick Reference

❓ Frequently Asked Questions

🎯 Mini-Challenge: Gradient Descent in a Loop

Mini-Challenge

Lesson 8 complete — you understand deep learning's engine!

Cookie & Privacy Settings