Lesson 8 • Intermediate
Deep Learning Fundamentals
By the end of this lesson you'll be able to explain how a deep network turns numbers into predictions, how it learns from its mistakes with backpropagation and gradient descent, and how to build one in Keras without getting lost.
What You'll Learn in This Lesson
- ✓What makes a network "deep" — stacking many hidden layers
- ✓Run a forward pass by hand through a tiny 2-layer network
- ✓The intuition behind backpropagation and gradient descent
- ✓How loss functions measure "how wrong" a model is
- ✓What epochs, batches, and the learning rate actually control
- ✓Build a small Keras model and fight overfitting with dropout
🎯 Real-World Analogy: A Factory Assembly Line
Picture a factory assembly line. Raw materials enter at one end; each station does one small job and passes the result on. By the last station, scattered parts have become a finished car. A deep network works the same way — each layer is a station that refines the data a little before passing it along.
The first layers detect simple things (edges in an image, fragments of a word). Middle layers combine those into shapes or phrases. The final layer makes the call: "that's a golden retriever" or "this review is positive." Deep just means many of these stations stacked in a row — and more stations means the network can learn more abstract ideas.
1What Makes a Network "Deep"?
A layer is a row of neurons. A network with one or two layers is called shallow; a network with many layers is deep. The data flows forward: the output of one layer becomes the input to the next. This step — running data through every layer to get a prediction — is called a forward pass.
Two breakthroughs made deep networks practical: the ReLU activation (which keeps positive numbers and zeroes out negatives) fixed a problem where gradients vanished in deep stacks, and GPUs made it fast enough to train millions of weights. Below, you'll do a forward pass entirely by hand so the layers stop being a mystery.
Read every comment in this worked example, then run it. Each hidden neuron is just a weighted sum of the inputs, plus a bias, passed through an activation.
Worked Example: A 2-Layer Forward Pass (Plain Python)
Run data through two layers by hand and watch a prediction appear
# A tiny "deep" network by hand: 2 inputs -> 2 hidden neurons -> 1 output
# No frameworks. Just lists and arithmetic so you can SEE what a layer does.
import math
def relu(x):
return x if x > 0 else 0.0 # ReLU: keep positives, zero the rest
def sigmoid(x):
return 1 / (1 + math.exp(-x)) # squashes any number into 0..1
# --- The input the network "sees" ---
inputs = [0.5, 0.9] # 2 features (e.g. brightness, size)
# --- Layer 1: 2 inputs -> 2 hidden
...🎯 Your Turn: Finish a Single Neuron
Fill in the blanks to complete one neuron's weighted sum and activation
# 🎯 YOUR TURN — finish this single neuron's forward pass.
# Fill in the blanks marked ___ . One neuron = weighted sum + bias + activation.
import math
def relu(x):
return x if x > 0 else 0.0
inputs = [1.0, 2.0, 3.0] # 3 features
weights = [0.5, -0.2, 0.1] # one weight per feature
bias = 0.4
total = bias
for i in range(3):
# 👉 add each input multiplied by its matching weight
total += inputs[i] * ___ # 👉 replace ___ with weights[i]
# 👉 pass the weig
...2How Networks Learn: Backprop & Gradient Descent
A fresh network guesses badly. Learning is the process of nudging every weight so the guesses get better. There are two parts:
- Backpropagation answers "which weights are to blame, and in which direction?" It sends the error backwards through the layers (using the chain rule from calculus) to compute a gradient — the slope of the loss — for every weight.
- Gradient descent uses those gradients to take a step. Each weight moves a little in the direction that lowers the loss. Repeat thousands of times and the network gets good.
The clearest way to feel gradient descent is on a single number. Below, you minimise f(x) = (x - 3)². You already know the answer is x = 3 — watch the algorithm discover it by always stepping downhill, opposite the gradient.
Worked Example: One Function, Many Gradient-Descent Steps
Watch x walk downhill toward the minimum, step by step
# Gradient descent: the rule that lets a network LEARN.
# Goal: find the x that makes f(x) = (x - 3)**2 as small as possible.
# The minimum is obviously x = 3 — watch the algorithm walk towards it.
def f(x):
return (x - 3) ** 2 # the "loss" we want to minimise
def gradient(x):
return 2 * (x - 3) # slope of f at x (calculus gives this)
x = 0.0 # a bad starting guess
learning_rate = 0.1 # how big ea
...🎯 Your Turn: Take One Step Downhill
Apply the gradient-descent update rule for a single step
# 🎯 YOUR TURN — take ONE gradient-descent step by hand.
# Same function as before: f(x) = (x - 3)**2, gradient = 2*(x - 3).
def gradient(x):
return 2 * (x - 3)
x = 5.0 # current guess (too high — the minimum is at 3)
learning_rate = 0.1
g = gradient(x) # slope at x = 5 -> 4.0
# 👉 move x DOWNHILL: subtract learning_rate * gradient
new_x = x - ___ * g # 👉 replace ___ with learning_rate
print("gradient:", g)
print("new x:", round(new_x, 3))
# ✅ Expected
...3Loss Functions: Measuring "How Wrong"
Before a network can improve, it needs a single number that says how bad its current guesses are. That number is the loss. Gradient descent's whole job is to make this number smaller.
Different tasks use different loss functions:
- Mean Squared Error (MSE) — for predicting numbers (regression). It squares the gap between prediction and truth, so big mistakes hurt a lot.
- Cross-entropy — for classification (cat vs dog, spam vs not). It heavily punishes being confidently wrong.
# Loss = one number that says "how wrong are we?"
predictions = [2.8, 5.3, 6.5]
targets = [3.0, 5.0, 7.0]
# Mean Squared Error: average of the squared gaps
errors = [(p - t) for p, t in zip(predictions, targets)]
squared = [e * e for e in errors]
mse = sum(squared) / len(squared)
print("errors:", [round(e, 2) for e in errors]) # [-0.2, 0.3, -0.5]
print("MSE:", round(mse, 4)) # 0.1267
# Lower MSE = better predictions. Training drives this toward 0.4Epochs, Batches, and the Learning Rate
These three dials control how training runs. They trip up beginners, so here they are in plain English:
Epoch
One full pass over ALL your training data. 10 epochs = the network sees every example 10 times.
Batch
A small chunk (e.g. 32 samples) used for one weight update. Smaller batches = more frequent, noisier updates.
Learning rate
How big each weight step is. Too high overshoots; too low crawls. A common default is 0.001.
NaN, your learning rate is probably too high — divide it by 10. If loss barely moves, it may be too low.5Frameworks & Overfitting (TensorFlow/Keras, PyTorch)
You'd never hand-code backprop for a real model. Frameworks do the calculus and GPU work for you. The two most popular are TensorFlow/Keras (a model is a short list of layers — the gentlest start) and PyTorch (flexible and Pythonic, favoured in research). They teach the same ideas you just learned.
A big risk with deep models is overfitting: the network memorises the training data and then flops on new data. Regularisation fights this. The most common trick is dropout — during training it randomly switches off a fraction of neurons, forcing the network to spread its knowledge out instead of relying on a few memorised paths.
The Keras model below uses the same layers, ReLU, sigmoid, Adam optimiser, and binary cross-entropy loss from this lesson — plus one Dropout layer. Run it where TensorFlow is installed; here the expected summary is shown so you can read it as a reference.
Worked Example: A Small Keras Model (with Dropout)
The same concepts in a real framework — read the expected summary
# A real deep network in Keras — the SAME ideas, just fewer lines.
# (Run this where TensorFlow is installed; here it is a worked reference.)
import tensorflow as tf
from tensorflow import keras
# Build: 2 inputs -> 16 hidden (ReLU) -> 8 hidden (ReLU) -> 1 output (sigmoid)
model = keras.Sequential([
keras.layers.Dense(16, activation="relu", input_shape=(2,)),
keras.layers.Dropout(0.2), # regularisation — fights overfitting
keras.layers.Dense(8, activation="relu"),
...6Common Errors (And How to Fix Them)
These four problems trip up almost every beginner. Here's how to spot and fix them:
🔥 Learning rate too high
The loss jumps around, balloons to a giant number, or becomes nan. The steps are so big that gradient descent overshoots the minimum every time.
✅ Fix: lower the learning rate (try 0.001, then divide by 10 if it still diverges).
📏 No input normalisation
One feature ranges 0–1 and another ranges 0–100,000. Training is unstable or painfully slow because the large feature dominates every gradient.
✅ Fix: scale features to a similar range (e.g. subtract the mean and divide by the standard deviation) before training.
🧠 Overfitting with no regularisation
Training accuracy hits 99% but new-data accuracy is poor. The model memorised the training set instead of the pattern.
✅ Fix: add Dropout, get more data, or stop training earlier (early stopping).
📉 Too little data
With only a handful of examples, a deep network has nothing to generalise from and overfits instantly — it can't tell signal from noise.
✅ Fix: gather more data, use data augmentation, or pick a smaller/simpler model that needs fewer examples.
📋 Quick Reference
| Term | What It Means | Typical Choice |
|---|---|---|
| Forward pass | Run data through every layer to get a prediction | — |
| Backpropagation | Send error backwards to compute each weight's gradient | Automatic in frameworks |
| Loss (regression) | Measures error on number predictions | MSE |
| Loss (classification) | Measures error on category predictions | Cross-entropy |
| Learning rate | Step size of each weight update | 0.001 (Adam) |
| Epoch / Batch | One full data pass / one update chunk | 10–100 / 32 |
| Activation (hidden) | Adds non-linearity between layers | ReLU |
| Regularisation | Fights overfitting | Dropout (0.2–0.5) |
❓ Frequently Asked Questions
Q: What makes a neural network "deep"?
A: Depth means stacking many hidden layers between the input and output. Each layer transforms the previous layer's output, so early layers learn simple patterns (edges, word fragments) and later layers combine them into complex concepts (faces, meaning). One or two layers is "shallow"; many layers is "deep".
Q: What is backpropagation in plain English?
A: Backpropagation is how the network figures out which weights to blame for its mistakes. After a forward pass produces an answer, the error is sent backwards layer by layer using the chain rule from calculus, giving each weight a gradient. Gradient descent then nudges every weight in the direction that lowers the loss.
Q: What's the difference between an epoch, a batch, and the learning rate?
A: An epoch is one full pass over all your training data. A batch is a small chunk of that data the network updates from at a time (e.g. 32 samples). The learning rate controls how big each weight update is — too high and training diverges, too low and it crawls.
Q: Should I use TensorFlow/Keras or PyTorch?
A: Both are excellent and learn the same concepts. Keras (on top of TensorFlow) is the gentlest start — a model is a short list of layers. PyTorch is favoured in research for its flexible, Pythonic feel. Pick one, learn the ideas, and the other becomes easy.
Q: What is overfitting and how does dropout help?
A: Overfitting is when a model memorises the training data instead of learning the general pattern, so it does well on data it has seen but poorly on new data. Dropout randomly switches off a fraction of neurons during training, forcing the network to spread its learning out rather than rely on a few memorised paths.
🎯 Mini-Challenge: Gradient Descent in a Loop
You've taken one step by hand and watched a worked loop. Now write the loop yourself from the outline below — no filled-in logic this time, just the plan.
Mini-Challenge
Minimise f(x) = (x - 10)² with your own gradient-descent loop
# 🎯 MINI-CHALLENGE: run gradient descent in a loop.
# Minimise f(x) = (x - 10)**2 . The answer should march toward x = 10.
#
# 1. Define gradient(x) that returns 2 * (x - 10)
# 2. Start x at 0.0 and set learning_rate to 0.1
# 3. Loop 20 times: each pass, do x = x - learning_rate * gradient(x)
# 4. After the loop, print the final x rounded to 2 places
#
# ✅ Expected (roughly): final x is close to 8.84 after 20 steps
# (more steps -> even closer to 10)
# your code hereLesson 8 complete — you understand deep learning's engine!
You can run a forward pass by hand, explain backpropagation and gradient descent, choose the right loss function, set epochs/batches/learning rate sensibly, build a small Keras model, and fight overfitting with dropout. That's the core of every deep learning system in the world.
🚀 Up next: Natural Language Processing — teaching computers to understand and generate human text.
Sign up for free to track which lessons you've completed and get learning reminders.