Lesson 21 • Advanced
Residual Networks & DenseNets
Learn the one idea — the skip connection — that let neural networks go from a few dozen layers to hundreds, and how to put a pretrained ResNet to work on your own images in minutes.
What You'll Learn in This Lesson
- ✓Why very deep plain networks degrade and train worse
- ✓How a skip connection makes output = F(x) + x
- ✓Why learning the residual and identity mapping is easier
- ✓How DenseNet concatenates features and what growth rate means
- ✓What a bottleneck block is and why it saves computation
- ✓How to use and fine-tune a pretrained ResNet (transfer learning)
🛣️ Real-World Analogy: Express Lanes for Information
Picture a 100-stop bus route across a city. A regular passenger has to get off and back on at every single stop — by the time they reach the end, the journey is exhausting and a lot has been forgotten. That is a deep plain network: information is reprocessed at every layer, and the original signal slowly gets garbled.
Now imagine an express lane running alongside the route — a shortcut that carries the original passenger straight through, untouched, while the local bus still makes its stops. At each stop you simply add whatever the local route figured out to the passenger who took the express lane. That shortcut is a skip connection. The express lane keeps the original information intact, so the network can be enormously deep without losing the thread.
ResNet builds one express lane per block (it adds the shortcut). DenseNet is even more generous: every stop gets its own dedicated lane to every later stop, and the lanes are bundled together (it concatenates) so nothing is ever thrown away.
1The Degradation Problem: Deeper Got Worse
You'd expect a deeper network to be at least as good as a shallow one — worst case, the extra layers could just copy their input and do no harm. In practice, before 2015, the opposite happened. Stack enough plain layers and accuracy got worse, on the training set as well as the test set. A famous result showed a 56-layer network beaten by a 20-layer one.
This is the degradation problem, and it is not overfitting (overfitting would mean great training accuracy, poor test accuracy). It's an optimisation problem: as a gradient travels backwards through dozens of layers during training, it gets multiplied at every step. Multiply by small numbers enough times and it shrinks to almost nothing — the vanishing-gradient problem — so the early layers receive almost no signal telling them how to improve.
2The ResNet Fix: Learn the Residual
ResNet's idea is almost embarrassingly simple. Instead of asking a block to produce the full desired output H(x), ask it to produce only the residual — the change — and then add the original input back:
output = F(x) + x # F is the block; x arrives via the skip connection # rearranged: F(x) = output - x -> F learns the "residual" (the change)
The + x term is the skip connection (also called a shortcut or identity connection). It costs no extra parameters — it's just addition. Why does it help so much? Because if the best thing a block can do is leave its input alone (an identity mapping), it just needs to drive F(x) towards zero. Pushing weights towards zero is easy; learning to perfectly reproduce the input with raw weight layers is hard. The skip guarantees deeper networks can never be worse than shallower ones — the extra blocks can always fall back to doing nothing.
The example below builds one residual block in plain Python so you can see F(x) + x directly:
Worked Example: A Residual Block (F(x) + x)
Build one residual block in plain Python and watch the skip connection add the input back
# A residual block in plain Python: output = F(x) + x
# No libraries needed — we use small lists and simple functions
# so you can watch the skip connection do its job.
def relu(v):
# ReLU keeps positives, zeroes out negatives
return [max(0.0, n) for n in v]
def scale(v, factor):
# Stand-in for "a weight layer" — multiply every number
return [n * factor for n in v]
def F(x):
# The block's learned transformation (two tiny "layers")
h = relu(scale(x, 0.5)) # layer 1 +
...3Gradient Flow: With vs Without the Skip
Here's the heart of why ResNet works. When the training signal flows backwards, a plain layer multiplies the gradient by some factor. Across many layers those factors compound — and if they're below 1, the gradient vanishes. A skip connection changes the maths: because output = F(x) + x, the gradient gets a guaranteed "+1" path at every block, so it can never fully die.
The runnable example below contrasts the two. The plain column shrinks towards zero as depth grows; the residual column stays healthy:
Worked Example: Vanishing vs Surviving Gradients
Compare how the gradient signal fares in a plain net vs a residual net as depth increases
# Why skip connections rescue deep networks: GRADIENT FLOW.
# When gradients travel back through N layers, a PLAIN net multiplies
# them by each layer's factor (they can shrink to ~0 = "vanishing").
# A residual net ADDS 1 at every block, so the signal never dies.
layer_factor = 0.5 # each plain layer weakens the gradient (factor < 1)
depths = [1, 5, 10, 20, 50]
print(f"{'depth':>6} {'plain':>14} {'residual':>12}")
for n in depths:
plain = layer_factor ** n # 0.5 * 0.5 * ... (n
...🎯 Your Turn: Complete the Skip Connection
Fill in the blank so the block returns F(x) + x
# 🎯 YOUR TURN: complete the residual block
# Make the block return F(x) + x by adding the input back.
def F(x):
return [n * 0.5 for n in x] # the block's transformation
def residual_block(x):
fx = F(x)
# 👉 add each fx value to the matching x value (the skip connection)
out = [a + b for a, b in zip(fx, ___)] # replace ___ with the input list
return out
x = [4.0, 8.0, 2.0]
print("F(x) :", F(x))
print("residual :", residual_block(x))
# ✅ Expected output:
# F(x)
...4Bottleneck Blocks: Going Deep Affordably
A 3x3 convolution on many channels is expensive. To build really deep ResNets (50, 101, 152 layers) without the cost exploding, ResNet uses a bottleneck block: a three-step sandwich.
- 1x1 conv — squeeze: shrink the channel count (e.g. 256 → 64). Cheap.
- 3x3 conv — work: do the real spatial processing on the small 64-channel tensor.
- 1x1 conv — expand: grow the channels back (64 → 256) so the skip can add cleanly.
The narrow middle is the "bottleneck". By doing the heavy 3x3 work on far fewer channels, the block delivers similar power for a fraction of the computation — which is exactly how ResNet-50 stays trainable on ordinary hardware. The skip connection still wraps the whole sandwich: output = bottleneck(x) + x.
🔑 Dimensions must match to add
The skip's x and the block's output must be the same shape to add them. When a block changes the channel count or downsamples (stride 2), ResNet puts a 1x1 "projection" convolution on the skip path so both sides line up. Forgetting this is the #1 build error (see Common Errors).
5DenseNet: Concatenate Instead of Add
DenseNet (2017) pushed the skip idea to its limit. Where ResNet adds the input (F(x) + x), DenseNet concatenates: inside a dense block, every layer receives the stacked feature maps of all earlier layers as its input. Layer 4 sees the outputs of layers 1, 2, and 3 bundled together.
Each layer contributes only a small fixed number of new feature maps — the growth rate k (often 12 to 32). Because every layer can reach back to all earlier features, the layers themselves can be narrow, which keeps the total parameter count surprisingly low. Concatenation also gives every layer a direct line to the loss, so gradients flow even more freely than in ResNet.
# Inside a dense block (growth rate k), channels grow layer by layer: # layer 1 input: 16 output adds k -> running total 16 + k # layer 2 input: 16 + k output adds k -> running total 16 + 2k # layer 3 input: 16 + 2k output adds k -> running total 16 + 3k # ResNet would keep the channel count fixed by ADDING instead.
Because channels keep growing, DenseNet inserts transition layers (a 1x1 conv plus pooling) between dense blocks to compress the feature maps back down and keep memory in check.
| ResNet | DenseNet | |
|---|---|---|
| Skip operation | Addition | Concatenation |
| Channels | Stay fixed | Grow by k each layer |
| Feature reuse | Previous block | All earlier layers |
| Parameters | More | Fewer |
| Memory use | Lower | Higher |
🎯 Your Turn: Prove the Identity Shortcut
Make F(x) return zeros so the residual block output equals its input exactly
# 🎯 YOUR TURN: show the identity shortcut
# A residual block can "do nothing harmful": if F(x) outputs all zeros,
# then F(x) + x must equal x exactly. Fill in the blank so F returns zeros.
def F(x):
# 👉 return a list of 0.0 the SAME length as x
return [___ for _ in x] # replace ___ with 0.0
def residual_block(x):
fx = F(x)
return [a + b for a, b in zip(fx, x)]
x = [5.0, -1.0, 9.0]
print("output:", residual_block(x))
print("equals input?", residual_block(x) == x)
# ✅
...6Using a Pretrained ResNet (Transfer Learning)
You almost never train a ResNet from scratch. A ResNet trained on ImageNet (1.2 million images, 1000 classes) has already learned reusable visual features — edges, textures, shapes, object parts. Transfer learning reuses those learned features: load the pretrained weights, swap out the final classification layer for your own classes, and fine-tune. The two read-only examples below run locally with torchvision installed (pip install torch torchvision).
Load a pretrained model and classify an image:
# Loading a PRETRAINED ResNet-50 with torchvision (run locally).
# It already knows 1000 ImageNet classes — no training needed.
import torch
from torchvision.models import resnet50, ResNet50_Weights
weights = ResNet50_Weights.DEFAULT # best available ImageNet weights
model = resnet50(weights=weights)
model.eval() # inference mode (no dropout/BN updates)
preprocess = weights.transforms() # the EXACT transforms it was trained with
from PIL import Image
img = Image.open("dog.jpg")
batch = preprocess(img).unsqueeze(0) # shape: [1, 3, 224, 224]
with torch.no_grad():
logits = model(batch)
probs = logits.softmax(dim=1)
top_p, top_id = probs.max(dim=1)
label = weights.meta["categories"][top_id.item()]
print(f"{label}: {top_p.item():.1%}")
# Expected output:
# Labrador retriever: 87.3%Adapt it to your own classes by replacing only the final layer:
# TRANSFER LEARNING: reuse a pretrained ResNet on YOUR classes.
# Freeze the learned features, replace only the final layer.
import torch.nn as nn
from torchvision.models import resnet50, ResNet50_Weights
model = resnet50(weights=ResNet50_Weights.DEFAULT)
# 1) Freeze every pretrained layer so training won't disturb them
for param in model.parameters():
param.requires_grad = False
# 2) Replace the 1000-class head with one for YOUR task (say 3 classes)
num_features = model.fc.in_features # 2048 for ResNet-50
model.fc = nn.Linear(num_features, 3) # only THIS layer will train
trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"Trainable parameters: {trainable}") # tiny vs. 25M total
# Expected output:
# Trainable parameters: 6147requires_grad = False) and train only the new head — a few thousand parameters instead of 25 million. That's why transfer learning works from a few hundred images and finishes in minutes, not days. Always feed images through weights.transforms() so they're normalised exactly as the model expects.Common Errors (And How to Fix Them)
❌ Degradation: stacking plain layers without skips
Building a very deep network out of plain conv layers and finding it trains worse than a shallow one. The gradient vanishes before it reaches the early layers.
✅ Fix:
# Wrap every block with a skip connection so the gradient has a direct path: out = block(x) + x # residual form — never plain block(x) alone for deep nets # In PyTorch you literally write: return self.block(x) + identity
❌ Dimension mismatch in the skip path
You change channels or downsample inside the block, then try to add the original input — and PyTorch raises a shape error because the two tensors don't match.
# RuntimeError: The size of tensor a (256) must match the size of # tensor b (64) at non-singleton dimension 1
✅ Fix:
# Project the skip with a 1x1 conv (and matching stride) so shapes line up: self.shortcut = nn.Conv2d(in_ch, out_ch, kernel_size=1, stride=stride) out = self.block(x) + self.shortcut(x) # now both are [N, out_ch, H, W]
❌ Training from scratch instead of using pretrained weights
Calling resnet50() with no weights gives a randomly initialised network. On a small dataset it trains slowly and underperforms because it has to relearn basic vision from nothing.
✅ Fix:
# Always start from ImageNet weights for transfer learning: model = resnet50(weights=ResNet50_Weights.DEFAULT) # NOT resnet50(weights=None)
❌ Overfitting a big model on a small dataset
Fine-tuning all 25M parameters on a few hundred images: training accuracy hits 100% while validation accuracy stalls. The model memorised the data instead of learning it.
✅ Fix:
# Freeze the backbone, train only the new head, and add data augmentation: for p in model.parameters(): p.requires_grad = False # freeze features model.fc = nn.Linear(model.fc.in_features, num_classes) # train head only # Also: use augmentation (flips/crops), early stopping, and a small LR.
📋 Quick Reference
| Concept | What It Does | Key Idea |
|---|---|---|
| Degradation problem | Deep plain nets train worse | optimisation, not overfitting |
| Skip connection | Adds input to output | output = F(x) + x |
| Residual F(x) | The change a block learns | F(x) = H(x) - x |
| Identity mapping | Block does no harm | push F(x) → 0 |
| Bottleneck block | Cheap deep block | 1x1 → 3x3 → 1x1 |
| DenseNet | Connects all layers | concatenate, growth rate k |
| Transfer learning | Reuse pretrained features | freeze backbone, swap head |
❓ Frequently Asked Questions
Q: What problem do skip connections actually solve?
A: They solve the degradation problem. Before ResNet, stacking more layers made very deep plain networks HARDER to train, so a 56-layer net could score worse than a 20-layer one — not from overfitting, but because the gradient signal weakened (vanished) on its long trip back through the layers. A skip connection adds the block's input straight to its output, giving gradients a direct path back and letting each block learn only a small change instead of a whole mapping.
Q: What is a residual, and why is learning it easier?
A: A residual is the change a block makes: F(x) = H(x) - x, where H(x) is the full mapping you want and x is the input. The block computes F(x) and the skip adds x back, so output = F(x) + x. If the best thing a block can do is leave the input alone (an identity mapping), it just has to push F(x) towards zero — far easier than learning the identity from scratch with raw weight layers.
Q: How is DenseNet different from ResNet?
A: ResNet ADDS the input to the output (output = F(x) + x), so the channel count stays the same. DenseNet CONCATENATES: each layer receives the feature maps of every earlier layer stacked together, so channels grow by a fixed 'growth rate' k each layer. DenseNet maximises feature reuse and often needs fewer parameters; ResNet uses less memory and is the simpler default.
Q: What is a bottleneck block?
A: A bottleneck block uses a 1x1 convolution to shrink the channel count, then a 3x3 convolution to do the real work on the cheaper representation, then another 1x1 convolution to expand the channels back. The narrow middle (the 'bottleneck') cuts the amount of computation dramatically, which is how ResNet-50/101/152 stay affordable despite their depth.
Q: Should I build my own ResNet or use a pretrained one?
A: Use a pretrained one. A ResNet trained on ImageNet has already learned generic visual features (edges, textures, shapes) from over a million images. With transfer learning you load those weights, replace only the final classification layer, and fine-tune on your data — getting strong results from a few hundred images in minutes instead of training from scratch for days.
🎯 Mini-Challenge: Stack 10 Residual Blocks
Time to fly solo. Build a residual block from scratch and pass a signal through it ten times. The starter block gives you the brief only — write the logic yourself, then check it against the expected output.
Mini-Challenge
Stack residual blocks and watch the signal survive instead of vanishing
# 🎯 MINI-CHALLENGE: stack residual blocks and keep the signal alive
# 1. Write F(x): multiply every number in the list by 0.5
# 2. Write residual_block(x): return F(x) + x (use zip)
# 3. Start with x = [1.0, 1.0, 1.0]
# 4. Pass x through the residual block 10 times in a loop
# 5. Print the result after all 10 blocks
#
# ✅ Expected output (the signal GROWS instead of vanishing):
# after 10 residual blocks: [57.665039..., 57.665039..., 57.665039...]
# (each value is 1.5 ** 10 = 57.66...)
# your
...Lesson complete — you understand what made deep networks possible!
You can explain the degradation problem, build a residual block (output = F(x) + x), describe why the identity shortcut and bottleneck make 100+ layer networks trainable, contrast ResNet's addition with DenseNet's concatenation, and fine-tune a pretrained ResNet with transfer learning.
🚀 Up next: Training Stability Techniques — the methods (batch norm, learning-rate schedules, and more) that keep these deep networks training smoothly.
Sign up for free to track which lessons you've completed and get learning reminders.