Lesson 22 • Advanced
Training Stability Techniques
Master normalization, weight initialization, and gradient clipping — the essential techniques that make deep learning actually work.
✅ What You'll Learn
- • Batch Norm vs Layer Norm vs Group Norm
- • Xavier and He weight initialization
- • Gradient clipping by value vs by norm
- • Learning rate warmup and scheduling
⚡ Why Training Fails
🎯 Real-World Analogy: Training a deep neural network is like tuning a 100-string guitar. If one string is wildly out of tune, the whole instrument sounds terrible. Normalization keeps every "string" (layer) in the right range. Proper initialization starts them close to the right pitch. Gradient clipping prevents any string from snapping.
Three problems plague deep network training: internal covariate shift (each layer's input distribution keeps changing), vanishing/exploding gradients (signals die or blow up), and sensitivity to initialisation (bad starting weights → training never converges).
Try It: Normalization Techniques
Compare Batch Norm and Layer Norm — see how they tame wild activations
import numpy as np
# Batch Normalization vs Layer Normalization
# See how normalization stabilizes activations during training
np.random.seed(42)
def batch_norm(x, gamma=1.0, beta=0.0, eps=1e-5):
"""Normalize across the BATCH dimension"""
mean = np.mean(x, axis=0)
var = np.var(x, axis=0)
x_norm = (x - mean) / np.sqrt(var + eps)
return gamma * x_norm + beta
def layer_norm(x, gamma=1.0, beta=0.0, eps=1e-5):
"""Normalize across the FEATURE dimension"""
mean = np.mean
...Try It: Weight Initialization
Watch activations vanish or explode with bad init — then fix it
import numpy as np
# Weight Initialization: Xavier vs He vs Random
# Bad init = dead neurons or exploding gradients
np.random.seed(42)
def simulate_forward(n_layers, n_neurons, init_method):
"""Simulate forward pass through multiple layers"""
x = np.random.randn(1, n_neurons)
activations = [np.std(x)]
for _ in range(n_layers):
if init_method == "random":
W = np.random.randn(n_neurons, n_neurons) * 1.0
elif init_method == "small":
W
...⚠️ Common Mistake: Using Batch Norm with small batch sizes (e.g., batch_size=2). The statistics become noisy and training destabilises. Switch to Layer Norm or Group Norm when batch sizes are small.
Try It: Gradient Clipping
Prevent exploding gradients while preserving gradient direction
import numpy as np
# Gradient Clipping: Preventing Exploding Gradients
# Essential for RNNs, Transformers, and any deep network
np.random.seed(42)
def clip_by_value(gradients, min_val, max_val):
"""Clip each gradient independently"""
return np.clip(gradients, min_val, max_val)
def clip_by_norm(gradients, max_norm):
"""Scale all gradients together to preserve direction"""
total_norm = np.sqrt(sum(np.sum(g**2) for g in gradients))
if total_norm > max_norm:
scale = m
...💡 Pro Tip: Always use learning rate warmup for the first 5-10% of training steps, especially with Adam optimizer. Start with a tiny LR and linearly increase to your target. This prevents early gradient explosions when weights are still random.
📋 Quick Reference
| Technique | Fixes | Use With |
|---|---|---|
| Batch Norm | Covariate shift | CNNs, large batches |
| Layer Norm | Covariate shift | Transformers, RNNs |
| He Init | Vanishing activations | ReLU networks |
| Xavier Init | Vanishing activations | Tanh, sigmoid |
| Gradient Clipping | Exploding gradients | RNNs, Transformers |
| LR Warmup | Early instability | All architectures |
🎉 Lesson Complete!
You now have the toolkit to train deep networks stably. Next, learn how to build models that generate entirely new data!
Sign up for free to track which lessons you've completed and get learning reminders.