Lesson 22 • Advanced

    Training Stability Techniques

    Master normalization, weight initialization, and gradient clipping — the essential techniques that make deep learning actually work.

    ✅ What You'll Learn

    • • Batch Norm vs Layer Norm vs Group Norm
    • • Xavier and He weight initialization
    • • Gradient clipping by value vs by norm
    • • Learning rate warmup and scheduling

    ⚡ Why Training Fails

    🎯 Real-World Analogy: Training a deep neural network is like tuning a 100-string guitar. If one string is wildly out of tune, the whole instrument sounds terrible. Normalization keeps every "string" (layer) in the right range. Proper initialization starts them close to the right pitch. Gradient clipping prevents any string from snapping.

    Three problems plague deep network training: internal covariate shift (each layer's input distribution keeps changing), vanishing/exploding gradients (signals die or blow up), and sensitivity to initialisation (bad starting weights → training never converges).

    Try It: Normalization Techniques

    Compare Batch Norm and Layer Norm — see how they tame wild activations

    Try it Yourself »
    Python
    import numpy as np
    
    # Batch Normalization vs Layer Normalization
    # See how normalization stabilizes activations during training
    
    np.random.seed(42)
    
    def batch_norm(x, gamma=1.0, beta=0.0, eps=1e-5):
        """Normalize across the BATCH dimension"""
        mean = np.mean(x, axis=0)
        var = np.var(x, axis=0)
        x_norm = (x - mean) / np.sqrt(var + eps)
        return gamma * x_norm + beta
    
    def layer_norm(x, gamma=1.0, beta=0.0, eps=1e-5):
        """Normalize across the FEATURE dimension"""
        mean = np.mean
    ...

    Try It: Weight Initialization

    Watch activations vanish or explode with bad init — then fix it

    Try it Yourself »
    Python
    import numpy as np
    
    # Weight Initialization: Xavier vs He vs Random
    # Bad init = dead neurons or exploding gradients
    
    np.random.seed(42)
    
    def simulate_forward(n_layers, n_neurons, init_method):
        """Simulate forward pass through multiple layers"""
        x = np.random.randn(1, n_neurons)
        activations = [np.std(x)]
        
        for _ in range(n_layers):
            if init_method == "random":
                W = np.random.randn(n_neurons, n_neurons) * 1.0
            elif init_method == "small":
                W 
    ...

    ⚠️ Common Mistake: Using Batch Norm with small batch sizes (e.g., batch_size=2). The statistics become noisy and training destabilises. Switch to Layer Norm or Group Norm when batch sizes are small.

    Try It: Gradient Clipping

    Prevent exploding gradients while preserving gradient direction

    Try it Yourself »
    Python
    import numpy as np
    
    # Gradient Clipping: Preventing Exploding Gradients
    # Essential for RNNs, Transformers, and any deep network
    
    np.random.seed(42)
    
    def clip_by_value(gradients, min_val, max_val):
        """Clip each gradient independently"""
        return np.clip(gradients, min_val, max_val)
    
    def clip_by_norm(gradients, max_norm):
        """Scale all gradients together to preserve direction"""
        total_norm = np.sqrt(sum(np.sum(g**2) for g in gradients))
        if total_norm > max_norm:
            scale = m
    ...

    💡 Pro Tip: Always use learning rate warmup for the first 5-10% of training steps, especially with Adam optimizer. Start with a tiny LR and linearly increase to your target. This prevents early gradient explosions when weights are still random.

    📋 Quick Reference

    TechniqueFixesUse With
    Batch NormCovariate shiftCNNs, large batches
    Layer NormCovariate shiftTransformers, RNNs
    He InitVanishing activationsReLU networks
    Xavier InitVanishing activationsTanh, sigmoid
    Gradient ClippingExploding gradientsRNNs, Transformers
    LR WarmupEarly instabilityAll architectures

    🎉 Lesson Complete!

    You now have the toolkit to train deep networks stably. Next, learn how to build models that generate entirely new data!

    Sign up for free to track which lessons you've completed and get learning reminders.

    Previous

    Cookie & Privacy Settings

    We use cookies to improve your experience, analyze traffic, and show personalized ads. You can manage your preferences below.

    By clicking "Accept All", you consent to our use of cookies for analytics and personalized advertising. You can customize your preferences or reject non-essential cookies.

    Privacy PolicyTerms of Service