Lesson 39 • Advanced

    Model Compression

    Make models smaller and faster — quantization, pruning, and knowledge distillation to deploy AI on any device.

    ✅ What You'll Learn

    • • Quantization: FP32 → INT8 → INT4 precision reduction
    • • Pruning: removing unimportant weights
    • • Knowledge distillation: training small models from large ones
    • • Combining techniques for maximum compression

    📦 Shrinking AI Models

    🎯 Real-World Analogy: Model compression is like packing for a flight. Quantization is switching from hardcover to paperback books (same content, less space). Pruning is leaving behind items you won't use (70% of neural network weights can be removed with minimal impact). Knowledge distillation is having an expert write you a concise summary instead of carrying the full encyclopaedia.

    A LLaMA-70B model needs 280GB in FP32 — that's 4× A100 GPUs. With INT4 quantization, it fits on a single 24GB GPU. Compression makes the difference between "research prototype" and "production deployment."

    Try It: Quantization

    See how reducing precision compresses models with minimal quality loss

    Try it Yourself »
    Python
    import numpy as np
    
    # Quantization: Reduce Precision to Shrink Models
    # FP32 → FP16 → INT8 → INT4
    
    np.random.seed(42)
    
    def quantize(weights, bits):
        """Simulate uniform quantization to N bits"""
        n_levels = 2 ** bits
        w_min, w_max = weights.min(), weights.max()
        scale = (w_max - w_min) / (n_levels - 1)
        
        quantized = np.round((weights - w_min) / scale).astype(int)
        dequantized = quantized * scale + w_min
        
        error = np.mean(np.abs(weights - dequantized))
        max_error =
    ...

    Try It: Pruning & Distillation

    Remove unimportant weights and distill knowledge from large models

    Try it Yourself »
    Python
    import numpy as np
    
    # Pruning & Knowledge Distillation
    # Remove unnecessary weights or train smaller models to mimic larger ones
    
    np.random.seed(42)
    
    def magnitude_pruning(weights, sparsity):
        """Remove weights with smallest absolute values"""
        threshold = np.percentile(np.abs(weights), sparsity * 100)
        mask = np.abs(weights) > threshold
        pruned = weights * mask
        return pruned, mask
    
    # Simulate a layer's weights
    n = 1000
    weights = np.random.randn(n) * 0.5
    
    print("=== Weight Prunin
    ...

    ⚠️ Common Mistake: Quantizing to INT4 without checking accuracy on your specific task. Some tasks (maths reasoning, code generation) are more sensitive to quantization than others (text classification, summarization). Always benchmark before and after on your actual use case.

    💡 Pro Tip: For LLMs, use GPTQ or AWQ for 4-bit quantization with minimal quality loss. For CNNs, TensorRT's INT8 calibration is the gold standard. Combine pruning + quantization + distillation for up to 50× compression on edge devices.

    📋 Quick Reference

    TechniqueCompressionQuality LossEase
    FP16~0%Trivial
    INT8 (PTQ)~1%Easy
    INT4 (GPTQ)~3%Moderate
    Pruning (50%)~1%Moderate
    Distillation2-10×~3-5%Complex

    🎉 Lesson Complete!

    You can now compress models for efficient deployment! Next, learn how to optimise inference for specific hardware targets.

    Sign up for free to track which lessons you've completed and get learning reminders.

    Previous

    Cookie & Privacy Settings

    We use cookies to improve your experience, analyze traffic, and show personalized ads. You can manage your preferences below.

    By clicking "Accept All", you consent to our use of cookies for analytics and personalized advertising. You can customize your preferences or reject non-essential cookies.

    Privacy PolicyTerms of Service