Lesson 39 • Advanced
Model Compression
Make models smaller and faster — quantization, pruning, and knowledge distillation to deploy AI on any device.
✅ What You'll Learn
- • Quantization: FP32 → INT8 → INT4 precision reduction
- • Pruning: removing unimportant weights
- • Knowledge distillation: training small models from large ones
- • Combining techniques for maximum compression
📦 Shrinking AI Models
🎯 Real-World Analogy: Model compression is like packing for a flight. Quantization is switching from hardcover to paperback books (same content, less space). Pruning is leaving behind items you won't use (70% of neural network weights can be removed with minimal impact). Knowledge distillation is having an expert write you a concise summary instead of carrying the full encyclopaedia.
A LLaMA-70B model needs 280GB in FP32 — that's 4× A100 GPUs. With INT4 quantization, it fits on a single 24GB GPU. Compression makes the difference between "research prototype" and "production deployment."
Try It: Quantization
See how reducing precision compresses models with minimal quality loss
import numpy as np
# Quantization: Reduce Precision to Shrink Models
# FP32 → FP16 → INT8 → INT4
np.random.seed(42)
def quantize(weights, bits):
"""Simulate uniform quantization to N bits"""
n_levels = 2 ** bits
w_min, w_max = weights.min(), weights.max()
scale = (w_max - w_min) / (n_levels - 1)
quantized = np.round((weights - w_min) / scale).astype(int)
dequantized = quantized * scale + w_min
error = np.mean(np.abs(weights - dequantized))
max_error =
...Try It: Pruning & Distillation
Remove unimportant weights and distill knowledge from large models
import numpy as np
# Pruning & Knowledge Distillation
# Remove unnecessary weights or train smaller models to mimic larger ones
np.random.seed(42)
def magnitude_pruning(weights, sparsity):
"""Remove weights with smallest absolute values"""
threshold = np.percentile(np.abs(weights), sparsity * 100)
mask = np.abs(weights) > threshold
pruned = weights * mask
return pruned, mask
# Simulate a layer's weights
n = 1000
weights = np.random.randn(n) * 0.5
print("=== Weight Prunin
...⚠️ Common Mistake: Quantizing to INT4 without checking accuracy on your specific task. Some tasks (maths reasoning, code generation) are more sensitive to quantization than others (text classification, summarization). Always benchmark before and after on your actual use case.
💡 Pro Tip: For LLMs, use GPTQ or AWQ for 4-bit quantization with minimal quality loss. For CNNs, TensorRT's INT8 calibration is the gold standard. Combine pruning + quantization + distillation for up to 50× compression on edge devices.
📋 Quick Reference
| Technique | Compression | Quality Loss | Ease |
|---|---|---|---|
| FP16 | 2× | ~0% | Trivial |
| INT8 (PTQ) | 4× | ~1% | Easy |
| INT4 (GPTQ) | 8× | ~3% | Moderate |
| Pruning (50%) | 2× | ~1% | Moderate |
| Distillation | 2-10× | ~3-5% | Complex |
🎉 Lesson Complete!
You can now compress models for efficient deployment! Next, learn how to optimise inference for specific hardware targets.
Sign up for free to track which lessons you've completed and get learning reminders.