Lesson 41 • Advanced
Distributed Training
Train large models across multiple GPUs — data parallelism, model parallelism, pipeline parallelism, and FSDP for trillion-parameter models.
✅ What You'll Learn
- • DDP: same model on each GPU, different data batches
- • Tensor and pipeline parallelism for huge models
- • FSDP and DeepSpeed ZeRO for memory efficiency
- • How to calculate GPU requirements for any model
🖥️ Scaling Beyond One GPU
🎯 Real-World Analogy: Data parallelism is like having multiple chefs cook the same recipe with different ingredients — they all follow the same technique but work faster together. Model parallelism is like an assembly line where each worker handles one step (layers 1-8 on GPU 0, layers 9-16 on GPU 1). FSDP combines both: each worker knows the full recipe but only carries the tools they need at each moment.
Training GPT-4 required thousands of GPUs running for months. Understanding distributed training is essential for anyone working with models larger than ~1B parameters, and it's increasingly relevant as even fine-tuning requires multi-GPU setups.
Try It: Data Parallel Training
See how splitting data across GPUs achieves near-linear speedup
import numpy as np
# Data Parallel Training: Split data across GPUs
# Each GPU processes a different batch, then sync gradients
np.random.seed(42)
def simulate_ddp(n_gpus, batch_size, model_params, epochs):
"""Simulate Distributed Data Parallel training"""
effective_batch = batch_size * n_gpus
samples = 50000
steps_per_epoch = samples // effective_batch
print(f" GPUs: {n_gpus}")
print(f" Per-GPU batch: {batch_size}")
print(f" Effective batch: {effective_bat
...Try It: Model Parallelism
Compare DDP, tensor parallel, pipeline parallel, and FSDP
import numpy as np
# Model Parallelism: When the Model Doesn't Fit on One GPU
# Split model layers across multiple GPUs
np.random.seed(42)
print("=== Model Parallelism Strategies ===")
print()
# Strategy comparison
print("1. DATA PARALLEL (DDP)")
print(" Each GPU: full model + different data")
print(" GPU 0: [Full Model] ← Batch 1")
print(" GPU 1: [Full Model] ← Batch 2")
print(" GPU 2: [Full Model] ← Batch 3")
print(" → Sync gradients after each step")
print()
print("2. TENSOR PA
...⚠️ Common Mistake: Not adjusting the learning rate when scaling to multiple GPUs. With DDP, you're effectively using a larger batch size. The linear scaling rule says: if you double the batch size, double the learning rate. Always add LR warmup for the first 5-10% of training to prevent divergence.
💡 Pro Tip: Start with PyTorch DDP (simplest, handles most cases). Move to FSDP when the model doesn't fit on one GPU. Use DeepSpeed ZeRO-3 for maximum memory efficiency. For LLM pretraining at scale, Megatron-LM combines all three parallelism types (3D parallelism).
📋 Quick Reference
| Strategy | Splits | When | Framework |
|---|---|---|---|
| DDP | Data | Model fits 1 GPU | PyTorch DDP |
| FSDP | Params + data | Model barely fits | PyTorch FSDP |
| ZeRO-3 | Everything | Max memory saving | DeepSpeed |
| Pipeline | Layers | Very deep models | Megatron-LM |
| Tensor | Weights | Wide layers | Megatron-LM |
🎉 Lesson Complete!
You now understand how to train models across multiple GPUs! Next, learn how to serve ML models in production reliably.
Sign up for free to track which lessons you've completed and get learning reminders.