Lesson 41 • Advanced
Distributed Training
Learn how to scale model training across many GPUs — by the end you'll be able to explain data vs model parallelism, hand-compute the gradient all-reduce, simulate gradient accumulation, and pick the right framework (PyTorch DDP, FSDP, DeepSpeed) for a given model size.
What You'll Learn in This Lesson
- ✓How data parallelism (DDP) averages gradients with all-reduce
- ✓When to switch to model, pipeline, or tensor parallelism
- ✓The difference between synchronous and asynchronous training
- ✓How gradient accumulation fakes a big batch on a small GPU
- ✓Mixed precision (FP16/BF16) and why large batches need LR scaling
- ✓Which framework to reach for: PyTorch DDP, FSDP, DeepSpeed, Horovod
🍳 Real-World Analogy: Many Cooks Splitting a Huge Meal Prep
Imagine a giant banquet that's too much for one cook. You hire a team and split the work — but there's more than one way to split it.
- Data parallelism (DDP): every cook knows the whole recipe and gets a different pile of vegetables. They all chop at once. Then they compare notes and agree on one set of seasonings before the next course — that "compare and average" is the all-reduce of gradients.
- Model / pipeline parallelism: the recipe is too big to memorise, so it becomes an assembly line — cook 1 preps, cook 2 sautés, cook 3 plates. Each handles only their station (different layers on different GPUs).
- Gradient accumulation: a single cook with a tiny pan cooks four small batches and pools them before serving one big dish — a big "effective batch" without a bigger pan.
Training GPT-4-scale models took thousands of GPUs running for months. The same ideas scale down: even fine-tuning a 7B model usually needs more than one GPU.
1Data Parallelism — Same Model, Different Data
Data parallelism is the most common way to scale. You put a full copy of the model on every GPU (each copy is a worker), hand each worker a different slice of the batch, and let them all compute gradients at the same time.
The catch: if every worker just applied its own gradient, the copies would drift apart. So after the backward pass, the workers run an all-reduce — they sum every gradient across all workers and divide by the number of workers. Each GPU then applies that same averaged gradient, so all copies stay identical. The example below does the all-reduce by hand so you can see the math.
Try It: Average ('all-reduce') Gradients From N Workers
Run the gradient averaging that keeps every model copy in sync
# === Data Parallelism: averaging gradients across N workers ===
# In Data Parallel training, every worker (GPU) holds a FULL copy of the
# model. Each worker sees a DIFFERENT slice of the batch, computes its own
# gradients, and then all workers AVERAGE their gradients so every copy
# applies the SAME update and stays identical. That averaging step is the
# "all-reduce".
# Pretend each worker computed a gradient for the same 3 parameters.
# (Real gradients come from backprop; here we hand-pick
...2Synchronous vs Asynchronous
There are two ways workers can coordinate. In synchronous training, every worker finishes its step and waits at a barrier for the all-reduce before anyone moves on — all copies stay identical, but the slowest worker (a "straggler") holds everyone up.
In asynchronous training, workers push gradients to a central parameter server and grab the latest weights without waiting. It avoids stragglers but workers can train on slightly stale weights, which can hurt accuracy. Modern deep learning almost always uses synchronous all-reduce — it's simpler and more accurate, and fast GPU interconnects (like NVLink) make the waiting cheap.
Synchronous (default)
- All workers average gradients every step
- Copies stay identical — reproducible
- Slowest worker sets the pace
Asynchronous
- No waiting — no straggler problem
- Workers can use stale weights
- Harder to tune, can lose accuracy
3Gradient Accumulation — A Big Batch on a Small GPU
Big batches train more smoothly, but they don't always fit in GPU memory. Gradient accumulation is the workaround: run several small micro-batches, add up their gradients, and only call the optimizer once at the end. Four micro-batches of 8 give the same effective batch of 32 as one big batch — you trade extra time for less memory.
Try It: Simulate Gradient Accumulation Over Micro-Batches
Add up micro-batch gradients to fake a larger effective batch
# === Gradient accumulation: simulate a big batch on a small GPU ===
# Goal: train as if the batch were 32, but your GPU can only hold 8 at a time.
# Trick: run 4 "micro-batches" of 8, ADD UP their gradients, and only update
# the weights once. 4 micro-batches x 8 = an EFFECTIVE batch of 32.
micro_batch_size = 8
accum_steps = 4 # how many micro-batches to add before updating
effective_batch = micro_batch_size * accum_steps
print(f"Micro-batch: {micro_batch_size} Accum ste
...Notice you divide the accumulated gradient by the number of micro-batches so the update has the same scale as a real batch-of-32 step. The same idea stacks with data parallelism: effective batch = micro_batch × accum_steps × num_gpus.
4Model, Pipeline, and Tensor Parallelism
Data parallelism assumes the whole model fits on one GPU. When it doesn't — say a 70B-parameter model that needs hundreds of gigabytes — you have to split the model itself.
Pipeline parallelism (split by layer)
Different layers live on different GPUs — GPU 0 holds layers 1–8, GPU 1 holds layers 9–16, and the activations flow forward through them like an assembly line. To keep GPUs busy, the batch is sliced into micro-batches that are fed in a staggered pipeline.
Tensor parallelism (split inside a layer)
A single huge layer is split across GPUs — each GPU holds part of the weight matrix and they sync activations within the layer. Used for very wide layers (the attention and MLP blocks in large transformers).
FSDP / ZeRO (shard everything, still data parallel)
FSDP (Fully Sharded Data Parallel) and DeepSpeed ZeRO keep the data-parallel idea but shard the model weights, gradients, and optimizer states so each GPU stores only 1/N of everything. Params are gathered just in time for each layer's forward pass, then released — letting you train models far larger than one GPU's memory.
A rough memory rule for training in FP16: you need about 4× the model size (weights + gradients + optimizer states + activations). A 7B model is ~14 GB in FP16, so ~56 GB to train — close to filling one 80 GB A100 before you've added the batch.
5Mixed Precision and Learning-Rate Scaling
Mixed precision stores most numbers in 16-bit instead of 32-bit. That halves memory and roughly doubles throughput on modern GPUs. Two 16-bit formats matter:
- FP16 — half precision, fast, but a narrow range, so it needs loss scaling to stop tiny gradients underflowing to zero.
- BF16 (bfloat16) — same wide range as FP32 with less mantissa, so it trains stably without loss scaling. It's the default on newer hardware (A100/H100/TPU).
- FP32 — full precision, kept for the master copy of weights and the optimizer step for stability.
Scaling out (more GPUs, more accumulation) grows the effective batch. Larger batches produce smoother, lower-variance gradients, so you can — and must — take bigger steps. The linear scaling rule: when you multiply the batch by k, multiply the learning rate by k too, and add a few hundred steps of LR warmup so the bigger steps don't blow up early in training.
6The Real Thing: PyTorch DDP
In practice you rarely write the all-reduce yourself — the framework does it. The most common entry point is PyTorch DistributedDataParallel (DDP): wrap your model in DDP(...) and it averages gradients automatically during loss.backward(). You launch one process per GPU with torchrun. Read this through — it ties together every idea from the lesson.
# === The real thing: PyTorch DistributedDataParallel (DDP) ===
# This is what production code looks like. You launch one process per GPU
# (e.g. "torchrun --nproc_per_node=4 train.py"). DDP does the all-reduce
# of gradients for you, automatically, during backward().
import torch
import torch.nn as nn
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP
from torch.utils.data import DataLoader, DistributedSampler
def main():
dist.init_process_group(backend="nccl") # set up GPU comms
rank = dist.get_rank() # which process am I? (0..N-1)
world_size = dist.get_world_size() # how many processes total?
torch.cuda.set_device(rank)
model = nn.Linear(1024, 10).cuda(rank)
model = DDP(model, device_ids=[rank]) # wrap -> auto gradient all-reduce
# DistributedSampler gives each process a DIFFERENT, non-overlapping shard.
sampler = DistributedSampler(dataset, num_replicas=world_size, rank=rank)
loader = DataLoader(dataset, batch_size=32, sampler=sampler)
# Linear scaling rule: effective batch = 32 * world_size, so scale the LR.
base_lr = 0.1
opt = torch.optim.SGD(model.parameters(), lr=base_lr * world_size)
for epoch in range(epochs):
sampler.set_epoch(epoch) # reshuffle shards each epoch
for x, y in loader:
x, y = x.cuda(rank), y.cuda(rank)
opt.zero_grad()
loss = loss_fn(model(x), y)
loss.backward() # <-- DDP all-reduces grads HERE
opt.step() # every GPU applies the SAME update
dist.destroy_process_group()
# Expected output (4 GPUs, conceptually):
# rank 0..3 each train on 1/4 of the data
# gradients are averaged across all 4 every step (all-reduce)
# all 4 model copies stay byte-for-byte identicalOther frameworks fill different gaps: FSDP (built into PyTorch) and DeepSpeed ZeRO shard the model when it no longer fits; Horovod is an older Uber library that adds ring all-reduce to TensorFlow/PyTorch and is still seen in multi-node clusters.
🎯 Your Turn #1: Finish the All-Reduce
Two workers each computed a gradient for the same 2 parameters. Fill in the blanks so you sum across workers and divide by the worker count — the all-reduce.
Your Turn: All-Reduce Two Workers
Fill in the ___ blanks to average the gradients
# 🎯 YOUR TURN — finish the all-reduce (average the gradients)
# Two workers each computed a gradient for the same 2 parameters.
# Average them element-by-element so both model copies get the same update.
worker_grads = [
[4.0, 10.0], # worker 0
[6.0, 2.0], # worker 1
]
n_workers = len(worker_grads)
n_params = len(worker_grads[0])
averaged = []
for p in range(n_params):
total = 0.0
for w in range(n_workers):
total += ___ # 👉 add this worker's value for pa
...🎯 Your Turn #2: Accumulate and Scale the LR
Accumulate four micro-batches into an effective batch of 16, then apply the linear scaling rule to the learning rate. Fill in the blanks.
Your Turn: Accumulation + LR Scaling
Fill in the ___ blanks for the effective batch and scaled LR
# 🎯 YOUR TURN — accumulate micro-batches, then scale the learning rate
# Your GPU holds 4 samples, but you want an effective batch of 16.
micro_batch_size = 4
accum_steps = ___ # 👉 how many micro-batches make an effective batch of 16?
micro_grads = [2.0, 4.0, 6.0, 8.0] # one gradient per micro-batch
accumulated = 0.0
for g in micro_grads:
accumulated += ___ # 👉 add each micro-batch's gradient
avg_grad = accumulated / accum_steps
# Linear scaling rule: bigger effecti
...Common Errors (And How to Fix Them)
Distributed bugs are sneaky — the code runs but the model trains wrong. Here are the four that bite everyone.
❌ Not scaling the learning rate with the batch
8 GPUs means an 8× larger effective batch, but the LR is still tuned for one GPU:
opt = SGD(model.parameters(), lr=0.1) # ❌ same LR on 8 GPUs -> trains far too slowly
✅ Fix: apply the linear scaling rule (and warmup):
opt = SGD(model.parameters(), lr=0.1 * world_size) # ✅ scale LR by #GPUs
❌ Gradient sync bugs (copies drift apart)
Stepping the optimizer on the raw model skips DDP's gradient all-reduce, so each GPU diverges:
model = nn.Linear(1024, 10).cuda() ddp_model = DDP(model, device_ids=[rank]) loss = loss_fn(model(x), y) # ❌ used the UN-wrapped model -> no all-reduce
✅ Fix: always run forward/backward through the DDP-wrapped model:
loss = loss_fn(ddp_model(x), y) # ✅ backward() now all-reduces grads
❌ Uneven / overlapping data sharding
Giving every worker the full dataset means each sample is trained on N times — and the "extra" data is wasted compute:
loader = DataLoader(dataset, batch_size=32) # ❌ every GPU sees ALL the data
✅ Fix: use a DistributedSampler so each GPU gets a disjoint shard:
sampler = DistributedSampler(dataset, num_replicas=world_size, rank=rank) loader = DataLoader(dataset, batch_size=32, sampler=sampler) # remember sampler.set_epoch(epoch) so shuffling differs each epoch
❌ Communication bottleneck (GPUs idle, waiting)
If the all-reduce of gradients takes longer than the compute, adding GPUs barely helps — you scale poorly:
# Symptom: 8 GPUs give only ~3x speedup, GPUs sit at 40% utilisation
✅ Fix: reduce comms relative to compute:
# - use a larger per-GPU batch (more compute per all-reduce) # - use gradient accumulation to all-reduce less often # - use NCCL + fast interconnect (NVLink/InfiniBand), not Ethernet # - keep grads in FP16/BF16 so there's half as much to send
📋 Quick Reference
| Strategy | Splits | Use when | Framework |
|---|---|---|---|
| Data Parallel (DDP) | The data batch | Model fits on 1 GPU | PyTorch DDP, Horovod |
| FSDP | Params + grads + data | Model barely fits | PyTorch FSDP |
| ZeRO-3 | Everything (sharded) | Max memory saving | DeepSpeed |
| Pipeline Parallel | By layer | Very deep models | Megatron-LM, DeepSpeed |
| Tensor Parallel | Inside a layer | Very wide layers | Megatron-LM |
| Concept | What it does |
|---|---|
| all-reduce | Sum gradients across workers, divide by N, broadcast back |
| Gradient accumulation | Add micro-batch grads, step optimizer once → bigger effective batch |
| Linear scaling rule | Multiply LR by the same factor you multiply the batch by |
| FP16 vs BF16 | FP16 needs loss scaling; BF16 has FP32's range and doesn't |
❓ Frequently Asked Questions
Q: What is the difference between data parallelism and model parallelism?
A: Data parallelism puts a full copy of the model on every GPU and feeds each GPU a different slice of the batch, then averages (all-reduces) the gradients so every copy stays identical. Model parallelism splits the model itself across GPUs — used when the model is too big to fit on one device. Tensor parallelism splits individual layers across GPUs; pipeline parallelism puts different layers on different GPUs.
Q: What is all-reduce and why does data parallel training need it?
A: All-reduce sums each gradient across every worker and divides by the number of workers, giving every GPU the same averaged gradient. Without it, each GPU would apply a different update from its own data slice and the model copies would drift apart. All-reduce keeps all copies byte-for-byte identical after every step.
Q: What is gradient accumulation?
A: Gradient accumulation lets you simulate a large batch on a small GPU. You run several small micro-batches, add up their gradients, and only run the optimizer once. Four micro-batches of 8 give the same effective batch of 32 as one big batch — you trade extra time for less memory.
Q: Why do I have to scale the learning rate when I add more GPUs?
A: More GPUs (or more accumulation steps) means a larger effective batch, which averages over more samples and produces a smoother, smaller-variance gradient. The linear scaling rule says to multiply the learning rate by the same factor you multiplied the batch by, and to add a few hundred steps of LR warmup so the larger steps do not diverge early in training.
Q: When should I use FSDP or DeepSpeed instead of plain DDP?
A: Use plain PyTorch DDP when the full model fits on one GPU — it is the simplest and fastest. Move to FSDP (Fully Sharded Data Parallel) or DeepSpeed ZeRO when the model, gradients, and optimizer states no longer fit, because they shard those across GPUs so each device stores only 1/N of everything. Horovod is an older all-reduce library still seen in some TensorFlow and multi-node setups.
Q: What is the difference between FP16, BF16, and FP32 in mixed-precision training?
A: FP32 is full 32-bit precision (accurate but memory-heavy). FP16 is 16-bit — half the memory and much faster on modern GPUs, but it has a narrow range so it needs loss scaling to avoid underflow. BF16 (bfloat16) is also 16-bit but keeps FP32's wide exponent range, so it trains stably without loss scaling and is the default on newer hardware.
🎯 Mini-Challenge: A Synchronous Data-Parallel Step
Put it together with no hints. Simulate one synchronous step across 3 workers: all-reduce the gradients, then update the weights. The expected output is in the comments so you can self-check.
Mini-Challenge: One Sync Step
Average 3 workers' gradients and apply the update from scratch
# 🎯 MINI-CHALLENGE: synchronous data-parallel step from scratch
# Simulate ONE synchronous training step across 3 workers.
#
# 1. Start with worker_grads = [[1.0, 1.0], [2.0, 2.0], [3.0, 3.0]]
# 2. All-reduce: average the gradients element-by-element across the 3 workers
# 3. Start weights = [0.5, 0.5] and learning_rate = 0.1
# 4. Update: new_weight = weight - learning_rate * averaged_grad (per parameter)
# 5. print the averaged gradient and the new weights
#
# ✅ Expected output:
# Averaged gr
...Lesson Complete — you can scale training now!
You can explain data parallelism and its gradient all-reduce, contrast it with model/pipeline/tensor parallelism and FSDP, simulate gradient accumulation, reason about FP16/BF16 mixed precision, and apply the linear LR scaling rule. You also know the four classic bugs to avoid.
🚀 Up next: Model Serving — once a model is trained across many GPUs, learn how to serve it in production reliably.
Sign up for free to track which lessons you've completed and get learning reminders.