Courses/AI & ML/Distributed Training

Lesson 41 • Advanced

Distributed Training

Learn how to scale model training across many GPUs — by the end you'll be able to explain data vs model parallelism, hand-compute the gradient all-reduce, simulate gradient accumulation, and pick the right framework (PyTorch DDP, FSDP, DeepSpeed) for a given model size.

What You'll Learn in This Lesson

✓How data parallelism (DDP) averages gradients with all-reduce
✓When to switch to model, pipeline, or tensor parallelism
✓The difference between synchronous and asynchronous training
✓How gradient accumulation fakes a big batch on a small GPU
✓Mixed precision (FP16/BF16) and why large batches need LR scaling
✓Which framework to reach for: PyTorch DDP, FSDP, DeepSpeed, Horovod

Before you start: You should be comfortable with how a single model trains — forward pass, loss, and the Hardware Optimization lesson. This lesson is about running that same loop across many GPUs at once.

🍳 Real-World Analogy: Many Cooks Splitting a Huge Meal Prep

Imagine a giant banquet that's too much for one cook. You hire a team and split the work — but there's more than one way to split it.

Data parallelism (DDP): every cook knows the whole recipe and gets a different pile of vegetables. They all chop at once. Then they compare notes and agree on one set of seasonings before the next course — that "compare and average" is the all-reduce of gradients.
Model / pipeline parallelism: the recipe is too big to memorise, so it becomes an assembly line — cook 1 preps, cook 2 sautés, cook 3 plates. Each handles only their station (different layers on different GPUs).
Gradient accumulation: a single cook with a tiny pan cooks four small batches and pools them before serving one big dish — a big "effective batch" without a bigger pan.

Training GPT-4-scale models took thousands of GPUs running for months. The same ideas scale down: even fine-tuning a 7B model usually needs more than one GPU.

1Data Parallelism — Same Model, Different Data

Data parallelism is the most common way to scale. You put a full copy of the model on every GPU (each copy is a worker), hand each worker a different slice of the batch, and let them all compute gradients at the same time.

The catch: if every worker just applied its own gradient, the copies would drift apart. So after the backward pass, the workers run an all-reduce — they sum every gradient across all workers and divide by the number of workers. Each GPU then applies that same averaged gradient, so all copies stay identical. The example below does the all-reduce by hand so you can see the math.

Try It: Average ('all-reduce') Gradients From N Workers

Run the gradient averaging that keeps every model copy in sync

Try it Yourself »

Python

# === Data Parallelism: averaging gradients across N workers ===
# In Data Parallel training, every worker (GPU) holds a FULL copy of the
# model. Each worker sees a DIFFERENT slice of the batch, computes its own
# gradients, and then all workers AVERAGE their gradients so every copy
# applies the SAME update and stays identical. That averaging step is the
# "all-reduce".

# Pretend each worker computed a gradient for the same 3 parameters.
# (Real gradients come from backprop; here we hand-pick
...

Key insight: all-reduce = sum across workers, then divide by N. Because every worker ends up with the same averaged gradient, every model copy stays byte-for-byte identical.

2Synchronous vs Asynchronous

There are two ways workers can coordinate. In synchronous training, every worker finishes its step and waits at a barrier for the all-reduce before anyone moves on — all copies stay identical, but the slowest worker (a "straggler") holds everyone up.

In asynchronous training, workers push gradients to a central parameter server and grab the latest weights without waiting. It avoids stragglers but workers can train on slightly stale weights, which can hurt accuracy. Modern deep learning almost always uses synchronous all-reduce — it's simpler and more accurate, and fast GPU interconnects (like NVLink) make the waiting cheap.

Synchronous (default)

All workers average gradients every step
Copies stay identical — reproducible
Slowest worker sets the pace

Asynchronous

No waiting — no straggler problem
Workers can use stale weights
Harder to tune, can lose accuracy

3Gradient Accumulation — A Big Batch on a Small GPU

Big batches train more smoothly, but they don't always fit in GPU memory. Gradient accumulation is the workaround: run several small micro-batches, add up their gradients, and only call the optimizer once at the end. Four micro-batches of 8 give the same effective batch of 32 as one big batch — you trade extra time for less memory.

Try It: Simulate Gradient Accumulation Over Micro-Batches

Add up micro-batch gradients to fake a larger effective batch

Try it Yourself »

Python

# === Gradient accumulation: simulate a big batch on a small GPU ===
# Goal: train as if the batch were 32, but your GPU can only hold 8 at a time.
# Trick: run 4 "micro-batches" of 8, ADD UP their gradients, and only update
# the weights once. 4 micro-batches x 8 = an EFFECTIVE batch of 32.

micro_batch_size = 8
accum_steps = 4                       # how many micro-batches to add before updating
effective_batch = micro_batch_size * accum_steps
print(f"Micro-batch: {micro_batch_size}  Accum ste
...

Notice you divide the accumulated gradient by the number of micro-batches so the update has the same scale as a real batch-of-32 step. The same idea stacks with data parallelism: effective batch = micro_batch × accum_steps × num_gpus.

4Model, Pipeline, and Tensor Parallelism

Data parallelism assumes the whole model fits on one GPU. When it doesn't — say a 70B-parameter model that needs hundreds of gigabytes — you have to split the model itself.

Pipeline parallelism (split by layer)

Different layers live on different GPUs — GPU 0 holds layers 1–8, GPU 1 holds layers 9–16, and the activations flow forward through them like an assembly line. To keep GPUs busy, the batch is sliced into micro-batches that are fed in a staggered pipeline.

Tensor parallelism (split inside a layer)

A single huge layer is split across GPUs — each GPU holds part of the weight matrix and they sync activations within the layer. Used for very wide layers (the attention and MLP blocks in large transformers).

FSDP / ZeRO (shard everything, still data parallel)

FSDP (Fully Sharded Data Parallel) and DeepSpeed ZeRO keep the data-parallel idea but shard the model weights, gradients, and optimizer states so each GPU stores only 1/N of everything. Params are gathered just in time for each layer's forward pass, then released — letting you train models far larger than one GPU's memory.

A rough memory rule for training in FP16: you need about 4× the model size (weights + gradients + optimizer states + activations). A 7B model is ~14 GB in FP16, so ~56 GB to train — close to filling one 80 GB A100 before you've added the batch.

5Mixed Precision and Learning-Rate Scaling

Mixed precision stores most numbers in 16-bit instead of 32-bit. That halves memory and roughly doubles throughput on modern GPUs. Two 16-bit formats matter:

FP16 — half precision, fast, but a narrow range, so it needs loss scaling to stop tiny gradients underflowing to zero.
BF16 (bfloat16) — same wide range as FP32 with less mantissa, so it trains stably without loss scaling. It's the default on newer hardware (A100/H100/TPU).
FP32 — full precision, kept for the master copy of weights and the optimizer step for stability.

Scaling out (more GPUs, more accumulation) grows the effective batch. Larger batches produce smoother, lower-variance gradients, so you can — and must — take bigger steps. The linear scaling rule: when you multiply the batch by k, multiply the learning rate by k too, and add a few hundred steps of LR warmup so the bigger steps don't blow up early in training.

Forgetting to scale the LR is the single most common distributed-training bug — your loss curve looks fine but the model trains far too slowly, because an 8-GPU batch is being nudged with a 1-GPU learning rate.

6The Real Thing: PyTorch DDP

In practice you rarely write the all-reduce yourself — the framework does it. The most common entry point is PyTorch DistributedDataParallel (DDP): wrap your model in DDP(...) and it averages gradients automatically during loss.backward(). You launch one process per GPU with torchrun. Read this through — it ties together every idea from the lesson.

# === The real thing: PyTorch DistributedDataParallel (DDP) ===
# This is what production code looks like. You launch one process per GPU
# (e.g. "torchrun --nproc_per_node=4 train.py"). DDP does the all-reduce
# of gradients for you, automatically, during backward().

import torch
import torch.nn as nn
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP
from torch.utils.data import DataLoader, DistributedSampler

def main():
    dist.init_process_group(backend="nccl")          # set up GPU comms
    rank = dist.get_rank()                            # which process am I? (0..N-1)
    world_size = dist.get_world_size()                # how many processes total?
    torch.cuda.set_device(rank)

    model = nn.Linear(1024, 10).cuda(rank)
    model = DDP(model, device_ids=[rank])             # wrap -> auto gradient all-reduce

    # DistributedSampler gives each process a DIFFERENT, non-overlapping shard.
    sampler = DistributedSampler(dataset, num_replicas=world_size, rank=rank)
    loader = DataLoader(dataset, batch_size=32, sampler=sampler)

    # Linear scaling rule: effective batch = 32 * world_size, so scale the LR.
    base_lr = 0.1
    opt = torch.optim.SGD(model.parameters(), lr=base_lr * world_size)

    for epoch in range(epochs):
        sampler.set_epoch(epoch)                      # reshuffle shards each epoch
        for x, y in loader:
            x, y = x.cuda(rank), y.cuda(rank)
            opt.zero_grad()
            loss = loss_fn(model(x), y)
            loss.backward()                           # <-- DDP all-reduces grads HERE
            opt.step()                                # every GPU applies the SAME update

    dist.destroy_process_group()

# Expected output (4 GPUs, conceptually):
#   rank 0..3 each train on 1/4 of the data
#   gradients are averaged across all 4 every step (all-reduce)
#   all 4 model copies stay byte-for-byte identical

Other frameworks fill different gaps: FSDP (built into PyTorch) and DeepSpeed ZeRO shard the model when it no longer fits; Horovod is an older Uber library that adds ring all-reduce to TensorFlow/PyTorch and is still seen in multi-node clusters.

🎯 Your Turn #1: Finish the All-Reduce

Two workers each computed a gradient for the same 2 parameters. Fill in the blanks so you sum across workers and divide by the worker count — the all-reduce.

Your Turn: All-Reduce Two Workers

Fill in the ___ blanks to average the gradients

Try it Yourself »

Python

# 🎯 YOUR TURN — finish the all-reduce (average the gradients)
# Two workers each computed a gradient for the same 2 parameters.
# Average them element-by-element so both model copies get the same update.

worker_grads = [
    [4.0, 10.0],   # worker 0
    [6.0, 2.0],    # worker 1
]
n_workers = len(worker_grads)
n_params = len(worker_grads[0])

averaged = []
for p in range(n_params):
    total = 0.0
    for w in range(n_workers):
        total += ___          # 👉 add this worker's value for pa
...

🎯 Your Turn #2: Accumulate and Scale the LR

Accumulate four micro-batches into an effective batch of 16, then apply the linear scaling rule to the learning rate. Fill in the blanks.

Your Turn: Accumulation + LR Scaling

Fill in the ___ blanks for the effective batch and scaled LR

Try it Yourself »

Python

# 🎯 YOUR TURN — accumulate micro-batches, then scale the learning rate
# Your GPU holds 4 samples, but you want an effective batch of 16.

micro_batch_size = 4
accum_steps = ___            # 👉 how many micro-batches make an effective batch of 16?
micro_grads = [2.0, 4.0, 6.0, 8.0]   # one gradient per micro-batch

accumulated = 0.0
for g in micro_grads:
    accumulated += ___       # 👉 add each micro-batch's gradient

avg_grad = accumulated / accum_steps

# Linear scaling rule: bigger effecti
...

Common Errors (And How to Fix Them)

Distributed bugs are sneaky — the code runs but the model trains wrong. Here are the four that bite everyone.

❌ Not scaling the learning rate with the batch

8 GPUs means an 8× larger effective batch, but the LR is still tuned for one GPU:

opt = SGD(model.parameters(), lr=0.1)   # ❌ same LR on 8 GPUs -> trains far too slowly

✅ Fix: apply the linear scaling rule (and warmup):

opt = SGD(model.parameters(), lr=0.1 * world_size)   # ✅ scale LR by #GPUs

❌ Gradient sync bugs (copies drift apart)

Stepping the optimizer on the raw model skips DDP's gradient all-reduce, so each GPU diverges:

model = nn.Linear(1024, 10).cuda()
ddp_model = DDP(model, device_ids=[rank])
loss = loss_fn(model(x), y)   # ❌ used the UN-wrapped model -> no all-reduce

✅ Fix: always run forward/backward through the DDP-wrapped model:

loss = loss_fn(ddp_model(x), y)   # ✅ backward() now all-reduces grads

❌ Uneven / overlapping data sharding

Giving every worker the full dataset means each sample is trained on N times — and the "extra" data is wasted compute:

loader = DataLoader(dataset, batch_size=32)   # ❌ every GPU sees ALL the data

✅ Fix: use a DistributedSampler so each GPU gets a disjoint shard:

sampler = DistributedSampler(dataset, num_replicas=world_size, rank=rank)
loader = DataLoader(dataset, batch_size=32, sampler=sampler)
# remember sampler.set_epoch(epoch) so shuffling differs each epoch

❌ Communication bottleneck (GPUs idle, waiting)

If the all-reduce of gradients takes longer than the compute, adding GPUs barely helps — you scale poorly:

# Symptom: 8 GPUs give only ~3x speedup, GPUs sit at 40% utilisation

✅ Fix: reduce comms relative to compute:

# - use a larger per-GPU batch (more compute per all-reduce)
# - use gradient accumulation to all-reduce less often
# - use NCCL + fast interconnect (NVLink/InfiniBand), not Ethernet
# - keep grads in FP16/BF16 so there's half as much to send

📋 Quick Reference

Strategy	Splits	Use when	Framework
Data Parallel (DDP)	The data batch	Model fits on 1 GPU	PyTorch DDP, Horovod
FSDP	Params + grads + data	Model barely fits	PyTorch FSDP
ZeRO-3	Everything (sharded)	Max memory saving	DeepSpeed
Pipeline Parallel	By layer	Very deep models	Megatron-LM, DeepSpeed
Tensor Parallel	Inside a layer	Very wide layers	Megatron-LM

Concept	What it does
all-reduce	Sum gradients across workers, divide by N, broadcast back
Gradient accumulation	Add micro-batch grads, step optimizer once → bigger effective batch
Linear scaling rule	Multiply LR by the same factor you multiply the batch by
FP16 vs BF16	FP16 needs loss scaling; BF16 has FP32's range and doesn't

❓ Frequently Asked Questions

Q: What is the difference between data parallelism and model parallelism?

A: Data parallelism puts a full copy of the model on every GPU and feeds each GPU a different slice of the batch, then averages (all-reduces) the gradients so every copy stays identical. Model parallelism splits the model itself across GPUs — used when the model is too big to fit on one device. Tensor parallelism splits individual layers across GPUs; pipeline parallelism puts different layers on different GPUs.

Q: What is all-reduce and why does data parallel training need it?

A: All-reduce sums each gradient across every worker and divides by the number of workers, giving every GPU the same averaged gradient. Without it, each GPU would apply a different update from its own data slice and the model copies would drift apart. All-reduce keeps all copies byte-for-byte identical after every step.

Q: What is gradient accumulation?

A: Gradient accumulation lets you simulate a large batch on a small GPU. You run several small micro-batches, add up their gradients, and only run the optimizer once. Four micro-batches of 8 give the same effective batch of 32 as one big batch — you trade extra time for less memory.

Q: Why do I have to scale the learning rate when I add more GPUs?

A: More GPUs (or more accumulation steps) means a larger effective batch, which averages over more samples and produces a smoother, smaller-variance gradient. The linear scaling rule says to multiply the learning rate by the same factor you multiplied the batch by, and to add a few hundred steps of LR warmup so the larger steps do not diverge early in training.

Q: When should I use FSDP or DeepSpeed instead of plain DDP?

A: Use plain PyTorch DDP when the full model fits on one GPU — it is the simplest and fastest. Move to FSDP (Fully Sharded Data Parallel) or DeepSpeed ZeRO when the model, gradients, and optimizer states no longer fit, because they shard those across GPUs so each device stores only 1/N of everything. Horovod is an older all-reduce library still seen in some TensorFlow and multi-node setups.

Q: What is the difference between FP16, BF16, and FP32 in mixed-precision training?

A: FP32 is full 32-bit precision (accurate but memory-heavy). FP16 is 16-bit — half the memory and much faster on modern GPUs, but it has a narrow range so it needs loss scaling to avoid underflow. BF16 (bfloat16) is also 16-bit but keeps FP32's wide exponent range, so it trains stably without loss scaling and is the default on newer hardware.

🎯 Mini-Challenge: A Synchronous Data-Parallel Step

Put it together with no hints. Simulate one synchronous step across 3 workers: all-reduce the gradients, then update the weights. The expected output is in the comments so you can self-check.

Mini-Challenge: One Sync Step

Average 3 workers' gradients and apply the update from scratch

Try it Yourself »

Python

# 🎯 MINI-CHALLENGE: synchronous data-parallel step from scratch
# Simulate ONE synchronous training step across 3 workers.
#
# 1. Start with worker_grads = [[1.0, 1.0], [2.0, 2.0], [3.0, 3.0]]
# 2. All-reduce: average the gradients element-by-element across the 3 workers
# 3. Start weights = [0.5, 0.5] and learning_rate = 0.1
# 4. Update: new_weight = weight - learning_rate * averaged_grad  (per parameter)
# 5. print the averaged gradient and the new weights
#
# ✅ Expected output:
# Averaged gr
...

🎉

Lesson Complete — you can scale training now!

You can explain data parallelism and its gradient all-reduce, contrast it with model/pipeline/tensor parallelism and FSDP, simulate gradient accumulation, reason about FP16/BF16 mixed precision, and apply the linear LR scaling rule. You also know the four classic bugs to avoid.

🚀 Up next: Model Serving — once a model is trained across many GPUs, learn how to serve it in production reliably.

Distributed Training

What You'll Learn in This Lesson

🍳 Real-World Analogy: Many Cooks Splitting a Huge Meal Prep

1Data Parallelism — Same Model, Different Data

Try It: Average ('all-reduce') Gradients From N Workers

2Synchronous vs Asynchronous

3Gradient Accumulation — A Big Batch on a Small GPU

Try It: Simulate Gradient Accumulation Over Micro-Batches

4Model, Pipeline, and Tensor Parallelism

5Mixed Precision and Learning-Rate Scaling

6The Real Thing: PyTorch DDP

🎯 Your Turn #1: Finish the All-Reduce

Your Turn: All-Reduce Two Workers

🎯 Your Turn #2: Accumulate and Scale the LR

Your Turn: Accumulation + LR Scaling

Common Errors (And How to Fix Them)

📋 Quick Reference

❓ Frequently Asked Questions

🎯 Mini-Challenge: A Synchronous Data-Parallel Step

Mini-Challenge: One Sync Step

Lesson Complete — you can scale training now!

Cookie & Privacy Settings