Courses/AI & ML/Hardware Optimization

Lesson 40 • Advanced

Hardware Optimization for ML

Make a trained model run fast and cheap in production — choose the right chip (CPU vs GPU vs TPU vs accelerators), beat the memory-bandwidth bottleneck, batch for throughput, drop precision (FP32 → FP16 → INT8), fuse and compile the graph with TensorRT/ONNX Runtime, and profile to find the real bottleneck instead of guessing.

What You'll Learn in This Lesson

✓Match the workload to the right hardware: CPU, GPU, TPU, or accelerator
✓See why memory bandwidth — not FLOPs — caps LLM inference speed
✓Batch inputs to maximize throughput, and find the best batch size
✓Trade precision (FP32 → FP16 → INT8) for memory and speed
✓Fuse operators and compile graphs with TensorRT, ONNX Runtime, and torch.compile
✓Profile first to find the real bottleneck instead of optimizing blind

Before you start: You should have a trained model ready to deploy and have completed Model Compression, since quantization and pruning set up the precision and memory savings you'll tune here. Basic Python loops and arithmetic are all you need for the exercises.

🚚 Real-World Analogy: The Right Vehicle for the Job

Moving goods isn't about owning the fastest engine — it's about picking the right vehicle and loading it well. Hardware optimization is the same: the model is the cargo, and your job is to deliver it as fast and cheaply as possible.

The vehicle — a CPU is a nimble motorbike (fast for one small errand, hopeless for bulk). A GPU is a fleet of vans running in parallel. A TPU is a freight train purpose-built for one route. Match the vehicle to the load.
Batching — sending one parcel per trip wastes the whole van. Fill it up (batch the inputs) and each trip delivers far more for almost the same fuel.
The road, not the engine — a faster engine is useless on a narrow road. Memory bandwidth is the road width; for big models it, not raw compute, decides your speed.
Lighter cargo — pack the same goods in lighter boxes (FP16, INT8) and you move more per trip on the same road.
One smooth run — fewer stops mean faster delivery. Operator fusion and compilation merge many little steps into one continuous run instead of unloading and reloading at every junction.

And before you change anything, you check the traffic report: profiling tells you whether you're stuck on the road (bandwidth), idling at the depot (data loading), or actually limited by the engine (compute). Optimizing the wrong thing is just driving faster down the wrong street.

1CPU vs GPU vs TPU vs Accelerators

A neural network is mostly one operation done billions of times: multiply a matrix of inputs by a matrix of weights. Which chip you run that on changes your speed by orders of magnitude, because the chips are built for different jobs.

CPU — a handful of very fast, flexible cores. Brilliant at branchy control code and small models, but it does those huge matrix multiplies one small chunk at a time. Use it for tiny models, preprocessing, or when no GPU is available.
GPU — thousands of simpler cores that crunch the same matrix multiply in parallel. This is the default for both training and inference. NVIDIA GPUs add Tensor Cores, dedicated units for fast FP16/INT8 matrix maths.
TPU — Google's chip built almost entirely around a giant matrix-multiply array plus very high-bandwidth memory. Superb on large transformers and CNNs and very power-efficient, but less flexible than a GPU and tied to Google's stack.
Accelerators / NPUs — purpose-built inference chips (AWS Inferentia, Apple Neural Engine, edge NPUs). They sacrifice generality for efficiency on a fixed set of operations, ideal on phones and at the edge.

Rule of thumb: small/branchy work → CPU; almost everything in deep learning → GPU; very large fixed transformer/CNN workloads at scale → TPU or a dedicated accelerator.

2The Memory-Bandwidth Bottleneck

Beginners assume a model is slow because the chip can't do enough maths (not enough FLOPs, floating-point operations per second). For large-model inference the opposite is usually true: the chip is starving, waiting for weights to arrive from memory. This is the memory-bandwidth bottleneck.

Here's why. To generate one token, an LLM has to read every weight in the model from memory, but only does a tiny bit of arithmetic with each one. So the limit isn't compute — it's how many bytes per second you can stream out of memory (the bandwidth). A useful back-of-the-envelope estimate for single-stream speed is:

# Single-stream LLM speed is bounded by memory bandwidth.
tokens_per_sec  ≈  memory_bandwidth  /  model_size_in_bytes

# Example: a 7B model in FP16 is ~14 GB. On a GPU with 2 TB/s:
#   2,000 GB/s  /  14 GB  ≈  143 tokens/sec

Two big consequences follow. First, shrinking the model (lower precision, Section 4) directly raises speed because there are fewer bytes to read. Second, batching (Section 3) helps because you read each weight once and reuse it for the whole batch — turning a bandwidth-bound job into a compute-bound one that keeps the cores busy.

3Batching — Throughput vs Latency

Running one input at a time wastes the hardware: you pay the cost of reading the weights but only process a single item. Batching stacks many inputs and runs them together, so the expensive weight read is shared across the whole batch. Two metrics matter, and they pull in opposite directions:

Throughput — items processed per second. Batching raises it sharply, because the per-batch overhead is amortized over more items.
Latency — time for one request to finish. Batching increases it, because each request waits for the batch to fill and complete.

The model is simple: a batch costs a fixed overhead (kernel launch, reading weights) plus a small per-item compute cost. Throughput is batch / time_per_batch. It climbs as the batch grows — until the hardware saturates, after which throughput flattens while latency keeps rising. The job is to pick the batch size that maximizes throughput within your latency budget.

Real-time apps (chat, voice) optimize for p95 latency and use small batches or continuous batching. Offline jobs (data pipelines) optimize for throughput and use the largest batch that fits in memory.

4Precision — FP32 vs FP16 vs INT8

Every weight is stored in a number format, and the format's bit width decides how many bytes you move and how fast the hardware runs. Cutting precision is the single easiest large speedup in inference.

FP32 (4 bytes)

Full precision. The training default, the most accurate, the slowest and largest. Keep it only for numerically sensitive layers.

FP16 / BF16 (2 bytes)

Half the memory and bandwidth, ~2x faster on Tensor Cores, almost no accuracy loss. The sensible default for inference.

INT8 (1 byte)

A quarter of the memory, often 2-4x faster. Needs calibration and an accuracy check, but huge for serving at scale.

Because speed scales with bytes read, going FP32 → FP16 roughly doubles tokens/sec and FP16 → INT8 roughly doubles it again. The catch is accuracy: lower precision rounds more, so always benchmark the quantized model on your task before shipping. You'll feel this effect directly in Your Turn #2.

5Operator Fusion & Graph Compilation

A model is a graph of small operations — a matrix multiply, then add a bias, then a ReLU activation. Run naively, each operation launches its own GPU kernel (a unit of GPU work) and writes its result back to slow memory before the next one reads it. Those round-trips dominate the time.

Operator fusion merges a chain of ops into a single kernel that keeps the intermediate values in fast registers — e.g. matmul + bias + ReLU becomes one fused kernel, eliminating the memory round-trips. You rarely write fusion by hand; graph compilers do it for you:

torch.compile — one line in PyTorch. Traces, fuses, and picks fast kernels. ~1.3-2x for free.
ONNX Runtime — export to ONNX (the universal model format), then run with cross-platform graph optimization on CPU or GPU.
TensorRT — NVIDIA's inference compiler. Aggressive fusion, FP16/INT8 kernels, and kernel auto-tuning for ~3-5x on NVIDIA GPUs.
OpenVINO / CoreML — the equivalents for Intel CPUs and Apple Silicon.

The usual pipeline: PyTorch/TF model → export to ONNX → compile with the right backend for your hardware → deploy. The two worked examples below show torch.compile and a TensorRT FP16 export.

6Profile First — Find the Real Bottleneck

The most expensive mistake in this whole lesson is optimizing by guessing. The bottleneck is rarely where you assume — a slow model is often waiting on data loading or Python overhead, not compute. Profile first.

# PyTorch profiler — see where the time actually goes.
import torch
from torch.profiler import profile, ProfilerActivity

model = torch.nn.Linear(1024, 1024).eval()
x = torch.randn(64, 1024)

with profile(activities=[ProfilerActivity.CPU]) as prof:
    with torch.no_grad():
        for _ in range(10):
            model(x)

# Print the ops that took the most total time.
print(prof.key_averages().table(sort_by="cpu_time_total", row_limit=5))

# The table shows each op's total time — read it to find the real
# hotspot before you change a single line of model code.

For GPU work, NVIDIA Nsight Systems shows whether the GPU is busy or idle and whether transfers overlap compute. Low GPU utilization with slow inference almost always means the GPU is waiting: tiny batches, data-loading stalls, or per-call Python overhead. Fix what the profiler points at — then measure again.

🧩 Worked Example: Graph Compilation with torch.compile

One line wraps your model and gives you fused kernels with no accuracy change. The first call compiles (slow); every call after runs the optimized graph.

# Graph compilation with torch.compile — one line, no accuracy change.
# It traces the model, fuses ops into bigger kernels, and picks the
# best kernel for your GPU. Typical inference speedup: 1.3-2x.

import torch
import torch.nn as nn

model = nn.Sequential(
    nn.Linear(1024, 1024), nn.ReLU(),
    nn.Linear(1024, 1024), nn.ReLU(),
    nn.Linear(1024, 10),
).eval()

x = torch.randn(64, 1024)

# Compile the graph. The FIRST call is slow (it compiles); every call
# after that runs the optimized, fused kernels.
fast_model = torch.compile(model)

with torch.no_grad():
    out = fast_model(x)        # warm-up: triggers compilation
    out = fast_model(x)        # now running the fused kernels

print("output shape:", tuple(out.shape))

# Expected output:
#   output shape: (64, 10)

⚙️ Worked Example: TensorRT FP16 Export

The production pipeline in code: export to ONNX, then build a TensorRT engine with FP16 enabled. The FP16 flag is the speed switch that turns on half-precision kernels.

# TensorRT: NVIDIA's inference compiler. Export to ONNX, then let
# TensorRT fuse ops, pick fast kernels, and run in FP16 for ~3-5x.

import torch
import torch.nn as nn

model = nn.Sequential(nn.Linear(512, 512), nn.ReLU(), nn.Linear(512, 10)).eval()
dummy = torch.randn(1, 512)

# 1) Export the PyTorch model to ONNX (the universal interchange format).
torch.onnx.export(model, dummy, "model.onnx", input_names=["x"],
                  output_names=["y"], dynamic_axes={"x": {0: "batch"}})

# 2) Build a TensorRT engine from the ONNX file with FP16 enabled.
#    (Pseudocode — TensorRT's Python API in one place for clarity.)
import tensorrt as trt
logger  = trt.Logger(trt.Logger.WARNING)
builder = trt.Builder(logger)
config  = builder.create_builder_config()
config.set_flag(trt.BuilderFlag.FP16)   # <- the speed switch

print("ONNX exported, TensorRT FP16 engine configured")

# Expected output:
#   ONNX exported, TensorRT FP16 engine configured

▶️ Worked Example: Throughput vs Batch Size (run it)

This is the batching trade-off with no libraries — just a per-batch overhead, a per-item cost, and the throughput each batch size delivers. Read the comments, then press run and watch throughput climb as the batch grows.

Worked Example: Throughput vs Batch Size

Compute items/second for each batch size and pick the fastest

Try it Yourself »

Python

# Worked example: find the batch size that maximizes THROUGHPUT.
# throughput = items processed / total time. We model the time for a
# batch as a fixed per-batch overhead plus a small per-item cost.

batch_sizes = [1, 2, 4, 8, 16, 32, 64]

fixed_overhead = 0.010   # 10 ms to launch a batch (kernel launch, memory read)
per_item_cost  = 0.002   # 2 ms of compute per item once the batch is running

best_batch = None
best_throughput = 0.0

for batch in batch_sizes:
    # Time for ONE batch = fixed 
...

🎯 Your Turn #1: Maximize Throughput

Fill in the two blanks marked ___ so the loop computes throughput (items / time) and remembers the batch size that wins. Check your output against the # ✅ Expected output comment.

Your Turn #1: Best Batch Size

Finish the throughput formula and track the winning batch size

Try it Yourself »

Python

# 🎯 YOUR TURN #1 — compute throughput and pick the best batch size
# Fill in the two blanks marked ___ so the loop finds the batch size
# that processes the most items per second.

batch_sizes = [1, 4, 8, 16, 32]

fixed_overhead = 0.020   # 20 ms per-batch overhead
per_item_cost  = 0.004   # 4 ms per item

best_batch = None
best_throughput = 0.0

for batch in batch_sizes:
    time_per_batch = fixed_overhead + per_item_cost * batch
    throughput = batch / ___          # 👉 items per second = ba
...

🎯 Your Turn #2: Precision & Speed

Fill in the two blanks so the loop turns parameter count plus bytes-per-weight into model size, then estimates tokens/sec from memory bandwidth. Watch the numbers double as precision halves.

Your Turn #2: Precision and Speed

Turn precision into memory footprint and a tokens/sec estimate

Try it Yourself »

Python

# 🎯 YOUR TURN #2 — lower precision saves memory AND boosts tokens/sec
# A 7-billion-parameter model. Each precision uses a different number of
# BYTES per weight. Single-stream speed ≈ bandwidth / model_size_bytes.

params = 7_000_000_000          # 7B weights
bandwidth = 2_000_000_000_000   # 2 TB/s of GPU memory bandwidth

precisions = [
    ("FP32", 4),    # 4 bytes per weight
    ("FP16", 2),    # 2 bytes per weight
    ("INT8", 1),    # 1 byte per weight
]

for name, bytes_per_weight in pr
...

Common Errors (And How to Fix Them)

These five mistakes sink most first attempts at optimizing inference:

❌ Running on CPU when the work needs a GPU

Serving a deep model on a CPU and wondering why each request takes seconds — the CPU does those huge matrix multiplies a tiny chunk at a time.

✅ Fix: move the model to a GPU (model.to("cuda")) and the inputs with it. Keep the CPU for tiny models, branchy logic, and preprocessing.

❌ Batch size of 1 in an offline job

Looping over inputs one at a time in a data pipeline, leaving the GPU 90% idle while it reads the weights again for every single item.

✅ Fix: batch the inputs so the weight read is amortized. Use the largest batch that fits memory for offline work; use continuous batching for serving.

❌ Memory overflow (CUDA out of memory)

Pushing the batch size or sequence length until the GPU throws CUDA out of memory — activations and the KV cache grow with batch and length, not just the weights.

✅ Fix: lower the batch size, drop precision (FP16/INT8) to free VRAM, use gradient/activation checkpointing or paged attention, and leave headroom for the KV cache.

❌ Optimizing without profiling first

Rewriting model code for a week to speed it up, when a profile would have shown the GPU was simply starved by a slow data loader.

✅ Fix: profile before touching anything (PyTorch profiler, Nsight Systems), fix the single biggest hotspot, then measure again. Never optimize blind.

❌ Precision dropped too low

Quantizing straight to INT8 (or INT4) to chase speed and shipping it without checking — accuracy quietly collapses on maths, code, or rare classes.

✅ Fix: step down gradually (FP16 → INT8), calibrate properly, and benchmark accuracy on your real task at each step. Keep sensitive layers in higher precision.

📋 Quick Reference

Lever	What it does	Typical gain	Watch out for
GPU instead of CPU	Parallel matrix multiplies	10-100x	Model + inputs must be on the device
Batching	Amortize the weight read	2-10x throughput	Adds latency per request
FP16 / BF16	16-bit weights	~2x	Tiny accuracy loss
INT8	8-bit weights	2-4x	Needs calibration + accuracy check
torch.compile	Fuse + pick fast kernels	1.3-2x	Slow first (compile) call
TensorRT (FP16/INT8)	NVIDIA fusion compiler	3-5x	NVIDIA GPUs only
ONNX Runtime	Cross-platform graph opt	1.5-2x	Export to ONNX first
Profiling	Find the real bottleneck	Avoids wasted work	Do it before optimizing

❓ Frequently Asked Questions

Q: What is the difference between a CPU, a GPU, and a TPU for ML?

A: A CPU has a few very fast, flexible cores and is great for control-heavy code and small models, but slow on the giant matrix multiplications inside a neural network. A GPU has thousands of simpler cores that run those matrix multiplications in parallel, so it is the default for training and most inference. A TPU is a Google-designed accelerator built almost entirely around matrix-multiply units and high-bandwidth memory, so it is extremely fast and efficient on large transformer and CNN workloads but less flexible than a GPU.

Q: Why is LLM inference limited by memory bandwidth instead of compute?

A: When you generate one token at a time, the GPU must read every weight in the model from memory for each token, but does only a little arithmetic with each weight. So the bottleneck is how fast you can stream weights out of memory (bandwidth), not how many FLOPs the chip can do. A rough estimate of single-stream speed is tokens/sec ≈ memory_bandwidth / model_size_in_bytes. This is also why batching and quantization help so much: they get more useful work out of each byte you read.

Q: How does batching improve throughput, and what is the cost?

A: Batching runs many inputs through the model in one pass. Because the weights are loaded once and reused across the whole batch, you amortize the expensive memory reads and the hardware stays busy, so throughput (items per second) climbs sharply at first. The cost is latency: each individual request waits for the batch to fill and finish, and once the batch is large enough to saturate the hardware, throughput stops improving while latency keeps rising. You pick the batch size that maximizes throughput without breaking your latency budget.

Q: What do FP32, FP16, and INT8 mean, and when should I use lower precision?

A: They are number formats with different bit widths: FP32 is 32-bit float (full precision), FP16/BF16 is 16-bit float (half the memory and bandwidth, ~2x faster), and INT8 is an 8-bit integer (a quarter of the memory, often 2-4x faster). Lower precision moves fewer bytes and runs on faster hardware units, so it is the easiest large speedup. Use FP16/BF16 as the default for inference, drop to INT8 when you need more speed and have verified accuracy holds, and keep FP32 only for parts of the model that are numerically sensitive.

Q: What are operator fusion and graph compilation (TensorRT, ONNX Runtime, torch.compile)?

A: A model is a graph of small operations (matmul, bias add, activation). Run naively, each op launches its own GPU kernel and writes its result back to slow memory before the next op reads it. Operator fusion combines a chain of ops into a single kernel that keeps intermediate values in fast registers, eliminating those round-trips to memory. Graph compilers like TensorRT, ONNX Runtime, and torch.compile analyze the whole graph and apply fusion, constant folding, and hardware-specific kernel selection automatically, typically giving 1.5-5x speedups with no accuracy change.

Q: Why is my GPU only 10% utilized even though inference feels slow?

A: Low GPU utilization with slow inference almost always means the GPU is waiting — on data loading, on tiny batches that do not fill the cores, on Python overhead between kernel launches, or on memory transfers. The fix is to profile first: use the PyTorch profiler or Nsight Systems to see where time actually goes, then increase batch size, fuse/compile the graph, move preprocessing off the critical path, and overlap data transfer with compute. Never optimize by guessing — the bottleneck is rarely where you assume.

🎯 Mini-Challenge: Batch Within a Latency Budget

Now combine throughput and latency with only a comment outline — no filled-in logic. Find the largest batch whose latency fits a 100 ms budget, and report its throughput.

Mini-Challenge: Best Batch Within Budget

Pick the highest-throughput batch that still meets the latency budget

Try it Yourself »

Python

# 🎯 MINI-CHALLENGE: pick the batch size that fits BOTH limits
# A request must finish within a latency budget AND you want the most
# throughput. Find the largest batch whose latency is within budget,
# and report its throughput.
#
# 1. batch_sizes = [1, 2, 4, 8, 16, 32, 64]
# 2. fixed_overhead = 0.015   # 15 ms
#    per_item_cost  = 0.003   # 3 ms per item
#    latency_budget = 0.100   # 100 ms — a batch may not take longer than this
# 3. For each batch: latency = fixed_overhead + per_item_cos
...

🎉

Lesson 40 complete — you can make a model fly in production!

You can match a workload to the right chip (CPU, GPU, TPU, accelerator), explain why memory bandwidth — not FLOPs — caps LLM inference, batch inputs to maximize throughput and pick the best batch size, trade precision (FP32 → FP16 → INT8) for speed, fuse and compile graphs with torch.compile/ONNX Runtime/TensorRT, and profile to find the real bottleneck instead of guessing — while avoiding CPU-for-GPU work, batch-size-1 stalls, memory overflow, blind optimization, and precision dropped too low.

🚀 Up next: Distributed Training — scale a model across many GPUs and machines when one chip is no longer enough.

Hardware Optimization for ML

What You'll Learn in This Lesson

🚚 Real-World Analogy: The Right Vehicle for the Job

1CPU vs GPU vs TPU vs Accelerators

2The Memory-Bandwidth Bottleneck

3Batching — Throughput vs Latency

4Precision — FP32 vs FP16 vs INT8

5Operator Fusion & Graph Compilation

6Profile First — Find the Real Bottleneck

🧩 Worked Example: Graph Compilation with torch.compile

⚙️ Worked Example: TensorRT FP16 Export

▶️ Worked Example: Throughput vs Batch Size (run it)

Worked Example: Throughput vs Batch Size

🎯 Your Turn #1: Maximize Throughput

Your Turn #1: Best Batch Size

🎯 Your Turn #2: Precision & Speed

Your Turn #2: Precision and Speed

Common Errors (And How to Fix Them)

📋 Quick Reference

❓ Frequently Asked Questions

🎯 Mini-Challenge: Batch Within a Latency Budget

Mini-Challenge: Best Batch Within Budget

Lesson 40 complete — you can make a model fly in production!

Cookie & Privacy Settings