Lesson 40 • Advanced
Hardware Optimization for ML
Optimise ML inference for GPUs, CPUs, and TPUs — ONNX export, TensorRT, operator fusion, and KV caching for maximum throughput.
✅ What You'll Learn
- • GPU vs CPU vs TPU performance characteristics
- • ONNX export and TensorRT optimization
- • Operator fusion, Flash Attention, and KV caching
- • Memory bandwidth as the inference bottleneck
⚡ Speed Matters in Production
🎯 Real-World Analogy: Running a PyTorch model directly in production is like driving a Formula 1 car on regular tyres — it works, but you're leaving 80% of performance on the table. TensorRT is like fitting race tyres, ONNX is the universal pit stop, and Flash Attention is a turbocharger. Each optimization compounds: 2× from TensorRT, 2× from batching, 2× from FP16 = 8× total speedup.
The difference between "demo on a laptop" and "production at scale" is inference optimization. A model that takes 500ms per request on PyTorch can serve at 60ms with TensorRT FP16 — turning a toy into a real-time product.
Try It: Hardware Comparison
Compare GPU, CPU, and TPU performance for different model sizes
import numpy as np
# GPU vs CPU vs TPU: Understanding Hardware for ML
# Why GPUs are essential and how to use them efficiently
np.random.seed(42)
def simulate_matmul(size, device_gflops):
"""Simulate matrix multiplication performance"""
flops = 2 * size ** 3 # FLOPs for matrix multiply
time_seconds = flops / (device_gflops * 1e9)
return flops, time_seconds
print("=== Hardware Performance Comparison ===")
print()
devices = [
("CPU (i9-13900K)", 0.8), # ~800 GFLOPS
...Try It: Inference Optimization
See how ONNX, TensorRT, and fusion techniques stack up
import numpy as np
# ONNX & TensorRT: Optimizing Inference Speed
# Convert models to optimized formats for 2-5x speedup
np.random.seed(42)
print("=== Model Optimization Pipeline ===")
print()
print(" PyTorch/TF Model")
print(" ↓ export")
print(" ONNX (Open Neural Network Exchange)")
print(" ↓ optimize")
print(" TensorRT / OpenVINO / CoreML")
print(" ↓ deploy")
print(" Production Inference")
print()
# Simulate optimization effects
optimizations = [
("Original PyTorch
...⚠️ Common Mistake: Optimising for throughput when latency matters (or vice versa). Batching increases throughput but adds latency. For real-time applications (chat, voice), optimise for p95 latency. For batch processing (data pipelines), optimise for throughput (requests/second).
💡 Pro Tip: For LLM inference, use vLLM or TGI (Text Generation Inference) — they implement PagedAttention, continuous batching, and KV cache management automatically. For CNN inference, TensorRT with FP16 gives the best results on NVIDIA GPUs.
📋 Quick Reference
| Tool | Hardware | Speedup | Use Case |
|---|---|---|---|
| ONNX Runtime | CPU/GPU | 1.5-2× | Cross-platform |
| TensorRT | NVIDIA GPU | 3-5× | Max GPU perf |
| OpenVINO | Intel CPU | 2-3× | Intel hardware |
| CoreML | Apple Silicon | 2-4× | iOS/macOS |
| vLLM | GPU | 3-10× | LLM serving |
🎉 Lesson Complete!
You can now optimise models for any hardware target! Next, learn distributed training to scale across multiple GPUs.
Sign up for free to track which lessons you've completed and get learning reminders.