Lesson 42 • Advanced
Serving ML Models
Turn a trained model into a reliable production service — REST and gRPC inference endpoints, request batching, autoscaling, model versioning, and safe rollouts with canary and A/B deploys.
What You'll Learn in This Lesson
- ✓Expose a model as a REST or gRPC inference endpoint
- ✓Pick a serving framework: TF Serving, TorchServe, Triton, or BentoML
- ✓Use request batching to raise GPU throughput
- ✓Read latency vs throughput with p50/p99 percentiles
- ✓Autoscale replicas and avoid cold starts
- ✓Ship safely with model versioning, canary, and A/B deploys
🍽️ Real-World Analogy: A Restaurant Kitchen
A trained model in a notebook is like a chef who only cooks at home. Serving it is opening a restaurant: the same cooking skill, but now you must take orders, cook many at once, and never keep a table waiting.
- The order ticket is the request — REST/JSON for walk-ins, gRPC for the kitchen-to-kitchen hotline.
- The chef cooking 8 steaks on one grill is batching — almost the same effort as cooking one.
- How long a diner waits is latency; meals served per hour is throughput.
- Hiring more cooks at the dinner rush is autoscaling; a cook arriving cold is a cold start.
- Trialing a new recipe on a few tables first is a canary deploy; the printed recipe version is model versioning.
The whole lesson is about running that kitchen well: fast tickets, full grills, enough cooks, and a safe way to change the menu.
1Inference Endpoints — REST and gRPC
Serving means putting your model behind a network address so other programs send an input and get a prediction back. The address is an inference endpoint.
- REST/JSON — human-readable, works from a browser or
curl, easy to debug. Best default for public APIs. - gRPC — binary protocol, lower latency and smaller payloads for tensors. Best for fast service-to-service calls inside your cluster.
A common setup is REST on the edge for callers, gRPC between internal services. Whichever you pick, two endpoints are non-negotiable: a /predict for inference and a /health the load balancer can poll.
Worked example — a minimal REST endpoint with FastAPI:
# A minimal REST inference endpoint with FastAPI.
# Run with: uvicorn app:app --host 0.0.0.0 --port 8000
from fastapi import FastAPI
from pydantic import BaseModel
app = FastAPI()
# Load the model ONCE at startup, not inside the handler.
# 'model' lives for the life of the process and is reused by every request.
model = None
@app.on_event("startup")
def load_model():
global model
# In real code: model = torch.load("model.pt"); model.eval()
model = lambda text: {"label": "positive", "score": 0.97}
# Warmup: run one dummy inference so the first real request is fast.
model("warmup")
# Pydantic validates the request body and rejects bad input early.
class PredictRequest(BaseModel):
text: str
class PredictResponse(BaseModel):
label: str
confidence: float
@app.post("/predict", response_model=PredictResponse)
def predict(req: PredictRequest):
out = model(req.text) # reuse the loaded model
return PredictResponse(label=out["label"], confidence=out["score"])
@app.get("/health") # used by the load balancer
def health():
return {"status": "ok", "model_loaded": model is not None}
# Expected output (POST /predict with {"text": "great movie"}):
# {"label": "positive", "confidence": 0.97}
# Expected output (GET /health):
# {"status": "ok", "model_loaded": true}Notice the model loads once at startup and is reused by every request. The /health route is how the kitchen tells the maitre d' (load balancer) it is ready for orders.
2Serving Frameworks — Don't Build the Plumbing Yourself
FastAPI is great for one model, but a dedicated serving framework gives you batching, model versioning, multi-model hosting, and GPU scheduling for free. Pick by your stack:
| Framework | Built for | Reach for it when… |
|---|---|---|
| TensorFlow Serving | TensorFlow | You have SavedModel files and want versioned TF serving |
| TorchServe | PyTorch | You have PyTorch models and want built-in dynamic batching |
| Triton | Any framework | Multi-model, multi-GPU, mixed TF/PyTorch/ONNX on one server |
| BentoML | Any framework | You want to package model + code + deps as one deployable unit |
3Request Batching, Latency vs Throughput
A GPU runs a batch of 8 inputs almost as fast as a single one. Request batching groups incoming requests into one forward pass, so throughput (requests per second) shoots up. The cost: a request waits a few milliseconds for the batch to fill, so its latency rises slightly. Frameworks expose two dials — max batch size and max wait — to balance the two.
You measure latency in percentiles, not averages. p50 is the median request; p99 is the slow tail — 1 in 100 requests is at least this slow. Users feel the tail, so p99 is the number that matters for an SLA.
Worked example — the batching loop a framework runs for you:
# Why serving frameworks batch requests: one GPU pass handles many inputs.
# This sketches the loop TorchServe / Triton / vLLM run for you.
import time
MAX_BATCH = 8 # never let a batch grow unbounded
MAX_WAIT_MS = 5 # don't make request #1 wait too long for a full batch
def run_batch(texts):
# One forward pass over the whole batch — ~same cost as a single item.
return ["positive"] * len(texts)
def serve(queue):
results = []
start = time.time()
batch = []
for item in queue: # incoming requests
batch.append(item)
waited_ms = (time.time() - start) * 1000
# Flush when the batch is full OR we've waited long enough.
if len(batch) >= MAX_BATCH or waited_ms >= MAX_WAIT_MS:
results += run_batch(batch) # 1 GPU pass for the whole batch
batch = []
start = time.time()
if batch:
results += run_batch(batch) # flush the leftovers
return results
print(serve(["a", "b", "c"]))
# Expected output:
# ['positive', 'positive', 'positive']The batch flushes when it is full or the wait timer fires — that timer is what keeps p99 from exploding under light traffic.
▶️ Worked Example: Latency Percentiles (run it)
Run this to see how one slow request pulls p99 far above p50, even when most requests are fast. Read the comments, then press run.
Worked Example: Compute p50 and p99
See how the slow tail (p99) hides behind a healthy median (p50)
# Worked example: measure serving latency from a list of timings.
# Latency is usually reported as percentiles, not an average, because a
# few slow requests (the "tail") matter most to real users.
def percentile(times, p):
# p is a fraction, e.g. 0.50 for p50 (median), 0.99 for p99.
ordered = sorted(times) # sort smallest -> largest
k = int(round((len(ordered) - 1) * p)) # index for that percentile
return ordered[k]
# Latency of 10 requests, in milliseconds.
...🎯 Your Turn #1: Batch the Requests
Fill in the two blanks marked ___. Group 7 requests into batches of 4 and count how many GPU passes that takes. Check your output against the # ✅ Expected output comment.
Your Turn #1: Request Batching
Group requests into batches and count the GPU passes
# 🎯 YOUR TURN #1 — simulate request batching
# A GPU pass costs 8ms no matter how many items are in the batch (up to 8).
# Group the incoming requests into batches, then count the GPU passes.
PASS_MS = 8 # cost of one GPU forward pass
MAX_BATCH = ___ # 👉 set the max batch size to 4
requests = ["r1", "r2", "r3", "r4", "r5", "r6", "r7"]
batches = 0
i = 0
while i < len(requests):
batch = requests[i:i + MAX_BATCH] # take up to MAX_BATCH requests
batches += ___
...🎯 Your Turn #2: Measure the Tail
Fill in the two blanks so percentile() sorts the timings and returns the right value. A single 150ms request should send p99 sky-high while p50 stays calm.
Your Turn #2: p50 and p99 Latency
Finish the percentile helper and read the latency tail
# 🎯 YOUR TURN #2 — compute p50 and p99 latency
# Sort the timings and pick the value at each percentile index.
def percentile(times, p):
ordered = ___ # 👉 sort the list smallest -> largest
k = int(round((len(ordered) - 1) * p))
return ordered[k]
latencies_ms = [20, 22, 19, 21, 23, 20, 150, 21, 22, 20]
p50 = percentile(latencies_ms, ___) # 👉 use 0.50 for the median
p99 = percentile(latencies_ms, 0.99)
print("p50:", p50, "ms")
print("p99:", p9
...4Autoscaling and Cold Starts
Autoscaling adds or removes server replicas as traffic changes — more cooks at the dinner rush, fewer at 3am. You scale on a signal like GPU utilisation, queue depth, or requests per second.
The catch is the cold start: a freshly added replica is slow on its first request because the model still has to load into memory and the GPU warm up. Two fixes: keep a minimum number of replicas always running, and run a warmup inference before the replica accepts traffic (you saw the warmup call in Section 1).
5Model Versioning, Canary & A/B Deploys
Never overwrite a live model. Give every model a version (v1, v2, …) so you can serve a specific one, compare them, and roll back instantly if v2 misbehaves.
- Canary deploy — route a small slice (e.g. 5%) of traffic to v2. If error rate and latency stay healthy, ramp to 100%; if not, roll back. This is a safety mechanism.
- A/B deploy — split traffic between v1 and v2 on purpose and measure which scores better on a business metric. This is a measurement mechanism.
- Shadow mode — send v2 a copy of real traffic but don't return its answers, so you can compare offline with zero user risk.
Trial the new recipe on a few tables (canary), or serve two recipes to see which sells better (A/B) — either way, the old recipe is one switch away.
Common Errors (And How to Fix Them)
These five mistakes sink most first serving deployments:
❌ No batching — wasting the GPU
One request per forward pass leaves the GPU 90% idle and caps your throughput.
✅ Fix: enable dynamic batching (set max batch size + max wait), or use a framework that does it for you.
❌ Cold starts on every request
Loading the model inside the handler reloads 500MB+ from disk per call — 10s latencies.
✅ Fix: load once at startup, run a warmup inference, and keep a minimum replica count.
❌ No versioning — can't roll back
Overwriting the live model means a bad deploy has no undo and no way to compare.
✅ Fix: tag every model with a version and deploy via canary so rollback is one switch.
❌ Blocking I/O on the request path
A synchronous DB or network call inside the handler stalls the whole worker under load.
✅ Fix: use async handlers, move slow work off the hot path, and set request timeouts.
❌ Unbounded queues — silent meltdown
An infinite request queue hides overload: latency climbs forever and memory blows up instead of failing fast.
✅ Fix: cap the queue length and reject extra requests with HTTP 429 so callers can back off.
📋 Quick Reference
| Concept | What it is | Why it matters |
|---|---|---|
| REST vs gRPC | JSON over HTTP vs binary RPC | REST to debug, gRPC for speed |
| Batching | Group requests into one pass | 5–10× throughput |
| Latency (p50/p99) | Median vs slow-tail wait | p99 is what users feel |
| Throughput | Requests served per second | Capacity of the service |
| Autoscaling | Add/remove replicas on load | Match cost to demand |
| Cold start | Slow first request after boot | Warmup + min replicas fix it |
| Versioning | Tag every model v1, v2… | Compare and roll back |
| Canary / A/B | Slow rollout vs split test | Safe change vs measured change |
❓ Frequently Asked Questions
Q: What is model serving?
A: Model serving is wrapping a trained model behind a network endpoint (usually REST or gRPC) so other programs can send inputs and get predictions back over the network, instead of calling the model from inside a notebook.
Q: REST or gRPC for inference?
A: Use REST/JSON when you want something easy to debug and call from any client, including browsers. Use gRPC when you need lower latency and smaller payloads for binary data like tensors, and your callers are other services. Many teams expose REST publicly and gRPC internally.
Q: Why does request batching make serving faster?
A: A GPU runs one batch almost as fast as one item, so grouping many small requests into a single forward pass dramatically raises throughput (requests per second). The trade-off is that a request may wait a few milliseconds for the batch to fill, so latency per request rises slightly.
Q: What is a cold start and how do I avoid it?
A: A cold start is the slow first request after a server boots or scales up, because the model still has to load into memory and the GPU has to warm up. Avoid it by loading the model once at startup and running a dummy 'warmup' inference before accepting traffic, and by keeping a minimum number of replicas always running.
Q: What is the difference between a canary and an A/B deploy?
A: A canary sends a small slice of traffic (say 5%) to a new model version to check it is healthy before rolling it out to everyone, so it is a safety mechanism. An A/B test deliberately splits traffic between versions to measure which performs better on a metric, so it is a measurement mechanism.
Q: Should I build serving myself or use a framework?
A: Start with FastAPI when you have a single model and want full control. Move to a dedicated framework (TorchServe, TensorFlow Serving, Triton, or BentoML) once you need built-in batching, model versioning, multi-model hosting, or GPU scheduling without writing that plumbing yourself.
🎯 Mini-Challenge: SLA Checker
Now write it yourself with only a comment outline. Build a tiny SLA checker that flags when your p99 latency breaches a 200ms budget. The starter has just the steps — no filled-in logic.
Mini-Challenge: SLA Checker
Compute p99 and decide OK vs BREACH against a 200ms budget
# 🎯 MINI-CHALLENGE: a tiny SLA checker for your serving endpoint
# 1. Make a list of latencies (ms): [30, 28, 31, 29, 32, 30, 220, 29, 31, 30]
# 2. Write a percentile(times, p) helper (sort, then index at (n-1)*p rounded)
# 3. Compute p99
# 4. Set a budget: SLA_MS = 200
# 5. Print "BREACH" if p99 > SLA_MS, otherwise print "OK", plus the p99 value
#
# ✅ Expected output:
# p99: 220 ms -> BREACH
# your code hereLesson 42 complete — you can serve a model in production!
You can expose a model over REST or gRPC, pick a serving framework, batch requests for throughput, read p50/p99 latency, autoscale without cold starts, and roll out new versions safely with canary and A/B deploys.
🚀 Up next: Model Monitoring — watch your live model for drift, bias, and degradation before your users notice.
Sign up for free to track which lessons you've completed and get learning reminders.