Lesson 42 • Advanced

Serving ML Models

Turn a trained model into a reliable production service — REST and gRPC inference endpoints, request batching, autoscaling, model versioning, and safe rollouts with canary and A/B deploys.

What You'll Learn in This Lesson

✓Expose a model as a REST or gRPC inference endpoint
✓Pick a serving framework: TF Serving, TorchServe, Triton, or BentoML
✓Use request batching to raise GPU throughput
✓Read latency vs throughput with p50/p99 percentiles
✓Autoscale replicas and avoid cold starts
✓Ship safely with model versioning, canary, and A/B deploys

Before you start: You should have a trained model and be comfortable with Distributed Training. Knowing basic Python functions and lists is enough for the exercises below.

🍽️ Real-World Analogy: A Restaurant Kitchen

A trained model in a notebook is like a chef who only cooks at home. Serving it is opening a restaurant: the same cooking skill, but now you must take orders, cook many at once, and never keep a table waiting.

The order ticket is the request — REST/JSON for walk-ins, gRPC for the kitchen-to-kitchen hotline.
The chef cooking 8 steaks on one grill is batching — almost the same effort as cooking one.
How long a diner waits is latency; meals served per hour is throughput.
Hiring more cooks at the dinner rush is autoscaling; a cook arriving cold is a cold start.
Trialing a new recipe on a few tables first is a canary deploy; the printed recipe version is model versioning.

The whole lesson is about running that kitchen well: fast tickets, full grills, enough cooks, and a safe way to change the menu.

1Inference Endpoints — REST and gRPC

Serving means putting your model behind a network address so other programs send an input and get a prediction back. The address is an inference endpoint.

REST/JSON — human-readable, works from a browser or curl, easy to debug. Best default for public APIs.
gRPC — binary protocol, lower latency and smaller payloads for tensors. Best for fast service-to-service calls inside your cluster.

A common setup is REST on the edge for callers, gRPC between internal services. Whichever you pick, two endpoints are non-negotiable: a /predict for inference and a /health the load balancer can poll.

Worked example — a minimal REST endpoint with FastAPI:

# A minimal REST inference endpoint with FastAPI.
# Run with:  uvicorn app:app --host 0.0.0.0 --port 8000

from fastapi import FastAPI
from pydantic import BaseModel

app = FastAPI()

# Load the model ONCE at startup, not inside the handler.
# 'model' lives for the life of the process and is reused by every request.
model = None

@app.on_event("startup")
def load_model():
    global model
    # In real code: model = torch.load("model.pt"); model.eval()
    model = lambda text: {"label": "positive", "score": 0.97}
    # Warmup: run one dummy inference so the first real request is fast.
    model("warmup")

# Pydantic validates the request body and rejects bad input early.
class PredictRequest(BaseModel):
    text: str

class PredictResponse(BaseModel):
    label: str
    confidence: float

@app.post("/predict", response_model=PredictResponse)
def predict(req: PredictRequest):
    out = model(req.text)                 # reuse the loaded model
    return PredictResponse(label=out["label"], confidence=out["score"])

@app.get("/health")                       # used by the load balancer
def health():
    return {"status": "ok", "model_loaded": model is not None}

# Expected output (POST /predict with {"text": "great movie"}):
#   {"label": "positive", "confidence": 0.97}
# Expected output (GET /health):
#   {"status": "ok", "model_loaded": true}

Notice the model loads once at startup and is reused by every request. The /health route is how the kitchen tells the maitre d' (load balancer) it is ready for orders.

2Serving Frameworks — Don't Build the Plumbing Yourself

FastAPI is great for one model, but a dedicated serving framework gives you batching, model versioning, multi-model hosting, and GPU scheduling for free. Pick by your stack:

Framework	Built for	Reach for it when…
TensorFlow Serving	TensorFlow	You have SavedModel files and want versioned TF serving
TorchServe	PyTorch	You have PyTorch models and want built-in dynamic batching
Triton	Any framework	Multi-model, multi-GPU, mixed TF/PyTorch/ONNX on one server
BentoML	Any framework	You want to package model + code + deps as one deployable unit

Rule of thumb: start on FastAPI to learn the shape of serving, then graduate to Triton (general models) or TorchServe (PyTorch) once you need batching and versioning without writing that code yourself.

3Request Batching, Latency vs Throughput

A GPU runs a batch of 8 inputs almost as fast as a single one. Request batching groups incoming requests into one forward pass, so throughput (requests per second) shoots up. The cost: a request waits a few milliseconds for the batch to fill, so its latency rises slightly. Frameworks expose two dials — max batch size and max wait — to balance the two.

You measure latency in percentiles, not averages. p50 is the median request; p99 is the slow tail — 1 in 100 requests is at least this slow. Users feel the tail, so p99 is the number that matters for an SLA.

Worked example — the batching loop a framework runs for you:

# Why serving frameworks batch requests: one GPU pass handles many inputs.
# This sketches the loop TorchServe / Triton / vLLM run for you.

import time

MAX_BATCH = 8        # never let a batch grow unbounded
MAX_WAIT_MS = 5      # don't make request #1 wait too long for a full batch

def run_batch(texts):
    # One forward pass over the whole batch — ~same cost as a single item.
    return ["positive"] * len(texts)

def serve(queue):
    results = []
    start = time.time()
    batch = []
    for item in queue:                         # incoming requests
        batch.append(item)
        waited_ms = (time.time() - start) * 1000
        # Flush when the batch is full OR we've waited long enough.
        if len(batch) >= MAX_BATCH or waited_ms >= MAX_WAIT_MS:
            results += run_batch(batch)         # 1 GPU pass for the whole batch
            batch = []
            start = time.time()
    if batch:
        results += run_batch(batch)             # flush the leftovers
    return results

print(serve(["a", "b", "c"]))

# Expected output:
#   ['positive', 'positive', 'positive']

The batch flushes when it is full or the wait timer fires — that timer is what keeps p99 from exploding under light traffic.

▶️ Worked Example: Latency Percentiles (run it)

Run this to see how one slow request pulls p99 far above p50, even when most requests are fast. Read the comments, then press run.

Worked Example: Compute p50 and p99

See how the slow tail (p99) hides behind a healthy median (p50)

Try it Yourself »

Python

# Worked example: measure serving latency from a list of timings.
# Latency is usually reported as percentiles, not an average, because a
# few slow requests (the "tail") matter most to real users.

def percentile(times, p):
    # p is a fraction, e.g. 0.50 for p50 (median), 0.99 for p99.
    ordered = sorted(times)                  # sort smallest -> largest
    k = int(round((len(ordered) - 1) * p))   # index for that percentile
    return ordered[k]

# Latency of 10 requests, in milliseconds.
...

🎯 Your Turn #1: Batch the Requests

Fill in the two blanks marked ___. Group 7 requests into batches of 4 and count how many GPU passes that takes. Check your output against the # ✅ Expected output comment.

Your Turn #1: Request Batching

Group requests into batches and count the GPU passes

Try it Yourself »

Python

# 🎯 YOUR TURN #1 — simulate request batching
# A GPU pass costs 8ms no matter how many items are in the batch (up to 8).
# Group the incoming requests into batches, then count the GPU passes.

PASS_MS = 8          # cost of one GPU forward pass
MAX_BATCH = ___      # 👉 set the max batch size to 4

requests = ["r1", "r2", "r3", "r4", "r5", "r6", "r7"]

batches = 0
i = 0
while i < len(requests):
    batch = requests[i:i + MAX_BATCH]   # take up to MAX_BATCH requests
    batches += ___           
...

🎯 Your Turn #2: Measure the Tail

Fill in the two blanks so percentile() sorts the timings and returns the right value. A single 150ms request should send p99 sky-high while p50 stays calm.

Your Turn #2: p50 and p99 Latency

Finish the percentile helper and read the latency tail

Try it Yourself »

Python

# 🎯 YOUR TURN #2 — compute p50 and p99 latency
# Sort the timings and pick the value at each percentile index.

def percentile(times, p):
    ordered = ___                            # 👉 sort the list smallest -> largest
    k = int(round((len(ordered) - 1) * p))
    return ordered[k]

latencies_ms = [20, 22, 19, 21, 23, 20, 150, 21, 22, 20]

p50 = percentile(latencies_ms, ___)          # 👉 use 0.50 for the median
p99 = percentile(latencies_ms, 0.99)

print("p50:", p50, "ms")
print("p99:", p9
...

4Autoscaling and Cold Starts

Autoscaling adds or removes server replicas as traffic changes — more cooks at the dinner rush, fewer at 3am. You scale on a signal like GPU utilisation, queue depth, or requests per second.

The catch is the cold start: a freshly added replica is slow on its first request because the model still has to load into memory and the GPU warm up. Two fixes: keep a minimum number of replicas always running, and run a warmup inference before the replica accepts traffic (you saw the warmup call in Section 1).

Scaling to zero replicas saves money but guarantees a cold start on the next request. For latency-sensitive services, set the minimum replica count to at least 1.

5Model Versioning, Canary & A/B Deploys

Never overwrite a live model. Give every model a version (v1, v2, …) so you can serve a specific one, compare them, and roll back instantly if v2 misbehaves.

Canary deploy — route a small slice (e.g. 5%) of traffic to v2. If error rate and latency stay healthy, ramp to 100%; if not, roll back. This is a safety mechanism.
A/B deploy — split traffic between v1 and v2 on purpose and measure which scores better on a business metric. This is a measurement mechanism.
Shadow mode — send v2 a copy of real traffic but don't return its answers, so you can compare offline with zero user risk.

Trial the new recipe on a few tables (canary), or serve two recipes to see which sells better (A/B) — either way, the old recipe is one switch away.

Common Errors (And How to Fix Them)

These five mistakes sink most first serving deployments:

❌ No batching — wasting the GPU

One request per forward pass leaves the GPU 90% idle and caps your throughput.

✅ Fix: enable dynamic batching (set max batch size + max wait), or use a framework that does it for you.

❌ Cold starts on every request

Loading the model inside the handler reloads 500MB+ from disk per call — 10s latencies.

✅ Fix: load once at startup, run a warmup inference, and keep a minimum replica count.

❌ No versioning — can't roll back

Overwriting the live model means a bad deploy has no undo and no way to compare.

✅ Fix: tag every model with a version and deploy via canary so rollback is one switch.

❌ Blocking I/O on the request path

A synchronous DB or network call inside the handler stalls the whole worker under load.

✅ Fix: use async handlers, move slow work off the hot path, and set request timeouts.

❌ Unbounded queues — silent meltdown

An infinite request queue hides overload: latency climbs forever and memory blows up instead of failing fast.

✅ Fix: cap the queue length and reject extra requests with HTTP 429 so callers can back off.

📋 Quick Reference

Concept	What it is	Why it matters
REST vs gRPC	JSON over HTTP vs binary RPC	REST to debug, gRPC for speed
Batching	Group requests into one pass	5–10× throughput
Latency (p50/p99)	Median vs slow-tail wait	p99 is what users feel
Throughput	Requests served per second	Capacity of the service
Autoscaling	Add/remove replicas on load	Match cost to demand
Cold start	Slow first request after boot	Warmup + min replicas fix it
Versioning	Tag every model v1, v2…	Compare and roll back
Canary / A/B	Slow rollout vs split test	Safe change vs measured change

❓ Frequently Asked Questions

Q: What is model serving?

A: Model serving is wrapping a trained model behind a network endpoint (usually REST or gRPC) so other programs can send inputs and get predictions back over the network, instead of calling the model from inside a notebook.

Q: REST or gRPC for inference?

A: Use REST/JSON when you want something easy to debug and call from any client, including browsers. Use gRPC when you need lower latency and smaller payloads for binary data like tensors, and your callers are other services. Many teams expose REST publicly and gRPC internally.

Q: Why does request batching make serving faster?

A: A GPU runs one batch almost as fast as one item, so grouping many small requests into a single forward pass dramatically raises throughput (requests per second). The trade-off is that a request may wait a few milliseconds for the batch to fill, so latency per request rises slightly.

Q: What is a cold start and how do I avoid it?

A: A cold start is the slow first request after a server boots or scales up, because the model still has to load into memory and the GPU has to warm up. Avoid it by loading the model once at startup and running a dummy 'warmup' inference before accepting traffic, and by keeping a minimum number of replicas always running.

Q: What is the difference between a canary and an A/B deploy?

A: A canary sends a small slice of traffic (say 5%) to a new model version to check it is healthy before rolling it out to everyone, so it is a safety mechanism. An A/B test deliberately splits traffic between versions to measure which performs better on a metric, so it is a measurement mechanism.

Q: Should I build serving myself or use a framework?

A: Start with FastAPI when you have a single model and want full control. Move to a dedicated framework (TorchServe, TensorFlow Serving, Triton, or BentoML) once you need built-in batching, model versioning, multi-model hosting, or GPU scheduling without writing that plumbing yourself.

🎯 Mini-Challenge: SLA Checker

Now write it yourself with only a comment outline. Build a tiny SLA checker that flags when your p99 latency breaches a 200ms budget. The starter has just the steps — no filled-in logic.

Mini-Challenge: SLA Checker

Compute p99 and decide OK vs BREACH against a 200ms budget

Try it Yourself »

Python

# 🎯 MINI-CHALLENGE: a tiny SLA checker for your serving endpoint
# 1. Make a list of latencies (ms): [30, 28, 31, 29, 32, 30, 220, 29, 31, 30]
# 2. Write a percentile(times, p) helper (sort, then index at (n-1)*p rounded)
# 3. Compute p99
# 4. Set a budget: SLA_MS = 200
# 5. Print "BREACH" if p99 > SLA_MS, otherwise print "OK", plus the p99 value
#
# ✅ Expected output:
#   p99: 220 ms -> BREACH

# your code here

🎉

Lesson 42 complete — you can serve a model in production!

You can expose a model over REST or gRPC, pick a serving framework, batch requests for throughput, read p50/p99 latency, autoscale without cold starts, and roll out new versions safely with canary and A/B deploys.

🚀 Up next: Model Monitoring — watch your live model for drift, bias, and degradation before your users notice.

Serving ML Models

What You'll Learn in This Lesson

🍽️ Real-World Analogy: A Restaurant Kitchen

1Inference Endpoints — REST and gRPC

2Serving Frameworks — Don't Build the Plumbing Yourself

3Request Batching, Latency vs Throughput

▶️ Worked Example: Latency Percentiles (run it)

Worked Example: Compute p50 and p99

🎯 Your Turn #1: Batch the Requests

Your Turn #1: Request Batching

🎯 Your Turn #2: Measure the Tail

Your Turn #2: p50 and p99 Latency

4Autoscaling and Cold Starts

5Model Versioning, Canary & A/B Deploys

Common Errors (And How to Fix Them)

📋 Quick Reference

❓ Frequently Asked Questions

🎯 Mini-Challenge: SLA Checker

Mini-Challenge: SLA Checker

Lesson 42 complete — you can serve a model in production!

Cookie & Privacy Settings