Skip to main content

    Lesson 42 • Advanced

    Serving ML Models

    Turn a trained model into a reliable production service — REST and gRPC inference endpoints, request batching, autoscaling, model versioning, and safe rollouts with canary and A/B deploys.

    What You'll Learn in This Lesson

    • Expose a model as a REST or gRPC inference endpoint
    • Pick a serving framework: TF Serving, TorchServe, Triton, or BentoML
    • Use request batching to raise GPU throughput
    • Read latency vs throughput with p50/p99 percentiles
    • Autoscale replicas and avoid cold starts
    • Ship safely with model versioning, canary, and A/B deploys

    🍽️ Real-World Analogy: A Restaurant Kitchen

    A trained model in a notebook is like a chef who only cooks at home. Serving it is opening a restaurant: the same cooking skill, but now you must take orders, cook many at once, and never keep a table waiting.

    • The order ticket is the request — REST/JSON for walk-ins, gRPC for the kitchen-to-kitchen hotline.
    • The chef cooking 8 steaks on one grill is batching — almost the same effort as cooking one.
    • How long a diner waits is latency; meals served per hour is throughput.
    • Hiring more cooks at the dinner rush is autoscaling; a cook arriving cold is a cold start.
    • Trialing a new recipe on a few tables first is a canary deploy; the printed recipe version is model versioning.

    The whole lesson is about running that kitchen well: fast tickets, full grills, enough cooks, and a safe way to change the menu.

    1Inference Endpoints — REST and gRPC

    Serving means putting your model behind a network address so other programs send an input and get a prediction back. The address is an inference endpoint.

    • REST/JSON — human-readable, works from a browser or curl, easy to debug. Best default for public APIs.
    • gRPC — binary protocol, lower latency and smaller payloads for tensors. Best for fast service-to-service calls inside your cluster.

    A common setup is REST on the edge for callers, gRPC between internal services. Whichever you pick, two endpoints are non-negotiable: a /predict for inference and a /health the load balancer can poll.

    Worked example — a minimal REST endpoint with FastAPI:

    # A minimal REST inference endpoint with FastAPI.
    # Run with:  uvicorn app:app --host 0.0.0.0 --port 8000
    
    from fastapi import FastAPI
    from pydantic import BaseModel
    
    app = FastAPI()
    
    # Load the model ONCE at startup, not inside the handler.
    # 'model' lives for the life of the process and is reused by every request.
    model = None
    
    @app.on_event("startup")
    def load_model():
        global model
        # In real code: model = torch.load("model.pt"); model.eval()
        model = lambda text: {"label": "positive", "score": 0.97}
        # Warmup: run one dummy inference so the first real request is fast.
        model("warmup")
    
    # Pydantic validates the request body and rejects bad input early.
    class PredictRequest(BaseModel):
        text: str
    
    class PredictResponse(BaseModel):
        label: str
        confidence: float
    
    @app.post("/predict", response_model=PredictResponse)
    def predict(req: PredictRequest):
        out = model(req.text)                 # reuse the loaded model
        return PredictResponse(label=out["label"], confidence=out["score"])
    
    @app.get("/health")                       # used by the load balancer
    def health():
        return {"status": "ok", "model_loaded": model is not None}
    
    # Expected output (POST /predict with {"text": "great movie"}):
    #   {"label": "positive", "confidence": 0.97}
    # Expected output (GET /health):
    #   {"status": "ok", "model_loaded": true}

    Notice the model loads once at startup and is reused by every request. The /health route is how the kitchen tells the maitre d' (load balancer) it is ready for orders.

    2Serving Frameworks — Don't Build the Plumbing Yourself

    FastAPI is great for one model, but a dedicated serving framework gives you batching, model versioning, multi-model hosting, and GPU scheduling for free. Pick by your stack:

    FrameworkBuilt forReach for it when…
    TensorFlow ServingTensorFlowYou have SavedModel files and want versioned TF serving
    TorchServePyTorchYou have PyTorch models and want built-in dynamic batching
    TritonAny frameworkMulti-model, multi-GPU, mixed TF/PyTorch/ONNX on one server
    BentoMLAny frameworkYou want to package model + code + deps as one deployable unit

    3Request Batching, Latency vs Throughput

    A GPU runs a batch of 8 inputs almost as fast as a single one. Request batching groups incoming requests into one forward pass, so throughput (requests per second) shoots up. The cost: a request waits a few milliseconds for the batch to fill, so its latency rises slightly. Frameworks expose two dials — max batch size and max wait — to balance the two.

    You measure latency in percentiles, not averages. p50 is the median request; p99 is the slow tail — 1 in 100 requests is at least this slow. Users feel the tail, so p99 is the number that matters for an SLA.

    Worked example — the batching loop a framework runs for you:

    # Why serving frameworks batch requests: one GPU pass handles many inputs.
    # This sketches the loop TorchServe / Triton / vLLM run for you.
    
    import time
    
    MAX_BATCH = 8        # never let a batch grow unbounded
    MAX_WAIT_MS = 5      # don't make request #1 wait too long for a full batch
    
    def run_batch(texts):
        # One forward pass over the whole batch — ~same cost as a single item.
        return ["positive"] * len(texts)
    
    def serve(queue):
        results = []
        start = time.time()
        batch = []
        for item in queue:                         # incoming requests
            batch.append(item)
            waited_ms = (time.time() - start) * 1000
            # Flush when the batch is full OR we've waited long enough.
            if len(batch) >= MAX_BATCH or waited_ms >= MAX_WAIT_MS:
                results += run_batch(batch)         # 1 GPU pass for the whole batch
                batch = []
                start = time.time()
        if batch:
            results += run_batch(batch)             # flush the leftovers
        return results
    
    print(serve(["a", "b", "c"]))
    
    # Expected output:
    #   ['positive', 'positive', 'positive']

    The batch flushes when it is full or the wait timer fires — that timer is what keeps p99 from exploding under light traffic.

    ▶️ Worked Example: Latency Percentiles (run it)

    Run this to see how one slow request pulls p99 far above p50, even when most requests are fast. Read the comments, then press run.

    Worked Example: Compute p50 and p99

    See how the slow tail (p99) hides behind a healthy median (p50)

    Try it Yourself »
    Python
    # Worked example: measure serving latency from a list of timings.
    # Latency is usually reported as percentiles, not an average, because a
    # few slow requests (the "tail") matter most to real users.
    
    def percentile(times, p):
        # p is a fraction, e.g. 0.50 for p50 (median), 0.99 for p99.
        ordered = sorted(times)                  # sort smallest -> largest
        k = int(round((len(ordered) - 1) * p))   # index for that percentile
        return ordered[k]
    
    # Latency of 10 requests, in milliseconds.
    ...

    🎯 Your Turn #1: Batch the Requests

    Fill in the two blanks marked ___. Group 7 requests into batches of 4 and count how many GPU passes that takes. Check your output against the # ✅ Expected output comment.

    Your Turn #1: Request Batching

    Group requests into batches and count the GPU passes

    Try it Yourself »
    Python
    # 🎯 YOUR TURN #1 — simulate request batching
    # A GPU pass costs 8ms no matter how many items are in the batch (up to 8).
    # Group the incoming requests into batches, then count the GPU passes.
    
    PASS_MS = 8          # cost of one GPU forward pass
    MAX_BATCH = ___      # 👉 set the max batch size to 4
    
    requests = ["r1", "r2", "r3", "r4", "r5", "r6", "r7"]
    
    batches = 0
    i = 0
    while i < len(requests):
        batch = requests[i:i + MAX_BATCH]   # take up to MAX_BATCH requests
        batches += ___           
    ...

    🎯 Your Turn #2: Measure the Tail

    Fill in the two blanks so percentile() sorts the timings and returns the right value. A single 150ms request should send p99 sky-high while p50 stays calm.

    Your Turn #2: p50 and p99 Latency

    Finish the percentile helper and read the latency tail

    Try it Yourself »
    Python
    # 🎯 YOUR TURN #2 — compute p50 and p99 latency
    # Sort the timings and pick the value at each percentile index.
    
    def percentile(times, p):
        ordered = ___                            # 👉 sort the list smallest -> largest
        k = int(round((len(ordered) - 1) * p))
        return ordered[k]
    
    latencies_ms = [20, 22, 19, 21, 23, 20, 150, 21, 22, 20]
    
    p50 = percentile(latencies_ms, ___)          # 👉 use 0.50 for the median
    p99 = percentile(latencies_ms, 0.99)
    
    print("p50:", p50, "ms")
    print("p99:", p9
    ...

    4Autoscaling and Cold Starts

    Autoscaling adds or removes server replicas as traffic changes — more cooks at the dinner rush, fewer at 3am. You scale on a signal like GPU utilisation, queue depth, or requests per second.

    The catch is the cold start: a freshly added replica is slow on its first request because the model still has to load into memory and the GPU warm up. Two fixes: keep a minimum number of replicas always running, and run a warmup inference before the replica accepts traffic (you saw the warmup call in Section 1).

    5Model Versioning, Canary & A/B Deploys

    Never overwrite a live model. Give every model a version (v1, v2, …) so you can serve a specific one, compare them, and roll back instantly if v2 misbehaves.

    • Canary deploy — route a small slice (e.g. 5%) of traffic to v2. If error rate and latency stay healthy, ramp to 100%; if not, roll back. This is a safety mechanism.
    • A/B deploy — split traffic between v1 and v2 on purpose and measure which scores better on a business metric. This is a measurement mechanism.
    • Shadow mode — send v2 a copy of real traffic but don't return its answers, so you can compare offline with zero user risk.

    Trial the new recipe on a few tables (canary), or serve two recipes to see which sells better (A/B) — either way, the old recipe is one switch away.

    Common Errors (And How to Fix Them)

    These five mistakes sink most first serving deployments:

    ❌ No batching — wasting the GPU

    One request per forward pass leaves the GPU 90% idle and caps your throughput.

    ✅ Fix: enable dynamic batching (set max batch size + max wait), or use a framework that does it for you.

    ❌ Cold starts on every request

    Loading the model inside the handler reloads 500MB+ from disk per call — 10s latencies.

    ✅ Fix: load once at startup, run a warmup inference, and keep a minimum replica count.

    ❌ No versioning — can't roll back

    Overwriting the live model means a bad deploy has no undo and no way to compare.

    ✅ Fix: tag every model with a version and deploy via canary so rollback is one switch.

    ❌ Blocking I/O on the request path

    A synchronous DB or network call inside the handler stalls the whole worker under load.

    ✅ Fix: use async handlers, move slow work off the hot path, and set request timeouts.

    ❌ Unbounded queues — silent meltdown

    An infinite request queue hides overload: latency climbs forever and memory blows up instead of failing fast.

    ✅ Fix: cap the queue length and reject extra requests with HTTP 429 so callers can back off.

    📋 Quick Reference

    ConceptWhat it isWhy it matters
    REST vs gRPCJSON over HTTP vs binary RPCREST to debug, gRPC for speed
    BatchingGroup requests into one pass5–10× throughput
    Latency (p50/p99)Median vs slow-tail waitp99 is what users feel
    ThroughputRequests served per secondCapacity of the service
    AutoscalingAdd/remove replicas on loadMatch cost to demand
    Cold startSlow first request after bootWarmup + min replicas fix it
    VersioningTag every model v1, v2…Compare and roll back
    Canary / A/BSlow rollout vs split testSafe change vs measured change

    ❓ Frequently Asked Questions

    Q: What is model serving?

    A: Model serving is wrapping a trained model behind a network endpoint (usually REST or gRPC) so other programs can send inputs and get predictions back over the network, instead of calling the model from inside a notebook.

    Q: REST or gRPC for inference?

    A: Use REST/JSON when you want something easy to debug and call from any client, including browsers. Use gRPC when you need lower latency and smaller payloads for binary data like tensors, and your callers are other services. Many teams expose REST publicly and gRPC internally.

    Q: Why does request batching make serving faster?

    A: A GPU runs one batch almost as fast as one item, so grouping many small requests into a single forward pass dramatically raises throughput (requests per second). The trade-off is that a request may wait a few milliseconds for the batch to fill, so latency per request rises slightly.

    Q: What is a cold start and how do I avoid it?

    A: A cold start is the slow first request after a server boots or scales up, because the model still has to load into memory and the GPU has to warm up. Avoid it by loading the model once at startup and running a dummy 'warmup' inference before accepting traffic, and by keeping a minimum number of replicas always running.

    Q: What is the difference between a canary and an A/B deploy?

    A: A canary sends a small slice of traffic (say 5%) to a new model version to check it is healthy before rolling it out to everyone, so it is a safety mechanism. An A/B test deliberately splits traffic between versions to measure which performs better on a metric, so it is a measurement mechanism.

    Q: Should I build serving myself or use a framework?

    A: Start with FastAPI when you have a single model and want full control. Move to a dedicated framework (TorchServe, TensorFlow Serving, Triton, or BentoML) once you need built-in batching, model versioning, multi-model hosting, or GPU scheduling without writing that plumbing yourself.

    🎯 Mini-Challenge: SLA Checker

    Now write it yourself with only a comment outline. Build a tiny SLA checker that flags when your p99 latency breaches a 200ms budget. The starter has just the steps — no filled-in logic.

    Mini-Challenge: SLA Checker

    Compute p99 and decide OK vs BREACH against a 200ms budget

    Try it Yourself »
    Python
    # 🎯 MINI-CHALLENGE: a tiny SLA checker for your serving endpoint
    # 1. Make a list of latencies (ms): [30, 28, 31, 29, 32, 30, 220, 29, 31, 30]
    # 2. Write a percentile(times, p) helper (sort, then index at (n-1)*p rounded)
    # 3. Compute p99
    # 4. Set a budget: SLA_MS = 200
    # 5. Print "BREACH" if p99 > SLA_MS, otherwise print "OK", plus the p99 value
    #
    # ✅ Expected output:
    #   p99: 220 ms -> BREACH
    
    # your code here
    🎉

    Lesson 42 complete — you can serve a model in production!

    You can expose a model over REST or gRPC, pick a serving framework, batch requests for throughput, read p50/p99 latency, autoscale without cold starts, and roll out new versions safely with canary and A/B deploys.

    🚀 Up next: Model Monitoring — watch your live model for drift, bias, and degradation before your users notice.

    Sign up for free to track which lessons you've completed and get learning reminders.

    Previous

    Cookie & Privacy Settings

    We use cookies to improve your experience, analyze traffic, and show personalized ads. You can manage your preferences below.

    By clicking "Accept All", you consent to our use of cookies for analytics and personalized advertising. You can customize your preferences or reject non-essential cookies.

    Privacy PolicyTerms of Service