Skip to main content

    Lesson 14 • Intermediate

    Model Deployment

    Take a trained model out of the notebook and turn it into a service real users can call — save it, wrap it in an API, validate every input, containerise it, and keep watch once it is live.

    What You'll Learn in This Lesson

    • Save and load a model (pickle, joblib, ONNX, SavedModel)
    • Wrap a model in a REST API with FastAPI or Flask
    • Validate inputs and avoid train/serve skew
    • Containerise the service with Docker so deps never go missing
    • Choose between batch, real-time, and edge inference
    • Hand off to CI/CD and monitor for data drift

    🏭 Real-World Analogy: Shipping a Product from Workshop to Customer

    Training a model is like building one perfect prototype in your workshop. It works on your bench, with your tools, in your hands. Deployment is everything that happens after — getting that product into the hands of thousands of customers, reliably, every day.

    • Box it up (serialise): freeze the finished product into a package that survives the journey.
    • Open a shop counter (REST API): a fixed window where customers hand you an order and get a result back.
    • Check the order (validation): refuse nonsense orders politely instead of breaking.
    • Same recipe as the prototype (preprocessing parity): if the workshop used metric and the shop uses imperial, every product is subtly wrong.
    • Standard shipping crate (Docker): the same sealed crate works on any truck, in any warehouse.
    • Quality control after it ships (monitoring): watch returns and complaints so you catch a drop in quality early.

    Around 80% of ML projects never reach customers. The model is the prototype; this lesson is the shop, the crate, and the quality control.

    1Save and Load the Model (Serialisation)

    A trained model lives in memory. The moment Python exits, it's gone. Serialising means writing the model to a file so a completely separate program — your API server — can load it back later.

    Critically, you save more than the weights. You also save the preprocessing recipe (feature order, scaling means and stds, encoders) so serving can repeat it exactly. Run the worked example below, then read the comments about which format to use when.

    joblib

    Best for scikit-learn — efficient with big NumPy arrays

    pickle

    Any Python object, but Python-only and version-sensitive

    ONNX

    Cross-language and fast — great for production inference

    SavedModel / torch.save

    Native formats for TensorFlow and PyTorch deep nets

    Try It: Save and Load a Model

    Serialise a model (weights + preprocessing) to a file and load it back

    Try it Yourself »
    Python
    # Saving and loading a trained model
    # A trained model lives in memory. Close Python and it is gone.
    # "Serialising" writes the model to a file so you can load it later
    # in a totally separate program (your API server).
    
    # We use plain dicts + json here so it RUNS in the browser sandbox.
    # In production you'd use joblib/pickle (see the comments at the bottom).
    import json
    
    # Pretend this is what training produced: learned weights + bias,
    # PLUS the exact preprocessing it expects (this matters — 
    ...

    2Wrap It in a REST API (FastAPI / Flask)

    Other programs can't import your Python model directly. A REST API is a shop counter: a client sends a small JSON order over HTTP, your service runs the model, and hands a JSON prediction back.

    FastAPI is the modern Python choice (Flask is the older, simpler alternative). With FastAPI you declare the expected input as a class and it validates every request for you — a missing field or wrong type is rejected automatically before your code runs. Note the two rules below.

    Study the real serving shape (this is read-only — you'll write runnable plain Python just below):

    # serve.py — the real production shape with FastAPI
    # (read this; you run plain Python in the exercises below)
    import joblib
    from fastapi import FastAPI, HTTPException
    from pydantic import BaseModel, Field
    
    app = FastAPI()
    
    # Load the model ONCE at startup, not on every request.
    model = joblib.load("model.joblib")
    
    # Pydantic validates the request body for you: wrong type or
    # missing field -> automatic 422 error, your code never runs.
    class HouseRequest(BaseModel):
        sqft: float = Field(gt=0)   # must be greater than 0
        beds: int   = Field(ge=0)
    
    @app.post("/predict")
    def predict(req: HouseRequest):
        # Preprocess EXACTLY as in training (same feature order, same scaling).
        x = [req.sqft, req.beds]
        price = model.predict([x])[0]
        return {"price": round(price, 2), "model_version": "1.3.0"}
    
    @app.get("/health")        # let your load balancer check the service is alive
    def health():
        return {"status": "ok"}
    
    # Run it locally:
    #   uvicorn serve:app --reload
    # Then POST {"sqft": 1800, "beds": 3} to http://localhost:8000/predict
    
    # Expected output (from the POST request):
    # {"price": 304000.0, "model_version": "1.3.0"}

    3Validate Inputs and Match Training (Preprocessing Parity)

    Two failures sink more deployments than bad models. First, no input validation — a missing or negative field crashes your service. Second, train/serve skew — the model was trained on scaled features, but serving forgot to scale them the same way, so every prediction is quietly wrong.

    The fix for both: validate first, then run the exact same preprocessing you used in training (same feature order, same scaling, same encoders). In the exercise below you'll write a predict() that validates the request and standardises each feature with the saved means and stds before scoring.

    🎯 Your Turn: Build a predict() Function

    Validate the input, preprocess it like training, and return a label

    Try it Yourself »
    Python
    # 🎯 YOUR TURN — finish the predict() function
    # Fill in every blank marked with ___  (hints on the 👉 lines)
    
    # This is the model your training step produced and saved.
    MODEL = {
        "feature_order": ["sqft", "beds"],
        "means": {"sqft": 1500.0, "beds": 3.0},   # standardise: (x - mean) / std
        "stds":  {"sqft": 500.0,  "beds": 1.0},
        "weights": {"sqft": 180.0, "beds": 15000.0},
        "bias": 250000.0,
    }
    
    def predict(request):
        # 1) VALIDATE: every required feature must be present.
        
    ...

    4Containerise with Docker

    "It works on my machine" is the classic deployment failure — a library version differs in the cloud and the service breaks. A Docker container is a sealed shipping crate that freezes the OS, the Python version, and every dependency, so the service runs identically everywhere.

    You pin your dependencies in a requirements.txt, copy them and your code into an image, and run it anywhere. Copying dependencies before the code lets Docker cache that slow step.

    # Dockerfile — package the service + its exact dependencies
    # A container freezes the OS, Python version, and every library so it
    # runs identically on your laptop and in the cloud ("missing deps" gone).
    
    FROM python:3.11-slim
    
    WORKDIR /app
    
    # Install deps FIRST (this layer is cached unless requirements change).
    COPY requirements.txt .
    RUN pip install --no-cache-dir -r requirements.txt
    
    # Then copy the code and the serialised model.
    COPY serve.py model.joblib ./
    
    EXPOSE 8000
    CMD ["uvicorn", "serve:app", "--host", "0.0.0.0", "--port", "8000"]
    
    # Build and run:
    #   docker build -t house-api .
    #   docker run -p 8000:8000 house-api
    #
    # Expected output (terminal):
    # Uvicorn running on http://0.0.0.0:8000

    Batch vs Real-Time vs Edge

    The same model can be served three ways — only the wrapper around it changes. Pick the one that matches how predictions are consumed.

    Real-time (online)

    One request, answered in milliseconds behind a REST API. Use for live UIs.

    Batch

    Score thousands of rows on a schedule (e.g. a nightly job). Use for reports and pipelines.

    Edge

    Run the model on the device (phone, sensor) — no network call. Use for offline or low-latency needs.

    In the next exercise you'll take the same scoring logic and run it over a whole batch of houses instead of one request.

    🎯 Your Turn: Score a Batch

    Reuse one scoring function to predict a whole list of inputs at once

    Try it Yourself »
    Python
    # 🎯 YOUR TURN — score a whole BATCH of houses at once
    # Real-time serving answers ONE request fast. Batch scoring runs
    # many rows offline (e.g. a nightly job). Same model, different shape.
    
    WEIGHTS = {"sqft": 180.0, "beds": 15000.0}
    BIAS = 100000.0
    
    def score_one(house):
        total = BIAS
        for feature, weight in WEIGHTS.items():
            total += house[feature] * weight
        return total
    
    houses = [
        {"sqft": 1000, "beds": 2},
        {"sqft": 1500, "beds": 3},
        {"sqft": 2000, "beds": 4},
    ]
    
    ...

    5CI/CD and Monitoring (the Handoff)

    Shipping once is easy; shipping safely over and over is the job. CI/CD (Continuous Integration / Continuous Delivery) is an automated pipeline — a tool like GitHub Actions runs your tests, builds the Docker image, and deploys it whenever you push a new model. No manual steps means fewer human mistakes.

    Once live, monitoring takes over. A model trained on yesterday's data slowly goes stale as the world changes — this is data drift. You can't see it from the code, only from the metrics. Track these and alert when they cross a threshold, then retrain.

    Prediction error / accuracy over time
    Latency — how long each request takes
    Data drift — input distributions shifting away from training
    Error rate — how many requests get rejected

    Common tools: GitHub Actions / MLflow / DVC for CI/CD, and Prometheus + Grafana / Evidently for monitoring. Start simple — wire these in only once you have real traffic.

    6Common Errors (And How to Fix Them)

    These five mistakes break far more deployments than poorly-tuned models. Learn to spot each one.

    ❌ Train/serve skew — predictions silently wrong

    You scaled features during training but serve raw values, so the maths no longer lines up:

    # Training scaled inputs, serving forgot to:
    x = [req.sqft, req.beds]          # raw, unscaled  ❌
    price = model.predict([x])

    ✅ Fix: reuse the saved means/stds at serve time:

    x = [(req.sqft - mean_sqft) / std_sqft,
         (req.beds - mean_beds) / std_beds]   # same as training ✓

    ❌ No input validation — KeyError / crash

    A request missing a field throws and takes down the worker:

    price = request["sqft"] * w   # KeyError: 'sqft'  ❌

    ✅ Fix: check first, return a clean 400:

    if "sqft" not in request:
        return {"error": "missing field: sqft", "status": 400}

    ❌ Model not versioned — can't roll back

    Overwriting model.joblib with no version means a bad model can't be undone and you can't tell which one made a prediction.

    ✅ Fix: store and return a version string:

    model = {"version": "1.3.0", "weights": [...]}
    return {"price": price, "model_version": model["version"]}

    ❌ Blocking inference — API freezes under load

    Loading the model (or training!) inside the handler makes every request slow:

    @app.post("/predict")
    def predict(req):
        model = joblib.load("model.joblib")   # reloads every call  ❌

    ✅ Fix: load once at startup, reuse the object:

    model = joblib.load("model.joblib")   # at module top, once ✓

    ❌ Missing dependencies — ModuleNotFoundError in production

    It runs locally but the server lacks a library:

    ModuleNotFoundError: No module named 'scikit-learn'  ❌

    ✅ Fix: pin every dep and bake it into the container:

    # requirements.txt
    scikit-learn==1.5.0
    fastapi==0.115.0
    # then: RUN pip install -r requirements.txt  (in the Dockerfile)

    📋 Quick Reference

    StageToolsPurpose
    Serialisejoblib, pickle, ONNX, SavedModelSave / load the trained model
    ServeFastAPI, FlaskREST API for HTTP predictions
    ValidatepydanticReject bad input, match training
    ContaineriseDocker, requirements.txtReproducible environment
    AutomateGitHub Actions, MLflow, DVCCI/CD pipelines
    MonitorPrometheus, Grafana, EvidentlyTrack health and data drift

    ❓ Frequently Asked Questions

    Q: What does it mean to deploy a machine learning model?

    A: Deploying means taking a trained model out of your notebook and making it available to real users or other software. In practice you serialise the model to a file, wrap it in a REST API so other programs can send inputs and get predictions back, package everything in a container, and then monitor it once it is live.

    Q: What is train/serve skew and how do I avoid it?

    A: Train/serve skew happens when the data is processed differently during training and during serving — for example you scaled or one-hot-encoded features when training but forgot to apply the exact same steps in your API. The predictions silently become wrong. Avoid it by saving the preprocessing parameters (means, stds, encoders, feature order) alongside the model and reusing the identical code path at serve time.

    Q: Should I use pickle, joblib, or ONNX to save my model?

    A: Use joblib for scikit-learn models (it handles large NumPy arrays efficiently). pickle works for any Python object but is Python-only and version-sensitive. Choose ONNX when you need the model to run fast or from another language. Deep-learning frameworks have their own formats (TensorFlow SavedModel, torch.save). Whatever you pick, always store a version number with the file.

    Q: What is the difference between batch and real-time inference?

    A: Real-time (online) inference answers a single request as fast as possible, usually behind a REST API — think a price shown the instant a user clicks. Batch inference scores many rows at once on a schedule, like a nightly job that prices every listing. Edge inference runs the model directly on a device (phone, sensor) with no network call. The same model can serve all three; only the wrapper changes.

    Q: Why do I need to monitor a model after deploying it?

    A: A model is trained on yesterday's data, but the real world keeps changing — this is called data drift. Accuracy can quietly degrade even though no code changed. Monitoring tracks prediction error, latency, and input-distribution shifts so you get alerted and can retrain before users notice. Deployment is the start of the model's life, not the end.

    🎯 Mini Challenge: A Versioned predict() Endpoint

    Time to fly with the support faded. Build a small prediction function from the brief in the comments — it must validate its input, run the model, and return a versioned result. Check your output against the expected lines.

    Mini Challenge

    Write a validated, versioned predict() from scratch

    Try it Yourself »
    Python
    # 🎯 MINI-CHALLENGE: a versioned predict() with input validation
    #
    # 1. Make a MODEL dict with: "version" (a string like "2.0.0"),
    #    "weights" {"hours": 8.0, "score": 1.5}, and "bias" 20.0.
    # 2. Write predict(request) that:
    #      - returns {"error": ..., "status": 400} if "hours" or "score" is missing
    #      - returns {"error": ..., "status": 400} if "hours" is negative
    #      - otherwise computes bias + hours*8 + score*1.5
    #      - returns {"prediction": <number>, "version": <model version>
    ...
    🎉

    Lesson 14 complete — your model is in production!

    You can serialise a model, serve it through a validated REST API, keep training and serving in lockstep to avoid skew, seal it in a Docker container, choose batch vs real-time vs edge, and hand off to CI/CD and monitoring. That's the full journey from workshop to customer.

    🚀 Up next: Unsupervised Learning — find hidden patterns in unlabelled data, no targets required.

    Sign up for free to track which lessons you've completed and get learning reminders.

    Previous

    Cookie & Privacy Settings

    We use cookies to improve your experience, analyze traffic, and show personalized ads. You can manage your preferences below.

    By clicking "Accept All", you consent to our use of cookies for analytics and personalized advertising. You can customize your preferences or reject non-essential cookies.

    Privacy PolicyTerms of Service

    Install LearnCodingFast

    Learn faster with the app on your home screen.