Lesson 14 • Intermediate
Model Deployment
Take a trained model out of the notebook and turn it into a service real users can call — save it, wrap it in an API, validate every input, containerise it, and keep watch once it is live.
What You'll Learn in This Lesson
- ✓Save and load a model (pickle, joblib, ONNX, SavedModel)
- ✓Wrap a model in a REST API with FastAPI or Flask
- ✓Validate inputs and avoid train/serve skew
- ✓Containerise the service with Docker so deps never go missing
- ✓Choose between batch, real-time, and edge inference
- ✓Hand off to CI/CD and monitor for data drift
🏭 Real-World Analogy: Shipping a Product from Workshop to Customer
Training a model is like building one perfect prototype in your workshop. It works on your bench, with your tools, in your hands. Deployment is everything that happens after — getting that product into the hands of thousands of customers, reliably, every day.
- Box it up (serialise): freeze the finished product into a package that survives the journey.
- Open a shop counter (REST API): a fixed window where customers hand you an order and get a result back.
- Check the order (validation): refuse nonsense orders politely instead of breaking.
- Same recipe as the prototype (preprocessing parity): if the workshop used metric and the shop uses imperial, every product is subtly wrong.
- Standard shipping crate (Docker): the same sealed crate works on any truck, in any warehouse.
- Quality control after it ships (monitoring): watch returns and complaints so you catch a drop in quality early.
Around 80% of ML projects never reach customers. The model is the prototype; this lesson is the shop, the crate, and the quality control.
1Save and Load the Model (Serialisation)
A trained model lives in memory. The moment Python exits, it's gone. Serialising means writing the model to a file so a completely separate program — your API server — can load it back later.
Critically, you save more than the weights. You also save the preprocessing recipe (feature order, scaling means and stds, encoders) so serving can repeat it exactly. Run the worked example below, then read the comments about which format to use when.
joblib
Best for scikit-learn — efficient with big NumPy arrays
pickle
Any Python object, but Python-only and version-sensitive
ONNX
Cross-language and fast — great for production inference
SavedModel / torch.save
Native formats for TensorFlow and PyTorch deep nets
Try It: Save and Load a Model
Serialise a model (weights + preprocessing) to a file and load it back
# Saving and loading a trained model
# A trained model lives in memory. Close Python and it is gone.
# "Serialising" writes the model to a file so you can load it later
# in a totally separate program (your API server).
# We use plain dicts + json here so it RUNS in the browser sandbox.
# In production you'd use joblib/pickle (see the comments at the bottom).
import json
# Pretend this is what training produced: learned weights + bias,
# PLUS the exact preprocessing it expects (this matters —
...2Wrap It in a REST API (FastAPI / Flask)
Other programs can't import your Python model directly. A REST API is a shop counter: a client sends a small JSON order over HTTP, your service runs the model, and hands a JSON prediction back.
FastAPI is the modern Python choice (Flask is the older, simpler alternative). With FastAPI you declare the expected input as a class and it validates every request for you — a missing field or wrong type is rejected automatically before your code runs. Note the two rules below.
/health endpoint so your load balancer can check the service is alive.Study the real serving shape (this is read-only — you'll write runnable plain Python just below):
# serve.py — the real production shape with FastAPI
# (read this; you run plain Python in the exercises below)
import joblib
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel, Field
app = FastAPI()
# Load the model ONCE at startup, not on every request.
model = joblib.load("model.joblib")
# Pydantic validates the request body for you: wrong type or
# missing field -> automatic 422 error, your code never runs.
class HouseRequest(BaseModel):
sqft: float = Field(gt=0) # must be greater than 0
beds: int = Field(ge=0)
@app.post("/predict")
def predict(req: HouseRequest):
# Preprocess EXACTLY as in training (same feature order, same scaling).
x = [req.sqft, req.beds]
price = model.predict([x])[0]
return {"price": round(price, 2), "model_version": "1.3.0"}
@app.get("/health") # let your load balancer check the service is alive
def health():
return {"status": "ok"}
# Run it locally:
# uvicorn serve:app --reload
# Then POST {"sqft": 1800, "beds": 3} to http://localhost:8000/predict
# Expected output (from the POST request):
# {"price": 304000.0, "model_version": "1.3.0"}3Validate Inputs and Match Training (Preprocessing Parity)
Two failures sink more deployments than bad models. First, no input validation — a missing or negative field crashes your service. Second, train/serve skew — the model was trained on scaled features, but serving forgot to scale them the same way, so every prediction is quietly wrong.
The fix for both: validate first, then run the exact same preprocessing you used in training (same feature order, same scaling, same encoders). In the exercise below you'll write a predict() that validates the request and standardises each feature with the saved means and stds before scoring.
🎯 Your Turn: Build a predict() Function
Validate the input, preprocess it like training, and return a label
# 🎯 YOUR TURN — finish the predict() function
# Fill in every blank marked with ___ (hints on the 👉 lines)
# This is the model your training step produced and saved.
MODEL = {
"feature_order": ["sqft", "beds"],
"means": {"sqft": 1500.0, "beds": 3.0}, # standardise: (x - mean) / std
"stds": {"sqft": 500.0, "beds": 1.0},
"weights": {"sqft": 180.0, "beds": 15000.0},
"bias": 250000.0,
}
def predict(request):
# 1) VALIDATE: every required feature must be present.
...4Containerise with Docker
"It works on my machine" is the classic deployment failure — a library version differs in the cloud and the service breaks. A Docker container is a sealed shipping crate that freezes the OS, the Python version, and every dependency, so the service runs identically everywhere.
You pin your dependencies in a requirements.txt, copy them and your code into an image, and run it anywhere. Copying dependencies before the code lets Docker cache that slow step.
# Dockerfile — package the service + its exact dependencies
# A container freezes the OS, Python version, and every library so it
# runs identically on your laptop and in the cloud ("missing deps" gone).
FROM python:3.11-slim
WORKDIR /app
# Install deps FIRST (this layer is cached unless requirements change).
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
# Then copy the code and the serialised model.
COPY serve.py model.joblib ./
EXPOSE 8000
CMD ["uvicorn", "serve:app", "--host", "0.0.0.0", "--port", "8000"]
# Build and run:
# docker build -t house-api .
# docker run -p 8000:8000 house-api
#
# Expected output (terminal):
# Uvicorn running on http://0.0.0.0:8000⚡ Batch vs Real-Time vs Edge
The same model can be served three ways — only the wrapper around it changes. Pick the one that matches how predictions are consumed.
Real-time (online)
One request, answered in milliseconds behind a REST API. Use for live UIs.
Batch
Score thousands of rows on a schedule (e.g. a nightly job). Use for reports and pipelines.
Edge
Run the model on the device (phone, sensor) — no network call. Use for offline or low-latency needs.
In the next exercise you'll take the same scoring logic and run it over a whole batch of houses instead of one request.
🎯 Your Turn: Score a Batch
Reuse one scoring function to predict a whole list of inputs at once
# 🎯 YOUR TURN — score a whole BATCH of houses at once
# Real-time serving answers ONE request fast. Batch scoring runs
# many rows offline (e.g. a nightly job). Same model, different shape.
WEIGHTS = {"sqft": 180.0, "beds": 15000.0}
BIAS = 100000.0
def score_one(house):
total = BIAS
for feature, weight in WEIGHTS.items():
total += house[feature] * weight
return total
houses = [
{"sqft": 1000, "beds": 2},
{"sqft": 1500, "beds": 3},
{"sqft": 2000, "beds": 4},
]
...5CI/CD and Monitoring (the Handoff)
Shipping once is easy; shipping safely over and over is the job. CI/CD (Continuous Integration / Continuous Delivery) is an automated pipeline — a tool like GitHub Actions runs your tests, builds the Docker image, and deploys it whenever you push a new model. No manual steps means fewer human mistakes.
Once live, monitoring takes over. A model trained on yesterday's data slowly goes stale as the world changes — this is data drift. You can't see it from the code, only from the metrics. Track these and alert when they cross a threshold, then retrain.
Common tools: GitHub Actions / MLflow / DVC for CI/CD, and Prometheus + Grafana / Evidently for monitoring. Start simple — wire these in only once you have real traffic.
6Common Errors (And How to Fix Them)
These five mistakes break far more deployments than poorly-tuned models. Learn to spot each one.
❌ Train/serve skew — predictions silently wrong
You scaled features during training but serve raw values, so the maths no longer lines up:
# Training scaled inputs, serving forgot to: x = [req.sqft, req.beds] # raw, unscaled ❌ price = model.predict([x])
✅ Fix: reuse the saved means/stds at serve time:
x = [(req.sqft - mean_sqft) / std_sqft,
(req.beds - mean_beds) / std_beds] # same as training ✓❌ No input validation — KeyError / crash
A request missing a field throws and takes down the worker:
price = request["sqft"] * w # KeyError: 'sqft' ❌
✅ Fix: check first, return a clean 400:
if "sqft" not in request:
return {"error": "missing field: sqft", "status": 400}❌ Model not versioned — can't roll back
Overwriting model.joblib with no version means a bad model can't be undone and you can't tell which one made a prediction.
✅ Fix: store and return a version string:
model = {"version": "1.3.0", "weights": [...]}
return {"price": price, "model_version": model["version"]}❌ Blocking inference — API freezes under load
Loading the model (or training!) inside the handler makes every request slow:
@app.post("/predict")
def predict(req):
model = joblib.load("model.joblib") # reloads every call ❌✅ Fix: load once at startup, reuse the object:
model = joblib.load("model.joblib") # at module top, once ✓❌ Missing dependencies — ModuleNotFoundError in production
It runs locally but the server lacks a library:
ModuleNotFoundError: No module named 'scikit-learn' ❌
✅ Fix: pin every dep and bake it into the container:
# requirements.txt scikit-learn==1.5.0 fastapi==0.115.0 # then: RUN pip install -r requirements.txt (in the Dockerfile)
📋 Quick Reference
| Stage | Tools | Purpose |
|---|---|---|
| Serialise | joblib, pickle, ONNX, SavedModel | Save / load the trained model |
| Serve | FastAPI, Flask | REST API for HTTP predictions |
| Validate | pydantic | Reject bad input, match training |
| Containerise | Docker, requirements.txt | Reproducible environment |
| Automate | GitHub Actions, MLflow, DVC | CI/CD pipelines |
| Monitor | Prometheus, Grafana, Evidently | Track health and data drift |
❓ Frequently Asked Questions
Q: What does it mean to deploy a machine learning model?
A: Deploying means taking a trained model out of your notebook and making it available to real users or other software. In practice you serialise the model to a file, wrap it in a REST API so other programs can send inputs and get predictions back, package everything in a container, and then monitor it once it is live.
Q: What is train/serve skew and how do I avoid it?
A: Train/serve skew happens when the data is processed differently during training and during serving — for example you scaled or one-hot-encoded features when training but forgot to apply the exact same steps in your API. The predictions silently become wrong. Avoid it by saving the preprocessing parameters (means, stds, encoders, feature order) alongside the model and reusing the identical code path at serve time.
Q: Should I use pickle, joblib, or ONNX to save my model?
A: Use joblib for scikit-learn models (it handles large NumPy arrays efficiently). pickle works for any Python object but is Python-only and version-sensitive. Choose ONNX when you need the model to run fast or from another language. Deep-learning frameworks have their own formats (TensorFlow SavedModel, torch.save). Whatever you pick, always store a version number with the file.
Q: What is the difference between batch and real-time inference?
A: Real-time (online) inference answers a single request as fast as possible, usually behind a REST API — think a price shown the instant a user clicks. Batch inference scores many rows at once on a schedule, like a nightly job that prices every listing. Edge inference runs the model directly on a device (phone, sensor) with no network call. The same model can serve all three; only the wrapper changes.
Q: Why do I need to monitor a model after deploying it?
A: A model is trained on yesterday's data, but the real world keeps changing — this is called data drift. Accuracy can quietly degrade even though no code changed. Monitoring tracks prediction error, latency, and input-distribution shifts so you get alerted and can retrain before users notice. Deployment is the start of the model's life, not the end.
🎯 Mini Challenge: A Versioned predict() Endpoint
Time to fly with the support faded. Build a small prediction function from the brief in the comments — it must validate its input, run the model, and return a versioned result. Check your output against the expected lines.
Mini Challenge
Write a validated, versioned predict() from scratch
# 🎯 MINI-CHALLENGE: a versioned predict() with input validation
#
# 1. Make a MODEL dict with: "version" (a string like "2.0.0"),
# "weights" {"hours": 8.0, "score": 1.5}, and "bias" 20.0.
# 2. Write predict(request) that:
# - returns {"error": ..., "status": 400} if "hours" or "score" is missing
# - returns {"error": ..., "status": 400} if "hours" is negative
# - otherwise computes bias + hours*8 + score*1.5
# - returns {"prediction": <number>, "version": <model version>
...Lesson 14 complete — your model is in production!
You can serialise a model, serve it through a validated REST API, keep training and serving in lockstep to avoid skew, seal it in a Docker container, choose batch vs real-time vs edge, and hand off to CI/CD and monitoring. That's the full journey from workshop to customer.
🚀 Up next: Unsupervised Learning — find hidden patterns in unlabelled data, no targets required.
Sign up for free to track which lessons you've completed and get learning reminders.