Lesson 43 • Advanced

Monitoring Models in Production 📊

Shipping a model is the start, not the finish. By the end of this lesson you'll detect data drift, track prediction quality over time, and wire up alerts that tell you to retrain — before users feel the damage.

What You'll Learn in This Lesson

✓Tell data drift apart from concept drift, with real examples
✓Detect drift by comparing the mean and std of two batches
✓Track prediction quality with a rolling-accuracy metric
✓Trigger an alert when accuracy dips below a threshold
✓Recognise label lag and why input drift warns you sooner
✓Know when drift should trigger a model retrain

Before you start: You should be comfortable with Model Serving — this lesson assumes you already have a model answering live requests.

🏥 Real-World Analogy: Think of model monitoring like a hospital patient monitor. The patient (your model) can look fine while their vitals quietly slide. The monitor tracks heart rate, blood pressure, and oxygen continuously, draws a baseline, and sounds an alarm the moment a reading crosses a safe threshold — long before the patient collapses. Your job is to put the same continuous vitals and alarms around a model in production.

1Why Models Degrade

A model that was 95% accurate on launch day can silently slide to 70% within weeks. Nothing in your code changed — the world changed, and your model didn't. There are two failures you must watch for.

Data drift — the inputs change shape. You trained on summer shoppers; now it's winter and the incoming feature values look different. The model still works correctly, it just hasn't seen this kind of input before.

Concept drift — the relationship between inputs and the right answer changes. A "good salary" meant 50k in 2010 and 80k today, so the same input should now produce a different label. The model is now answering the wrong question.

The trap: most teams only watch accuracy. But accuracy needs labels (the true answer), and labels often arrive weeks later — a problem called label lag. Input drift, by contrast, is visible the instant a request arrives. That's why you monitor both: drift warns you early, accuracy confirms the damage.

Key insight: Input drift is a smoke alarm; falling accuracy is the fire. Watch the smoke alarm, because by the time you smell smoke the house is already burning.

2Detecting Drift by Comparing Two Batches

The simplest, label-free drift check is this: keep a reference batch (a sample of your training data) and, for each new current batch of live traffic, compare their summary statistics. If the mean (average) or standard deviation (how spread out the values are) move beyond a tolerance you set, the inputs have drifted.

Read the worked example below line by line — every function is plain Python, no libraries. Then run it and confirm the output matches the # Expected output comment at the bottom.

Worked Example: Detect Data Drift (plain Python)

Compare the mean and std of a reference batch against a live batch to flag drift

Try it Yourself »

Python

# ============================================
# DATA DRIFT vs CONCEPT DRIFT (plain Python)
# ============================================
# Data drift  = the INPUTS change shape over time.
#   (You trained on summer shoppers; now it is winter.)
# Concept drift = the INPUT -> OUTPUT relationship changes.
#   ("good salary" meant 50k in 2010, 80k today.)
#
# This example detects DATA drift the simplest way there is:
# compare the mean and standard deviation of two batches.

def mean(values):
    
...

🎯 Your Turn: Flag the Drift

Fill in the three blanks so the comparison flags Batch current as drifted. Pass the current batch, measure how far the mean moved, and pick the right comparison operator.

🎯 Your Turn: Detect Drift

Fill in the ___ blanks, then run and self-check against the expected output

Try it Yourself »

Python

# 🎯 YOUR TURN — detect drift by comparing two batches
# Fill in each ___ , then run it.

def mean(values):
    return sum(values) / len(values)

reference = [100, 102, 98, 101, 99, 100, 101, 99]
current   = [130, 133, 128, 131, 129, 132, 130, 131]

ref_mean = mean(reference)
cur_mean = mean(___)              # 👉 pass the CURRENT batch here

mean_shift = abs(cur_mean - ref_mean)   # 👉 how far the average moved

THRESHOLD = 10
drifted = mean_shift ___ THRESHOLD      # 👉 use a comparison: > or 
...

3Tracking Prediction Quality Over Time

Once labels do arrive, you can measure how well predictions are landing. But a single day's accuracy is noisy — one weird batch can make it jump or dip for reasons that don't matter. The fix is a rolling window: average the last few days so a genuine downward trend stands out from the daily wobble.

You then set an alert threshold — a line in the sand. When the rolling value crosses below it, you raise an alert (log it, post to Slack, or page on-call). The next example builds exactly that: a rolling accuracy that trips an alarm below 0.85.

Worked Example: Rolling Accuracy + Alert (plain Python)

Smooth daily accuracy with a rolling window and trip an alert below the threshold

Try it Yourself »

Python

# ============================================
# ROLLING ACCURACY + THRESHOLD ALERT (plain Python)
# ============================================
# Accuracy rarely falls off a cliff. It sags slowly as the world
# drifts away from your training data. A ROLLING window smooths out
# the day-to-day noise so a real downward trend is visible.

def rolling_accuracy(daily_accuracy, window=3):
    """Average of the last 'window' days, computed day by day."""
    rolled = []
    for i in range(len(daily_a
...

🎯 Your Turn: Raise the Alert

Complete the rolling-accuracy loop so it alerts once quality drops below 0.85. Divide by the right count and choose the comparison that fires when accuracy is too low.

🎯 Your Turn: Rolling Accuracy Alert

Fill in the ___ blanks, then run and self-check against the expected output

Try it Yourself »

Python

# 🎯 YOUR TURN — raise an alert when rolling accuracy dips
# Fill in each ___ , then run it.

daily = [0.90, 0.89, 0.88, 0.84, 0.82, 0.80]
window = 2
ALERT_THRESHOLD = 0.85

for i in range(len(daily)):
    start = max(0, i - window + 1)
    chunk = daily[start:i + 1]
    rolling = sum(chunk) / len(___)        # 👉 divide by how many days are in 'chunk'

    if rolling ___ ALERT_THRESHOLD:        # 👉 alert when BELOW the threshold
        status = "ALERT"
    else:
        status = "ok"

    pri
...

4Logging, Alerting, and the Tools Teams Actually Use

The plain-Python examples show you the idea. In real systems you don't hand-roll the maths — you log every prediction and its inputs, push the numbers to a metric store, draw them on a dashboard, and let an alerting rule page you. A production stack usually layers up like this:

┌──────────────────────────────────────┐
│   Alerting   (PagerDuty / Slack)     │  <- pages a human on critical drift
├──────────────────────────────────────┤
│   Dashboard  (Grafana / DataDog)     │  <- humans watch trends here
├──────────────────────────────────────┤
│   Metric store (Prometheus/BigQuery) │  <- PSI, accuracy, latency over time
├──────────────────────────────────────┤
│   Collectors (input/output loggers)  │  <- log every request + prediction
├──────────────────────────────────────┤
│   Model inference service            │  <- your model answering requests
└──────────────────────────────────────┘

For the drift maths itself, purpose-built libraries do the heavy lifting:

Evidently / WhyLabs / NannyML — drift reports and statistical tests
MLflow — track model versions and metrics across runs
Great Expectations — validate incoming data (nulls, ranges, types)
Fairlearn / AIF360 — fairness metrics across groups

Here's the same drift check from Section 2, but expressed with Evidently. It's read-only — the tool isn't installed in the editor — so study it as the production version of what you already built by hand.

# ============================================
# THE SAME CHECK WITH EVIDENTLY (a monitoring tool)
# ============================================
# In production you don't hand-roll drift maths. Tools like Evidently,
# WhyLabs, or NannyML run the statistical tests, render a report, and
# wire into alerting for you. Here is the Evidently equivalent.
#
# (Read-only: 'pip install evidently' is not available in this editor.)

import pandas as pd
from evidently.report import Report
from evidently.metric_preset import DataDriftPreset

# reference_df = the training data; current_df = today's live traffic
reference_df = pd.DataFrame({"income": [50, 51, 49, 52, 48, 50, 51, 49]})
current_df   = pd.DataFrame({"income": [60, 62, 58, 64, 61, 59, 63, 60]})

report = Report(metrics=[DataDriftPreset()])
report.run(reference_data=reference_df, current_data=current_df)

# Pull out the machine-readable result so you can alert on it
result = report.as_dict()
drift = result["metrics"][0]["result"]
print("Columns drifted:", drift["number_of_drifted_columns"])
print("Dataset drift?:", drift["dataset_drift"])

# Expected output:
# Columns drifted: 1
# Dataset drift?: True

🔁 Retraining Triggers — Closing the Loop

Monitoring is only useful if it leads to action. The action is usually a retrain — fit a fresh model on recent data so it re-learns the world as it is now. Don't retrain on a single bad reading; tie it to a tiered, persistent signal:

Signal	Tier	Action
Mean/std shift within tolerance	Info	Log, keep watching
Moderate drift, a few days	Warning	Investigate inputs
Rolling accuracy below target, sustained	Critical	Retrain on fresh data
Pipeline returns nulls / broken feature	Critical	Fix data, then retrain

The persistence rule ("sustained across a window") is what stops you retraining on a one-off spike — and it's the same idea as the rolling window you coded above.

5Common Mistakes (And How to Fix Them)

Monitoring fails in predictable ways. Here are the four that bite teams most often:

❌ No monitoring at all

The model ships and nobody watches it. Accuracy quietly rots and the first signal is an angry customer.

✅ Fix: log every prediction with its inputs from day one, even if the only "dashboard" is a daily print of mean/std and rolling accuracy.

❌ Drift goes unnoticed

You watch accuracy only. But accuracy needs labels, and by the time it drops the inputs have been drifting for weeks.

✅ Fix: monitor input statistics too — they're available immediately and warn you before accuracy moves.

❌ Ignoring label lag

You wait for ground-truth labels that take months to arrive (did the loan default? did the user churn?), so your alerts are always late.

✅ Fix: lean on label-free signals (input drift, prediction-distribution shift) as your early warning; treat accuracy as confirmation, not detection.

❌ Alert fatigue

Every tiny wobble fires a page. The team mutes the channel — and then misses the alert that actually mattered.

✅ Fix: tier alerts (info / warning / critical), only page for critical, and require the signal to persist across a rolling window before firing.

🎯 Mini-Challenge: Build a Tiny Watcher

Time to fade the scaffolding. You've detected drift and tracked rolling accuracy separately — now combine them into one daily watcher. The starter below gives you only a comment outline and the data. Write the loop yourself, then check it against the expected output in the comments.

🎯 Mini-Challenge: Monitoring Loop

Comment outline only — write the logic and self-check against the expected output

Try it Yourself »

Python

# 🎯 MINI-CHALLENGE: a tiny monitoring loop
# Combine both signals you learned into one watcher.
#
# 1. You are given a list of (input_mean, accuracy) tuples, one per day.
# 2. For each day, flag DATA DRIFT if input_mean is more than 5 away
#    from the reference mean of 50.
# 3. Also flag LOW ACCURACY if accuracy is below 0.85.
# 4. Print the day number and which alerts (if any) fired.
#
# ✅ Expected (for the data below):
# Day 1: ok
# Day 2: ok
# Day 3: DATA DRIFT
# Day 4: DATA DRIFT, LOW ACC
...

📋 Quick Reference — Model Monitoring

What to watch	How	Tool	Alert when
Data drift	Mean / std vs reference, PSI	Evidently, WhyLabs	shift > tolerance
Concept drift	Rolling accuracy over time	MLflow, Prometheus	below baseline target
Latency	p50 / p95 / p99 timings	Grafana, DataDog	p99 > 200ms
Feature health	Null %, range checks	Great Expectations	> 5% nulls
Retrain trigger	Sustained critical signal	CI/CD pipeline	persists across window

❓ Frequently Asked Questions

Q: What is the difference between data drift and concept drift?

A: Data drift means the inputs change shape — the distribution of features your model sees moves away from the training data (e.g. a new customer demographic). Concept drift means the relationship between inputs and the correct output changes, so the same input should now map to a different prediction (e.g. what counts as a 'good salary' rises over time). Data drift you can spot from inputs alone; concept drift usually shows up as falling accuracy once labels arrive.

Q: How do I detect drift without any labels?

A: Compare the statistics of incoming inputs against the training data — mean, standard deviation, min/max, or a binned distribution. If those summary numbers move beyond a tolerance, the inputs have drifted. This is exactly what the plain-Python example does, and it needs no ground-truth labels, which is why it is your first line of defence in production.

Q: Why use a rolling window for accuracy instead of the raw number?

A: A single day's accuracy is noisy — one unusual batch can make it spike or dip for reasons that don't matter. A rolling average over the last few days smooths that noise so a genuine downward trend stands out. You alert on the rolling value, not the raw value, to avoid firing on random wobble.

Q: What is label lag and why does it matter for monitoring?

A: Label lag is the delay between making a prediction and learning whether it was correct. A loan model may not know if a borrower defaults for months, so accuracy-based alerts arrive late. That is why you also monitor input drift, which is available immediately — it warns you before the (delayed) accuracy metric confirms a problem.

Q: When should drift trigger an automatic retrain?

A: Tie retraining to a tiered threshold rather than a single number. Minor drift logs and is watched; moderate drift opens an investigation; major sustained drift (or rolling accuracy below your service-level target) triggers retraining on fresh data. Always require the signal to persist across a window so you don't retrain on a one-off spike.

Q: How do I avoid alert fatigue?

A: Tier your alerts — info, warning, critical — and only page a human for critical, sustained problems. Use rolling windows and 'must persist for N periods' rules so transient blips stay silent. If every wobble pages the team, people start ignoring alerts, and the one that matters gets missed too.

🎉

Lesson complete — your models now have vitals and alarms!

You can tell data drift from concept drift, detect drift by comparing the mean and std of two batches, smooth prediction quality with a rolling-accuracy metric, trip an alert below a threshold, and decide when a sustained signal should trigger a retrain. That's the full monitoring loop, from raw signal to action.

🚀 Up next: MLOps Fundamentals — automate this whole loop so detection, retraining, and redeployment happen as a pipeline instead of by hand.

Monitoring Models in Production 📊

What You'll Learn in This Lesson

1Why Models Degrade

2Detecting Drift by Comparing Two Batches

Worked Example: Detect Data Drift (plain Python)

🎯 Your Turn: Flag the Drift

🎯 Your Turn: Detect Drift

3Tracking Prediction Quality Over Time

Worked Example: Rolling Accuracy + Alert (plain Python)

🎯 Your Turn: Raise the Alert

🎯 Your Turn: Rolling Accuracy Alert

4Logging, Alerting, and the Tools Teams Actually Use

🔁 Retraining Triggers — Closing the Loop

5Common Mistakes (And How to Fix Them)

🎯 Mini-Challenge: Build a Tiny Watcher

🎯 Mini-Challenge: Monitoring Loop

📋 Quick Reference — Model Monitoring

❓ Frequently Asked Questions

Lesson complete — your models now have vitals and alarms!

Cookie & Privacy Settings