Skip to main content

    Lesson 43 • Advanced

    Monitoring Models in Production 📊

    Shipping a model is the start, not the finish. By the end of this lesson you'll detect data drift, track prediction quality over time, and wire up alerts that tell you to retrain — before users feel the damage.

    What You'll Learn in This Lesson

    • Tell data drift apart from concept drift, with real examples
    • Detect drift by comparing the mean and std of two batches
    • Track prediction quality with a rolling-accuracy metric
    • Trigger an alert when accuracy dips below a threshold
    • Recognise label lag and why input drift warns you sooner
    • Know when drift should trigger a model retrain

    1Why Models Degrade

    A model that was 95% accurate on launch day can silently slide to 70% within weeks. Nothing in your code changed — the world changed, and your model didn't. There are two failures you must watch for.

    Data drift — the inputs change shape. You trained on summer shoppers; now it's winter and the incoming feature values look different. The model still works correctly, it just hasn't seen this kind of input before.

    Concept drift — the relationship between inputs and the right answer changes. A "good salary" meant 50k in 2010 and 80k today, so the same input should now produce a different label. The model is now answering the wrong question.

    The trap: most teams only watch accuracy. But accuracy needs labels (the true answer), and labels often arrive weeks later — a problem called label lag. Input drift, by contrast, is visible the instant a request arrives. That's why you monitor both: drift warns you early, accuracy confirms the damage.

    2Detecting Drift by Comparing Two Batches

    The simplest, label-free drift check is this: keep a reference batch (a sample of your training data) and, for each new current batch of live traffic, compare their summary statistics. If the mean (average) or standard deviation (how spread out the values are) move beyond a tolerance you set, the inputs have drifted.

    Read the worked example below line by line — every function is plain Python, no libraries. Then run it and confirm the output matches the # Expected output comment at the bottom.

    Worked Example: Detect Data Drift (plain Python)

    Compare the mean and std of a reference batch against a live batch to flag drift

    Try it Yourself »
    Python
    # ============================================
    # DATA DRIFT vs CONCEPT DRIFT (plain Python)
    # ============================================
    # Data drift  = the INPUTS change shape over time.
    #   (You trained on summer shoppers; now it is winter.)
    # Concept drift = the INPUT -> OUTPUT relationship changes.
    #   ("good salary" meant 50k in 2010, 80k today.)
    #
    # This example detects DATA drift the simplest way there is:
    # compare the mean and standard deviation of two batches.
    
    def mean(values):
        
    ...

    🎯 Your Turn: Flag the Drift

    Fill in the three blanks so the comparison flags Batch current as drifted. Pass the current batch, measure how far the mean moved, and pick the right comparison operator.

    🎯 Your Turn: Detect Drift

    Fill in the ___ blanks, then run and self-check against the expected output

    Try it Yourself »
    Python
    # 🎯 YOUR TURN — detect drift by comparing two batches
    # Fill in each ___ , then run it.
    
    def mean(values):
        return sum(values) / len(values)
    
    reference = [100, 102, 98, 101, 99, 100, 101, 99]
    current   = [130, 133, 128, 131, 129, 132, 130, 131]
    
    ref_mean = mean(reference)
    cur_mean = mean(___)              # 👉 pass the CURRENT batch here
    
    mean_shift = abs(cur_mean - ref_mean)   # 👉 how far the average moved
    
    THRESHOLD = 10
    drifted = mean_shift ___ THRESHOLD      # 👉 use a comparison: > or 
    ...

    3Tracking Prediction Quality Over Time

    Once labels do arrive, you can measure how well predictions are landing. But a single day's accuracy is noisy — one weird batch can make it jump or dip for reasons that don't matter. The fix is a rolling window: average the last few days so a genuine downward trend stands out from the daily wobble.

    You then set an alert threshold — a line in the sand. When the rolling value crosses below it, you raise an alert (log it, post to Slack, or page on-call). The next example builds exactly that: a rolling accuracy that trips an alarm below 0.85.

    Worked Example: Rolling Accuracy + Alert (plain Python)

    Smooth daily accuracy with a rolling window and trip an alert below the threshold

    Try it Yourself »
    Python
    # ============================================
    # ROLLING ACCURACY + THRESHOLD ALERT (plain Python)
    # ============================================
    # Accuracy rarely falls off a cliff. It sags slowly as the world
    # drifts away from your training data. A ROLLING window smooths out
    # the day-to-day noise so a real downward trend is visible.
    
    def rolling_accuracy(daily_accuracy, window=3):
        """Average of the last 'window' days, computed day by day."""
        rolled = []
        for i in range(len(daily_a
    ...

    🎯 Your Turn: Raise the Alert

    Complete the rolling-accuracy loop so it alerts once quality drops below 0.85. Divide by the right count and choose the comparison that fires when accuracy is too low.

    🎯 Your Turn: Rolling Accuracy Alert

    Fill in the ___ blanks, then run and self-check against the expected output

    Try it Yourself »
    Python
    # 🎯 YOUR TURN — raise an alert when rolling accuracy dips
    # Fill in each ___ , then run it.
    
    daily = [0.90, 0.89, 0.88, 0.84, 0.82, 0.80]
    window = 2
    ALERT_THRESHOLD = 0.85
    
    for i in range(len(daily)):
        start = max(0, i - window + 1)
        chunk = daily[start:i + 1]
        rolling = sum(chunk) / len(___)        # 👉 divide by how many days are in 'chunk'
    
        if rolling ___ ALERT_THRESHOLD:        # 👉 alert when BELOW the threshold
            status = "ALERT"
        else:
            status = "ok"
    
        pri
    ...

    4Logging, Alerting, and the Tools Teams Actually Use

    The plain-Python examples show you the idea. In real systems you don't hand-roll the maths — you log every prediction and its inputs, push the numbers to a metric store, draw them on a dashboard, and let an alerting rule page you. A production stack usually layers up like this:

    ┌──────────────────────────────────────┐
    │   Alerting   (PagerDuty / Slack)     │  <- pages a human on critical drift
    ├──────────────────────────────────────┤
    │   Dashboard  (Grafana / DataDog)     │  <- humans watch trends here
    ├──────────────────────────────────────┤
    │   Metric store (Prometheus/BigQuery) │  <- PSI, accuracy, latency over time
    ├──────────────────────────────────────┤
    │   Collectors (input/output loggers)  │  <- log every request + prediction
    ├──────────────────────────────────────┤
    │   Model inference service            │  <- your model answering requests
    └──────────────────────────────────────┘

    For the drift maths itself, purpose-built libraries do the heavy lifting:

    • Evidently / WhyLabs / NannyML — drift reports and statistical tests
    • MLflow — track model versions and metrics across runs
    • Great Expectations — validate incoming data (nulls, ranges, types)
    • Fairlearn / AIF360 — fairness metrics across groups

    Here's the same drift check from Section 2, but expressed with Evidently. It's read-only — the tool isn't installed in the editor — so study it as the production version of what you already built by hand.

    # ============================================
    # THE SAME CHECK WITH EVIDENTLY (a monitoring tool)
    # ============================================
    # In production you don't hand-roll drift maths. Tools like Evidently,
    # WhyLabs, or NannyML run the statistical tests, render a report, and
    # wire into alerting for you. Here is the Evidently equivalent.
    #
    # (Read-only: 'pip install evidently' is not available in this editor.)
    
    import pandas as pd
    from evidently.report import Report
    from evidently.metric_preset import DataDriftPreset
    
    # reference_df = the training data; current_df = today's live traffic
    reference_df = pd.DataFrame({"income": [50, 51, 49, 52, 48, 50, 51, 49]})
    current_df   = pd.DataFrame({"income": [60, 62, 58, 64, 61, 59, 63, 60]})
    
    report = Report(metrics=[DataDriftPreset()])
    report.run(reference_data=reference_df, current_data=current_df)
    
    # Pull out the machine-readable result so you can alert on it
    result = report.as_dict()
    drift = result["metrics"][0]["result"]
    print("Columns drifted:", drift["number_of_drifted_columns"])
    print("Dataset drift?:", drift["dataset_drift"])
    
    # Expected output:
    # Columns drifted: 1
    # Dataset drift?: True

    🔁 Retraining Triggers — Closing the Loop

    Monitoring is only useful if it leads to action. The action is usually a retrain — fit a fresh model on recent data so it re-learns the world as it is now. Don't retrain on a single bad reading; tie it to a tiered, persistent signal:

    SignalTierAction
    Mean/std shift within toleranceInfoLog, keep watching
    Moderate drift, a few daysWarningInvestigate inputs
    Rolling accuracy below target, sustainedCriticalRetrain on fresh data
    Pipeline returns nulls / broken featureCriticalFix data, then retrain

    The persistence rule ("sustained across a window") is what stops you retraining on a one-off spike — and it's the same idea as the rolling window you coded above.

    5Common Mistakes (And How to Fix Them)

    Monitoring fails in predictable ways. Here are the four that bite teams most often:

    ❌ No monitoring at all

    The model ships and nobody watches it. Accuracy quietly rots and the first signal is an angry customer.

    ✅ Fix: log every prediction with its inputs from day one, even if the only "dashboard" is a daily print of mean/std and rolling accuracy.

    ❌ Drift goes unnoticed

    You watch accuracy only. But accuracy needs labels, and by the time it drops the inputs have been drifting for weeks.

    ✅ Fix: monitor input statistics too — they're available immediately and warn you before accuracy moves.

    ❌ Ignoring label lag

    You wait for ground-truth labels that take months to arrive (did the loan default? did the user churn?), so your alerts are always late.

    ✅ Fix: lean on label-free signals (input drift, prediction-distribution shift) as your early warning; treat accuracy as confirmation, not detection.

    ❌ Alert fatigue

    Every tiny wobble fires a page. The team mutes the channel — and then misses the alert that actually mattered.

    ✅ Fix: tier alerts (info / warning / critical), only page for critical, and require the signal to persist across a rolling window before firing.

    🎯 Mini-Challenge: Build a Tiny Watcher

    Time to fade the scaffolding. You've detected drift and tracked rolling accuracy separately — now combine them into one daily watcher. The starter below gives you only a comment outline and the data. Write the loop yourself, then check it against the expected output in the comments.

    🎯 Mini-Challenge: Monitoring Loop

    Comment outline only — write the logic and self-check against the expected output

    Try it Yourself »
    Python
    # 🎯 MINI-CHALLENGE: a tiny monitoring loop
    # Combine both signals you learned into one watcher.
    #
    # 1. You are given a list of (input_mean, accuracy) tuples, one per day.
    # 2. For each day, flag DATA DRIFT if input_mean is more than 5 away
    #    from the reference mean of 50.
    # 3. Also flag LOW ACCURACY if accuracy is below 0.85.
    # 4. Print the day number and which alerts (if any) fired.
    #
    # ✅ Expected (for the data below):
    # Day 1: ok
    # Day 2: ok
    # Day 3: DATA DRIFT
    # Day 4: DATA DRIFT, LOW ACC
    ...

    📋 Quick Reference — Model Monitoring

    What to watchHowToolAlert when
    Data driftMean / std vs reference, PSIEvidently, WhyLabsshift > tolerance
    Concept driftRolling accuracy over timeMLflow, Prometheusbelow baseline target
    Latencyp50 / p95 / p99 timingsGrafana, DataDogp99 > 200ms
    Feature healthNull %, range checksGreat Expectations> 5% nulls
    Retrain triggerSustained critical signalCI/CD pipelinepersists across window

    ❓ Frequently Asked Questions

    Q: What is the difference between data drift and concept drift?

    A: Data drift means the inputs change shape — the distribution of features your model sees moves away from the training data (e.g. a new customer demographic). Concept drift means the relationship between inputs and the correct output changes, so the same input should now map to a different prediction (e.g. what counts as a 'good salary' rises over time). Data drift you can spot from inputs alone; concept drift usually shows up as falling accuracy once labels arrive.

    Q: How do I detect drift without any labels?

    A: Compare the statistics of incoming inputs against the training data — mean, standard deviation, min/max, or a binned distribution. If those summary numbers move beyond a tolerance, the inputs have drifted. This is exactly what the plain-Python example does, and it needs no ground-truth labels, which is why it is your first line of defence in production.

    Q: Why use a rolling window for accuracy instead of the raw number?

    A: A single day's accuracy is noisy — one unusual batch can make it spike or dip for reasons that don't matter. A rolling average over the last few days smooths that noise so a genuine downward trend stands out. You alert on the rolling value, not the raw value, to avoid firing on random wobble.

    Q: What is label lag and why does it matter for monitoring?

    A: Label lag is the delay between making a prediction and learning whether it was correct. A loan model may not know if a borrower defaults for months, so accuracy-based alerts arrive late. That is why you also monitor input drift, which is available immediately — it warns you before the (delayed) accuracy metric confirms a problem.

    Q: When should drift trigger an automatic retrain?

    A: Tie retraining to a tiered threshold rather than a single number. Minor drift logs and is watched; moderate drift opens an investigation; major sustained drift (or rolling accuracy below your service-level target) triggers retraining on fresh data. Always require the signal to persist across a window so you don't retrain on a one-off spike.

    Q: How do I avoid alert fatigue?

    A: Tier your alerts — info, warning, critical — and only page a human for critical, sustained problems. Use rolling windows and 'must persist for N periods' rules so transient blips stay silent. If every wobble pages the team, people start ignoring alerts, and the one that matters gets missed too.

    🎉

    Lesson complete — your models now have vitals and alarms!

    You can tell data drift from concept drift, detect drift by comparing the mean and std of two batches, smooth prediction quality with a rolling-accuracy metric, trip an alert below a threshold, and decide when a sustained signal should trigger a retrain. That's the full monitoring loop, from raw signal to action.

    🚀 Up next: MLOps Fundamentals — automate this whole loop so detection, retraining, and redeployment happen as a pipeline instead of by hand.

    Sign up for free to track which lessons you've completed and get learning reminders.

    Previous

    Cookie & Privacy Settings

    We use cookies to improve your experience, analyze traffic, and show personalized ads. You can manage your preferences below.

    By clicking "Accept All", you consent to our use of cookies for analytics and personalized advertising. You can customize your preferences or reject non-essential cookies.

    Privacy PolicyTerms of Service

    Install LearnCodingFast

    Learn faster with the app on your home screen.