Lesson 43 β€’ Advanced

    Monitoring Models in Production πŸ“Š

    Deploying a model is just the beginning. Learn to detect data drift, outliers, bias, and performance degradation before they cause real damage.

    What You'll Learn in This Lesson

    • β€’ How to detect data drift using PSI (Population Stability Index)
    • β€’ Monitoring for concept drift with accuracy tracking over time
    • β€’ Catching anomalous inputs before they corrupt predictions
    • β€’ Fairness monitoring with disparate impact and equal opportunity metrics
    • β€’ Building automated alerting pipelines for production ML

    1️⃣ Why Models Degrade

    A model that was 95% accurate at deployment can silently drop to 70% within weeks. The three main causes:

    TypeWhat ChangesExample
    Data DriftInput distribution shiftsNew customer demographics
    Concept DriftInput→output relationship changes"Good credit" threshold shifts
    Feature DriftFeature pipeline breaksAPI returns null for a column

    πŸ’‘ Pro Tip: Monitor all three β€” most teams only check accuracy and miss silent data issues.

    Try It: Data Drift Detection

    Calculate PSI to detect when your input data shifts from the training distribution

    Try it Yourself Β»
    Python
    import numpy as np
    
    # ============================================
    # DATA DRIFT DETECTION
    # ============================================
    # Data drift = input distribution changes over time
    # Model was trained on summer data, now it's winter
    
    np.random.seed(42)
    
    print("=== Population Stability Index (PSI) ===")
    print()
    print("PSI measures how much the input distribution shifted.")
    print("Think of it like checking if your customers changed.")
    print()
    
    def calculate_psi(reference, current, bins=10)
    ...

    2️⃣ Monitoring Architecture

    A production monitoring system has these layers:

    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚         Alerting (PagerDuty/Slack)  β”‚
    β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
    β”‚    Dashboard (Grafana/DataDog)      β”‚
    β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
    β”‚  Metric Store (Prometheus/BigQuery) β”‚
    β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
    β”‚   Collectors (input/output loggers) β”‚
    β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
    β”‚     Model Inference Service         β”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

    Key metrics to track:

    • Latency: p50, p95, p99 response times
    • Throughput: Requests per second
    • Error rate: Failed predictions / total
    • Feature distributions: Mean, std, min, max per feature
    • Prediction distribution: Class balance or output range

    Try It: Anomaly Detection in Production

    Catch outlier inputs using z-score monitoring before they corrupt predictions

    Try it Yourself Β»
    Python
    import numpy as np
    
    # ============================================
    # OUTLIER & ANOMALY DETECTION IN PRODUCTION
    # ============================================
    np.random.seed(42)
    
    print("=== Production Anomaly Detection ===")
    print()
    print("Your model expects certain input ranges.")
    print("Outliers can cause silent failures or wrong predictions.")
    print()
    
    # Simulate normal feature distributions from training
    feature_names = ["age", "income", "credit_score", "loan_amount"]
    train_means = [35, 55000
    ...

    Try It: Bias Monitoring

    Track fairness metrics across demographic groups with disparate impact analysis

    Try it Yourself Β»
    Python
    import numpy as np
    
    # ============================================
    # BIAS MONITORING IN PRODUCTION
    # ============================================
    np.random.seed(42)
    
    print("=== Fairness Metrics Monitoring ===")
    print()
    print("Even a fair model at training can become biased in production")
    print("as user demographics shift over time.")
    print()
    
    # Simulate loan approval predictions across groups
    groups = {
        "Group A (majority)": {"total": 5000, "approved": 3800, "default_rate": 0.05},
        "Grou
    ...

    3️⃣ Common Mistakes

    ⚠️
    Only monitoring accuracy β€” data drift happens before accuracy drops. By the time accuracy falls, damage is done.
    ⚠️
    No baseline comparison β€” always compare current metrics against a reference window, not just absolute thresholds.
    ⚠️
    Alert fatigue β€” too many alerts and teams ignore them. Tier alerts: info β†’ warning β†’ critical.
    πŸ’‘
    Pro Tip: Store raw predictions and inputs for at least 30 days. When something breaks, you'll need them to debug.

    πŸ“‹ Quick Reference β€” Model Monitoring

    MetricToolAlert Threshold
    PSIEvidently, WhyLabs> 0.25 β†’ retrain
    AccuracyMLflow, Prometheus< baseline - 5%
    Latency p99Grafana, DataDog> 200ms
    Disparate ImpactFairlearn, AIF360< 0.8 ratio
    Feature Null %Great Expectations> 5% nulls

    πŸŽ‰ Lesson Complete!

    You can now monitor ML models in production! Next, learn how to automate the entire ML lifecycle with MLOps pipelines.

    Sign up for free to track which lessons you've completed and get learning reminders.

    Previous

    Cookie & Privacy Settings

    We use cookies to improve your experience, analyze traffic, and show personalized ads. You can manage your preferences below.

    By clicking "Accept All", you consent to our use of cookies for analytics and personalized advertising. You can customize your preferences or reject non-essential cookies.

    Privacy Policy β€’ Terms of Service