Skip to main content

    Lesson 38 • Advanced

    Evaluating AI Models

    By the end of this lesson you'll read a confusion matrix, pick the right metric for any task, and tell a great model from one that's only pretending.

    What You'll Learn in This Lesson

    • You'll be able to build a confusion matrix from two lists by hand
    • You'll compute accuracy, precision, recall and F1 from the four counts
    • You'll know when to read ROC-AUC instead of accuracy
    • You'll measure regression error with MAE, MSE, RMSE and R²
    • You'll run k-fold cross-validation and report a mean ± spread
    • You'll diagnose overfitting vs underfitting from the bias-variance gap

    📊 Real-World Analogy: A School Report Card

    Imagine grading a student with a single number — say "78% of answers correct." It hides a lot. Did they ace maths but fail science? Did the test even cover the right topics? A real report card breaks the grade into subjects, comments, and a comparison to the class, because one number can mislead.

    Evaluating an AI model is exactly the same. One score — accuracy — can look brilliant while the model is useless. A proper evaluation is a report card: several metrics, each measuring a different subject, plus a check that you graded on questions the model never saw before. This lesson is how you write that report card.

    1The Confusion Matrix — Where Every Metric Begins

    When a model answers yes/no questions, every prediction lands in one of four buckets. A true positive is a correct "yes"; a true negative a correct "no". A false positive is a false alarm (said yes, was no), and a false negative is a miss (said no, was yes). Lay those four counts in a grid and you have the confusion matrix.

    You don't need any library to build one — just walk the two lists together and tally. Run this:

    Worked Example: Build a Confusion Matrix

    Tally TP, TN, FP and FN from two plain lists

    Try it Yourself »
    Python
    # A confusion matrix sorts every prediction into one of four buckets.
    # We'll build it by hand from two lists — no libraries needed.
    
    # 1 = "positive" (e.g. the patient IS sick), 0 = "negative" (healthy)
    y_true = [1, 0, 1, 1, 0, 0, 1, 0, 0, 1]   # the real answers
    y_pred = [1, 0, 0, 1, 0, 1, 1, 0, 0, 1]   # what the model guessed
    
    # Four counts make up the confusion matrix:
    tp = 0   # True Positive  — predicted 1, really 1  (correct hit)
    tn = 0   # True Negative  — predicted 0, really 0  (correc
    ...

    2Classification Metrics — Accuracy, Precision, Recall, F1

    Those four counts give you four very different views of quality. Accuracy is the fraction of all predictions that were right. Precision asks "when the model says yes, how often is it correct?" Recall asks "of everything it should have caught, how much did it?" And F1 blends precision and recall into one number using the harmonic mean, so a model can't cheat by being great at one and terrible at the other.

    accuracy = (TP + TN) / everything

    precision = TP / (TP + FP)

    recall = TP / (TP + FN)

    F1 = 2 · precision · recall / (precision + recall)

    Worked Example: Metrics From the Counts

    Derive accuracy, precision, recall and F1 with plain arithmetic

    Try it Yourself »
    Python
    # From the four confusion-matrix counts you can derive every metric.
    tp, tn, fp, fn = 4, 4, 1, 1
    
    # Accuracy: of ALL predictions, how many were correct?
    accuracy = (tp + tn) / (tp + tn + fp + fn)
    
    # Precision: of everything FLAGGED positive, how many really were?
    #   "When the model says yes, how often is it right?"
    precision = tp / (tp + fp)
    
    # Recall (sensitivity): of all REAL positives, how many did we catch?
    #   "Of everything we should have caught, how much did we?"
    recall = tp / (tp + fn)
    
    ...

    🛠️ The Real Tool: scikit-learn

    You hand-rolled the metrics above to understand them — but in real projects you call a library. Here is the exact same data scored by scikit-learn, including the ROC-AUC you can't easily do by hand. The # Expected output comment shows what you'd see if you ran it locally.

    Worked Example: sklearn Metrics

    The same numbers, computed by the library you'll actually use

    Try it Yourself »
    Python
    # In real projects you don't hand-roll these — scikit-learn does it for you.
    # This is what the worked example above looks like with the library.
    from sklearn.metrics import (
        accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
    )
    
    y_true = [1, 0, 1, 1, 0, 0, 1, 0, 0, 1]
    y_pred = [1, 0, 0, 1, 0, 1, 1, 0, 0, 1]
    # Predicted PROBABILITY of class 1 (ROC-AUC needs scores, not labels):
    y_prob = [0.9, 0.1, 0.4, 0.8, 0.2, 0.6, 0.7, 0.3, 0.1, 0.95]
    
    print("accuracy :", accuracy_scor
    ...

    Notice the metrics match your hand calculations exactly. The library just saves you the arithmetic — and adds ROC-AUC, which needs the probability column y_prob.

    Try It Yourself: Precision and Recall

    Fill in the two formulas using the confusion counts

    Try it Yourself »
    Python
    # 🎯 YOUR TURN — compute precision and recall from confusion counts
    # A spam filter was tested on 100 emails. Here are its four counts:
    
    tp = 35   # spam correctly caught
    fp = 5    # good emails wrongly flagged as spam (false alarms)
    fn = 10   # spam that slipped through (misses)
    tn = 50   # good emails correctly let through
    
    # 1) Precision = TP / (TP + FP)   "when it flags spam, how often is it right?"
    precision = ___          # 👉 replace ___ using tp and fp
    
    # 2) Recall = TP / (TP + FN)      
    ...

    3Regression Metrics — MAE, MSE, RMSE, R²

    Classification predicts a category; regression predicts a number — a price, a temperature, an age. So instead of counting hits and misses, you measure how far off the predictions are.

    MAE (mean absolute error) is the average size of the error in the data's own units. MSE squares the errors first, so big misses count extra; RMSE takes the square root of MSE to get back into the original units. reports the fraction of the data's variation the model explains — 1.0 is perfect, 0 is no better than always guessing the average.

    Worked Example: Regression Error

    Compute MAE, MSE, RMSE and R² from two lists

    Try it Yourself »
    Python
    # Regression predicts a NUMBER (a price, a temperature), not a class.
    # So we measure how far off we are, on average.
    
    y_true = [3.0, -0.5, 2.0, 7.0, 4.5]   # the real values
    y_pred = [2.5,  0.0, 2.0, 8.0, 4.0]   # the model's predictions
    n = len(y_true)
    
    # Mean Absolute Error: average size of the error (same units as the data).
    mae = sum(abs(t - p) for t, p in zip(y_true, y_pred)) / n
    
    # Mean Squared Error: average of the SQUARED errors (punishes big misses).
    mse = sum((t - p) ** 2 for t, p in 
    ...

    Try It Yourself: Compute RMSE

    Fill in MSE and RMSE from two lists of prices

    Try it Yourself »
    Python
    # 🎯 YOUR TURN — compute RMSE from two lists
    # A model predicted house prices (in £1000s). Compare to the real prices.
    
    y_true = [200, 350, 500, 275]   # real prices
    y_pred = [210, 330, 540, 260]   # predicted prices
    n = len(y_true)
    
    # 1) Mean Squared Error: average of the squared differences.
    mse = ___ / n            # 👉 sum((t - p) ** 2 for t, p in zip(y_true, y_pred))
    
    # 2) RMSE: the square root of MSE (use ** 0.5).
    rmse = ___               # 👉 take the square root of mse
    
    print(f"MSE : {ms
    ...

    4Cross-Validation — Don't Trust a Single Split

    Score a model on one random test split and you might just get lucky — or unlucky. k-fold cross-validation fixes that: it chops the data into k equal parts (folds), trains on k−1 of them and tests on the one held out, then rotates so every row gets tested exactly once. You report the average of the k scores.

    The spread of those scores matters too. A small standard deviation means the result is stable; a big one means your model's quality swings with the split — a warning sign.

    Worked Example: k-Fold Averaging

    Average the fold scores and measure how stable they are

    Try it Yourself »
    Python
    # Cross-validation gives a more trustworthy score than a single split.
    # k-fold splits the data into k parts: train on k-1, test on the held-out 1,
    # and rotate so every row is tested exactly once. Here we fake the scores
    # you'd get back from each fold to show how the averaging works.
    
    fold_scores = [0.82, 0.79, 0.85, 0.81, 0.78]   # accuracy from each of 5 folds
    k = len(fold_scores)
    
    mean_score = sum(fold_scores) / k
    
    # Standard deviation tells you how STABLE the model is across folds.
    varianc
    ...

    ⚖️ Overfitting vs Underfitting (Bias-Variance)

    The whole reason you cross-validate is to catch two opposite failures. Underfitting (high bias) is a model too simple to learn the pattern — it scores poorly on both the training data and new data. Overfitting (high variance) is a model that memorised the training data, including its noise — it scores brilliantly on training but badly on anything new.

    Underfitting

    Train low, test low. Too simple — add features or a more flexible model.

    Just Right

    Train high, test high, small gap. The sweet spot you're aiming for.

    Overfitting

    Train high, test low, big gap. Too complex — get more data or regularise.

    🎯 Mini-Challenge: Score a Classifier

    Time to fly solo. You're given a confusion matrix in the comments. Compute all four classification metrics from scratch — only the outline is provided, no formulas filled in.

    Mini-Challenge

    Build accuracy, precision, recall and F1 from four counts

    Try it Yourself »
    Python
    # 🎯 MINI-CHALLENGE: Score a binary classifier
    # A model was tested on 200 cases. Its confusion matrix:
    #   TP = 60,  FP = 20,  FN = 30,  TN = 90
    #
    # 1. Store the four counts in variables tp, fp, fn, tn
    # 2. Compute accuracy  = (tp + tn) / (tp + tn + fp + fn)
    # 3. Compute precision = tp / (tp + fp)
    # 4. Compute recall    = tp / (tp + fn)
    # 5. Compute f1        = 2 * precision * recall / (precision + recall)
    # 6. Print each value rounded to 2 decimals
    #
    # ✅ Expected output:
    # Accuracy:  0.75
    # Pr
    ...

    5Common Mistakes (And How to Fix Them)

    These four traps catch almost every beginner. Spot them before they fool you.

    ❌ Trusting accuracy on imbalanced data

    If 99% of cases are negative, a model that always says "negative" hits 99% accuracy and catches nothing:

    # 990 healthy, 10 sick; model predicts "healthy" for everyone
    accuracy = 990 / 1000   # 0.99  ← looks amazing
    recall   = 0 / 10       # 0.00  ← caught zero sick patients!

    ✅ Fix: report precision, recall, F1 and ROC-AUC alongside accuracy.

    ❌ Evaluating on the training set

    Scoring on the same data the model learned from rewards memorisation, not understanding:

    model.fit(X_train, y_train)
    score = model.score(X_train, y_train)   # ❌ inflated — it has seen this

    ✅ Fix: always score on a held-out test set the model has never seen.

    score = model.score(X_test, y_test)     # ✅ honest

    ❌ Reporting one lucky split, with no cross-validation

    A single train/test split can hand you a flattering (or unfair) number by chance.

    ✅ Fix: use k-fold cross-validation and report the mean ± standard deviation, not one number.

    ❌ Optimising the wrong metric

    Maximising precision on a cancer screen means you'll miss sick patients (low recall) to avoid false alarms — the opposite of what matters here.

    ✅ Fix: choose the metric from the cost of errors. Disease detection → recall. Spam filter → precision. Unsure → F1.

    📋 Quick Reference

    MetricTaskFormula / MeaningUse When
    AccuracyClassification(TP+TN)/allClasses are balanced
    PrecisionClassificationTP/(TP+FP)False positives costly
    RecallClassificationTP/(TP+FN)False negatives costly
    F1ClassificationHarmonic mean of P & RBoth matter equally
    ROC-AUCBinary classif.0.5 random, 1.0 perfectThreshold-independent
    MAERegressionavg |y−ŷ|Easy-to-explain error
    RMSERegression√(avg squared error)Big misses count extra
    Regressionvariance explained1.0 perfect, 0 = mean

    ❓ Frequently Asked Questions

    Q: What is a confusion matrix?

    A: A 2x2 table that sorts predictions into four buckets: true positives, true negatives, false positives, and false negatives. Every classification metric — accuracy, precision, recall, F1 — is calculated from these four counts.

    Q: What is the difference between precision and recall?

    A: Precision asks 'when the model flags something positive, how often is it right?' (TP / (TP + FP)). Recall asks 'of all the real positives, how many did we catch?' (TP / (TP + FN)). Optimise precision when false alarms are costly, recall when missing a real case is costly.

    Q: Why is accuracy misleading on imbalanced data?

    A: If 99% of cases are negative, a model that always predicts 'negative' scores 99% accuracy while catching zero positives. Accuracy hides this failure; precision, recall, F1, and ROC-AUC expose it.

    Q: When should I use RMSE versus MAE?

    A: Both measure regression error in the data's own units. MAE is the average absolute error and is easy to explain. RMSE squares the errors first, so it penalises large mistakes more heavily — use it when big misses are especially bad.

    Q: What does k-fold cross-validation do?

    A: It splits your data into k parts, trains on k-1 and tests on the held-out part, then rotates so every row is tested once. Averaging the k scores gives a more reliable estimate than a single train/test split, and the spread tells you how stable the model is.

    Q: What is the difference between overfitting and underfitting?

    A: Underfitting (high bias) means the model is too simple and does poorly on both training and test data. Overfitting (high variance) means it memorised the training data and does well there but poorly on new data. The gap between training and test scores is the giveaway.

    🎉

    Lesson complete — you can now write a model's report card!

    You can build a confusion matrix by hand, derive accuracy, precision, recall and F1, measure regression error with MAE/RMSE/R², cross-validate for a trustworthy score, and read the bias-variance gap to spot overfitting. Most importantly, you know that the metric you choose shapes the model you build.

    🚀 Up next: Model Compression — make a good model smaller and faster without losing the quality you just learned to measure.

    Sign up for free to track which lessons you've completed and get learning reminders.

    Previous

    Cookie & Privacy Settings

    We use cookies to improve your experience, analyze traffic, and show personalized ads. You can manage your preferences below.

    By clicking "Accept All", you consent to our use of cookies for analytics and personalized advertising. You can customize your preferences or reject non-essential cookies.

    Privacy PolicyTerms of Service