Lesson 38 • Advanced

Evaluating AI Models

By the end of this lesson you'll read a confusion matrix, pick the right metric for any task, and tell a great model from one that's only pretending.

What You'll Learn in This Lesson

✓You'll be able to build a confusion matrix from two lists by hand
✓You'll compute accuracy, precision, recall and F1 from the four counts
✓You'll know when to read ROC-AUC instead of accuracy
✓You'll measure regression error with MAE, MSE, RMSE and R²
✓You'll run k-fold cross-validation and report a mean ± spread
✓You'll diagnose overfitting vs underfitting from the bias-variance gap

Before you start: You should have trained at least one model — see Supervised Learning. This lesson is about judging how good that model actually is.

📊 Real-World Analogy: A School Report Card

Imagine grading a student with a single number — say "78% of answers correct." It hides a lot. Did they ace maths but fail science? Did the test even cover the right topics? A real report card breaks the grade into subjects, comments, and a comparison to the class, because one number can mislead.

Evaluating an AI model is exactly the same. One score — accuracy — can look brilliant while the model is useless. A proper evaluation is a report card: several metrics, each measuring a different subject, plus a check that you graded on questions the model never saw before. This lesson is how you write that report card.

1The Confusion Matrix — Where Every Metric Begins

When a model answers yes/no questions, every prediction lands in one of four buckets. A true positive is a correct "yes"; a true negative a correct "no". A false positive is a false alarm (said yes, was no), and a false negative is a miss (said no, was yes). Lay those four counts in a grid and you have the confusion matrix.

You don't need any library to build one — just walk the two lists together and tally. Run this:

Worked Example: Build a Confusion Matrix

Tally TP, TN, FP and FN from two plain lists

Try it Yourself »

Python

# A confusion matrix sorts every prediction into one of four buckets.
# We'll build it by hand from two lists — no libraries needed.

# 1 = "positive" (e.g. the patient IS sick), 0 = "negative" (healthy)
y_true = [1, 0, 1, 1, 0, 0, 1, 0, 0, 1]   # the real answers
y_pred = [1, 0, 0, 1, 0, 1, 1, 0, 0, 1]   # what the model guessed

# Four counts make up the confusion matrix:
tp = 0   # True Positive  — predicted 1, really 1  (correct hit)
tn = 0   # True Negative  — predicted 0, really 0  (correc
...

2Classification Metrics — Accuracy, Precision, Recall, F1

Those four counts give you four very different views of quality. Accuracy is the fraction of all predictions that were right. Precision asks "when the model says yes, how often is it correct?" Recall asks "of everything it should have caught, how much did it?" And F1 blends precision and recall into one number using the harmonic mean, so a model can't cheat by being great at one and terrible at the other.

accuracy = (TP + TN) / everything

precision = TP / (TP + FP)

recall = TP / (TP + FN)

F1 = 2 · precision · recall / (precision + recall)

Worked Example: Metrics From the Counts

Derive accuracy, precision, recall and F1 with plain arithmetic

Try it Yourself »

Python

# From the four confusion-matrix counts you can derive every metric.
tp, tn, fp, fn = 4, 4, 1, 1

# Accuracy: of ALL predictions, how many were correct?
accuracy = (tp + tn) / (tp + tn + fp + fn)

# Precision: of everything FLAGGED positive, how many really were?
#   "When the model says yes, how often is it right?"
precision = tp / (tp + fp)

# Recall (sensitivity): of all REAL positives, how many did we catch?
#   "Of everything we should have caught, how much did we?"
recall = tp / (tp + fn)

...

ROC-AUC measures how well the model ranks positives above negatives across every threshold. 0.5 means random guessing, 1.0 is perfect. It needs predicted probabilities (not just labels), so you normally read it from a library — shown next.

🛠️ The Real Tool: scikit-learn

You hand-rolled the metrics above to understand them — but in real projects you call a library. Here is the exact same data scored by scikit-learn, including the ROC-AUC you can't easily do by hand. The # Expected output comment shows what you'd see if you ran it locally.

Worked Example: sklearn Metrics

The same numbers, computed by the library you'll actually use

Try it Yourself »

Python

# In real projects you don't hand-roll these — scikit-learn does it for you.
# This is what the worked example above looks like with the library.
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
)

y_true = [1, 0, 1, 1, 0, 0, 1, 0, 0, 1]
y_pred = [1, 0, 0, 1, 0, 1, 1, 0, 0, 1]
# Predicted PROBABILITY of class 1 (ROC-AUC needs scores, not labels):
y_prob = [0.9, 0.1, 0.4, 0.8, 0.2, 0.6, 0.7, 0.3, 0.1, 0.95]

print("accuracy :", accuracy_scor
...

Notice the metrics match your hand calculations exactly. The library just saves you the arithmetic — and adds ROC-AUC, which needs the probability column y_prob.

Try It Yourself: Precision and Recall

Fill in the two formulas using the confusion counts

Try it Yourself »

Python

# 🎯 YOUR TURN — compute precision and recall from confusion counts
# A spam filter was tested on 100 emails. Here are its four counts:

tp = 35   # spam correctly caught
fp = 5    # good emails wrongly flagged as spam (false alarms)
fn = 10   # spam that slipped through (misses)
tn = 50   # good emails correctly let through

# 1) Precision = TP / (TP + FP)   "when it flags spam, how often is it right?"
precision = ___          # 👉 replace ___ using tp and fp

# 2) Recall = TP / (TP + FN)      
...

3Regression Metrics — MAE, MSE, RMSE, R²

Classification predicts a category; regression predicts a number — a price, a temperature, an age. So instead of counting hits and misses, you measure how far off the predictions are.

MAE (mean absolute error) is the average size of the error in the data's own units. MSE squares the errors first, so big misses count extra; RMSE takes the square root of MSE to get back into the original units. R² reports the fraction of the data's variation the model explains — 1.0 is perfect, 0 is no better than always guessing the average.

Worked Example: Regression Error

Compute MAE, MSE, RMSE and R² from two lists

Try it Yourself »

Python

# Regression predicts a NUMBER (a price, a temperature), not a class.
# So we measure how far off we are, on average.

y_true = [3.0, -0.5, 2.0, 7.0, 4.5]   # the real values
y_pred = [2.5,  0.0, 2.0, 8.0, 4.0]   # the model's predictions
n = len(y_true)

# Mean Absolute Error: average size of the error (same units as the data).
mae = sum(abs(t - p) for t, p in zip(y_true, y_pred)) / n

# Mean Squared Error: average of the SQUARED errors (punishes big misses).
mse = sum((t - p) ** 2 for t, p in 
...

Try It Yourself: Compute RMSE

Fill in MSE and RMSE from two lists of prices

Try it Yourself »

Python

# 🎯 YOUR TURN — compute RMSE from two lists
# A model predicted house prices (in £1000s). Compare to the real prices.

y_true = [200, 350, 500, 275]   # real prices
y_pred = [210, 330, 540, 260]   # predicted prices
n = len(y_true)

# 1) Mean Squared Error: average of the squared differences.
mse = ___ / n            # 👉 sum((t - p) ** 2 for t, p in zip(y_true, y_pred))

# 2) RMSE: the square root of MSE (use ** 0.5).
rmse = ___               # 👉 take the square root of mse

print(f"MSE : {ms
...

4Cross-Validation — Don't Trust a Single Split

Score a model on one random test split and you might just get lucky — or unlucky. k-fold cross-validation fixes that: it chops the data into k equal parts (folds), trains on k−1 of them and tests on the one held out, then rotates so every row gets tested exactly once. You report the average of the k scores.

The spread of those scores matters too. A small standard deviation means the result is stable; a big one means your model's quality swings with the split — a warning sign.

Worked Example: k-Fold Averaging

Average the fold scores and measure how stable they are

Try it Yourself »

Python

# Cross-validation gives a more trustworthy score than a single split.
# k-fold splits the data into k parts: train on k-1, test on the held-out 1,
# and rotate so every row is tested exactly once. Here we fake the scores
# you'd get back from each fold to show how the averaging works.

fold_scores = [0.82, 0.79, 0.85, 0.81, 0.78]   # accuracy from each of 5 folds
k = len(fold_scores)

mean_score = sum(fold_scores) / k

# Standard deviation tells you how STABLE the model is across folds.
varianc
...

⚖️ Overfitting vs Underfitting (Bias-Variance)

The whole reason you cross-validate is to catch two opposite failures. Underfitting (high bias) is a model too simple to learn the pattern — it scores poorly on both the training data and new data. Overfitting (high variance) is a model that memorised the training data, including its noise — it scores brilliantly on training but badly on anything new.

Underfitting

Train low, test low. Too simple — add features or a more flexible model.

Just Right

Train high, test high, small gap. The sweet spot you're aiming for.

Overfitting

Train high, test low, big gap. Too complex — get more data or regularise.

The tell-tale sign: compare the training score with the cross-validation score. A large gap (e.g. 99% train, 70% test) screams overfitting. Both low (e.g. 60% / 60%) means underfitting.

🎯 Mini-Challenge: Score a Classifier

Time to fly solo. You're given a confusion matrix in the comments. Compute all four classification metrics from scratch — only the outline is provided, no formulas filled in.

Mini-Challenge

Build accuracy, precision, recall and F1 from four counts

Try it Yourself »

Python

# 🎯 MINI-CHALLENGE: Score a binary classifier
# A model was tested on 200 cases. Its confusion matrix:
#   TP = 60,  FP = 20,  FN = 30,  TN = 90
#
# 1. Store the four counts in variables tp, fp, fn, tn
# 2. Compute accuracy  = (tp + tn) / (tp + tn + fp + fn)
# 3. Compute precision = tp / (tp + fp)
# 4. Compute recall    = tp / (tp + fn)
# 5. Compute f1        = 2 * precision * recall / (precision + recall)
# 6. Print each value rounded to 2 decimals
#
# ✅ Expected output:
# Accuracy:  0.75
# Pr
...

5Common Mistakes (And How to Fix Them)

These four traps catch almost every beginner. Spot them before they fool you.

❌ Trusting accuracy on imbalanced data

If 99% of cases are negative, a model that always says "negative" hits 99% accuracy and catches nothing:

# 990 healthy, 10 sick; model predicts "healthy" for everyone
accuracy = 990 / 1000   # 0.99  ← looks amazing
recall   = 0 / 10       # 0.00  ← caught zero sick patients!

✅ Fix: report precision, recall, F1 and ROC-AUC alongside accuracy.

❌ Evaluating on the training set

Scoring on the same data the model learned from rewards memorisation, not understanding:

model.fit(X_train, y_train)
score = model.score(X_train, y_train)   # ❌ inflated — it has seen this

✅ Fix: always score on a held-out test set the model has never seen.

score = model.score(X_test, y_test)     # ✅ honest

❌ Reporting one lucky split, with no cross-validation

A single train/test split can hand you a flattering (or unfair) number by chance.

✅ Fix: use k-fold cross-validation and report the mean ± standard deviation, not one number.

❌ Optimising the wrong metric

Maximising precision on a cancer screen means you'll miss sick patients (low recall) to avoid false alarms — the opposite of what matters here.

✅ Fix: choose the metric from the cost of errors. Disease detection → recall. Spam filter → precision. Unsure → F1.

📋 Quick Reference

Metric	Task	Formula / Meaning	Use When
Accuracy	Classification	`(TP+TN)/all`	Classes are balanced
Precision	Classification	`TP/(TP+FP)`	False positives costly
Recall	Classification	`TP/(TP+FN)`	False negatives costly
F1	Classification	Harmonic mean of P & R	Both matter equally
ROC-AUC	Binary classif.	0.5 random, 1.0 perfect	Threshold-independent
MAE	Regression	avg `\|y−ŷ\|`	Easy-to-explain error
RMSE	Regression	√(avg squared error)	Big misses count extra
R²	Regression	variance explained	1.0 perfect, 0 = mean

❓ Frequently Asked Questions

Q: What is a confusion matrix?

A: A 2x2 table that sorts predictions into four buckets: true positives, true negatives, false positives, and false negatives. Every classification metric — accuracy, precision, recall, F1 — is calculated from these four counts.

Q: What is the difference between precision and recall?

A: Precision asks 'when the model flags something positive, how often is it right?' (TP / (TP + FP)). Recall asks 'of all the real positives, how many did we catch?' (TP / (TP + FN)). Optimise precision when false alarms are costly, recall when missing a real case is costly.

Q: Why is accuracy misleading on imbalanced data?

A: If 99% of cases are negative, a model that always predicts 'negative' scores 99% accuracy while catching zero positives. Accuracy hides this failure; precision, recall, F1, and ROC-AUC expose it.

Q: When should I use RMSE versus MAE?

A: Both measure regression error in the data's own units. MAE is the average absolute error and is easy to explain. RMSE squares the errors first, so it penalises large mistakes more heavily — use it when big misses are especially bad.

Q: What does k-fold cross-validation do?

A: It splits your data into k parts, trains on k-1 and tests on the held-out part, then rotates so every row is tested once. Averaging the k scores gives a more reliable estimate than a single train/test split, and the spread tells you how stable the model is.

Q: What is the difference between overfitting and underfitting?

A: Underfitting (high bias) means the model is too simple and does poorly on both training and test data. Overfitting (high variance) means it memorised the training data and does well there but poorly on new data. The gap between training and test scores is the giveaway.

🎉

Lesson complete — you can now write a model's report card!

You can build a confusion matrix by hand, derive accuracy, precision, recall and F1, measure regression error with MAE/RMSE/R², cross-validate for a trustworthy score, and read the bias-variance gap to spot overfitting. Most importantly, you know that the metric you choose shapes the model you build.

🚀 Up next: Model Compression — make a good model smaller and faster without losing the quality you just learned to measure.

Evaluating AI Models

What You'll Learn in This Lesson

📊 Real-World Analogy: A School Report Card

1The Confusion Matrix — Where Every Metric Begins

Worked Example: Build a Confusion Matrix

2Classification Metrics — Accuracy, Precision, Recall, F1

Worked Example: Metrics From the Counts

🛠️ The Real Tool: scikit-learn

Worked Example: sklearn Metrics

Try It Yourself: Precision and Recall

3Regression Metrics — MAE, MSE, RMSE, R²

Worked Example: Regression Error

Try It Yourself: Compute RMSE

4Cross-Validation — Don't Trust a Single Split

Worked Example: k-Fold Averaging

⚖️ Overfitting vs Underfitting (Bias-Variance)

🎯 Mini-Challenge: Score a Classifier

Mini-Challenge

5Common Mistakes (And How to Fix Them)

📋 Quick Reference

❓ Frequently Asked Questions

Lesson complete — you can now write a model's report card!

Cookie & Privacy Settings