Lesson 38 • Advanced
Evaluating AI Models
Choose and calculate the right metrics — F1, ROC-AUC, BLEU, ROUGE, perplexity, and WER for classification, NLP, and generation tasks.
✅ What You'll Learn
- • Why accuracy is misleading on imbalanced datasets
- • Precision vs recall tradeoff and F1 score
- • BLEU and ROUGE for text generation evaluation
- • Word Error Rate for speech recognition
📊 Measuring What Matters
🎯 Real-World Analogy: Evaluating an AI model with just accuracy is like judging a goalkeeper by "percentage of time they don't concede." A goalkeeper who never touches the ball gets 99% — but that's useless. You need metrics that capture what actually matters: saves made (recall), percentage of successful saves (precision), and overall performance (F1). Different tasks need different "scoring systems."
The right metric depends on your task and the cost of errors. Missing a cancer diagnosis (false negative) is far worse than a false alarm (false positive). The metric you optimise for shapes your model's behaviour.
Try It: Classification Metrics
See why 99% accuracy can be completely useless
import numpy as np
# Classification Metrics: Beyond Accuracy
# F1, ROC-AUC, Precision-Recall for imbalanced datasets
np.random.seed(42)
def compute_metrics(y_true, y_pred, y_prob=None):
tp = np.sum((y_true == 1) & (y_pred == 1))
tn = np.sum((y_true == 0) & (y_pred == 0))
fp = np.sum((y_true == 0) & (y_pred == 1))
fn = np.sum((y_true == 1) & (y_pred == 0))
accuracy = (tp + tn) / (tp + tn + fp + fn)
precision = tp / (tp + fp) if (tp + fp) > 0 else 0
recall = tp
...Try It: NLP Metrics
Calculate BLEU for translation and WER for speech recognition
import numpy as np
# NLP & Generation Metrics: BLEU, ROUGE, Perplexity, WER
# Evaluating text generation and translation quality
def compute_bleu_ngram(reference, candidate, n):
"""Compute n-gram precision for BLEU"""
ref_words = reference.lower().split()
cand_words = candidate.lower().split()
if len(cand_words) < n:
return 0.0
ref_ngrams = {}
for i in range(len(ref_words) - n + 1):
gram = tuple(ref_words[i:i+n])
ref_ngrams[gram] = ref_
...⚠️ Common Mistake: Using BLEU score alone for text generation. BLEU only measures n-gram overlap — a paraphrase with different words gets a low score even if the meaning is perfect. Complement BLEU with human evaluation and semantic similarity metrics like BERTScore.
💡 Pro Tip: Always evaluate on a held-out test set that your model has never seen. For small datasets, use stratified k-fold cross-validation. For LLMs, evaluate on established benchmarks (MMLU, HumanEval, MT-Bench) rather than cherry-picked examples.
📋 Quick Reference
| Metric | Task | Optimise When |
|---|---|---|
| Precision | Classification | False positives are costly |
| Recall | Classification | False negatives are costly |
| F1 | Classification | Both matter equally |
| ROC-AUC | Binary classification | Threshold-independent eval |
| BLEU | Translation | N-gram match to reference |
| Perplexity | Language model | Lower = better predictions |
🎉 Lesson Complete!
You now know how to properly evaluate any AI model! Next, learn how to make models smaller and faster with compression techniques.
Sign up for free to track which lessons you've completed and get learning reminders.