Lesson 38 • Advanced

    Evaluating AI Models

    Choose and calculate the right metrics — F1, ROC-AUC, BLEU, ROUGE, perplexity, and WER for classification, NLP, and generation tasks.

    ✅ What You'll Learn

    • • Why accuracy is misleading on imbalanced datasets
    • • Precision vs recall tradeoff and F1 score
    • • BLEU and ROUGE for text generation evaluation
    • • Word Error Rate for speech recognition

    📊 Measuring What Matters

    🎯 Real-World Analogy: Evaluating an AI model with just accuracy is like judging a goalkeeper by "percentage of time they don't concede." A goalkeeper who never touches the ball gets 99% — but that's useless. You need metrics that capture what actually matters: saves made (recall), percentage of successful saves (precision), and overall performance (F1). Different tasks need different "scoring systems."

    The right metric depends on your task and the cost of errors. Missing a cancer diagnosis (false negative) is far worse than a false alarm (false positive). The metric you optimise for shapes your model's behaviour.

    Try It: Classification Metrics

    See why 99% accuracy can be completely useless

    Try it Yourself »
    Python
    import numpy as np
    
    # Classification Metrics: Beyond Accuracy
    # F1, ROC-AUC, Precision-Recall for imbalanced datasets
    
    np.random.seed(42)
    
    def compute_metrics(y_true, y_pred, y_prob=None):
        tp = np.sum((y_true == 1) & (y_pred == 1))
        tn = np.sum((y_true == 0) & (y_pred == 0))
        fp = np.sum((y_true == 0) & (y_pred == 1))
        fn = np.sum((y_true == 1) & (y_pred == 0))
        
        accuracy = (tp + tn) / (tp + tn + fp + fn)
        precision = tp / (tp + fp) if (tp + fp) > 0 else 0
        recall = tp 
    ...

    Try It: NLP Metrics

    Calculate BLEU for translation and WER for speech recognition

    Try it Yourself »
    Python
    import numpy as np
    
    # NLP & Generation Metrics: BLEU, ROUGE, Perplexity, WER
    # Evaluating text generation and translation quality
    
    def compute_bleu_ngram(reference, candidate, n):
        """Compute n-gram precision for BLEU"""
        ref_words = reference.lower().split()
        cand_words = candidate.lower().split()
        
        if len(cand_words) < n:
            return 0.0
        
        ref_ngrams = {}
        for i in range(len(ref_words) - n + 1):
            gram = tuple(ref_words[i:i+n])
            ref_ngrams[gram] = ref_
    ...

    ⚠️ Common Mistake: Using BLEU score alone for text generation. BLEU only measures n-gram overlap — a paraphrase with different words gets a low score even if the meaning is perfect. Complement BLEU with human evaluation and semantic similarity metrics like BERTScore.

    💡 Pro Tip: Always evaluate on a held-out test set that your model has never seen. For small datasets, use stratified k-fold cross-validation. For LLMs, evaluate on established benchmarks (MMLU, HumanEval, MT-Bench) rather than cherry-picked examples.

    📋 Quick Reference

    MetricTaskOptimise When
    PrecisionClassificationFalse positives are costly
    RecallClassificationFalse negatives are costly
    F1ClassificationBoth matter equally
    ROC-AUCBinary classificationThreshold-independent eval
    BLEUTranslationN-gram match to reference
    PerplexityLanguage modelLower = better predictions

    🎉 Lesson Complete!

    You now know how to properly evaluate any AI model! Next, learn how to make models smaller and faster with compression techniques.

    Sign up for free to track which lessons you've completed and get learning reminders.

    Previous

    Cookie & Privacy Settings

    We use cookies to improve your experience, analyze traffic, and show personalized ads. You can manage your preferences below.

    By clicking "Accept All", you consent to our use of cookies for analytics and personalized advertising. You can customize your preferences or reject non-essential cookies.

    Privacy PolicyTerms of Service