Lesson 34 • Advanced

    Speech Recognition & Audio ML

    Convert speech to text — learn mel spectrograms, CTC decoding, and how Whisper achieves near-human transcription accuracy.

    ✅ What You'll Learn

    • • Mel spectrograms: converting audio to 2D representations
    • • CTC decoding: aligning audio frames to text
    • • Modern ASR models: Whisper, wav2vec 2.0, Conformer
    • • Word Error Rate (WER) and evaluation

    🎙️ Teaching Machines to Listen

    🎯 Real-World Analogy: Speech recognition is like translating sheet music. First, you convert the raw sound waves into a visual representation (mel spectrogram = the sheet music). Then a "reader" (neural network) interprets the notes into words. The challenge is that people speak at different speeds, with accents, and with background noise — like reading messy, smudged sheet music.

    Modern speech recognition has two key preprocessing steps: (1) convert raw audio into a mel spectrogram (a 2D representation that mimics human hearing), then (2) feed it to a neural network that outputs text, either character-by-character (CTC) or as a sequence-to-sequence model (Whisper).

    Try It: Mel Spectrograms

    Convert audio into the 2D representation that speech models process

    Try it Yourself »
    Python
    import numpy as np
    
    # Mel Spectrogram: Converting Audio to "Images" for Neural Networks
    # The key preprocessing step for all modern speech models
    
    np.random.seed(42)
    
    def hz_to_mel(hz):
        return 2595 * np.log10(1 + hz / 700)
    
    def mel_to_hz(mel):
        return 700 * (10 ** (mel / 2595) - 1)
    
    def create_mel_filterbank(n_mels, n_fft, sample_rate):
        """Create triangular mel-scale filter bank"""
        mel_min = hz_to_mel(0)
        mel_max = hz_to_mel(sample_rate / 2)
        mel_points = np.linspace(mel_min
    ...

    Try It: CTC Decoding

    Decode neural network outputs into text by collapsing repeated characters

    Try it Yourself »
    Python
    import numpy as np
    
    # CTC: Connectionist Temporal Classification
    # Align audio frames to text without explicit alignment
    
    np.random.seed(42)
    
    def softmax(x):
        e = np.exp(x - np.max(x, axis=-1, keepdims=True))
        return e / e.sum(axis=-1, keepdims=True)
    
    def ctc_greedy_decode(log_probs, vocab):
        """Simple greedy CTC decoding"""
        # Get most likely token at each timestep
        best_path = np.argmax(log_probs, axis=1)
        
        # Collapse repeated tokens and remove blanks
        decoded = []
       
    ...

    ⚠️ Common Mistake: Using a 44.1kHz sample rate for speech models trained on 16kHz. Most speech models (Whisper, wav2vec2) expect 16kHz mono audio. Always resample your audio to match the model's expected input, or you'll get garbled transcriptions.

    💡 Pro Tip: For production speech-to-text, use Whisper large-v3 for best accuracy or Whisper tiny for speed. For real-time streaming, use faster-whisper (CTranslate2 backend) — it's 4× faster with the same accuracy. Add VAD (Voice Activity Detection) to skip silence and reduce costs.

    📋 Quick Reference

    ConceptWhatWhy
    Mel ScalePerceptual frequency scaleMatches human hearing
    STFTTime-frequency decompositionSee frequencies over time
    CTCAlignment-free decodingNo need for forced alignment
    WERWord Error RatePrimary ASR metric
    VADVoice Activity DetectionSkip silence, save compute

    🎉 Lesson Complete!

    You now understand how machines convert speech to text! Next, dive into advanced NLP with transformers for text understanding and generation.

    Sign up for free to track which lessons you've completed and get learning reminders.

    Previous

    Cookie & Privacy Settings

    We use cookies to improve your experience, analyze traffic, and show personalized ads. You can manage your preferences below.

    By clicking "Accept All", you consent to our use of cookies for analytics and personalized advertising. You can customize your preferences or reject non-essential cookies.

    Privacy PolicyTerms of Service