Skip to main content

    Lesson 34 • Advanced

    Speech Recognition & Audio ML

    By the end of this lesson you'll understand the whole pipeline that turns spoken audio into text — from raw waveforms to features to neural models — and you'll be able to decode model output and score it with Word Error Rate in plain Python.

    What You'll Learn in This Lesson

    • How audio becomes numbers: waveforms, samples, and sample rate
    • What a spectrogram and MFCC features are, and why models love them
    • How acoustic and language models split the work of recognition
    • The intuition behind CTC loss and how to decode its output
    • End-to-end models in practice: wav2vec 2.0 and Whisper
    • How to measure accuracy with Word Error Rate (WER)

    🎙️ Real-World Analogy: Turning Sound Waves Into Text

    Think of speech recognition like a musician transcribing a song by ear. First the air pressure wiggles — that's the raw sound wave. A microphone measures that wiggle thousands of times a second, giving a long list of numbers (the waveform).

    The musician doesn't read raw pressure values — they "see" the music as pitches over time. The computer does the same by drawing a spectrogram: a picture of which frequencies are loud at each moment, like sheet music for sound. Then a trained "reader" (the neural network) looks at that picture and writes down the words.

    The hard part is the same one a human transcriber faces: people speak at different speeds, with accents, mumbling, and a noisy café in the background. Good ASR is mostly about reading that messy, smudged sheet music correctly.

    1Audio Is Just a List of Numbers

    Sound is a wave of changing air pressure. To get it into a computer, a microphone samples that wave — it measures the pressure at fixed instants and writes down a number each time. Do that 16,000 times a second and you have a 16kHz recording: a list of 16,000 numbers per second.

    That number (16,000) is the sample rate. Higher sample rates capture higher frequencies but make bigger files. Speech lives mostly below 8kHz, so almost every speech model expects 16kHz mono audio — a single channel sampled 16,000 times a second.

    Before any neural network, you can already pull useful signals straight from the raw samples. Energy (the sum of squared samples) tells you how loud a slice is. Zero-crossings (how often the wave flips between positive and negative) hint at pitch and noisiness. Run the example below to compute both on a toy 12-sample "waveform".

    Try It: Energy and Zero-Crossings of a Toy Signal

    Measure loudness and pitch cues directly from raw audio samples — plain Python, no libraries

    Try it Yourself »
    Python
    # Audio is just a list of numbers — samples taken many times per second.
    # A 16kHz recording means 16,000 numbers (samples) for every second of sound.
    # Loudness shows up as bigger swings; pitch shows up as faster swings.
    #
    # Two cheap features tell you a lot about a frame of audio:
    #   - ENERGY:        how loud it is (sum of squared samples)
    #   - ZERO-CROSSINGS: how often the wave flips sign (a rough pitch / noisiness cue)
    
    # A tiny "waveform" — 12 samples ranging from -1.0 to +1.0
    signal = [0
    ...

    2From Waveform to Features: Spectrogram and MFCC

    Raw samples are hard for a network to read directly — there are too many, and they don't show pitch clearly. So you transform them into features: a compact picture of the sound.

    A spectrogram chops the audio into short overlapping frames (say 25ms every 10ms) and, for each frame, measures how much energy sits at each frequency. Stack those frames side by side and you get a 2D image: time across the bottom, frequency up the side, brightness = loudness.

    A mel spectrogram warps the frequency axis to match human hearing — we hear low pitches in fine detail and high pitches coarsely, so the mel scale packs more resolution into the low and mid range where speech lives. MFCCs (Mel-Frequency Cepstral Coefficients) go one step further: they compress each mel frame into ~13 numbers that summarise the shape of the sound. MFCCs powered classic ASR for decades; modern deep models often feed on the mel spectrogram directly.

    The pipeline, in one line:

    Waveform → frame it → FFT per frame → mel scale → log → (optional) MFCC

    The result is a 2D array, e.g. (80 mel bands × N time frames). Speech models treat this like a single-channel image.

    3Two Jobs: The Acoustic Model and the Language Model

    Classic recognition splits the work in two. The acoustic model answers "what sounds are in this audio?" — it maps spectrogram frames to phonemes or characters. It hears sounds, not meaning, so on its own it happily produces "wreck a nice beach".

    The language model answers "which word sequence is actually likely?" — it knows that "recognise speech" is far more probable than "wreck a nice beach". It rescues the acoustic model from sound-alike mistakes (homophones) and rare words.

    Modern end-to-end models often fold both jobs into one network, but the language-model idea still shows up — for example, Whisper's text decoder has implicitly learned which word sequences are plausible.

    4CTC Loss: Aligning Audio to Text Without a Ruler

    Here's the alignment problem: a one-second clip might be 100 frames, but the word "hi" is 2 letters. Which frames belong to "h" and which to "i"? Hand-labelling that for millions of clips is impossible.

    CTC (Connectionist Temporal Classification) solves it cleverly. The model emits one symbol per frame and is allowed to (a) repeat a symbol and (b) emit a special blank meaning "no new letter here". To read the final text you collapse runs of the same symbol, then delete the blanks. So h-e-l-l-o becomes hello. During training, CTC sums the probability of every frame-alignment that collapses to the correct text — so the model never needs a human to say where each letter starts.

    The clever bit: double letters survive because the model puts a blank between them. In h-e-l-l-o the two l's are kept; remove that blank and they would collapse to a single l, so you could never spell "hello". Run the decoder below to see both rules in action.

    Try It: Collapse CTC Output Into Text

    Apply the two CTC rules — collapse repeats, drop blanks — to decode model output like 'h-e-l-l-o'

    Try it Yourself »
    Python
    # CTC (Connectionist Temporal Classification) lets a model output one symbol
    # PER audio frame without knowing exactly where each letter starts.
    # The model emits repeats and a special BLANK ("-") to mark "no new letter here".
    #
    # To get the final text you apply two rules, in order:
    #   1) Collapse RUNS of the same symbol into one   (l l  -> l)
    #   2) Then DELETE every blank "-"
    #
    # So "h-e-l-l-o" decodes to "hello".
    
    def ctc_collapse(raw, blank="-"):
        # Step 1: collapse consecutive duplicate
    ...

    🎯 Your Turn: Finish the Signal Features

    Fill in the blanks to compute energy and zero-crossings

    Try it Yourself »
    Python
    # 🎯 YOUR TURN — fill in the blanks marked with ___
    
    signal = [0.2, 0.5, -0.4, -0.7, 0.3, 0.6, -0.2]
    
    # 1) Add up the SQUARE of every sample to get the energy
    energy = 0.0
    for s in signal:
        energy += ___          # 👉 multiply the sample by itself (s * s)
    print("Energy:", round(energy, 3))
    
    # 2) Count how many times the wave changes sign (a zero-crossing)
    crossings = 0
    for i in range(1, len(signal)):
        a, b = signal[i - 1], signal[i]
        if (a < 0 and b >= 0) or (a >= 0 and ___):   # 👉 b <
    ...

    5End-to-End Models: wav2vec 2.0 and Whisper

    Today's best systems are end-to-end: a single neural network goes straight from audio (or mel spectrogram) to text, learning the features and the alignment itself.

    • wav2vec 2.0 pre-trains on huge amounts of unlabelled audio by masking chunks and predicting them — like fill-in-the-blank for sound. You then fine-tune with a small labelled set and a CTC head. Great when you have lots of audio but little transcribed data.
    • Whisper (OpenAI) is a sequence-to-sequence (encoder–decoder) model trained on ~680,000 hours of multilingual audio. It is remarkably robust to noise and accents, transcribes and even translates many languages, and is the easiest place to start.

    You don't implement these from scratch. Here's the full Whisper "hello world" — read it as a worked example (you can't run model downloads in the sandbox, so the expected output is shown in comments):

    # pip install -U openai-whisper
    import whisper
    
    model = whisper.load_model("base")          # tiny / base / small / medium / large-v3
    result = model.transcribe("audio.mp3")
    print(result["text"])
    
    # Expected output (for a clip saying it):
    #  Hello, and welcome to the speech recognition lesson.

    The Hugging Face Transformers version, which works for Whisper and wav2vec 2.0 alike:

    # pip install transformers torch
    from transformers import pipeline
    
    # 16kHz mono audio in, text out — pick any ASR model from the Hub
    asr = pipeline("automatic-speech-recognition", model="openai/whisper-base")
    out = asr("audio.mp3")
    print(out["text"])
    
    # Expected output:
    #  Hello, and welcome to the speech recognition lesson.

    6Measuring Accuracy: Word Error Rate (WER)

    How good is a transcription? The industry metric is Word Error Rate. You line up the model's words against the truth and count the cheapest set of edits to fix them — substitutions, insertions, and deletions — then divide by the number of words in the reference:

    WER = (S + I + D) / (words in reference)

    Lower is better. 0.0 is perfect; a WER of 0.25 means one word in four is wrong. That "cheapest set of edits" is exactly the classic edit distance (Levenshtein) — but counted on whole words instead of characters. The example below computes it from scratch.

    Try It: Compute Word Error Rate from Scratch

    Edit distance on word lists gives you the standard ASR accuracy metric — plain Python

    Try it Yourself »
    Python
    # WORD ERROR RATE (WER) is THE score for speech recognition.
    # It counts how many word edits turn the model's guess into the truth:
    #   Substitutions (S) + Insertions (I) + Deletions (D)
    #   WER = (S + I + D) / number_of_words_in_truth
    # Lower is better. 0.0 = perfect. Humans are around 0.04-0.05 on clean speech.
    
    def word_edit_distance(ref, hyp):
        # Classic Levenshtein edit distance, but on WORDS not characters.
        n, m = len(ref), len(hyp)
        # dp[i][j] = edits to turn ref[:i] into hyp[:j
    ...

    🎯 Your Turn: Decode and Score

    Fill in the blanks to collapse CTC output and compute a simple WER

    Try it Yourself »
    Python
    # 🎯 YOUR TURN — fill in the blanks marked with ___
    
    def ctc_collapse(raw, blank="-"):
        collapsed = []
        prev = None
        for ch in raw:
            if ch != ___:        # 👉 only keep a symbol that DIFFERS from the previous one (prev)
                collapsed.append(ch)
            prev = ch
        return "".join(ch for ch in collapsed if ch != blank)
    
    # Decode the model's noisy per-frame output into clean text
    print(ctc_collapse("g-o-o-d"))     # should print: good
    
    # Now score a transcription with a si
    ...

    🔊 The Other Direction: Text-to-Speech (TTS)

    ASR turns audio into text; text-to-speech runs the pipeline in reverse — text in, audio out. A model like Tacotron or VITS first generates a mel spectrogram from the words, then a vocoder (such as HiFi-GAN) turns that spectrogram back into a waveform you can play.

    Notice the symmetry: the mel spectrogram sits in the middle of both tasks. It's the common "sheet music" representation that bridges sound and meaning in either direction.

    🎯 Mini-Challenge: Voice-Command Checker

    Put it together with no scaffolding. Decode a raw CTC string into a command, then score it against the expected command with WER. The starter only gives you an outline — write the logic yourself.

    Mini-Challenge

    Decode CTC output and compute its WER against the expected command

    Try it Yourself »
    Python
    # 🎯 MINI-CHALLENGE: Voice-command checker
    #
    # 1. Define ctc_collapse(raw, blank="-") that collapses repeats then drops blanks
    #    (reuse the two rules from the worked example).
    # 2. Decode this raw model output:  "v-o-l-u-m-e u-p"
    #    (a single space token separates the two words; the double letters that need
    #     protecting are already split by blanks, so a plain collapse+drop works)
    # 3. Compare it to the expected command "volume up" word-by-word and print
    #    a WER (wrong words / total r
    ...

    !Common Errors (And How to Fix Them)

    These four trip up almost everyone building their first speech system:

    ❌ Garbled output from noisy audio

    Background noise, music, or two people talking smear the spectrogram and the model guesses wrong.

    ✅ Fix: denoise or apply Voice Activity Detection (VAD) to keep only speech, and prefer a model trained on noisy data (Whisper handles noise well). Recording closer to the mic helps more than any code.

    ❌ Sample-rate mismatch

    Feeding 44.1kHz audio to a model trained on 16kHz (or vice versa) shifts every frequency and produces nonsense.

    # Resample to the rate the model expects (16kHz mono):
    import librosa
    audio, sr = librosa.load("clip.wav", sr=16000, mono=True)

    ✅ Fix: always resample to the model's expected rate and convert to mono before inference.

    ❌ Accents and out-of-vocabulary words

    A model trained mostly on US English mangles strong regional accents, names, brand names, and jargon it never saw.

    ✅ Fix: choose a multilingual / multi-accent model (Whisper), fine-tune on samples from your domain, or bias decoding toward a custom vocabulary of expected terms.

    ❌ No language model — sound-alike mistakes

    A pure acoustic model with no sense of likely word sequences writes "to too two" interchangeably and invents non-words.

    ✅ Fix: add or enable a language model (an n-gram LM with CTC beam search, or use a seq2seq model like Whisper whose decoder already knows likely word sequences).

    📋 Quick Reference

    TermWhat it isWhy it matters
    Sample rateSamples captured per secondSpeech models expect 16kHz mono
    WaveformThe raw list of samplesThe starting point for everything
    SpectrogramFrequency vs time picture"Sheet music" the network reads
    Mel / MFCCHuman-hearing-scaled featuresCompact, perceptually meaningful input
    Acoustic modelAudio → sounds/lettersHears, doesn't reason about words
    Language modelLikely word sequencesFixes sound-alike mistakes
    CTCAlignment-free training/decodingNo hand-made frame-to-letter labels
    Whisper / wav2vec2End-to-end ASR modelsState-of-the-art, easy to use
    WER(S + I + D) / reference wordsThe standard accuracy score

    ❓ Frequently Asked Questions

    Q: What is automatic speech recognition (ASR)?

    A: ASR is the technology that turns spoken audio into written text. Modern systems read the audio as a waveform, turn it into features like a spectrogram, then use a neural network to predict the words.

    Q: What is a mel spectrogram and why do speech models use it?

    A: A spectrogram shows which frequencies are present over time — like sheet music for sound. The mel version spaces the frequencies the way human ears hear them, packing more detail into the low and mid range where speech lives. It turns 1D audio into a 2D 'image' a neural network can read.

    Q: What does CTC loss actually do?

    A: CTC lets a model output one symbol per audio frame without knowing exactly where each letter starts and ends. It allows repeats and a special blank symbol, then collapses repeated symbols and removes blanks to recover the final text — so it can be trained on audio paired with text, with no hand-made alignment.

    Q: What is Word Error Rate (WER)?

    A: WER is the standard accuracy score for speech recognition. It is (substitutions + insertions + deletions) divided by the number of words in the reference transcript. Lower is better: 0.0 is perfect, and good systems on clean audio score under 0.05.

    Q: Should I use Whisper or wav2vec 2.0?

    A: Whisper is the easiest to start with: it is multilingual, robust to noise, and works in a couple of lines. wav2vec 2.0 shines when you fine-tune on your own domain with limited labelled data. For most projects, start with Whisper and only switch if you need custom fine-tuning.

    Q: Why does my model transcribe noisy or accented audio so badly?

    A: Most errors come from a mismatch between training and real data: background noise, the wrong sample rate (use 16kHz mono), accents or vocabulary the model never saw, or missing a language model to fix unlikely word sequences. Resampling, denoising, and choosing a model trained on diverse speech usually help most.

    🎉

    Lesson complete — you can read the whole ASR pipeline!

    You now know how audio becomes numbers, how spectrograms and MFCCs turn it into features a network can read, how acoustic and language models divide the work, why CTC lets models learn alignment for free, and how to use Whisper and wav2vec 2.0. You can even decode CTC output and score a transcription with WER by hand.

    🚀 Up next: Advanced NLP — transformers for understanding and generating text, the same architecture that powers Whisper's decoder.

    Sign up for free to track which lessons you've completed and get learning reminders.

    Previous

    Cookie & Privacy Settings

    We use cookies to improve your experience, analyze traffic, and show personalized ads. You can manage your preferences below.

    By clicking "Accept All", you consent to our use of cookies for analytics and personalized advertising. You can customize your preferences or reject non-essential cookies.

    Privacy PolicyTerms of Service