Lesson 34 • Advanced
Speech Recognition & Audio ML
Convert speech to text — learn mel spectrograms, CTC decoding, and how Whisper achieves near-human transcription accuracy.
✅ What You'll Learn
- • Mel spectrograms: converting audio to 2D representations
- • CTC decoding: aligning audio frames to text
- • Modern ASR models: Whisper, wav2vec 2.0, Conformer
- • Word Error Rate (WER) and evaluation
🎙️ Teaching Machines to Listen
🎯 Real-World Analogy: Speech recognition is like translating sheet music. First, you convert the raw sound waves into a visual representation (mel spectrogram = the sheet music). Then a "reader" (neural network) interprets the notes into words. The challenge is that people speak at different speeds, with accents, and with background noise — like reading messy, smudged sheet music.
Modern speech recognition has two key preprocessing steps: (1) convert raw audio into a mel spectrogram (a 2D representation that mimics human hearing), then (2) feed it to a neural network that outputs text, either character-by-character (CTC) or as a sequence-to-sequence model (Whisper).
Try It: Mel Spectrograms
Convert audio into the 2D representation that speech models process
import numpy as np
# Mel Spectrogram: Converting Audio to "Images" for Neural Networks
# The key preprocessing step for all modern speech models
np.random.seed(42)
def hz_to_mel(hz):
return 2595 * np.log10(1 + hz / 700)
def mel_to_hz(mel):
return 700 * (10 ** (mel / 2595) - 1)
def create_mel_filterbank(n_mels, n_fft, sample_rate):
"""Create triangular mel-scale filter bank"""
mel_min = hz_to_mel(0)
mel_max = hz_to_mel(sample_rate / 2)
mel_points = np.linspace(mel_min
...Try It: CTC Decoding
Decode neural network outputs into text by collapsing repeated characters
import numpy as np
# CTC: Connectionist Temporal Classification
# Align audio frames to text without explicit alignment
np.random.seed(42)
def softmax(x):
e = np.exp(x - np.max(x, axis=-1, keepdims=True))
return e / e.sum(axis=-1, keepdims=True)
def ctc_greedy_decode(log_probs, vocab):
"""Simple greedy CTC decoding"""
# Get most likely token at each timestep
best_path = np.argmax(log_probs, axis=1)
# Collapse repeated tokens and remove blanks
decoded = []
...⚠️ Common Mistake: Using a 44.1kHz sample rate for speech models trained on 16kHz. Most speech models (Whisper, wav2vec2) expect 16kHz mono audio. Always resample your audio to match the model's expected input, or you'll get garbled transcriptions.
💡 Pro Tip: For production speech-to-text, use Whisper large-v3 for best accuracy or Whisper tiny for speed. For real-time streaming, use faster-whisper (CTranslate2 backend) — it's 4× faster with the same accuracy. Add VAD (Voice Activity Detection) to skip silence and reduce costs.
📋 Quick Reference
| Concept | What | Why |
|---|---|---|
| Mel Scale | Perceptual frequency scale | Matches human hearing |
| STFT | Time-frequency decomposition | See frequencies over time |
| CTC | Alignment-free decoding | No need for forced alignment |
| WER | Word Error Rate | Primary ASR metric |
| VAD | Voice Activity Detection | Skip silence, save compute |
🎉 Lesson Complete!
You now understand how machines convert speech to text! Next, dive into advanced NLP with transformers for text understanding and generation.
Sign up for free to track which lessons you've completed and get learning reminders.