Lesson 9 • Intermediate
Natural Language Processing
Turn raw text into numbers a model can learn from — clean it, count it, weight it, and read its sentiment.
What You'll Learn in This Lesson
- ✓You'll be able to preprocess text: lowercase, tokenize, and strip punctuation
- ✓You'll remove stopwords and reduce words with stemming vs lemmatization
- ✓You'll build Bag of Words and TF-IDF vectors by hand and with scikit-learn
- ✓You'll generate n-grams to keep some word order
- ✓You'll explain word embeddings (word2vec / GloVe) in plain terms
- ✓You'll outline a full sentiment-analysis pipeline end to end
pip install scikit-learn.🌍 Real-World Analogy: Teaching a Computer to Read
Imagine teaching a child who has never seen language to read. You don't hand them a novel. You start by showing that text breaks into words (tokenization). You teach that "Cat" and "cat" are the same word (lowercasing) and that commas aren't words (removing punctuation). You explain that tiny words like "the" and "is" appear everywhere and rarely change the meaning (stopwords).
Then you teach which words matter in a given sentence (TF-IDF), that some words travel in pairs like "ice cream" (n-grams), and finally that words have meanings — "happy" sits near "joyful" and far from "miserable" (embeddings). A computer learns to read the exact same way, but with maths standing in for intuition. That is all NLP is: a ladder from raw characters up to meaning.
1Text Preprocessing — Cleaning Before You Count
Raw text is messy: mixed case, punctuation, and filler words. Preprocessing is the cleanup that happens before any model sees your text. You lowercase it, tokenize it (split it into a list of words), drop punctuation, remove stopwords (common words like "the" and "is" that carry little meaning), and reduce word forms with stemming or lemmatization.
Stemming crudely chops endings (so running becomes runn) and can produce non-words. Lemmatization is smarter: it uses a dictionary to return a real base word (running becomes run). Read the worked example, then run it to watch each step transform the text.
Worked Example: Preprocess Text Step by Step
Lowercase, strip punctuation, tokenize, remove stopwords, and stem
# === Text Preprocessing: cleaning raw text before analysis ===
# Goal: turn messy human text into a clean list of meaningful words.
raw = "The CATS are Running, and the cats RAN fast!!!"
# Step 1) Lowercase — so "Cats" and "cats" count as the same word
lowered = raw.lower()
print("lowercased:", lowered)
# lowercased: the cats are running, and the cats ran fast!!!
# Step 2) Remove punctuation — keep letters/spaces only
cleaned = ""
for ch in lowered:
if ch.isalpha() or ch == " ": # keep
...2Bag of Words and TF-IDF — Turning Text Into Numbers
Models do maths, not words, so you must convert text into vectors (lists of numbers). Bag of Words (BoW) is the simplest: build a vocabulary of every unique word, then represent each document by how many times each vocab word appears. Order is thrown away — hence "bag".
The problem: common words like "the" dominate the counts without adding meaning. TF-IDF (Term Frequency × Inverse Document Frequency) fixes this. It multiplies each word's frequency in a document by how rare the word is across all documents, so distinctive words score higher and ubiquitous words score lower.
In real projects you don't hand-roll the counting — scikit-learn's CountVectorizer and TfidfVectorizer do it in two lines. Read this example and compare it to the # Expected output comments:
# === Bag of Words & TF-IDF the professional way (scikit-learn) ===
# In real projects you don't hand-roll counting — sklearn does it fast.
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
docs = [
"i love nlp",
"i love machine learning",
"nlp is fun",
]
# --- Bag of Words: counts how many times each vocab word appears ---
bow = CountVectorizer()
X = bow.fit_transform(docs) # fit = learn vocab, transform = count
print(bow.get_feature_names_out())
# Expected output: ['fun' 'is' 'learning' 'love' 'machine' 'nlp']
print(X.toarray())
# Expected output:
# [[0 0 0 1 0 1]
# [0 0 1 1 1 0]
# [1 1 0 0 0 1]]
# Each row is a document; each column a vocab word; order is ignored.
# --- TF-IDF: down-weights words common to EVERY doc, up-weights rare ones ---
tfidf = TfidfVectorizer()
T = tfidf.fit_transform(docs)
print(T.toarray().round(2))
# Expected output (approx):
# [[0. 0. 0. 0.79 0. 0.61]
# [0. 0. 0.63 0.5 0.63 0. ]
# [0.62 0.62 0. 0. 0. 0.48]]
# 'love'/'nlp' span 2 docs -> smaller weight; 'machine'/'fun' are rarer -> larger.fit_transform does two jobs — fit learns the vocabulary, and transform turns each document into a row of numbers. You fit once, then transform anything.🎯 Your Turn: Tokenize and Word-Count
Fill in the two blanks to split a sentence into words and count each one. This is Bag of Words for a single document, built by hand.
Try It Yourself: Word Counter
Tokenize a sentence and count how often each word appears
# 🎯 YOUR TURN — tokenize a sentence and count each word
# Fill in the blanks marked ___ then run it.
sentence = "the cat sat on the mat the cat purred"
# 1) Lowercase and tokenize into a list of words
tokens = sentence.lower().___() # 👉 method that splits on spaces -> split
# 2) Count each word into a dictionary {word: count}
counts = {}
for word in tokens:
counts[word] = counts.get(word, ___) + 1 # 👉 starting count for a new word
# 3) Print every word and its count
for word, n i
...3N-grams — Keeping a Little Word Order
Bag of Words throws away order, so "dog bites man" and "man bites dog" look identical to it. An n-gram is a sequence of n adjacent words: a bigram is a pair, a trigram is three. Including bigrams like ("natural", "language") lets your model capture phrases a single-word bag would miss — at the cost of a much larger vocabulary.
Worked Example: Generate N-grams
Build unigrams, bigrams, and trigrams from a token list
# === N-grams: capture a little word ORDER that Bag of Words throws away ===
# A bigram is a pair of adjacent words; a trigram is three in a row.
tokens = "i really love natural language processing".split()
def ngrams(words, n):
return [tuple(words[i:i + n]) for i in range(len(words) - n + 1)]
print("unigrams:", ngrams(tokens, 1))
# unigrams: [('i',), ('really',), ('love',), ('natural',), ('language',), ('processing',)]
print("bigrams: ", ngrams(tokens, 2))
# bigrams: [('i', 'really'), ('
...🎯 Your Turn: Compute Term Frequency
Term frequency (the "TF" in TF-IDF) is just a word's count divided by the total number of words. Fill in the blank to finish the calculation.
Try It Yourself: Term Frequency
Divide each word count by the document length to get TF
# 🎯 YOUR TURN — compute term frequency (TF)
# TF = (times a word appears) / (total words in the document)
doc = "data data data science is fun"
tokens = doc.lower().split()
total = len(tokens) # total number of words
# Count each word
counts = {}
for word in tokens:
counts[word] = counts.get(word, 0) + 1
# Build {word: term_frequency}, rounded to 2 decimals
tf = {}
for word, n in counts.items():
tf[word] = round(n / ___, 2) # 👉 divide the count by the total word coun
...4Word Embeddings — Numbers That Capture Meaning
BoW and TF-IDF have a blind spot: to them, "happy" and "joyful" are as unrelated as "happy" and "table" — just different columns. Word embeddings fix this by representing each word as a dense vector of numbers (often 100–300 of them) learned from how words are actually used together.
Words used in similar contexts end up close together in this vector space. The famous result is that you can do arithmetic with meaning: king − man + woman ≈ queen.
word2vec — learns embeddings by predicting a word from its nearby context words (a sliding window). Fast and famous.
GloVe — learns from global co-occurrence counts (how often word pairs appear together across the whole corpus).
Both produce a lookup table mapping each word to its vector. Modern transformers (BERT, GPT) take this further with contextual embeddings — the vector for "bank" differs in "river bank" versus "bank account" — but the intuition you've just learned is the same.
🌍 The Sentiment-Analysis Pipeline, End to End
Everything in this lesson clicks together into one of NLP's most common tasks: deciding whether a review is positive or negative. Here is the full pipeline, stage by stage:
- Collect labelled data — reviews tagged positive / negative (e.g. star ratings).
- Preprocess — lowercase, tokenize, strip punctuation, remove stopwords (Section 1).
- Split first — divide into train and test sets before vectorizing.
- Vectorize —
fit_transforma TF-IDF vectorizer on the training text only, thentransformthe test text with it (Section 2). - Train a classifier — feed the vectors to logistic regression or a small neural net.
- Evaluate — check accuracy on the held-out test set, then predict sentiment on new reviews.
Production systems swap steps 4–5 for a fine-tuned transformer like BERT, which handles context, negation, and sarcasm far better — but the shape of the pipeline is exactly this.
!Common Errors (And How to Fix Them)
These four mistakes trip up almost everyone learning NLP. Spotting them early saves hours.
❌ Not cleaning the text first
Feeding raw text straight into a vectorizer means "Hello," and "hello" become different tokens, bloating the vocabulary with near-duplicates.
✅ Fix: always preprocess (lowercase, strip punctuation, drop stopwords) before counting.
❌ Ignoring case and punctuation
Treating "AI", "ai", and "ai!" as three separate words splits one concept across three columns and weakens every signal.
✅ Fix: call .lower() and remove punctuation so word forms collapse together.
❌ Vocabulary explosion
Adding bigrams and trigrams to a large corpus can create hundreds of thousands of columns — slow to train and prone to overfitting.
✅ Fix: cap the vocabulary with CountVectorizer(max_features=5000) or min_df to drop rare terms.
❌ Data leakage when fitting the vectorizer
Calling fit_transform on the whole dataset before splitting lets the vectorizer learn the test vocabulary — your accuracy looks great, then collapses in production.
# ❌ leaks test data into the vectorizer X = tfidf.fit_transform(all_texts) X_train, X_test = split(X)
✅ Fix: split first, fit only on train, transform test:
train, test = split(all_texts) X_train = tfidf.fit_transform(train) # fit on train only X_test = tfidf.transform(test) # reuse the fitted vocab
📋 Quick Reference
| Technique | What It Does | Tool / Example |
|---|---|---|
| Tokenization | Split text into words | text.lower().split() |
| Stopword removal | Drop common low-meaning words | [w for w in t if w not in stop] |
| Stemming / Lemmatizing | Reduce words to a base form | running → run |
| Bag of Words | Count word frequencies | CountVectorizer() |
| TF-IDF | Weight words by importance | TfidfVectorizer() |
| N-grams | Keep adjacent-word order | ngram_range=(1, 2) |
| Word embeddings | Dense meaning vectors | word2vec, GloVe |
| Contextual models | Meaning from full context | BERT, GPT |
❓ Frequently Asked Questions
Q: What is tokenization in NLP?
A: Tokenization splits raw text into smaller units called tokens — usually words. For example 'I love NLP' becomes ['i', 'love', 'nlp']. Tokens are the basic pieces every later step (counting, vectorizing, embedding) operates on.
Q: What is the difference between Bag of Words and TF-IDF?
A: Bag of Words counts how often each vocabulary word appears in a document and ignores order. TF-IDF starts from those counts but down-weights words that appear in almost every document (like 'the') and up-weights rarer, more distinctive words, so important terms stand out.
Q: What is the difference between stemming and lemmatization?
A: Both reduce a word to a base form so 'running' and 'ran' can be grouped. Stemming crudely chops endings ('running' -> 'runn') and can produce non-words. Lemmatization uses a dictionary and grammar to return a real base word ('running' -> 'run'), so it is more accurate but slower.
Q: What are word embeddings like word2vec and GloVe?
A: Word embeddings represent each word as a dense vector of numbers learned from how words are used together. Words with similar meanings end up close together in that space, which lets a model capture meaning rather than just word counts. word2vec learns from local context windows; GloVe learns from global co-occurrence counts.
Q: What is data leakage when fitting a vectorizer?
A: Data leakage happens when you call fit (or fit_transform) on your full dataset before splitting into train and test. The vectorizer then 'sees' the test vocabulary and statistics, giving you over-optimistic results. Always fit only on training data, then transform the test set with that already-fitted vectorizer.
🎯 Mini-Challenge: Tiny Sentiment Scorer
Time to fly with less support. The starter below gives you only a comment outline — no filled-in logic. Use everything from this lesson (tokenize, then count matches against two word sets) to print a verdict.
Mini-Challenge: Sentiment Scorer
Score a sentence as Positive, Negative, or Neutral from word lists
# 🎯 MINI-CHALLENGE: tiny sentiment scorer
# Brief: score a sentence as Positive / Negative / Neutral using word lists.
#
# 1. positive = {"love", "great", "happy"} ; negative = {"hate", "bad", "sad"}
# 2. Lowercase the sentence and split it into tokens
# 3. score = (number of positive words) - (number of negative words)
# 4. Print "Positive" if score > 0, "Negative" if score < 0, else "Neutral"
#
# ✅ Expected for "i love this great movie": Positive (score 2)
# ✅ Expected for "this is bad and i
...Lesson 9 complete — you can teach a computer to read!
You can preprocess raw text, build Bag of Words and TF-IDF vectors by hand and with scikit-learn, generate n-grams, explain word embeddings, and lay out a full sentiment-analysis pipeline — including how to avoid data leakage when fitting a vectorizer.
🚀 Up next: Computer Vision — teaching computers to see and understand images.
Sign up for free to track which lessons you've completed and get learning reminders.