Lesson 9 • Intermediate
Natural Language Processing
Teach computers to understand text — tokenisation, Bag of Words, TF-IDF, and sentiment analysis.
✅ What You'll Learn
- • Tokenisation: word, character, and subword (BPE)
- • Bag of Words and TF-IDF text representations
- • Building a sentiment analyser from scratch
- • How modern NLP (BERT, GPT) processes text
💬 What Is NLP?
🎯 Real-World Analogy: Imagine you're visiting Japan and don't speak Japanese. First, you break the language into recognisable pieces (tokenisation). Then you learn which words appear often together (patterns). Finally, you understand meaning from context (comprehension). NLP does exactly this — but with maths instead of intuition.
The fundamental challenge: computers see text as a sequence of characters. They don't know that "bank" means something different in "river bank" vs "bank account". NLP bridges this gap by converting text to numbers that capture meaning.
Try It: Tokenisation
Break text into tokens and build a vocabulary
# Tokenization: Breaking text into pieces the model can understand
text = "The quick brown fox jumps over the lazy dog. AI is amazing!"
# Method 1: Word tokenization (split by spaces)
word_tokens = text.split()
print("=== Word Tokenization ===")
print(f"Text: {text}")
print(f"Tokens: {word_tokens}")
print(f"Count: {len(word_tokens)}")
print()
# Method 2: Character tokenization
char_tokens = list(text)
print("=== Character Tokenization ===")
print(f"First 20 chars: {char_tokens[:20]}")
print(f
...Try It: Bag of Words & TF-IDF
Convert text to numerical vectors based on word frequency
import numpy as np
# Bag of Words: Simple but effective text representation
# Count word frequencies, ignore order
documents = [
"I love machine learning and AI",
"AI and deep learning are amazing",
"Machine learning is a subset of AI",
"I love programming in Python",
]
# Step 1: Build vocabulary from all documents
vocab = sorted(set(word.lower() for doc in documents for word in doc.split()))
print("=== Vocabulary ===")
print(f" {vocab}")
print(f" Size: {len(vocab)} unique w
...Try It: Sentiment Analysis
Build a rule-based sentiment analyser with negation handling
import numpy as np
# Sentiment Analysis: Is text positive, negative, or neutral?
# Simple rule-based approach (real systems use ML)
positive_words = {
"love", "great", "amazing", "excellent", "wonderful", "fantastic",
"good", "happy", "best", "awesome", "beautiful", "enjoy", "perfect"
}
negative_words = {
"hate", "terrible", "awful", "bad", "worst", "horrible", "poor",
"boring", "ugly", "disappointing", "waste", "annoying", "slow"
}
intensifiers = {"very", "really", "extremely"
...📋 Quick Reference
| Technique | What It Does | Used In |
|---|---|---|
| Tokenisation | Split text into pieces | Every NLP system |
| Bag of Words | Count word frequencies | Simple classifiers |
| TF-IDF | Weight words by importance | Search engines |
| Word Embeddings | Dense vector representations | Word2Vec, GloVe |
| Transformers | Contextual understanding | BERT, GPT, LLMs |
💡 Pro Tip: For production NLP, use Hugging Face Transformers library. With 3 lines of code you can load a pre-trained BERT model that took millions of dollars to train — and fine-tune it on your specific task for free.
🎉 Lesson Complete!
You can now process and analyse text with NLP! Next, learn Computer Vision — teaching computers to see and understand images.
Sign up for free to track which lessons you've completed and get learning reminders.