Lesson 35 • Advanced

Advanced NLP: BERT, T5 & LLaMA

By the end you'll explain how transformers read context, tell BERT (understanding) apart from GPT (generation), fine-tune a pretrained model, and run NER, question answering, and summarization with a single line of Hugging Face code.

What You'll Learn in This Lesson

✓Why transformers give every word a context-aware embedding
✓How BERT (encoder, masked LM) differs from GPT (decoder)
✓What transfer learning and fine-tuning mean for NLP tasks
✓How to tag entities with Named Entity Recognition (NER)
✓How question answering extracts an answer span from text
✓How to summarize text — and pick the right model per task

Before you start: You should be comfortable with the Transformers & Attention lesson. This one builds on that to show how real NLP tasks are solved.

🌍 Real-World Analogy: Understanding Meaning, Not Just Words

Imagine two readers handed the sentence "I went to the bank." A dictionary-only reader sees the word "bank" and shrugs — it could be money or a river. A thoughtful reader looks at the words around it ("I fished off the …") and instantly knows which one you mean.

That thoughtful reader is a transformer. It doesn't store one fixed meaning per word — it builds a contextual embedding, a fresh vector for each word shaped by its neighbours. That is the whole reason modern NLP understands meaning, not just spelling. BERT is the detective reading the full report in both directions to understand; GPT is the storyteller writing one word at a time; T5 is the translator who reads everything, then writes a fresh version.

1Contextual Embeddings — Words With a Meaning That Moves

An embedding is a list of numbers (a vector) that represents a piece of text so a computer can do maths with meaning. The breakthrough of transformers is that the embedding is contextual: the same word gets a different vector depending on the sentence.

You compare two vectors with cosine similarity — a score from 0 (unrelated) to 1 (same direction). This single tool powers semantic search: find the document whose vector points the same way as your question, even if it shares no exact words. Run the worked example below and read the comments — every result is stated inline.

Worked Example: Contextual Embeddings & Cosine Similarity

See how the same word 'bank' gets two different vectors — and measure semantic closeness

Try it Yourself »

Python

# Why "context" is the heart of modern NLP
#
# Older NLP gave every word ONE fixed vector (word2vec, GloVe).
# So "bank" had the same meaning in "river bank" and "savings bank".
# Transformers (BERT, GPT) produce a CONTEXTUAL embedding: the vector
# for a word changes depending on the surrounding words.

# We fake "contextual" vectors here so you can see the idea without a GPU.
bank_money = [0.9, 0.1, 0.2]   # "bank" near: money, account, loan
bank_river = [0.1, 0.9, 0.8]   # "bank" near: river,
...

2BERT vs GPT — Encoder vs Decoder

BERT is an encoder. It is pretrained with masked language modeling (MLM): random words are hidden behind a [MASK] token and BERT must guess them using context from both sides. Because it reads in both directions, it is brilliant at understanding — classification, NER, similarity, and question answering.

GPT is a decoder. It is pretrained to predict the next token using only the words to its left. That left-to-right view makes it a natural generator — chat, writing, code. T5 combines both: an encoder reads the input, a decoder writes the output, which is the perfect shape for summarization and translation.

Rule of thumb: if the task is "label or find something in this text," reach for an encoder (BERT). If the task is "write new text," reach for a decoder (GPT). If it is "rewrite this text into that text," reach for encoder-decoder (T5/BART).

3Transfer Learning & Fine-Tuning

You almost never train an NLP model from zero. Instead you use transfer learning: take a model already pretrained on billions of words, then fine-tune it on your much smaller, task-specific dataset. The model already "knows" language; you only teach it your specific job.

The recipe for fine-tuning BERT on a task like NER or sentiment is short:

# Fine-tuning recipe (conceptual)
# 1. Load a pretrained model      -> bert-base-uncased
# 2. Add a small task head        -> e.g. a linear classification layer
# 3. Train a few epochs (3-5)     -> on your labelled data only
# 4. Use a tiny learning rate     -> 2e-5 with warmup, so you don't wipe
#                                     out what the model already learned

With a few thousand labelled examples you can reach accuracy that would have needed millions of examples to train from scratch. That is the power — and the economy — of transfer learning.

4NER, Question Answering & Summarization in One Line

Three of the most common NLP jobs each have a ready-made pretrained model:

Named Entity Recognition (NER) — tag each word as a person, organisation, or location.
Question Answering (QA) — given a context and a question, return the exact answer span.
Summarization — condense a long passage into a short one.

Hugging Face's pipeline() downloads and wires up the right model for you. Study the worked example below — each call is followed by its # Expected output.

Worked Example: Hugging Face pipeline() — NER, QA, Summarization

One line per task, with the expected output shown in comments

Try it Yourself »

Python

# In real projects you do NOT build transformers from scratch.
# Hugging Face's pipeline() loads a pretrained model in ONE line.
# (This needs 'pip install transformers' + a download, so study the
#  Expected output rather than running it in the browser sandbox.)

from transformers import pipeline

# 1) Named Entity Recognition — tag people, orgs, places
ner = pipeline("ner", grouped_entities=True)
print(ner("Tim Cook is the CEO of Apple in California."))
# Expected output:
# [{'entity_group': '
...

🎯 Your Turn 1: Build a Rule-Based NER Tagger

Real NER uses BERT, but the idea is simple enough to do by hand: look each word up in a dictionary of known entities and label it. Fill in the one blank, then run it and check your output against the comment.

🎯 Your Turn: Rule-Based NER

Fill in the default label so unknown words are tagged 'O'

Try it Yourself »

Python

# 🎯 YOUR TURN — a tiny Named Entity Recognition tagger
# Real NER uses BERT, but the IDEA is: look each word up and label it.
# Fill in the blanks marked with ___

# A dictionary mapping known words to their entity TYPE
known = {
    "Apple": "ORG",
    "Google": "ORG",
    "Paris": "LOC",
    "London": "LOC",
    "Alice": "PER",
    "Bob": "PER",
}

sentence = "Alice flew from London to Paris to visit Google"

tags = []
for word in sentence.split():
    # 👉 if the word is a key in 'known', us
...

🎯 Your Turn 2: Semantic Search With Cosine Similarity

This is how a search box "understands" a question: turn the query and each document into vectors, then rank documents by cosine similarity. Complete the cosine formula and confirm the password document scores higher than the weather one.

🎯 Your Turn: Semantic Search

Finish the cosine_similarity formula and rank two documents

Try it Yourself »

Python

# 🎯 YOUR TURN — semantic search with cosine similarity
# Transformers turn a whole sentence into ONE vector. Sentences with
# similar MEANING point in a similar direction, even with different words.
# Fill in the blanks marked with ___

def dot(a, b):
    return sum(x * y for x, y in zip(a, b))

def magnitude(v):
    return sum(x * x for x in v) ** 0.5

def cosine_similarity(a, b):
    # 👉 cosine = dot product divided by (magnitude(a) * magnitude(b))
    return dot(a, b) / (___ * magnitude(b))
...

🎯 Mini-Challenge: Keyword Sentiment Tagger

Time to fade the scaffolding. Only a comment outline is provided — you write the logic. Build a tiny lexicon-based sentiment tagger that decides whether a review is positive, negative, or mixed.

🎯 Mini-Challenge: Sentiment From a Lexicon

Count positive vs negative words and print the overall verdict

Try it Yourself »

Python

# 🎯 MINI-CHALLENGE: keyword sentiment tagger
# 1. Make a dict 'lexicon' mapping words to "POS" or "NEG"
#    e.g. {"great": "POS", "love": "POS", "terrible": "NEG", "hate": "NEG"}
# 2. Split this review into words:
#       review = "I love this great phone but hate the terrible battery"
# 3. Count how many POS words and how many NEG words appear
# 4. Print "Overall: POSITIVE" if pos > neg, "Overall: NEGATIVE" if neg > pos,
#    otherwise "Overall: MIXED"
#
# ✅ Expected output (with the review a
...

5Common Errors (And How to Fix Them)

❌ Ignoring context

Treating a word as one fixed meaning (old word2vec thinking). "Apple" the company and "apple" the fruit are different — a model that ignores context will confuse them.

✅ Fix: Use a transformer model whose embeddings are contextual; never average away the sentence.

❌ Wrong model for the task

Using GPT for plain classification or NER. It works, but it is 10–100× slower and pricier than a fine-tuned BERT, and often less accurate on the label.

✅ Fix: Encoder (BERT) for understanding, decoder (GPT) for generation, encoder-decoder (T5) for rewriting. Use the simplest model that solves the task.

❌ Exceeding the token limit

BERT caps at 512 tokens. Feed a long document straight in and it is silently truncated — the model never sees the end, so your answer or summary is wrong.

✅ Fix: Chunk long text into overlapping windows, or use a long-context model designed for it.

❌ Poor fine-tuning data quality

Fine-tuning on noisy, mislabelled, or imbalanced data. The model faithfully learns your mistakes — garbage in, garbage out — and scores look fine on a flawed test set.

✅ Fix: Clean and balance labels, hold out a trustworthy validation set, and prefer fewer correct examples over many noisy ones.

📋 Quick Reference: BERT vs GPT

Feature	BERT (Encoder)	GPT (Decoder)
Attention	Bidirectional (both sides)	Left-to-right only
Pretraining	Masked LM (fill the blank)	Next-token prediction
Best at	Understanding	Generation
Typical tasks	Classification, NER, QA, similarity	Chat, writing, code, reasoning
Sees	All tokens at once	Past tokens only
Example	bert-base-uncased	GPT-4, LLaMA, Mistral

Need to rewrite text (summarize/translate)? Use an encoder-decoder like T5 or BART — it has both.

❓ Frequently Asked Questions

Q: What is a contextual embedding in NLP?

A: It is a vector for a word or sentence whose values depend on the surrounding text. Unlike older fixed embeddings (word2vec, GloVe) where 'bank' always had one vector, a transformer gives 'bank' a different vector in 'river bank' versus 'savings bank', which is why modern NLP understands meaning, not just spelling.

Q: What is the difference between BERT and GPT?

A: BERT is an encoder trained with masked language modeling: it reads the whole sentence in both directions, so it is best for understanding tasks like classification, NER, and question answering. GPT is a decoder trained to predict the next token left-to-right, so it is best for generating text like chat and writing.

Q: What is transfer learning and fine-tuning for NLP?

A: Transfer learning means starting from a model already pretrained on huge amounts of text, then fine-tuning it on your smaller task-specific dataset. You add a small task head and train for a few epochs, which reaches high accuracy with far less data than training from scratch.

Q: Which model should I use for summarization or translation?

A: Use an encoder-decoder model such as T5 or BART. They read the entire input with an encoder and then generate new text with a decoder, which is exactly the shape of summarization and translation tasks.

Q: Do I need to build transformers from scratch?

A: No. For most tasks you call Hugging Face's pipeline() for instant NER, question answering, or summarization, and use the Trainer API (often with LoRA) to fine-tune. Building a transformer by hand is a learning exercise, not how production NLP is shipped.

🎉

Lesson 35 complete — you understand the NLP model landscape!

You can explain contextual embeddings, distinguish BERT's encoder from GPT's decoder, describe transfer learning and fine-tuning, and run NER, question answering, and summarization with Hugging Face's pipeline(). You also built a rule-based NER tagger and a cosine-similarity search by hand.

🚀 Up next: RAG Systems — combine these LLMs with a knowledge base so they answer from your documents, not just their training data.

Advanced NLP: BERT, T5 & LLaMA

What You'll Learn in This Lesson

🌍 Real-World Analogy: Understanding Meaning, Not Just Words

1Contextual Embeddings — Words With a Meaning That Moves

Worked Example: Contextual Embeddings & Cosine Similarity

2BERT vs GPT — Encoder vs Decoder

3Transfer Learning & Fine-Tuning

4NER, Question Answering & Summarization in One Line

Worked Example: Hugging Face pipeline() — NER, QA, Summarization

🎯 Your Turn 1: Build a Rule-Based NER Tagger

🎯 Your Turn: Rule-Based NER

🎯 Your Turn 2: Semantic Search With Cosine Similarity

🎯 Your Turn: Semantic Search

🎯 Mini-Challenge: Keyword Sentiment Tagger

🎯 Mini-Challenge: Sentiment From a Lexicon

5Common Errors (And How to Fix Them)

📋 Quick Reference: BERT vs GPT

❓ Frequently Asked Questions

Lesson 35 complete — you understand the NLP model landscape!

Cookie & Privacy Settings