Skip to main content

    Lesson 35 • Advanced

    Advanced NLP: BERT, T5 & LLaMA

    By the end you'll explain how transformers read context, tell BERT (understanding) apart from GPT (generation), fine-tune a pretrained model, and run NER, question answering, and summarization with a single line of Hugging Face code.

    What You'll Learn in This Lesson

    • Why transformers give every word a context-aware embedding
    • How BERT (encoder, masked LM) differs from GPT (decoder)
    • What transfer learning and fine-tuning mean for NLP tasks
    • How to tag entities with Named Entity Recognition (NER)
    • How question answering extracts an answer span from text
    • How to summarize text — and pick the right model per task

    🌍 Real-World Analogy: Understanding Meaning, Not Just Words

    Imagine two readers handed the sentence "I went to the bank." A dictionary-only reader sees the word "bank" and shrugs — it could be money or a river. A thoughtful reader looks at the words around it ("I fished off the …") and instantly knows which one you mean.

    That thoughtful reader is a transformer. It doesn't store one fixed meaning per word — it builds a contextual embedding, a fresh vector for each word shaped by its neighbours. That is the whole reason modern NLP understands meaning, not just spelling. BERT is the detective reading the full report in both directions to understand; GPT is the storyteller writing one word at a time; T5 is the translator who reads everything, then writes a fresh version.

    1Contextual Embeddings — Words With a Meaning That Moves

    An embedding is a list of numbers (a vector) that represents a piece of text so a computer can do maths with meaning. The breakthrough of transformers is that the embedding is contextual: the same word gets a different vector depending on the sentence.

    You compare two vectors with cosine similarity — a score from 0 (unrelated) to 1 (same direction). This single tool powers semantic search: find the document whose vector points the same way as your question, even if it shares no exact words. Run the worked example below and read the comments — every result is stated inline.

    Worked Example: Contextual Embeddings & Cosine Similarity

    See how the same word 'bank' gets two different vectors — and measure semantic closeness

    Try it Yourself »
    Python
    # Why "context" is the heart of modern NLP
    #
    # Older NLP gave every word ONE fixed vector (word2vec, GloVe).
    # So "bank" had the same meaning in "river bank" and "savings bank".
    # Transformers (BERT, GPT) produce a CONTEXTUAL embedding: the vector
    # for a word changes depending on the surrounding words.
    
    # We fake "contextual" vectors here so you can see the idea without a GPU.
    bank_money = [0.9, 0.1, 0.2]   # "bank" near: money, account, loan
    bank_river = [0.1, 0.9, 0.8]   # "bank" near: river,
    ...

    2BERT vs GPT — Encoder vs Decoder

    BERT is an encoder. It is pretrained with masked language modeling (MLM): random words are hidden behind a [MASK] token and BERT must guess them using context from both sides. Because it reads in both directions, it is brilliant at understanding — classification, NER, similarity, and question answering.

    GPT is a decoder. It is pretrained to predict the next token using only the words to its left. That left-to-right view makes it a natural generator — chat, writing, code. T5 combines both: an encoder reads the input, a decoder writes the output, which is the perfect shape for summarization and translation.

    3Transfer Learning & Fine-Tuning

    You almost never train an NLP model from zero. Instead you use transfer learning: take a model already pretrained on billions of words, then fine-tune it on your much smaller, task-specific dataset. The model already "knows" language; you only teach it your specific job.

    The recipe for fine-tuning BERT on a task like NER or sentiment is short:

    # Fine-tuning recipe (conceptual)
    # 1. Load a pretrained model      -> bert-base-uncased
    # 2. Add a small task head        -> e.g. a linear classification layer
    # 3. Train a few epochs (3-5)     -> on your labelled data only
    # 4. Use a tiny learning rate     -> 2e-5 with warmup, so you don't wipe
    #                                     out what the model already learned

    With a few thousand labelled examples you can reach accuracy that would have needed millions of examples to train from scratch. That is the power — and the economy — of transfer learning.

    4NER, Question Answering & Summarization in One Line

    Three of the most common NLP jobs each have a ready-made pretrained model:

    • Named Entity Recognition (NER) — tag each word as a person, organisation, or location.
    • Question Answering (QA) — given a context and a question, return the exact answer span.
    • Summarization — condense a long passage into a short one.

    Hugging Face's pipeline() downloads and wires up the right model for you. Study the worked example below — each call is followed by its # Expected output.

    Worked Example: Hugging Face pipeline() — NER, QA, Summarization

    One line per task, with the expected output shown in comments

    Try it Yourself »
    Python
    # In real projects you do NOT build transformers from scratch.
    # Hugging Face's pipeline() loads a pretrained model in ONE line.
    # (This needs 'pip install transformers' + a download, so study the
    #  Expected output rather than running it in the browser sandbox.)
    
    from transformers import pipeline
    
    # 1) Named Entity Recognition — tag people, orgs, places
    ner = pipeline("ner", grouped_entities=True)
    print(ner("Tim Cook is the CEO of Apple in California."))
    # Expected output:
    # [{'entity_group': '
    ...

    🎯 Your Turn 1: Build a Rule-Based NER Tagger

    Real NER uses BERT, but the idea is simple enough to do by hand: look each word up in a dictionary of known entities and label it. Fill in the one blank, then run it and check your output against the comment.

    🎯 Your Turn: Rule-Based NER

    Fill in the default label so unknown words are tagged 'O'

    Try it Yourself »
    Python
    # 🎯 YOUR TURN — a tiny Named Entity Recognition tagger
    # Real NER uses BERT, but the IDEA is: look each word up and label it.
    # Fill in the blanks marked with ___
    
    # A dictionary mapping known words to their entity TYPE
    known = {
        "Apple": "ORG",
        "Google": "ORG",
        "Paris": "LOC",
        "London": "LOC",
        "Alice": "PER",
        "Bob": "PER",
    }
    
    sentence = "Alice flew from London to Paris to visit Google"
    
    tags = []
    for word in sentence.split():
        # 👉 if the word is a key in 'known', us
    ...

    🎯 Your Turn 2: Semantic Search With Cosine Similarity

    This is how a search box "understands" a question: turn the query and each document into vectors, then rank documents by cosine similarity. Complete the cosine formula and confirm the password document scores higher than the weather one.

    🎯 Mini-Challenge: Keyword Sentiment Tagger

    Time to fade the scaffolding. Only a comment outline is provided — you write the logic. Build a tiny lexicon-based sentiment tagger that decides whether a review is positive, negative, or mixed.

    🎯 Mini-Challenge: Sentiment From a Lexicon

    Count positive vs negative words and print the overall verdict

    Try it Yourself »
    Python
    # 🎯 MINI-CHALLENGE: keyword sentiment tagger
    # 1. Make a dict 'lexicon' mapping words to "POS" or "NEG"
    #    e.g. {"great": "POS", "love": "POS", "terrible": "NEG", "hate": "NEG"}
    # 2. Split this review into words:
    #       review = "I love this great phone but hate the terrible battery"
    # 3. Count how many POS words and how many NEG words appear
    # 4. Print "Overall: POSITIVE" if pos > neg, "Overall: NEGATIVE" if neg > pos,
    #    otherwise "Overall: MIXED"
    #
    # ✅ Expected output (with the review a
    ...

    5Common Errors (And How to Fix Them)

    ❌ Ignoring context

    Treating a word as one fixed meaning (old word2vec thinking). "Apple" the company and "apple" the fruit are different — a model that ignores context will confuse them.

    ✅ Fix: Use a transformer model whose embeddings are contextual; never average away the sentence.

    ❌ Wrong model for the task

    Using GPT for plain classification or NER. It works, but it is 10–100× slower and pricier than a fine-tuned BERT, and often less accurate on the label.

    ✅ Fix: Encoder (BERT) for understanding, decoder (GPT) for generation, encoder-decoder (T5) for rewriting. Use the simplest model that solves the task.

    ❌ Exceeding the token limit

    BERT caps at 512 tokens. Feed a long document straight in and it is silently truncated — the model never sees the end, so your answer or summary is wrong.

    ✅ Fix: Chunk long text into overlapping windows, or use a long-context model designed for it.

    ❌ Poor fine-tuning data quality

    Fine-tuning on noisy, mislabelled, or imbalanced data. The model faithfully learns your mistakes — garbage in, garbage out — and scores look fine on a flawed test set.

    ✅ Fix: Clean and balance labels, hold out a trustworthy validation set, and prefer fewer correct examples over many noisy ones.

    📋 Quick Reference: BERT vs GPT

    FeatureBERT (Encoder)GPT (Decoder)
    AttentionBidirectional (both sides)Left-to-right only
    PretrainingMasked LM (fill the blank)Next-token prediction
    Best atUnderstandingGeneration
    Typical tasksClassification, NER, QA, similarityChat, writing, code, reasoning
    SeesAll tokens at oncePast tokens only
    Examplebert-base-uncasedGPT-4, LLaMA, Mistral

    Need to rewrite text (summarize/translate)? Use an encoder-decoder like T5 or BART — it has both.

    ❓ Frequently Asked Questions

    Q: What is a contextual embedding in NLP?

    A: It is a vector for a word or sentence whose values depend on the surrounding text. Unlike older fixed embeddings (word2vec, GloVe) where 'bank' always had one vector, a transformer gives 'bank' a different vector in 'river bank' versus 'savings bank', which is why modern NLP understands meaning, not just spelling.

    Q: What is the difference between BERT and GPT?

    A: BERT is an encoder trained with masked language modeling: it reads the whole sentence in both directions, so it is best for understanding tasks like classification, NER, and question answering. GPT is a decoder trained to predict the next token left-to-right, so it is best for generating text like chat and writing.

    Q: What is transfer learning and fine-tuning for NLP?

    A: Transfer learning means starting from a model already pretrained on huge amounts of text, then fine-tuning it on your smaller task-specific dataset. You add a small task head and train for a few epochs, which reaches high accuracy with far less data than training from scratch.

    Q: Which model should I use for summarization or translation?

    A: Use an encoder-decoder model such as T5 or BART. They read the entire input with an encoder and then generate new text with a decoder, which is exactly the shape of summarization and translation tasks.

    Q: Do I need to build transformers from scratch?

    A: No. For most tasks you call Hugging Face's pipeline() for instant NER, question answering, or summarization, and use the Trainer API (often with LoRA) to fine-tune. Building a transformer by hand is a learning exercise, not how production NLP is shipped.

    🎉

    Lesson 35 complete — you understand the NLP model landscape!

    You can explain contextual embeddings, distinguish BERT's encoder from GPT's decoder, describe transfer learning and fine-tuning, and run NER, question answering, and summarization with Hugging Face's pipeline(). You also built a rule-based NER tagger and a cosine-similarity search by hand.

    🚀 Up next: RAG Systems — combine these LLMs with a knowledge base so they answer from your documents, not just their training data.

    Sign up for free to track which lessons you've completed and get learning reminders.

    Previous

    Cookie & Privacy Settings

    We use cookies to improve your experience, analyze traffic, and show personalized ads. You can manage your preferences below.

    By clicking "Accept All", you consent to our use of cookies for analytics and personalized advertising. You can customize your preferences or reject non-essential cookies.

    Privacy PolicyTerms of Service

    Install LearnCodingFast

    Learn faster with the app on your home screen.