Lesson 27 • Advanced

Fine-Tuning LLMs: LoRA, QLoRA & PEFT

By the end of this lesson you'll know when to prompt, when to use RAG, and when to fine-tune — and how to prepare data and train an adapter without melting your GPU.

What You'll Learn in This Lesson

✓You'll be able to choose between prompting, RAG, and fine-tuning for any task
✓You'll understand full fine-tuning vs parameter-efficient LoRA, QLoRA, and PEFT
✓You'll explain instruction tuning and how RLHF and DPO align a model
✓You'll prepare clean instruction/response training records yourself
✓You'll read a real Hugging Face PEFT + Trainer fine-tuning script
✓You'll spot and prevent overfitting and catastrophic forgetting

Before you start: It helps to have met Tokenization Strategies and to know that an LLM is a stack of weights trained to predict the next token. This lesson is conceptual — the runnable examples use plain Python so you can follow them anywhere.

🌍 Real-World Analogy: Specialising a Generalist Employee

A pre-trained LLM is like a sharp new graduate who knows a bit about everything. You have three ways to get the work you need from them, in increasing cost:

Prompting = leaving a clear sticky note on their desk. Instant, free, changeable — but they forget it the moment the task ends.
RAG = giving them a filing cabinet of your company documents to look things up in. They stay a generalist, but now answer with your facts.
Fine-tuning = sending them on a training course so the new skill becomes second nature. Expensive and slow, but afterwards they just do it without being told.

LoRA is the smart twist: instead of re-educating the whole employee, you teach them one focused playbook (a tiny adapter) they slot in for your task — and can pop out again for the next client.

1Prompting vs RAG vs Fine-Tuning — Choose the Cheapest That Works

Fine-tuning is powerful, but it is the last tool you should reach for, not the first. Most problems are solved more cheaply by writing a better prompt or by retrieval. Use this rule of thumb:

Prompting: change the model's output by changing the instruction. Zero training, instant to iterate. Try this first, always.
RAG (retrieval-augmented generation): the model needs facts it doesn't have — your docs, today's prices, a private wiki. You fetch the relevant text and paste it into the prompt at query time. No training; facts can change every minute.
Fine-tuning: the model needs a new behaviour — a consistent tone, a strict output format, a niche skill — that prompting can't make reliable. You change the weights with many examples.

Key distinction: RAG adds knowledge; fine-tuning changes behaviour. If your real problem is "the model doesn't know X", fine-tuning is usually the wrong fix — it is bad at memorising facts and you'd have to retrain whenever X changes.

The runnable example below builds a tiny decision helper and shows how few parameters LoRA actually trains. Run it and read every comment.

Worked Example: Decide & Estimate

Pick prompt vs RAG vs fine-tune, then see LoRA's parameter savings

Try it Yourself »

Python

# WHEN should you fine-tune? Use a simple decision helper, then estimate cost.

def recommend(needs_private_facts, needs_new_behaviour, examples_available):
    """Return the cheapest approach that fits the need."""
    if needs_private_facts and not needs_new_behaviour:
        return "RAG"          # inject documents at query time, no training
    if needs_new_behaviour and examples_available >= 500:
        return "Fine-tune"    # teach a new style/skill with many examples
    return "Prompti
...

2Full Fine-Tuning vs Parameter-Efficient (LoRA, QLoRA, PEFT)

Full fine-tuning updates every weight in the model. It gives the best quality but is brutally expensive: training a 70B model in 32-bit needs hundreds of gigabytes of GPU memory just to hold the weights, gradients, and optimiser state.

PEFT (Parameter-Efficient Fine-Tuning) is the family of methods that avoid this. The most popular is LoRA (Low-Rank Adaptation): freeze the original weights and learn two small matrices whose product is added to a weight matrix. A full weight matrix holds d×d numbers; LoRA trains only 2×d×r, where the rank r (usually 8–16) is tiny. That is often under 1% of the parameters.

QLoRA stacks one more trick on top: load the frozen base model in 4-bit precision (NormalFloat4) instead of 16-bit. The adapters stay in higher precision and do the learning, while the giant frozen base barely takes up room — so a 70B model that needed four A100s can be fine-tuned on a single 48GB GPU.

Here is what a real LoRA setup looks like with Hugging Face peft and transformers. It's read-only (it needs a GPU and the libraries), but notice the trainable% line in the expected output:

# Real LoRA fine-tuning with Hugging Face (read-only — needs a GPU + libraries).
from peft import LoraConfig, get_peft_model
from transformers import AutoModelForCausalLM, TrainingArguments, Trainer

model = AutoModelForCausalLM.from_pretrained("base-llm-7b")

lora = LoraConfig(
    r=8,                 # rank — 8-16 is plenty for most tasks
    lora_alpha=16,       # scaling factor (often 2 * r)
    target_modules=["q_proj", "v_proj"],  # apply to attention Q and V
    lora_dropout=0.05,
)

model = get_peft_model(model, lora)
model.print_trainable_parameters()
# Expected output:
# trainable params: 4,194,304 || all params: 6,742,609,920 || trainable%: 0.0622

trainer = Trainer(
    model=model,
    args=TrainingArguments(output_dir="out", num_train_epochs=3,
                           per_device_train_batch_size=4, learning_rate=2e-4),
    train_dataset=dataset,      # your instruction/response records
)
trainer.train()
model.save_pretrained("my-lora-adapter")   # only ~16MB of adapter weights!

And QLoRA is the same idea with one extra config object to quantise the base model to 4-bit:

# QLoRA: load the base model in 4-bit so a 70B model fits on one GPU.
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model
import torch

bnb = BitsAndBytesConfig(
    load_in_4bit=True,                 # 4 bits per weight instead of 16/32
    bnb_4bit_quant_type="nf4",         # NormalFloat4 — best for model weights
    bnb_4bit_compute_dtype=torch.bfloat16,
)

model = AutoModelForCausalLM.from_pretrained("base-llm-70b", quantization_config=bnb)
model = get_peft_model(model, LoraConfig(r=16, target_modules=["q_proj", "v_proj"]))

# Memory for the FROZEN base weights at different precisions (70B params):
#   fp32 -> 280 GB   fp16 -> 140 GB   int8 -> 70 GB   int4 -> 35 GB
# Expected output (model.print_trainable_parameters()):
# trainable params: 8,388,608 || all params: 70,000,000,000 || trainable%: 0.0120
#
# Result: a 70B model that needed 4x A100s now fine-tunes on ONE 48GB GPU.

3Instruction Tuning, RLHF & DPO — Teaching Behaviour

Instruction tuning is just supervised fine-tuning on instruction → response pairs. It teaches the model to follow commands rather than only continue text. This is the workhorse you'll use most.

RLHF (Reinforcement Learning from Human Feedback) goes further to make answers preferred by people. Humans rank several model answers; a small reward model learns to predict those rankings; then the LLM is optimised to score highly under that reward model. It works well but has many moving parts.

DPO (Direct Preference Optimisation) reaches a similar place far more simply. You give it pairs of a chosen answer and a rejected answer, and it trains the model directly to prefer the chosen one — no separate reward model, no reinforcement-learning loop. For most teams today, DPO is the easier path to alignment.

Mental model: instruction tuning teaches the model what to do; RLHF/DPO teach it which of two answers people like better. They are usually applied in that order.

4Dataset Prep — The Part That Actually Decides Quality

A fine-tune is only as good as its data. You build a list of records, each with a prompt (the instruction, wrapped in a consistent template) and a completion (the ideal response). The template matters: the model learns to produce whatever follows your ### Response: marker, so it must be identical in training and at inference.

Good data is consistent (same format every row), clean (no empty or contradictory answers), and representative (covers the cases you'll actually see). The runnable example below formats raw pairs into records and then shows the core idea of training — a single weight learning by gradient descent. Run it:

Worked Example: Build Records + Toy Gradient Step

Format instruction/response pairs, then watch one weight learn

Try it Yourself »

Python

# Fine-tuning starts with DATA: instruction -> response pairs.
# Here we format raw examples into clean training records — no libraries needed.

raw_examples = [
    ("Translate to French: Hello", "Bonjour"),
    ("Summarise: The cat sat on the mat.", "A cat sat on a mat."),
    ("Capital of Japan?", "Tokyo"),
]

# A simple chat-style template the model learns to complete.
def to_record(instruction, response):
    return {
        "prompt": "### Instruction:\n" + instruction + "\n\n### Response:
...

That five-line loop is the whole secret of training, scaled up: predict, measure the error, nudge each weight a little against the error, repeat. Fine-tuning runs this over billions of weights (or, with LoRA, just the adapter's few million).

5Evaluation & Overfitting — Did It Actually Get Better?

Always hold back a validation set the model never trains on. During training you watch two numbers: training loss (error on data it sees) and validation loss (error on data it doesn't). When training loss keeps falling but validation loss starts rising, you are overfitting — the model is memorising your examples instead of learning the pattern.

Train for few epochs (often 1–3). More passes mostly memorise.
Use a small learning rate (e.g. 2e-4 for LoRA) so you nudge, not shove.
Watch for catastrophic forgetting: if the model gets great at your task but worse at everything else, you trained too hard or on too narrow a dataset.
Judge on real outputs, not just loss — run the model on held-out prompts and read the answers.

The format-mismatch trap: if you train with ### Response: but prompt with Answer: at inference time, the model never sees the cue it learned and quality collapses. Keep the prompt template byte-for-byte identical between training and use.

🎯 Your Turn 1: Build Training Records

Finish the to_record function so each FAQ pair becomes a clean instruction/response record. Fill in the blanks marked ___ and check your output against the expected result in the comments.

Your Turn: Format the Dataset

Fill in the blanks to build instruction/response records

Try it Yourself »

Python

# 🎯 YOUR TURN — fill in the blanks marked with ___

# A FAQ bot needs training records. Format each (question, answer) pair into
# the chat template the model will learn to complete.

faq = [
    ("How do I reset my password?", "Click 'Forgot password' on the login page."),
    ("What are your hours?", "We are open 9am to 5pm, Monday to Friday."),
]

def to_record(question, answer):
    # 👉 build the prompt: "### Instruction:\n" + the question + "\n\n### Response:\n"
    prompt = ___
    # 👉 
...

🎯 Your Turn 2: Prompt, RAG, or Fine-Tune?

Complete the choose function so it returns the right approach for each task. Fill in the blanks marked ___.

Your Turn: Make the Call

Return the cheapest approach that fits each task

Try it Yourself »

Python

# 🎯 YOUR TURN — fill in the blanks marked with ___

# Pick the cheapest approach for each task.
# Rules: private facts only  -> "RAG"
#        new behaviour + lots of examples -> "Fine-tune"
#        otherwise            -> "Prompting"

def choose(private_facts, new_behaviour, n_examples):
    if private_facts and not new_behaviour:
        return ___                 # 👉 return the string for document lookup
    if new_behaviour and n_examples >= 500:
        return ___                 # 👉 re
...

🎯 Mini-Challenge: Clean the Dataset (faded)

No blanks this time — just a brief and an outline. Drop the bad records and format the rest. Use the expected output to check yourself.

Mini-Challenge: Spot the Bad Records

Filter empty pairs, then build clean records — outline only

Try it Yourself »

Python

# 🎯 MINI-CHALLENGE: Spot the bad training records
#
# 1. Start with a list of (instruction, response) pairs — include at least one
#    BAD pair where the response is an empty string "".
# 2. Write a function clean(pairs) that keeps only pairs where BOTH the
#    instruction and the response are non-empty after .strip().
# 3. Format the survivors into records: {"prompt": ..., "completion": ...}
#    using the "### Instruction:\n...\n\n### Response:\n" template.
# 4. Print how many were dropped 
...

!Common Errors (And How to Fix Them)

❌ Fine-tuning when prompting would do

You spin up GPUs and label data, but a one-line prompt tweak already solved it.

✅ Fix: Always try prompting first, then RAG. Only fine-tune when a careful prompt still can't make the behaviour reliable.

❌ Too little (or low-quality) data

A dozen examples can't teach a behaviour; the model barely shifts or learns the noise.

✅ Fix: Aim for a few hundred to a few thousand clean, consistent examples. Quality and consistency beat raw volume.

❌ Catastrophic forgetting

After training, the model nails your task but is suddenly worse at general questions.

✅ Fix: Use a small learning rate, fewer epochs, and LoRA (which leaves the base weights frozen). Mix in some general examples if needed.

❌ Overfitting

Training loss keeps dropping while validation loss climbs — it's memorising, not learning.

✅ Fix: Stop early when validation loss turns up, hold out a validation set, and train for only 1–3 epochs.

❌ Prompt-format mismatch

Great validation scores, terrible real answers — because inference uses a different template than training.

✅ Fix: Keep the prompt template (markers, whitespace, newlines) byte-for-byte identical at train and inference time.

📋 Quick Reference: Prompt vs RAG vs Fine-Tune

Approach	Best for	Changes weights?	Cost / Speed
Prompting	Most tasks; quick iteration	No	Free, instant
RAG	Private / changing facts	No	Low, fast
LoRA / QLoRA	New behaviour, <1% params trained	Adapter only	Moderate
Full fine-tune	Max quality, deep changes	All weights	High, slow
RLHF / DPO	Aligning to human preference	Yes (after instruction tuning)	High

❓ Frequently Asked Questions

Q: What is fine-tuning a large language model?

A: Fine-tuning continues training a pre-trained model on your own instruction/response examples so it adapts its behaviour, style, or format. You are not training from scratch — you start from a capable base model and nudge its weights with a small, focused dataset.

Q: When should I fine-tune instead of prompting or using RAG?

A: Prompt first — it is free and instant, and a good prompt solves most tasks. Use RAG (retrieval-augmented generation) when the model needs private or changing FACTS, because it injects documents at query time without training. Fine-tune only when you need a new consistent BEHAVIOUR or style and you have hundreds of high-quality examples that prompting cannot reliably produce.

Q: What is the difference between full fine-tuning and LoRA/QLoRA?

A: Full fine-tuning updates every weight, which is expensive and needs huge GPU memory. LoRA (a parameter-efficient method, PEFT) freezes the base model and trains two tiny low-rank adapter matrices — often under 1% of the parameters. QLoRA goes further by loading the frozen base model in 4-bit, so even a 70B model can be fine-tuned on a single GPU.

Q: What are instruction tuning, RLHF, and DPO?

A: Instruction tuning is supervised fine-tuning on instruction/response pairs so the model follows commands. RLHF (Reinforcement Learning from Human Feedback) then aligns the model to human preferences using a reward model. DPO (Direct Preference Optimisation) reaches a similar result more simply by training directly on pairs of preferred and rejected answers, without a separate reward model.

Q: How much data do I need, and how do I avoid overfitting?

A: Quality beats quantity: a few hundred to a few thousand clean, consistent examples often beat tens of thousands of noisy ones. To avoid overfitting, hold out a validation set, watch for validation loss rising while training loss falls, train for only 1-3 epochs, and use a small learning rate so the model does not forget its general skills (catastrophic forgetting).

🎉

Lesson complete — you can now reason about fine-tuning like a practitioner!

You can pick prompting, RAG, or fine-tuning for a task; explain LoRA, QLoRA, and PEFT; describe instruction tuning, RLHF, and DPO; prepare clean instruction/response data; and guard against overfitting, catastrophic forgetting, and format mismatch.

🚀 Up next: Reinforcement Learning Basics — the reward-driven training that powers RLHF and decision-making agents.

Fine-Tuning LLMs: LoRA, QLoRA & PEFT

What You'll Learn in This Lesson

🌍 Real-World Analogy: Specialising a Generalist Employee

1Prompting vs RAG vs Fine-Tuning — Choose the Cheapest That Works

Worked Example: Decide & Estimate

2Full Fine-Tuning vs Parameter-Efficient (LoRA, QLoRA, PEFT)

3Instruction Tuning, RLHF & DPO — Teaching Behaviour

4Dataset Prep — The Part That Actually Decides Quality

Worked Example: Build Records + Toy Gradient Step

5Evaluation & Overfitting — Did It Actually Get Better?

🎯 Your Turn 1: Build Training Records

Your Turn: Format the Dataset

🎯 Your Turn 2: Prompt, RAG, or Fine-Tune?

Your Turn: Make the Call

🎯 Mini-Challenge: Clean the Dataset (faded)

Mini-Challenge: Spot the Bad Records

!Common Errors (And How to Fix Them)

📋 Quick Reference: Prompt vs RAG vs Fine-Tune

❓ Frequently Asked Questions

Lesson complete — you can now reason about fine-tuning like a practitioner!

Cookie & Privacy Settings