Lesson 27 • Advanced
Fine-Tuning LLMs: LoRA, QLoRA & PEFT
Fine-tune billion-parameter language models on your own data using parameter-efficient techniques that run on a single GPU.
✅ What You'll Learn
- • LoRA: low-rank weight decomposition for efficient fine-tuning
- • QLoRA: 4-bit quantization + LoRA for extreme memory savings
- • Parameter reduction: train 0.1% of weights, keep 99% of quality
- • Practical fine-tuning recipes for different model sizes
🔧 The Fine-Tuning Challenge
🎯 Real-World Analogy: Full fine-tuning a 70B model is like renovating an entire skyscraper just to change the lobby. LoRA is like installing a small, elegant reception desk in the existing lobby — same effect for visitors, fraction of the cost. QLoRA goes further: it compresses the entire skyscraper blueprint into a filing cabinet, then adds the reception desk.
Full fine-tuning of LLaMA-70B requires 280GB of GPU memory (float32) — that's 4× A100 GPUs just for the weights. LoRA reduces trainable parameters by 1000×, and QLoRA reduces memory by 8× on top of that.
Try It: LoRA Adapters
See how low-rank decomposition reduces trainable parameters by 1000×
import numpy as np
# LoRA: Low-Rank Adaptation of Large Language Models
# Fine-tune LLMs by training tiny adapter matrices instead of all weights
np.random.seed(42)
def simulate_lora(original_dim, rank):
"""
LoRA decomposes weight update into two small matrices:
delta_W = A @ B where A is (d, r) and B is (r, d)
"""
d = original_dim
r = rank
# Original weight matrix (frozen during fine-tuning)
W_original = np.random.randn(d, d) * 0.1
# LoRA adapte
...Try It: QLoRA Quantization
Compress model weights to 4-bit — fit a 70B model on one GPU
import numpy as np
# QLoRA: Quantized LoRA — Fine-tune 70B models on a single GPU!
# Combines 4-bit quantization with LoRA adapters
np.random.seed(42)
def quantize_4bit(weights):
"""Simulate 4-bit quantization (NormalFloat4)"""
# Map float32 values to 16 discrete levels (4 bits = 2^4)
w_min, w_max = weights.min(), weights.max()
scale = (w_max - w_min) / 15 # 16 levels: 0-15
quantized = np.round((weights - w_min) / scale).astype(np.int8)
return quantized, scale, w_min
...⚠️ Common Mistake: Setting LoRA rank too high. Rank 8-16 works for most tasks. Higher ranks (64+) waste memory without improving quality. Also, apply LoRA to both Q and V projection matrices — skipping one reduces quality significantly.
💡 Pro Tip: Use the Hugging Face peft library for LoRA and bitsandbytes for 4-bit quantization. A complete QLoRA fine-tuning script is ~30 lines with these libraries. Start with unsloth for 2× faster training speed.
📋 Quick Reference
| Method | Memory Savings | Quality | Speed |
|---|---|---|---|
| Full Fine-Tuning | None | Best | Slow |
| LoRA (r=8) | ~10× less | ~98% of full | Fast |
| QLoRA (4-bit) | ~40× less | ~97% of full | Moderate |
| Prompt Tuning | ~100× less | ~90% of full | Fastest |
| Prefix Tuning | ~50× less | ~93% of full | Fast |
🎉 Lesson Complete!
You can now efficiently fine-tune LLMs on custom data. Next, explore Reinforcement Learning — training agents to make decisions through rewards!
Sign up for free to track which lessons you've completed and get learning reminders.