Lesson 27 • Advanced

    Fine-Tuning LLMs: LoRA, QLoRA & PEFT

    Fine-tune billion-parameter language models on your own data using parameter-efficient techniques that run on a single GPU.

    ✅ What You'll Learn

    • • LoRA: low-rank weight decomposition for efficient fine-tuning
    • • QLoRA: 4-bit quantization + LoRA for extreme memory savings
    • • Parameter reduction: train 0.1% of weights, keep 99% of quality
    • • Practical fine-tuning recipes for different model sizes

    🔧 The Fine-Tuning Challenge

    🎯 Real-World Analogy: Full fine-tuning a 70B model is like renovating an entire skyscraper just to change the lobby. LoRA is like installing a small, elegant reception desk in the existing lobby — same effect for visitors, fraction of the cost. QLoRA goes further: it compresses the entire skyscraper blueprint into a filing cabinet, then adds the reception desk.

    Full fine-tuning of LLaMA-70B requires 280GB of GPU memory (float32) — that's 4× A100 GPUs just for the weights. LoRA reduces trainable parameters by 1000×, and QLoRA reduces memory by 8× on top of that.

    Try It: LoRA Adapters

    See how low-rank decomposition reduces trainable parameters by 1000×

    Try it Yourself »
    Python
    import numpy as np
    
    # LoRA: Low-Rank Adaptation of Large Language Models
    # Fine-tune LLMs by training tiny adapter matrices instead of all weights
    
    np.random.seed(42)
    
    def simulate_lora(original_dim, rank):
        """
        LoRA decomposes weight update into two small matrices:
        delta_W = A @ B  where A is (d, r) and B is (r, d)
        """
        d = original_dim
        r = rank
        
        # Original weight matrix (frozen during fine-tuning)
        W_original = np.random.randn(d, d) * 0.1
        
        # LoRA adapte
    ...

    Try It: QLoRA Quantization

    Compress model weights to 4-bit — fit a 70B model on one GPU

    Try it Yourself »
    Python
    import numpy as np
    
    # QLoRA: Quantized LoRA — Fine-tune 70B models on a single GPU!
    # Combines 4-bit quantization with LoRA adapters
    
    np.random.seed(42)
    
    def quantize_4bit(weights):
        """Simulate 4-bit quantization (NormalFloat4)"""
        # Map float32 values to 16 discrete levels (4 bits = 2^4)
        w_min, w_max = weights.min(), weights.max()
        scale = (w_max - w_min) / 15  # 16 levels: 0-15
        quantized = np.round((weights - w_min) / scale).astype(np.int8)
        return quantized, scale, w_min
    
    ...

    ⚠️ Common Mistake: Setting LoRA rank too high. Rank 8-16 works for most tasks. Higher ranks (64+) waste memory without improving quality. Also, apply LoRA to both Q and V projection matrices — skipping one reduces quality significantly.

    💡 Pro Tip: Use the Hugging Face peft library for LoRA and bitsandbytes for 4-bit quantization. A complete QLoRA fine-tuning script is ~30 lines with these libraries. Start with unsloth for 2× faster training speed.

    📋 Quick Reference

    MethodMemory SavingsQualitySpeed
    Full Fine-TuningNoneBestSlow
    LoRA (r=8)~10× less~98% of fullFast
    QLoRA (4-bit)~40× less~97% of fullModerate
    Prompt Tuning~100× less~90% of fullFastest
    Prefix Tuning~50× less~93% of fullFast

    🎉 Lesson Complete!

    You can now efficiently fine-tune LLMs on custom data. Next, explore Reinforcement Learning — training agents to make decisions through rewards!

    Sign up for free to track which lessons you've completed and get learning reminders.

    Previous

    Cookie & Privacy Settings

    We use cookies to improve your experience, analyze traffic, and show personalized ads. You can manage your preferences below.

    By clicking "Accept All", you consent to our use of cookies for analytics and personalized advertising. You can customize your preferences or reject non-essential cookies.

    Privacy PolicyTerms of Service