Learning Notes

Basic key words

What it is: Post-training customization. You take a pre-trained model and adapt it to your specific task, data, or style.


Two Main Categories (by parameter usage)

1. Full Fine-Tuning

  • What: Update every weight in the model.
  • Pros: Maximum flexibility; model can fully adapt to new domain.
  • Cons: Needs massive VRAM (multi-GPU), large datasets, high risk of catastrophic forgetting (model forgets original knowledge).
  • Use when:
    • You have 10k+ high-quality, task-specific examples.
    • You control a multi-GPU cluster.
    • Task is very different from pre-training (e.g., medical diagnosis on a general LLM).
  • Example: Fine-tuning BERT on a legal corpus for contract clause extraction.

2. Parameter-Efficient Fine-Tuning (PEFT)

  • What: Freeze the base model. Train only small, added components (adapters). 99%+ of weights stay frozen.
  • Pros: Runs on single GPU (even Colab), faster training, less forgetting, easy to swap adapters for different tasks.
  • Cons: Slightly lower ceiling than full FT for extremely complex domain shifts.
  • Use when:
    • You have <1k examples.
    • You're on a single GPU or limited budget.
    • You want to iterate fast or maintain multiple task-specific versions.
  • Subtypes: LoRA, QLoRA, Prefix Tuning, Adapters. LoRA/QLoRA are the current standard.

Common Fine-Tuning Methods (Task/Goal-Based)

🔹 SFT (Supervised Fine-Tuning)

  • How: Train on input-output pairs (e.g., prompt → ideal response). Model learns to mimic the pattern.
  • Works with: Full FT, LoRA, QLoRA.
  • Example:
    • Input: "Explain quantum entanglement to a 10-year-old"
    • Output: "Imagine two magic dice that always show the same number, no matter how far apart they are…"
  • Use case: Teaching your NPC AI to respond in a specific character voice or game lore style.

🔹 DPO / ORPO (Preference Optimization)

  • How: Show the model two responses to the same prompt—one preferred, one not. Model learns to favor the style/quality you want. No separate reward model needed (unlike older RLHF).
  • DPO: Directly optimizes from human preference pairs.
  • ORPO: Combines SFT + preference in one stage; more efficient.
  • Example:
    • Prompt: "How do I defeat the boss?"
    • Response A (good): "Try dodging left, then strike when his shield drops."
    • Response B (bad): "Just keep attacking."
    • Model learns to generate helpful, contextual hints.
  • Use case: Aligning NPC dialogue to be helpful, immersive, or on-brand without manual rule-writing.

🔹 Distillation

  • How: Use a large, powerful model (teacher) to generate high-quality input-output pairs. Train a smaller model (student) on that data.
  • Goal: Compress capability into a faster, cheaper model.
  • Example: Use Llama-3-70B to generate 10k quest-dialogue pairs, then fine-tune a 1B model to run locally in your game engine.
  • Use case: Deploying smart NPCs on edge devices or low-end hardware.

🔹 Reinforcement Learning Fine-Tuning (RLHF / RLAIF)

  • How: Model generates output → gets a reward score (from human or AI) → updates policy to maximize reward.
  • GSPO/GRPO
  • Example: NPC gives a hint → player succeeds → positive reward. Player gets frustrated → negative reward. Model learns adaptive hinting.
  • Use case: NPCs that learn from player behavior over time (your emergent AI goal).

LoRA & QLoRA: Deep Dive (The PEFT Workhorses)

🔸 LoRA (Low-Rank Adaptation)

  • Core idea: Instead of updating huge weight matrices (e.g., 4096×4096), inject small trainable matrices (rank r, e.g., 8 or 64) alongside them.
  • Math simplified:
    • Original: W (frozen)
    • LoRA adds: W + ΔW, where ΔW = A × B (A and B are small, trainable)
  • Why it works: Most adaptation lives in a low-dimensional subspace. You don't need to tweak every parameter.
  • VRAM impact: ~70% less than full FT. A 7B model fits on a 24GB GPU.
  • When to use: Default choice for most fine-tuning tasks. Fast, stable, reversible.

🔸 QLoRA (Quantized LoRA)

  • Core idea: LoRA + 4-bit quantization of the base model.
  • How:
    1. Quantize base model weights to 4-bit (NF4 format) → shrinks model size ~4x.
    2. Apply LoRA adapters on top.
    3. Use paged optimizers to handle memory spikes.
  • VRAM impact: A 7B model runs on ~12GB VRAM. 70B model fits on a single 48GB GPU.
  • Trade-off: Tiny accuracy drop (<1%) for massive efficiency gain.
  • When to use:
    • You're on Colab, a laptop, or a single consumer GPU.
    • You want to experiment with large models (e.g., Llama-3-70B) without cloud costs.
    • Rapid prototyping for your NPC AI project.

Practical tip: Start with QLoRA + SFT


Quick Decision Guide

Your Situation Recommended Method
Single GPU, <1k examples QLoRA + SFT
Want NPC to learn from player feedback QLoRA + DPO/ORPO
Deploying on low-end hardware Distillation → QLoRA
Massive domain shift + big compute Full FT + SFT
Iterating on multiple NPC personalities LoRA adapters (swap per character)

One Thing to Watch

PEFT methods (LoRA/QLoRA) are additive. They don't rewrite the base model. they layer behavior on top. That's a feature: you can keep the model's general knowledge and just inject your game-specific logic. But if your task fundamentally conflicts with pre-training (e.g., reversing logic), you may still need full FT.