Intro Problem Math Mechanism Results Resources PaperMap Home
Interactive Paper Explainer

LoRA: Low-Rank Adaptation
of Large Language Models

A visual, research-accurate walkthrough of LoRA with low-rank math, optimization intuition, deployment tradeoffs, and benchmark-backed results from the original paper.

Start Learning Read Original Paper
14
Sections
10,000×
Param Reduction
0
Added Latency
2021
LoRA Paper
Learning Path: 🧠 Remember 💡 Understand 🔧 Apply 🔬 Analyze ⚖️ Evaluate 🚀 Create

Table of Contents

🎯
Law 1 — Prediction Before Explanation

Before We Start — What Do You Think?

🔮 Your Prediction Challenge

GPT-3 has 175 billion parameters. Fine-tuning it means updating all 175B parameters — which costs $5M+ in compute and 1.2 TB of GPU RAM. If you had to update only a tiny fraction of those parameters to teach the model a new skill, how would you decide which ones to update?

Think about it like this: when you learn a new task at work, you don't rewire your entire brain — you build a small set of new habits on top of existing skills. LoRA does exactly this.

⚡ One-Line Summary (Law 3 — Compression)

LoRA = "Instead of rewriting the whole novel, just write a small side-story that gets added to every chapter — the main story stays unchanged."

175B
GPT-3 parameters total
4.7M
LoRA trainable params
10,000×
Parameter reduction
GPU memory saved
0
Extra inference latency
r=1
Minimum effective rank
😤
Chapter 01 — The Problem

Why Full Fine-Tuning Is Broken at Scale

🏗️ Analogy — The Building vs. The Renovation

Full Fine-Tuning = Demolish and rebuild the entire 175-floor skyscraper every time you want to add a new office. Insanely expensive, takes forever, and you need a separate building for each client.

LoRA = Install a small modular office pod (A×B matrix) into the existing building. Cheap, fast, and you can swap pods between clients while the building stays the same.

The Three Concrete Problems

Why Existing Solutions Don't Work Either

❌ Adapter Layers

  • Add bottleneck layers sequentially
  • Must run extra compute at inference
  • Up to 30% latency increase at batch size 1
  • Can't be merged with base weights
  • Worse in latency-sensitive deployments

❌ Prefix / Prompt Tuning

  • Uses part of sequence length for "soft tokens"
  • Reduces usable context for the actual task
  • Hard to optimize — non-monotonic performance
  • Collapses badly in low-data settings
  • Special tokens shift input distribution
💡
The Key Insight That Made LoRA Possible Research (Aghajanyan et al., 2020; Li et al., 2018) showed that fine-tuned language models actually have a very low "intrinsic dimensionality" — meaning the useful changes to the model live in a tiny subspace of the full parameter space. LoRA exploits this directly.
📐
Chapter 02 — Math Foundation

Low-Rank Matrices — The Core Math Concept

🎨 Analogy — The Color Mixing Secret

Imagine a 1000×1000 painting with a million pixel colors. If the painting is a gradient sunset, you don't actually need to store all a million numbers — you can describe it as "column 1 = pure red, column 1000 = pure orange, blend between them." That's a rank-1 description of a million-number thing.

A low-rank matrix is one where the information can be compressed into the product of two much smaller matrices. LoRA exploits the fact that weight updates during fine-tuning are naturally low-rank.

What Does "Rank" Mean?

The rank of a matrix is the number of truly independent pieces of information it contains. A d×d weight matrix (e.g., 12288×12288 in GPT-3) has rank up to 12,288. But LoRA shows the update ΔW has effective rank of just 1–8 in practice!

Full Update Matrix ΔW d × k matrix (huge!) ΔW d × k = 12288 × 12288 ≈ 150M numbers! = B d × r d · A r × k k rank r (small!) Parameter Count: d×r + r×k vs d×k before
Figure: ΔW decomposed into B (d×r) and A (r×k). Since r ≪ d,k, the parameter count drops from d×k to r×(d+k)

The Parameter Savings — Concrete Numbers

// GPT-3 attention weight matrix example: Full matrix ΔW: d × k = 12288 × 12288 = 150,994,944 numbers // LoRA with rank r = 4: B matrix: d × r = 12288 × 4 = 49,152 numbers A matrix: r × k = 4 × 12288 = 49,152 numbers Total LoRA: 98,304 numbers Savings: 150,994,944 / 98,304 ≈ 1,536× reduction per layer!
🧠
Why Does This Work? The Intrinsic Rank Hypothesis When a model fine-tunes on a new task, it doesn't need to change everything — it just needs to "emphasize" a small set of directions in weight space that are relevant to the task. These directions form a low-dimensional subspace. r = 1 to 8 captures enough of this subspace to match full fine-tuning in practice.
⚙️
Chapter 03 — The Mechanism

How LoRA Actually Works — Step by Step

x PRE-TRAINED W₀ 🔒 FROZEN (no gradients) W₀ ∈ ℝᵈˣᵏ A r×k — trainable init: N(0,σ²) B d×r — trainable init: zeros (ΔW=0) + coord-wise h Scale: ΔWx × α/r α = constant r = rank h = W₀x + (α/r)·BAx Both paths run in parallel, outputs summed 🔒 ∂L/∂W₀ = 0 ✅ ∂L/∂A ≠ 0 ✅ ∂L/∂B ≠ 0
Figure: LoRA architecture — W₀ is frozen, only A and B receive gradient updates

The 5-Step Training Process

🚀
The Brilliant Zero-Latency Trick Since W₀ and BA are both d×k matrices, you can simply add them: W_merged = W₀ + BA. This merged matrix is used during inference — it's just one matrix multiplication, exactly like the original model. No adapter layers, no extra paths, zero overhead. This is what makes LoRA fundamentally different from adapters.
🔢
Chapter 04 — Formulas

All Key Formulas — Complete Symbol Breakdown

Formula 1: Full Fine-Tuning Objective

// Maximize conditional language model log-likelihood: max Σ Σ log P_Φ(yₜ | x, y₁:t₋₁) Φ (x,y)∈Z t=1..|y| // Φ = ALL model parameters (175B in GPT-3) // Updates: ΔΦ has SAME SIZE as Φ₀ — hugely expensive
SymbolMeaningProblem
ΦAll model parameters175B params — too many
ΔΦFull parameter update|ΔΦ| = |Φ₀| — same size!
ZTask training dataset
P_Φ(yₜ|x,y<t)Probability of next token given context

Formula 2: LoRA Objective (Parameter-Efficient)

// Same objective, but ΔΦ is now encoded by small Θ: max Σ Σ log P_{Φ₀+ΔΦ(Θ)}(yₜ | x, y₁:t₋₁) Θ (x,y)∈Z t=1..|y| // |Θ| << |Φ₀| (0.01% of original!) // Θ = {A₁, B₁, A₂, B₂, ...} for all LoRA modules
💡
Key insight: The task-specific increment ΔΦ(Θ) is encoded by a much smaller set Θ. For GPT-3: |Θ| can be as small as 0.01% of |Φ₀|. We optimize over Θ, not Φ.

Formula 3: The LoRA Forward Pass

// Standard forward pass: h = W₀x // original path (frozen W₀) // LoRA modified forward pass: h = W₀x + (α/r) · B·A·x // LoRA path added // Equivalently: h = (W₀ + ΔW)x where ΔW = (α/r)·BA // Dimensions: x: input vector ∈ ℝᵏ W₀: pre-trained weight matrix ∈ ℝᵈˣᵏ (frozen) A: LoRA A matrix ∈ ℝʳˣᵏ (trainable) B: LoRA B matrix ∈ ℝᵈˣʳ (trainable) r: rank (hyperparameter) << min(d,k) α: scaling constant (set to r, not tuned)
SymbolMeaningInitialized As
W₀Pre-trained weight matrixPre-trained values, frozen
A ∈ ℝʳˣᵏLoRA down-projection (compresses)Random Gaussian N(0,σ²)
B ∈ ℝᵈˣʳLoRA up-projection (expands)Zero matrix (so ΔW=0 at start)
rRank — the bottleneck dimensionHyperparameter (1–64 typical)
α/rScaling factor for stabilitySet α=r, ratio=1 initially
hOutput hidden state

Formula 4: Parameter Count Comparison

// Full fine-tuning (for one weight matrix): |ΔΦ| = d × k // e.g. 12288 × 12288 = 150M // LoRA (for one weight matrix): |Θ| = d×r + r×k = r×(d+k) // e.g. 4×(12288+12288) = 98K // For L weight matrices with LoRA applied: |Θ| = 2 × L_LoRA × d_model × r // GPT-3 example with r=4, Wq and Wv only, 96 layers: |Θ| = 2 × 96 × 12288 × 4 = 9.4M // vs 175B total!

Formula 5: Subspace Similarity (for analysis)

// Used to measure how similar the learned adaptation subspaces are: φ(A_{r=8}, A_{r=64}, i, j) = ||U^i_A8 · U^j_A64||²_F / min(i,j) ∈ [0,1] // φ = 1: complete overlap (same subspace learned) // φ = 0: completely orthogonal (different subspaces) // U^i_A: top-i left singular vectors of adaptation matrix A // Key finding: r=8 and r=64 share their top direction (φ > 0.5) // → confirms the intrinsic rank is very small (≈1)
⚠️
Why This Formula Matters This is how the paper proves that low rank works: by showing that r=8 and r=64 learn essentially the same top direction, they demonstrate the adaptation information truly lives in a 1-dimensional subspace. This is not just a trick — it reflects something fundamental about fine-tuning.
🎯
Chapter 05 — Where to Apply LoRA

Which Transformer Weights to Adapt?

🏢 Analogy — Which Floors to Renovate?

A Transformer has many weight matrices (Wq, Wk, Wv, Wo in attention + MLP). LoRA can be applied to any of them. Given a fixed renovation budget, which floors give you the most value? The paper answers this empirically.

Transformer Layer — Weight Matrices Where LoRA Can Be Applied Self-Attention Module Wq Query ✅ Best to adapt Wk Key ○ Less gain alone Wv Value ✅ Best to adapt Wo Output proj. ○ Helpful combined Paper recommendation: adapt Wq + Wv together (best quality/cost tradeoff) MLP Module FFN₁ W₁ FFN₂ W₂ 🔒 Frozen in this paper (future work)

Ablation Results: Which Weights? (Table 5 from paper)

Adapted WeightsRank rWikiSQL Acc.MultiNLI Acc.Verdict
Wq only870.4%91.0%Suboptimal
Wk only870.0%90.8%Suboptimal
Wv only873.0%91.0%Better
Wq + Wv473.7%91.3%✅ Best tradeoff
Wq + Wk + Wv + Wo273.7%91.7%Good but more params
⚠️
Key Finding: Spread budget across more types, not deeper in one type Adapting Wq+Wv with r=4 beats adapting just Wq with r=8 (same parameter count). The message: it's better to adapt multiple weight types with lower rank than to go deep on one type.

What is the Optimal Rank r? (Table 6 from paper)

rWikiSQL (Wq+Wv)MultiNLI (Wq+Wv)Finding
173.4%91.3%Already competitive!
273.3%91.4%
473.7%91.3%
873.8%91.6%Best overall
6473.5%91.4%Diminishing returns
🌟
Surprising Result: r=1 already works well! A rank of just 1 (meaning A is a row vector and B is a column vector) achieves 73.4% on WikiSQL, very close to the best. This powerfully validates the intrinsic rank hypothesis: the meaningful change in weights lives in essentially 1 direction for these tasks.
⚖️
Chapter 06 — Comparison

LoRA vs Adapters vs Prefix Tuning

FeatureFull Fine-TuneAdapter LayersPrefix TuningLoRA
Trainable params100% (175B)<1%<1%<1%
Inference latencyNone extra+20-30% at batch=1Sequence length lostNone extra (merged!)
Task switchingFull model reloadSwap adaptersSwap prefixSwap A,B matrices
Storage per task350 GB~10 MB~10 MB35 MB (r=4)
Max sequence lengthFullFullReducedFull
Low-data regimeOKOKVery poorBest
Can merge into modelN/ANoNoYes!
Performance vs FTBaselineUsually worseUsually worseOn-par or better

Inference Latency Proof — From the Paper

🔴
Adapter Latency is Real (Table 1) At batch size=1, seq length=128: Adapter adds +20.7% latency (AdapterL) to +30.3% (AdapterH). In online services with real users, this directly translates to slower response times. LoRA has exactly 0% extra latency when weights are merged — making it production-safe.
📊
Chapter 07 — Experiments

Results — What Did LoRA Actually Achieve?

RoBERTa on GLUE (Table 2)

ModelMethodTrainable ParamsAvg GLUE
RoBERTa-baseFull Fine-Tune125.0M86.4
RoBERTa-baseAdapter (0.9M)0.9M85.4
RoBERTa-baseLoRA0.3M87.2
RoBERTa-largeFull Fine-Tune355.0M88.9
RoBERTa-largeLoRA0.8M89.0
DeBERTa-XXLFull Fine-Tune1500.0M91.1
DeBERTa-XXLLoRA4.7M91.3
🏆
LoRA Beats Full Fine-Tuning! On RoBERTa-base, LoRA with only 0.3M parameters outperforms full fine-tuning (125M params) by 0.8 GLUE points. On DeBERTa-XXL, LoRA (4.7M params) beats full fine-tuning (1500M params). Less is more.

GPT-3 175B (Table 4) — The Real Stress Test

MethodTrainable ParamsWikiSQLMNLISAMSum R1
Full Fine-Tune175,255M73.889.552.0
BitFit14.2M71.391.051.3
Prefix-Embed3.2M63.188.648.3
Prefix-Layer20.2M70.189.550.8
AdapterH40.1M73.291.553.2
LoRA4.7M73.491.753.8
LoRA37.7M74.091.653.4

LoRA with just 4.7M params beats AdapterH with 40.1M params on WikiSQL and SAMSum. With 37.7M params, LoRA outperforms full fine-tuning on WikiSQL (74.0 vs 73.8).

GPT-3 WikiSQL Accuracy vs. Trainable Parameters Acc. 60% 70% 74% 73.8 Full FT 175B params 63.1 Prefix-Emb 3.2M params 73.2 AdapterH 40.1M params 73.4 LoRA 4.7M params ✅ 74.0 LoRA 37.7M params 🏆
LoRA (4.7M params) matches AdapterH (40.1M params) while using 8.5× fewer parameters

Low-Data Regime (Table 16)

One of the most striking findings: LoRA works well even with very few examples.

MethodMNLI-100 (100 examples)MNLI-1kMNLI-392K (full)
Full Fine-Tune60.2%85.8%89.5%
Prefix-Embed37.6% (near random!)75.2%88.6%
Prefix-Layer48.3%82.5%89.6%
LoRA63.8%85.6%91.7%
🔴
Prefix Tuning Catastrophically Fails at 100 Examples! Prefix-Embed (37.6%) barely beats random chance (33.3%) with only 100 examples. This is because prefix tokens are completely untrained on the task distribution. LoRA (63.8%) not only beats prefix methods — it beats full fine-tuning (60.2%) in this extreme low-data setting.
🔬
Chapter 08 — Analysis

Understanding the Low-Rank Updates — What LoRA Learns

🧪
The paper asks and answers 3 deep questions: (1) Which weights to adapt? (2) How small can r be? (3) What is the relationship between ΔW and W?

Question 3: How Does ΔW Relate to W? (Table 7)

The paper projects W onto the subspace of ΔW and measures the Frobenius norm. Here's what they found for r=4:

Projection direction||U^T W V||_FInterpretation
Top directions of ΔW0.32W has very little in ΔW's direction
Top directions of W (itself)21.67Most of W lives in its own top directions
Random directions0.02Baseline noise
||W||_F61.95Full weight magnitude
||ΔW||_F (r=4)6.91ΔW magnitude
🔬 Three Profound Conclusions
  • ΔW amplifies W's under-emphasized features. ΔW doesn't point in the same direction as W's top singular vectors — it finds the directions W "knows" but downplayed in pre-training.
  • The amplification factor is huge. 6.91 / 0.32 ≈ 21.5× — LoRA takes a barely-used direction in W and amplifies it 21× for the downstream task.
  • LoRA amplifies task-specific knowledge stored in W. The pre-training encoded the knowledge; LoRA just turns up the volume on the task-relevant subset.
🎵 Analogy — The EQ Knob

Think of pre-trained W as a complete audio recording with all frequencies. For a specific task (say, classical music classification), you need to boost specific frequency ranges (say, violin harmonics) that are present but quiet in the recording. LoRA is the equalizer — it doesn't add new frequencies, it amplifies the ones already there by a factor of 21×.

⚠️
Chapter 09 — Failure Modes (Law 2)

When LoRA Fails — Limitations You Must Know

🌍
Chapter 10 — The Big Picture

Why LoRA Changed Everything — The Legacy

⚡ The 3-Line Summary (Law 3 — Compression)

1. Fine-tuning giant models is prohibitively expensive — storage, memory, compute.
2. Weight updates during fine-tuning are inherently low-rank — they live in a tiny subspace.
3. LoRA: freeze W₀, train two small matrices A,B, add BA to W₀ — match fine-tuning at 0.01% cost.

LoRA's Impact on Modern AI (2021 → Today)

🚀
LoRA is the most widely used PEFT method in production AI today Stable Diffusion fine-tuning, custom ChatGPT models, LlaMA adaptations, Mistral fine-tunes — virtually all consumer fine-tuning of LLMs uses LoRA or its variants (QLoRA, AdaLoRA, LoRA+). This paper democratized LLM fine-tuning from a $5M project to something a student can run on a consumer GPU.

The LoRA Family — What Came After

MethodKey InnovationYear
LoRA (this paper)Low-rank weight update decomposition2021
QLoRA4-bit quantization + LoRA → fine-tune 65B on 1 GPU2023
AdaLoRAAdaptive rank allocation (different r per layer)2023
LoftQQuantization-aware LoRA initialization2023
LoRA+Different learning rates for A and B2024
DoRAWeight decomposition into magnitude + direction2024

The Fundamental Principle (First-Principles Thinking)

🧠 First Principles

Ask: What does fine-tuning actually need to accomplish?

It needs to shift the model's behavior from "general" to "task-specific." Research shows this behavioral shift lives in a low-dimensional subspace of the full parameter space. So instead of moving all 175B parameters slightly, move a few directions dramatically. LoRA operationalizes this insight with elegant, deployable engineering.

This is why LoRA works: it's not a hack — it's mathematically grounded in how language model adaptation actually works.

🎤
Chapter 11 — Interview Prep

Top Interview Questions & Model Answers

1
Explain LoRA in simple terms. What problem does it solve?
LoRA (Low-Rank Adaptation) solves the prohibitive cost of fine-tuning large language models. Instead of updating all model parameters, LoRA freezes the pre-trained weights W₀ and adds two small trainable matrices A and B, where ΔW = BA. Since rank r ≪ min(d,k), the parameter count drops from d×k to r×(d+k) — reducing trainable parameters by up to 10,000× for GPT-3. At deployment, BA is merged into W₀, so there's zero extra inference cost.
2
Why is A initialized randomly and B initialized to zero?
The initialization ensures ΔW = BA = 0 at the start of training. This means the model begins fine-tuning with exactly the same behavior as the pre-trained model — a warm, stable starting point. If both A and B were random, ΔW would be a random non-zero matrix, destabilizing training from the first step. A random, B=0 achieves the mathematically necessary zero initialization while breaking symmetry in A for non-trivial gradient flow.
3
What is the scaling factor α/r and why is it needed?
The update ΔW is scaled by α/r before adding to W₀. When using Adam optimizer, tuning α is equivalent to tuning the learning rate with appropriate initialization scaling. The authors set α equal to the first r tried (e.g., α=r means ratio=1) and don't tune it further. This is important because without scaling, changing r would affect the effective learning rate for the update, requiring re-tuning. With α/r, different rank choices have comparable gradient scales.
4
How does LoRA achieve zero inference latency?
After training, the LoRA matrices can be merged: W_merged = W₀ + (α/r)·BA. Since W₀ and BA have the same shape (d×k), their sum is a standard d×k matrix. During inference, only W_merged is needed — no separate A or B computation, no adapter paths, no extra memory. The computation is identical to a regular forward pass through W₀. To switch tasks, subtract the old BA and add the new B'A', which is fast and cheap.
5
Why does adapting Wq and Wv together outperform adapting just Wq with higher rank?
The paper shows that given a fixed parameter budget, spreading LoRA across multiple weight types (Wq + Wv, each with r=4) outperforms deeper adaptation of one type (Wq with r=8). This is because different attention weight matrices capture different aspects of language understanding — Wq captures query patterns while Wv captures value content. Both are important for downstream tasks, and adapting both with lower rank covers more of the needed adaptation directions than going deeper in one.
6
What does the paper's subspace analysis reveal about intrinsic rank?
The paper computes singular value decomposition of adaptation matrices learned with r=8 and r=64 separately. Using the Grassmann distance metric, they show that the top singular direction of Ar=8 and Ar=64 have normalized subspace similarity > 0.5 — meaning they essentially learn the same primary direction. This confirms that the adaptation lives in a 1-dimensional subspace, explaining why r=1 works competitively and why increasing r beyond 8 yields diminishing returns: the extra dimensions capture noise, not signal.
7
Compare LoRA to adapter layers. Which would you use and when?
LoRA is strictly better for latency-sensitive production deployments because it introduces zero inference overhead (weights are merged). Adapters introduce 20-30% latency at batch_size=1. However, adapters can be more flexible for multi-task serving if you can't merge (can swap adapters without recomputing merged weights). Choose LoRA when: (a) inference latency matters, (b) you can merge weights at deployment, (c) you want fewer parameters. Choose adapters when: (d) you need dynamic task-switching without weight recomputation, (e) merging weights is operationally difficult.
📝
Chapter 12 — Practice

Exercises — From Easy to Hard

🟢 Easy — Remember

Easy 1
In LoRA, what are the shapes of matrices A and B if the pre-trained weight matrix W₀ has shape d×k and the rank is r?
A has shape r×k and B has shape d×r. When multiplied, B·A gives a d×k matrix — the same shape as W₀ — so it can be directly added. The parameter count is r×k + d×r = r×(d+k), which is much smaller than d×k when r ≪ min(d,k).
Easy 2
Calculate the parameter savings: a weight matrix W₀ has d=4096, k=4096. Full update vs LoRA with r=8. How many parameters each?
Full ΔW: d×k = 4096×4096 = 16,777,216 parameters. LoRA: r×(d+k) = 8×(4096+4096) = 8×8192 = 65,536 parameters. Ratio: 16,777,216 / 65,536 = 256× reduction. For GPT-3's 12288×12288 matrices, this becomes 1,536× per matrix.

🟡 Medium — Apply & Analyze

Medium 1
Why does LoRA set B=0 at initialization but initialize A with random Gaussian values? What would happen if both A and B were initialized to zero?
We need BA=0 at initialization (so the model starts identical to the pretrained model). With B=0, BA=0 regardless of A. A is initialized randomly to break symmetry — if both were zero, gradient ∂L/∂A = B^T × grad_B = 0^T × grad_B = 0, meaning A would never receive gradients and would never learn. With A random and B=0, B gets gradients from the start (∂L/∂B involves A which is non-zero), and training proceeds normally.
Medium 2
You're deploying a customer service bot using GPT-3 + LoRA for 1000 different companies. Estimate the storage savings compared to full fine-tuning (assume 350 GB per full model, 35 MB per LoRA checkpoint).
Full fine-tuning: 1000 × 350 GB = 350,000 GB = 350 TB. LoRA: 350 GB (shared base) + 1000 × 35 MB = 350 GB + 35 GB ≈ 385 GB. Savings: 350,000 / 385 ≈ 909× less storage. This is the difference between a massive data center investment and fitting everything on a single high-end server — a fundamental economic shift in how LLM services can be operated.

🔴 Hard — Evaluate & Create

Hard 1
The paper claims ΔW "amplifies directions not emphasized in W." Use the data from Table 7 (r=4: ||U^T Wq V^T||_F = 0.32, ||Wq||_F = 61.95, ||ΔWq||_F = 6.91) to quantify the amplification factor and explain what it means conceptually.
Amplification factor = ||ΔW||_F / ||U^T W V^T||_F = 6.91 / 0.32 ≈ 21.5×. Interpretation: The directions in weight space that ΔW modifies are barely present in W₀ (contributing only 0.32 to W's total 61.95 Frobenius norm — about 0.5%). Yet ΔW, with magnitude 6.91, amplifies these directions by 21.5×. This means LoRA is selectively boosting "task-specific" features that pre-training encoded but didn't emphasize, rather than changing what the model fundamentally knows. LoRA doesn't teach new information — it re-weights existing, underemphasized knowledge for the downstream task.
Hard 2
Design a variant of LoRA that assigns different ranks to different layers (shallow vs deep) based on the complexity of the adaptation needed. How would you decide rank allocation, and what are the tradeoffs?
This is essentially AdaLoRA (2023). Approach: (1) Start with a budget B_total = total trainable parameters allowed. (2) Initialize all layers with a high rank (r_max). (3) During training, estimate the importance of each singular value direction using the sensitivity of the loss (|∂L/∂σ_i| × σ_i). (4) Prune the least important singular directions across all layers, redistributing budget to important layers. Key insight: early layers (syntax, morphology) typically need lower rank than later layers (task semantics). Tradeoffs: more complex to implement and tune, extra compute during training for importance estimation, but can achieve better performance per parameter than uniform-rank LoRA. Reference: Zhang et al., AdaLoRA, 2023.
🔗
Chapter 13 — Resources

Further Reading & Study Resources

🗺️
Suggested Study Path (1) Read this guide → (2) Read the original paper abstract + Section 4 → (3) Run the HuggingFace PEFT LoRA tutorial on a small model → (4) Read QLoRA to see how LoRA enables 4-bit training → (5) Try fine-tuning LLaMA-7B with LoRA on Colab → (6) Read AdaLoRA for the advanced adaptive version.

Based on: arXiv:2106.09685 · Hu et al. · Microsoft Corporation · ICLR 2022

Designed for deep understanding, not memorization. Always read the original paper for complete details.