English · Español

Break 00 — Set LoRA `rank = 0` (degenerate update space)¶

🇪🇸 Configuramos rank = 0 en LoRALinear. La factorización B A con B ∈ ℝ^(out × 0) y A ∈ ℝ^(0 × in) da un producto vacío que es matemáticamente la matriz cero. El optimizador no tiene gradiente que ajustar. La pérdida se queda exactamente donde estaba. La predicción es trivial; la lección es por qué el espacio de actualizaciones tiene que tener al menos dimensión 1.

This /break exercise targets the rank ≥ 1 constraint in LoRA. The bug is one number; the failure mode is zero learning, which is louder than any garbled output.

Anchors: theory/02-parameter-count.md, theory/05-lora-on-mini-gpt-exact-count.md, .claude/commands/break.md.

Hypothesis¶

The learner predicts: "Setting rank = 0 makes B shape (out, 0) and A shape (0, in). The product B @ A is a (out, in) zero matrix. ΔW = (α/r) · B @ A is undefined (r = 0 divides by zero) — but if the code happens to short-circuit on empty matmul, ΔW = 0 and the model output is exactly the frozen base. Loss does not move. Accuracy does not improve. The optimizer has zero parameters to update."

The break¶

In src/minimodel/peft/lora.py:

 class LoRALinear(Module):
-    def __init__(self, base: Linear, rank: int = 8, alpha: float = 16.0):
+    def __init__(self, base: Linear, rank: int = 0, alpha: float = 16.0):  # /break: degenerate
         super().__init__()
         self.base = base
         self.base.weight.requires_grad = False
         if self.base.bias is not None:
             self.base.bias.requires_grad = False
         self.rank = rank
         self.alpha = alpha
         out_f, in_f = base.weight.shape
-        self.A = Parameter(torch.empty(rank, in_f).normal_(std=1 / rank ** 0.5))
+        # /break: rank = 0 forces both A and B to empty tensors.
+        self.A = Parameter(torch.empty(rank, in_f).normal_())
         self.B = Parameter(torch.zeros(out_f, rank))

One number: rank: 8 → 0. Note the 1 / rank**0.5 initialization scaling becomes a ZeroDivisionError if you don't gracefully handle rank = 0 — which is the first symptom Borja will see (a crash on construction). The diff above also softens that crash to make the bug observable through training instead.

Predict, then run¶

For Mini-GPT at rank = 0:

A has shape (0, in) — empty in dim 0; legal but degenerate.
B has shape (out, 0) — empty in dim 1; legal but degenerate.
B @ A has shape (out, in) and value 0 (empty matmul). NumPy / PyTorch both return a zero matrix for this case.
ΔW = (α/r) · B @ A: division by zero if you literally compute α / 0. If the code uses α * (B @ A) / r it's 0 * (1/0) which is nan. If the code wraps with if rank == 0: skip, the model behaves like the frozen base.

Predictions¶

Training loss: doesn't move (or jumps to NaN if scaling is naïvely applied). The optimizer has 0 trainable parameters from the adapter.
Eval accuracy on §A13 irregular verbs: equal to the frozen base model's accuracy (probably ~60-70%, before fine-tuning).
p.numel() for p in LoRALinear.parameters() if p.requires_grad: 0. The optimizer step is a no-op.
torch.optim.AdamW: raises a deprecation warning or ValueError("optimizer got an empty parameter list") depending on the version. This is the clean signal — the optimizer literally has nothing to optimize.

Write your predictions in learners/borja/phase-28/notes/breaks.md before running.

Observe¶

Run the LoRA fine-tune with the broken config:

just exp 28-lora --rank 0

Diagnostics:

Constructor: should print LoRALinear(rank=0, trainable=0). If it doesn't, the numel() count is misreporting.
Optimizer construction: should raise ValueError("optimizer got an empty parameter list") from PyTorch, or print a warning depending on the version.
If you trick the optimizer with a dummy parameter and let training proceed: loss curve is flat (within float noise). Accuracy on the irregular-verbs eval set matches the frozen base.

Symptom Borja will see¶

Either a ValueError on optimizer construction, or
A completely flat loss curve, or
An accuracy number identical to the no-fine-tune baseline.
Whichever variant the code produces, the bug is observable within the first epoch — no need to train to convergence.

Hidden cause (one sentence)¶

The LoRA decomposition ΔW = BA with B ∈ ℝ^(out × 0) and A ∈ ℝ^(0 × in) represents the empty update space — the only element of rank-0 matrices is the zero matrix — so no fine-tuning signal can flow through.

Hint cascade¶

Print sum(p.numel() for p in self.parameters() if p.requires_grad) for your LoRALinear. Is it zero? Should it be?
What does B @ A produce when B.shape == (out, 0) and A.shape == (0, in)? Print it. Print (B @ A).abs().max().
Re-read theory/05-lora-on-mini-gpt-exact-count.md §"ΔW = BA, written out". What is the smallest meaningful rank?

Fix diff¶

 class LoRALinear(Module):
-    def __init__(self, base: Linear, rank: int = 0, alpha: float = 16.0):
+    def __init__(self, base: Linear, rank: int = 8, alpha: float = 16.0):
+        if rank < 1:
+            raise ValueError(f"LoRA rank must be ≥ 1; got rank={rank}")
         super().__init__()
         ...
-        self.A = Parameter(torch.empty(rank, in_f).normal_())
+        self.A = Parameter(torch.empty(rank, in_f).normal_(std=1 / rank ** 0.5))

Restore rank to the default of 8, and add a guard against rank < 1 so the failure becomes loud at construction (which is the right place to fail — Phase 9's "raise loud" principle).

Why this teaches the concept¶

LoRA's whole pitch is "you only need a low rank to capture useful updates". This break makes the obvious follow-up question concrete: what is the lower bound on r? The answer is r = 1 (a rank-1 outer product gives a full out × in matrix with one degree of freedom — useful for a single direction of update). r = 0 is the degenerate case — and the failure mode (zero learning) is exactly what you'd predict from linear algebra. The lesson generalizes to QLoRA, AdaLoRA, and any low-rank decomposition: the rank parameter bounds the expressivity, and the lower bound is 1, not 0.

The §A13 grammar-tutor task is also a nice anchor: 8 irregular verbs × 5 tenses × 3 persons ≈ 120 cell entries to potentially correct. Rank-1 can encode one direction of correction (e.g., "third-person past simple of eat adds an -e ending"), but you need ranks 2-8 to encode all of them. That's the rank-vs-accuracy curve you're about to measure in lab/02-lora-finetune.md.

Reference¶

Hu et al., LoRA (arXiv:2106.09685), §4.1 — discusses rank vs accuracy on RoBERTa and reports r = 8 as a sweet spot.
Eckart-Young-Mirsky theorem (matrix approximation theory) — the formal statement that any rank-r approximation of a matrix M has the same expressivity as the truncated SVD of M at rank r. r = 0 ⟹ the zero matrix.

Next: restore rank to 8 and run lab/02-lora-finetune.md. Plot the rank-vs-accuracy curve to confirm the r = 8 saturation point.

Break 00 — Set LoRA rank = 0 (degenerate update space)¶