English · Español

Phase 28 — Quizzes¶

🇪🇸 Espejo legible de data/quizzes/phase-28-lora-qlora.yaml. Respuestas detrás de bloques <details>.

Source of truth: data/quizzes/phase-28-lora-qlora.yaml.

q-28-01 — LoRA's exact parameter count on Mini-GPT¶

For Mini-GPT (12 Linears: 8 attention (64,64) + 2 FFN (256,64) + 2 FFN (64,256)), how many trainable LoRA parameters does rank r add? Pick the closed-form formula.

128 r
640 r
1024 r
2304 r

Answer

**Choice 4 (2304 r).** Sum: `8 × 128 r` (attention) + `2 × 320 r` (FFN W_1) + `2 × 320 r` (FFN W_2) = `1024 r + 640 r + 640 r = 2304 r`. At `r=8`, that's 18 432 params.

q-28-02 — Why initialize B = 0¶

LoRA initializes B = 0 and A ~ N(0, 1/r). Why not initialize both to small random?

Random init makes the optimizer unstable.
Random init makes BA non-zero at step 0, displacing the model from the pretrained equilibrium before any gradient signal arrives.
Random init violates the rank-r constraint.
Random init is incompatible with AdamW.

Answer

**Choice 2.** With `B=0`, `BA=0` at step 0, so the LoRA-equipped model behaves identically to the base. Random-both gives a non-zero starting displacement with no justification.

q-28-03 — What does `rank = 0` give you?¶

You set LoRA rank = 0 on Mini-GPT. Which of the following accurately describe what happens to training?

The trainable parameter count of the LoRA module is zero.
The optimizer can still update the frozen base weights.
The product B @ A equals the zero matrix because empty matmul returns zero.
PyTorch raises an error on optimizer construction (empty param list).

Answer

**Choices 1, 3, 4.** `B` and `A` are empty; `B @ A` is the zero matrix; no trainable LoRA params exist; AdamW errors on the empty param list. The base weights remain frozen by design.

q-28-04 — Why bigger models benefit more from LoRA (free)¶

The LoRA-to-full-FT trainable ratio for a square Linear of dim h at rank r is 2r/h. For r=8 and h ∈ {64, 768, 4096}, what does this ratio reveal about LoRA's relative efficiency?

Answer

The ratio is **0.25 (Mini-GPT), 0.021 (GPT-2 small), 0.004 (LLaMA-7B)**. LoRA's relative efficiency grows as `h` grows because `r` stays fixed — **bigger** models benefit more. Mini-GPT is small enough that the ratio is unflattering, which is why we use it for pedagogy, not as a real LoRA application.

q-28-05 — QLoRA = LoRA + ? (free)¶

QLoRA combines LoRA with one additional technique that allows fine-tuning a 65B model on a single 48 GB GPU. What is that technique, in one phrase?

Answer

**4-bit NF4 quantization** of the frozen base weights. The base lives in INT4 (with double quantization and paged optimizer state); only the LoRA adapter is full precision and trainable.

Phase 28 — Quizzes¶

q-28-01 — LoRA's exact parameter count on Mini-GPT¶

q-28-02 — Why initialize B = 0¶

q-28-03 — What does rank = 0 give you?¶

q-28-04 — Why bigger models benefit more from LoRA (free)¶

q-28-05 — QLoRA = LoRA + ? (free)¶

q-28-03 — What does `rank = 0` give you?¶