English · Español
Phase 28 — Quizzes¶
🇪🇸 Espejo legible de
data/quizzes/phase-28-lora-qlora.yaml. Respuestas detrás de bloques<details>.
Source of truth: data/quizzes/phase-28-lora-qlora.yaml.
q-28-01 — LoRA's exact parameter count on Mini-GPT¶
For Mini-GPT (12 Linears: 8 attention (64,64) + 2 FFN (256,64) + 2 FFN (64,256)), how many trainable LoRA parameters does rank r add? Pick the closed-form formula.
- 128 r
- 640 r
- 1024 r
- 2304 r
Answer
**Choice 4 (2304 r).** Sum: `8 × 128 r` (attention) + `2 × 320 r` (FFN W_1) + `2 × 320 r` (FFN W_2) = `1024 r + 640 r + 640 r = 2304 r`. At `r=8`, that's 18 432 params.q-28-02 — Why initialize B = 0¶
LoRA initializes B = 0 and A ~ N(0, 1/r). Why not initialize both to small random?
- Random init makes the optimizer unstable.
- Random init makes BA non-zero at step 0, displacing the model from the pretrained equilibrium before any gradient signal arrives.
- Random init violates the rank-r constraint.
- Random init is incompatible with AdamW.
Answer
**Choice 2.** With `B=0`, `BA=0` at step 0, so the LoRA-equipped model behaves identically to the base. Random-both gives a non-zero starting displacement with no justification.q-28-03 — What does rank = 0 give you?¶
You set LoRA rank = 0 on Mini-GPT. Which of the following accurately describe what happens to training?
- The trainable parameter count of the LoRA module is zero.
- The optimizer can still update the frozen base weights.
- The product
B @ Aequals the zero matrix because empty matmul returns zero. - PyTorch raises an error on optimizer construction (empty param list).
Answer
**Choices 1, 3, 4.** `B` and `A` are empty; `B @ A` is the zero matrix; no trainable LoRA params exist; AdamW errors on the empty param list. The base weights remain frozen by design.q-28-04 — Why bigger models benefit more from LoRA (free)¶
The LoRA-to-full-FT trainable ratio for a square Linear of dim h at rank r is 2r/h. For r=8 and h ∈ {64, 768, 4096}, what does this ratio reveal about LoRA's relative efficiency?
Answer
The ratio is **0.25 (Mini-GPT), 0.021 (GPT-2 small), 0.004 (LLaMA-7B)**. LoRA's relative efficiency grows as `h` grows because `r` stays fixed — **bigger** models benefit more. Mini-GPT is small enough that the ratio is unflattering, which is why we use it for pedagogy, not as a real LoRA application.q-28-05 — QLoRA = LoRA + ? (free)¶
QLoRA combines LoRA with one additional technique that allows fine-tuning a 65B model on a single 48 GB GPU. What is that technique, in one phrase?