English · Español
03 — Paper-Read Drill: 20 Minutes per Paper¶
🇪🇸 Protocolo de lectura rápida de papers (20 min) aplicado a Attention (Vaswani 2017), CLIP (Radford 2021), DPO (Rafailov 2023), Chinchilla (Hoffmann 2022). Para cada uno: qué leer, qué saltar, las 3 cosas que tienes que recordar.
The 20-minute protocol¶
A "paper round" in an AI-lab interview gives you 20-30 minutes with a paper (sometimes pre-shared, sometimes cold) and then 20-30 minutes of discussion. You will not finish the paper. That is fine. The protocol:
- Minutes 0-2 — Title, abstract, conclusion. Get the claim. Write it in one sentence in the margin.
- Minutes 2-5 — Figures and tables. This is where the result lives. Read every caption. Mark "the headline figure" and "the surprising table".
- Minutes 5-12 — Method. The core section. Read for exactly what is novel. Skip background. Re-derive the central equation on paper.
- Minutes 12-16 — Experiments. Read only the setup that grounds the headline number. Skip the laundry list of benchmarks.
- Minutes 16-18 — Ablations. This is where weak papers fall apart. Look for what is missing.
- Minutes 18-20 — Limitations / related work. Note what the authors themselves admit is weak. This is the easiest place to pose a question.
You walk into the discussion with: the one-sentence claim, the headline figure, the central equation, the most important ablation, and one weakness.
What interviewers grade¶
- Did you identify the claim correctly? (Many candidates over-read.)
- Can you explain the central method without jargon?
- Did you notice what is not in the paper? (Missing ablation, biased baseline, no scaling study.)
- Do you have an opinion? "I think the method works but the evaluation overstates it because of X."
Paper 1 — Vaswani et al. 2017, Attention Is All You Need¶
One-sentence claim¶
A pure attention-based encoder-decoder architecture matches or exceeds RNN-based machine translation while being far more parallelizable to train.
What to read (in priority order)¶
- §1, §2 (intro + background) — skim, 2 min.
- §3 (model architecture) — read carefully, 8 min. Specifically §3.2 (attention) and §3.3 (FFN).
- §4 (why self-attention) — read; this is the parallelization + path-length argument.
- Table 2 (BLEU + training cost on WMT) — the headline result.
- Table 3 (ablations on number of heads, d_k, dropout) — the most cited ablation.
What to skip¶
- §5 (training details, optimizer, regularization) — only skim. You already know Adam.
- §6.1 (machine translation results in detail) — Table 2 covers it.
- §6.2 (English constituency parsing) — peripheral.
Central equation¶
Attention(Q, K, V) = softmax(Q K^T / sqrt(d_k)) V
You should be able to derive sqrt(d_k) from variance scaling (see whiteboard Q1).
The 3 things to remember¶
- Self-attention has constant-path-length between any two positions, unlike RNNs (linear) or CNNs (log). This is the dominant theoretical argument.
- Multi-head attention is
hparallel attention ops on slices of the embedding, then concatenated. Not "h independent layers". - The cost is
O(n^2 d)which is the entire reason Phase 22 (KV cache), Phase 27 (FlashAttention), and the long-context arms race exist.
Likely follow-up questions¶
- "Walk me through multi-head attention." (Whiteboard Q1.)
- "Why pre-LN replaced post-LN later." (Q2, Q12.)
- "What is missing from this paper?" → "There is no scaling study; no analysis of attention head specialization (came later — Voita 2019, Clark 2019); no investigation of positional encoding alternatives (came later — RoPE, ALiBi)."
Pit-of-failure¶
Saying "this paper invented transformers" without being able to derive the scaled dot-product. Signals: never implemented.
Paper 2 — Radford et al. 2021, Learning Transferable Visual Models From Natural Language Supervision (CLIP)¶
One-sentence claim¶
Contrastive pretraining of paired image–text encoders on 400M web pairs produces a vision model with zero-shot performance competitive with supervised ImageNet, generalizing to dozens of classification tasks without per-task fine-tuning.
What to read¶
- §1 (intro) — the framing matters: "natural language supervision" vs "fixed label set".
- §2.1, §2.2 — dataset (WIT-400M) and training objective.
- §2.3 — the architecture choice (ResNet vs ViT image encoder; text encoder is a transformer).
- §3.1, §3.2 — zero-shot transfer methodology and prompt engineering.
- Figure 4 (zero-shot vs supervised ImageNet linear probe) — the headline.
- Figure 5 (zero-shot on 27 datasets) — generalization.
- §3.3 — robustness / distribution shift.
What to skip¶
- §4 (comparison to existing methods) — only skim.
- §5 (data and limitations) on first pass; revisit if asked.
- The 30-page appendix unless asked specifically about a dataset.
Central equation¶
The InfoNCE contrastive loss over a batch of N image-text pairs:
Two encoders (I, T), one shared temperature tau, batch-size-N softmax over text candidates for each image (and symmetrically for image candidates for each text).
The 3 things to remember¶
- 400M image-text pairs scraped from the web — scale was the key, not the architecture.
- Zero-shot via prompt templating: for ImageNet, use prompts like
"a photo of a {class}"and pick the class whose text embedding is closest to the image embedding. - Linear-probe ImageNet accuracy of CLIP ViT-L/14 matches supervised ResNet-50 trained on labels — the proof that natural-language supervision is competitive with classification supervision.
Likely follow-ups¶
- "How would you adapt CLIP for retrieval?" → use image and text embeddings as a shared space; nearest-neighbor search.
- "Why does CLIP fail on counting / fine-grained tasks?" → contrastive objective rewards coarse matching; counting requires symbolic structure that doesn't pop out of contrastive.
- "How does CLIP compare to ALIGN / SigLIP / EVA-CLIP?" → ALIGN (Google) is contemporaneous; SigLIP (2023) replaces the softmax with sigmoid for better scaling; EVA-CLIP scales the vision encoder.
Pit-of-failure¶
Saying "CLIP is contrastive" without being able to write InfoNCE. Signals: read the abstract.
Paper 3 — Rafailov et al. 2023, Direct Preference Optimization (DPO)¶
One-sentence claim¶
The optimal policy of KL-regularized RLHF has a closed form in terms of a reward function; this allows the reward to be expressed in terms of the policy itself, eliminating the need to train an explicit reward model and an RL loop — preference optimization reduces to a simple classification loss on the LM directly.
What to read¶
- §3 (Direct Preference Optimization) — the derivation. Read every line.
- §4 (theoretical analysis) — the gradient interpretation; one figure showing implicit reward weights.
- §6 (experiments) — IMDB sentiment, TL;DR summarization, dialogue.
- Figure 2 (sentiment) — performance vs RLHF baseline.
- §7 (discussion) and §8 (limitations) — note the honesty about preference noise sensitivity.
What to skip on first pass¶
- §2 (background) if you already know PPO RLHF.
- Most of §5 (theoretical analysis fine print).
- The dialogue qualitative samples.
Central derivation (you must be able to do this on a whiteboard)¶
Starting point — the KL-constrained RLHF objective:
The optimum has a closed form (Lagrangian + variational calculus):
Rearranging for r:
The Bradley-Terry preference likelihood P(y_w > y_l | x) = sigmoid(r(x, y_w) - r(x, y_l)) — substitute and the log Z(x) cancels:
L_DPO = - log sigmoid( beta * log(pi(y_w | x) / pi_ref(y_w | x)) - beta * log(pi(y_l | x) / pi_ref(y_l | x)) )
This is the DPO loss. No reward model, no PPO.
The 3 things to remember¶
- The
log Z(x)cancellation is the magic — it depends only on the prompt, not the response, so it vanishes from the pairwise comparison. - DPO requires a reference policy
pi_ref(typically the SFT model) to evaluate. The reference is frozen. - Beta controls the strength of the KL constraint to
pi_ref. Smallerbeta→ more drift from SFT → more reward-hacking risk.
Likely follow-ups¶
- "Derive the DPO loss from scratch." (Above.)
- "When does DPO underperform PPO?" → on noisy preferences (reward model averages noise; DPO does not); on tasks where the reward is easily verified (math, code) PPO with a verifier reward beats DPO.
- "What is IPO / KTO / ORPO?" → IPO fixes DPO's tendency to push
pi/pi_refto infinity; KTO uses one-sided (good/bad) labels; ORPO combines SFT and preference loss without reference. See X3.
Pit-of-failure¶
Knowing "DPO is direct" without being able to derive the loss. Signals: read the abstract. Will not pass an Anthropic interview.
Paper 4 — Hoffmann et al. 2022, Training Compute-Optimal Large Language Models (Chinchilla)¶
One-sentence claim¶
For a fixed compute budget, models are dramatically undertrained: parameters and training tokens should scale equally (N ∝ D, both ∝ sqrt(C)), not parameters disproportionately as in earlier work (Kaplan 2020).
What to read¶
- §1 (intro) — the framing: "given compute C, what (N, D) maximizes performance?"
- §3 (approaches) — three independent methodologies, all converge.
- Approach 1: train on fixed N, vary D — find optimal D per N.
- Approach 2: fix
D / Nratios, train multiple model sizes — find optimal ratio. - Approach 3: fit a parametric loss function
L(N, D) = E + A/N^alpha + B/D^beta. - Figure 3 (the iso-flop curves) — the headline figure of the paper.
- §4 (Chinchilla) — the 70B model trained on 1.4T tokens. Beats Gopher (280B, 300B tokens) at ¼ the inference cost.
- Table 4 (downstream benchmarks).
What to skip¶
- The long benchmark tables in §4.
- Most of the appendix.
Central equation¶
with fitted constants E ≈ 1.69, A ≈ 406.4, B ≈ 410.7, alpha ≈ 0.34, beta ≈ 0.28.
The conclusion (numerical): D ≈ 20 × N is the compute-optimal token count per parameter.
The 3 things to remember¶
- Tokens per parameter ≈ 20 at compute-optimal. (Llama 3 went much further — 15T tokens for 8B and 70B — proving you can keep training past the compute-optimal point if inference cost dominates training cost.)
- Gopher (280B params, 300B tokens) was wildly undertrained. Chinchilla (70B params, 1.4T tokens) outperformed it at 4x lower inference cost.
- The Kaplan 2020 prediction was wrong — they extrapolated from too-small models and over-predicted the importance of parameter count.
Likely follow-ups¶
- "What is the difference between Kaplan scaling and Chinchilla scaling?" → Kaplan said
Ngrows faster thanDwith compute; Chinchilla said equal. - "Why does Llama 3 train far past compute-optimal?" → Chinchilla optimizes training cost. If you amortize a model over billions of inferences, over-training a small model is cheaper per query than under-training a big one.
- "How would you design a scaling study for a new architecture?" → fit
L(N, D)across at least 4 N's and 4 D's; check residuals; reportalpha, beta.
Pit-of-failure¶
Quoting "20 tokens per parameter" as a universal law without understanding it is training-compute optimal, not inference-cost optimal. Signals: didn't read past the abstract.
How to drill this file¶
- Pick one of the four papers. Set a 20-minute timer.
- Follow the protocol. Write the one-sentence claim, the central equation, and one weakness on a card.
- Have a friend (or this file) fire follow-up questions. You have 60 seconds per follow-up.
- Repeat with another paper every 2 days.
After all 4 papers: add one more paper from your work history per week. Maintain a paper-card folder (see ../lab/01-paper-pitch-cards.md).
→ Next: 04-coding-drills.md