English · Español

03 — Paper-Read Drill: 20 Minutes per Paper¶

🇪🇸 Protocolo de lectura rápida de papers (20 min) aplicado a Attention (Vaswani 2017), CLIP (Radford 2021), DPO (Rafailov 2023), Chinchilla (Hoffmann 2022). Para cada uno: qué leer, qué saltar, las 3 cosas que tienes que recordar.

The 20-minute protocol¶

A "paper round" in an AI-lab interview gives you 20-30 minutes with a paper (sometimes pre-shared, sometimes cold) and then 20-30 minutes of discussion. You will not finish the paper. That is fine. The protocol:

Minutes 0-2 — Title, abstract, conclusion. Get the claim. Write it in one sentence in the margin.
Minutes 2-5 — Figures and tables. This is where the result lives. Read every caption. Mark "the headline figure" and "the surprising table".
Minutes 5-12 — Method. The core section. Read for exactly what is novel. Skip background. Re-derive the central equation on paper.
Minutes 12-16 — Experiments. Read only the setup that grounds the headline number. Skip the laundry list of benchmarks.
Minutes 16-18 — Ablations. This is where weak papers fall apart. Look for what is missing.
Minutes 18-20 — Limitations / related work. Note what the authors themselves admit is weak. This is the easiest place to pose a question.

You walk into the discussion with: the one-sentence claim, the headline figure, the central equation, the most important ablation, and one weakness.

What interviewers grade¶

Did you identify the claim correctly? (Many candidates over-read.)
Can you explain the central method without jargon?
Did you notice what is not in the paper? (Missing ablation, biased baseline, no scaling study.)
Do you have an opinion? "I think the method works but the evaluation overstates it because of X."

Paper 1 — Vaswani et al. 2017, Attention Is All You Need¶

One-sentence claim¶

A pure attention-based encoder-decoder architecture matches or exceeds RNN-based machine translation while being far more parallelizable to train.

What to read (in priority order)¶

§1, §2 (intro + background) — skim, 2 min.
§3 (model architecture) — read carefully, 8 min. Specifically §3.2 (attention) and §3.3 (FFN).
§4 (why self-attention) — read; this is the parallelization + path-length argument.
Table 2 (BLEU + training cost on WMT) — the headline result.
Table 3 (ablations on number of heads, d_k, dropout) — the most cited ablation.

What to skip¶

§5 (training details, optimizer, regularization) — only skim. You already know Adam.
§6.1 (machine translation results in detail) — Table 2 covers it.
§6.2 (English constituency parsing) — peripheral.

Central equation¶

Attention(Q, K, V) = softmax(Q K^T / sqrt(d_k)) V

You should be able to derive sqrt(d_k) from variance scaling (see whiteboard Q1).

The 3 things to remember¶

Self-attention has constant-path-length between any two positions, unlike RNNs (linear) or CNNs (log). This is the dominant theoretical argument.
Multi-head attention is h parallel attention ops on slices of the embedding, then concatenated. Not "h independent layers".
The cost is O(n^2 d) which is the entire reason Phase 22 (KV cache), Phase 27 (FlashAttention), and the long-context arms race exist.

Likely follow-up questions¶

"Walk me through multi-head attention." (Whiteboard Q1.)
"Why pre-LN replaced post-LN later." (Q2, Q12.)
"What is missing from this paper?" → "There is no scaling study; no analysis of attention head specialization (came later — Voita 2019, Clark 2019); no investigation of positional encoding alternatives (came later — RoPE, ALiBi)."

Pit-of-failure¶

Saying "this paper invented transformers" without being able to derive the scaled dot-product. Signals: never implemented.

Paper 2 — Radford et al. 2021, Learning Transferable Visual Models From Natural Language Supervision (CLIP)¶

One-sentence claim¶

Contrastive pretraining of paired image–text encoders on 400M web pairs produces a vision model with zero-shot performance competitive with supervised ImageNet, generalizing to dozens of classification tasks without per-task fine-tuning.

What to read¶

§1 (intro) — the framing matters: "natural language supervision" vs "fixed label set".
§2.1, §2.2 — dataset (WIT-400M) and training objective.
§2.3 — the architecture choice (ResNet vs ViT image encoder; text encoder is a transformer).
§3.1, §3.2 — zero-shot transfer methodology and prompt engineering.
Figure 4 (zero-shot vs supervised ImageNet linear probe) — the headline.
Figure 5 (zero-shot on 27 datasets) — generalization.
§3.3 — robustness / distribution shift.

What to skip¶

§4 (comparison to existing methods) — only skim.
§5 (data and limitations) on first pass; revisit if asked.
The 30-page appendix unless asked specifically about a dataset.

Central equation¶

The InfoNCE contrastive loss over a batch of N image-text pairs:

L = - (1/N) sum_i log( exp(sim(I_i, T_i) / tau) / sum_j exp(sim(I_i, T_j) / tau) )

Two encoders (I, T), one shared temperature tau, batch-size-N softmax over text candidates for each image (and symmetrically for image candidates for each text).

The 3 things to remember¶

400M image-text pairs scraped from the web — scale was the key, not the architecture.
Zero-shot via prompt templating: for ImageNet, use prompts like "a photo of a {class}" and pick the class whose text embedding is closest to the image embedding.
Linear-probe ImageNet accuracy of CLIP ViT-L/14 matches supervised ResNet-50 trained on labels — the proof that natural-language supervision is competitive with classification supervision.

Likely follow-ups¶

"How would you adapt CLIP for retrieval?" → use image and text embeddings as a shared space; nearest-neighbor search.
"Why does CLIP fail on counting / fine-grained tasks?" → contrastive objective rewards coarse matching; counting requires symbolic structure that doesn't pop out of contrastive.
"How does CLIP compare to ALIGN / SigLIP / EVA-CLIP?" → ALIGN (Google) is contemporaneous; SigLIP (2023) replaces the softmax with sigmoid for better scaling; EVA-CLIP scales the vision encoder.

Pit-of-failure¶

Saying "CLIP is contrastive" without being able to write InfoNCE. Signals: read the abstract.

Paper 3 — Rafailov et al. 2023, Direct Preference Optimization (DPO)¶

One-sentence claim¶

The optimal policy of KL-regularized RLHF has a closed form in terms of a reward function; this allows the reward to be expressed in terms of the policy itself, eliminating the need to train an explicit reward model and an RL loop — preference optimization reduces to a simple classification loss on the LM directly.

What to read¶

§3 (Direct Preference Optimization) — the derivation. Read every line.
§4 (theoretical analysis) — the gradient interpretation; one figure showing implicit reward weights.
§6 (experiments) — IMDB sentiment, TL;DR summarization, dialogue.
Figure 2 (sentiment) — performance vs RLHF baseline.
§7 (discussion) and §8 (limitations) — note the honesty about preference noise sensitivity.

What to skip on first pass¶

§2 (background) if you already know PPO RLHF.
Most of §5 (theoretical analysis fine print).
The dialogue qualitative samples.

Central derivation (you must be able to do this on a whiteboard)¶

Starting point — the KL-constrained RLHF objective:

max_pi  E_{x, y ~ pi} [r(x, y)] - beta * KL(pi(. | x) || pi_ref(. | x))

The optimum has a closed form (Lagrangian + variational calculus):

pi*(y | x) = (1 / Z(x)) * pi_ref(y | x) * exp( r(x, y) / beta )

Rearranging for r:

r(x, y) = beta * log( pi(y | x) / pi_ref(y | x) ) + beta * log Z(x)

The Bradley-Terry preference likelihood P(y_w > y_l | x) = sigmoid(r(x, y_w) - r(x, y_l)) — substitute and the log Z(x) cancels:

L_DPO = - log sigmoid( beta * log(pi(y_w | x) / pi_ref(y_w | x)) - beta * log(pi(y_l | x) / pi_ref(y_l | x)) )

This is the DPO loss. No reward model, no PPO.

The 3 things to remember¶

The log Z(x) cancellation is the magic — it depends only on the prompt, not the response, so it vanishes from the pairwise comparison.
DPO requires a reference policy pi_ref (typically the SFT model) to evaluate. The reference is frozen.
Beta controls the strength of the KL constraint to pi_ref. Smaller beta → more drift from SFT → more reward-hacking risk.

Likely follow-ups¶

"Derive the DPO loss from scratch." (Above.)
"When does DPO underperform PPO?" → on noisy preferences (reward model averages noise; DPO does not); on tasks where the reward is easily verified (math, code) PPO with a verifier reward beats DPO.
"What is IPO / KTO / ORPO?" → IPO fixes DPO's tendency to push pi/pi_ref to infinity; KTO uses one-sided (good/bad) labels; ORPO combines SFT and preference loss without reference. See X3.

Pit-of-failure¶

Knowing "DPO is direct" without being able to derive the loss. Signals: read the abstract. Will not pass an Anthropic interview.

Paper 4 — Hoffmann et al. 2022, Training Compute-Optimal Large Language Models (Chinchilla)¶

One-sentence claim¶

For a fixed compute budget, models are dramatically undertrained: parameters and training tokens should scale equally (N ∝ D, both ∝ sqrt(C)), not parameters disproportionately as in earlier work (Kaplan 2020).

What to read¶

§1 (intro) — the framing: "given compute C, what (N, D) maximizes performance?"
§3 (approaches) — three independent methodologies, all converge.
Approach 1: train on fixed N, vary D — find optimal D per N.
Approach 2: fix D / N ratios, train multiple model sizes — find optimal ratio.
Approach 3: fit a parametric loss function L(N, D) = E + A/N^alpha + B/D^beta.
Figure 3 (the iso-flop curves) — the headline figure of the paper.
§4 (Chinchilla) — the 70B model trained on 1.4T tokens. Beats Gopher (280B, 300B tokens) at ¼ the inference cost.
Table 4 (downstream benchmarks).

What to skip¶

The long benchmark tables in §4.
Most of the appendix.

Central equation¶

L(N, D) = E + A / N^alpha + B / D^beta

with fitted constants E ≈ 1.69, A ≈ 406.4, B ≈ 410.7, alpha ≈ 0.34, beta ≈ 0.28.

The conclusion (numerical): D ≈ 20 × N is the compute-optimal token count per parameter.

The 3 things to remember¶

Tokens per parameter ≈ 20 at compute-optimal. (Llama 3 went much further — 15T tokens for 8B and 70B — proving you can keep training past the compute-optimal point if inference cost dominates training cost.)
Gopher (280B params, 300B tokens) was wildly undertrained. Chinchilla (70B params, 1.4T tokens) outperformed it at 4x lower inference cost.
The Kaplan 2020 prediction was wrong — they extrapolated from too-small models and over-predicted the importance of parameter count.

Likely follow-ups¶

"What is the difference between Kaplan scaling and Chinchilla scaling?" → Kaplan said N grows faster than D with compute; Chinchilla said equal.
"Why does Llama 3 train far past compute-optimal?" → Chinchilla optimizes training cost. If you amortize a model over billions of inferences, over-training a small model is cheaper per query than under-training a big one.
"How would you design a scaling study for a new architecture?" → fit L(N, D) across at least 4 N's and 4 D's; check residuals; report alpha, beta.

Pit-of-failure¶

Quoting "20 tokens per parameter" as a universal law without understanding it is training-compute optimal, not inference-cost optimal. Signals: didn't read past the abstract.

How to drill this file¶

Pick one of the four papers. Set a 20-minute timer.
Follow the protocol. Write the one-sentence claim, the central equation, and one weakness on a card.
Have a friend (or this file) fire follow-up questions. You have 60 seconds per follow-up.
Repeat with another paper every 2 days.

After all 4 papers: add one more paper from your work history per week. Maintain a paper-card folder (see ../lab/01-paper-pitch-cards.md).

→ Next: 04-coding-drills.md