English · Español

Lab 01 — 17 Paper Pitch Cards¶

🇪🇸 17 tarjetas de elevator pitch para los papers que toda entrevista en un laboratorio de IA puede tocar. Formato: título | resumen de 1 frase | 3 números para recordar. Memoriza las tres líneas de cada tarjeta.

Format¶

Title:        <Author Year — Paper Name>
Summary:      <One sentence: claim + method.>
3 numbers:    1. <number with units and context>
              2. <number with units and context>
              3. <number with units and context>

Three numbers per card is the rule. Three is enough to anchor the paper; more is unmemorable. Pick numbers that are load-bearing — that an interviewer might ask about.

Drill protocol¶

Write each card on a physical index card. Carry the deck.
Drill: shuffle, draw a card, recite the three lines aloud in ≤ 30 seconds.
Goal: 17 cards in 8 minutes, no errors.
Re-drill weekly. Add new cards as you read new papers.

Card 01 — Vaswani 2017, Attention Is All You Need¶

Summary. Replaces RNN/CNN encoder-decoder with pure attention; introduces multi-head scaled dot-product attention and positional encoding for sequence transduction (machine translation).

3 numbers. 1. 65M parameters in the base model (compared to 213M for the best prior NMT model — they outperformed it). 2. 3.5 days on 8 P100 GPUs to train (an order of magnitude less than RNN baselines). 3. 28.4 BLEU on WMT 2014 EN→DE, a +2.0 over the previous SOTA.

→ See theory/03-paper-read-drill.md Paper 1.

Card 02 — Devlin 2018, BERT¶

Summary. Bidirectional transformer encoder pre-trained with masked-language-modeling and next-sentence prediction, fine-tuned for downstream NLU tasks; established the pretrain-then-fine-tune paradigm.

3 numbers. 1. 340M parameters in BERT-Large. 2. 15% token masking rate during MLM pretraining. 3. +7.7 points on GLUE over the prior SOTA at the time of release.

Card 03 — Radford 2019, Language Models are Unsupervised Multitask Learners (GPT-2)¶

Summary. A 1.5B-parameter decoder-only LM trained on WebText shows surprising zero-shot performance on multiple NLP tasks without task-specific fine-tuning.

3 numbers. 1. 1.5B parameters (the largest variant; OpenAI initially withheld this checkpoint citing misuse risk). 2. 40 GB of WebText (filtered Reddit-linked content) for pretraining. 3. 8 of 9 language modeling benchmarks set as zero-shot SOTA at release.

Card 04 — Brown 2020, Language Models are Few-Shot Learners (GPT-3)¶

Summary. A 175B-parameter dense LM exhibits in-context learning: it performs many tasks competitively after only a few examples in the prompt, with no weight updates.

3 numbers. 1. 175B parameters (10x larger than any LM at the time). 2. 300B training tokens. 3. ~$4.6M rough training cost (public estimates).

Card 05 — Raffel 2019, T5¶

Summary. Reframes all NLP tasks as text-to-text; encoder-decoder transformer pretrained on the C4 corpus with a denoising objective, transferred to dozens of downstream tasks via uniform formatting.

3 numbers. 1. 11B parameters in the largest variant. 2. 750 GB of cleaned web text in the C4 corpus. 3. 40 NLP tasks unified under one text-to-text framework.

Card 06 — Radford 2021, CLIP¶

Summary. Contrastive image-text pretraining on 400M web pairs produces a vision encoder with zero-shot transfer competitive with supervised ImageNet across dozens of classification tasks.

3 numbers. 1. 400M image-text pairs (WIT dataset). 2. 76.2% zero-shot ImageNet top-1 with ViT-L/14, ~ResNet-50 supervised. 3. 27 datasets for zero-shot generalization eval.

→ See theory/03-paper-read-drill.md Paper 2.

Card 07 — Dosovitskiy 2020, An Image is Worth 16x16 Words (ViT)¶

Summary. Applies a standard transformer directly to image patches (16×16, treated as tokens) and matches or exceeds CNNs on image classification given sufficient pretraining data.

3 numbers. 1. 16×16 patch size on 224×224 images → 196 tokens per image. 2. 300M images in JFT-300M, the pretraining set that unlocks ViT's performance. 3. +0.5 to +2% over ResNet baselines at iso-FLOPs on ImageNet after large-scale pretraining.

Card 08 — Radford 2022, Whisper¶

Summary. Encoder-decoder transformer trained on 680k hours of multilingual labeled audio-text from the web; performs ASR and translation across 99 languages with zero-shot robustness.

3 numbers. 1. 680k hours of labeled audio (compared to LibriSpeech's 1k hours). 2. 99 languages covered. 3. 1.55B parameters in the large variant.

Card 09 — Ouyang 2022, InstructGPT¶

Summary. Fine-tune GPT-3 with supervised demonstrations + RLHF (reward model trained on human preference comparisons, PPO with KL penalty) to follow instructions better than the base model.

3 numbers. 1. 1.3B-parameter InstructGPT preferred over 175B-parameter GPT-3 by labelers in 85% of cases. 2. 40 labelers wrote the demonstration and preference data. 3. 3-stage pipeline: SFT → RM training → PPO RL.

Card 10 — Hoffmann 2022, Chinchilla¶

Summary. For a fixed compute budget, parameters and training tokens should scale roughly equally (N ∝ D); previous models (GPT-3, Gopher) were dramatically undertrained relative to compute-optimality.

3 numbers. 1. 70B parameters trained on 1.4T tokens beats 280B Gopher on 300B tokens. 2. ~20 tokens per parameter is compute-optimal. 3. L(N, D) = E + A/N^0.34 + B/D^0.28 fitted parametric loss law.

→ See theory/03-paper-read-drill.md Paper 4.

Card 11 — Rafailov 2023, Direct Preference Optimization (DPO)¶

Summary. The KL-regularized RLHF optimum has a closed form in policy space, so preference optimization reduces to a classification-style loss on the LM directly — no reward model, no PPO loop.

3 numbers. 1. Zero reward models required (vs. RLHF, which requires one). 2. β (typically 0.1-0.5) controls the strength of the KL constraint to the SFT reference. 3. Comparable or better win rates than RLHF on TL;DR summarization and Anthropic-HH at a fraction of the engineering cost.

→ See theory/03-paper-read-drill.md Paper 3.

Card 12 — Hu 2021, LoRA¶

Summary. Freeze the pretrained weights and inject trainable low-rank decomposition B A (rank r << min(d, k)) into each linear layer; fine-tunes large LMs with 30-100x fewer parameters at negligible quality loss.

3 numbers. 1. r = 8 or 16 is the typical rank for LLM fine-tuning. 2. 30-100x fewer trainable parameters than full fine-tuning. 3. 0 latency overhead at inference because B A can be merged into W.

Card 13 — Dao 2022, FlashAttention¶

Summary. Tiles the attention computation to avoid materializing the n × n attention matrix in HBM; exact (not approximate); recomputes on backward to save memory.

3 numbers. 1. 2-4x training speedup on transformers at typical context lengths. 2. O(n) memory instead of O(n²). 3. 64KB SRAM per SM (A100) — the tile-size constraint.

Card 14 — Kwon 2023, Efficient Memory Management for LLM Serving with PagedAttention (vLLM)¶

Summary. Manages the KV cache like virtual memory with fixed-size pages, sharing pages across requests with common prefixes and reducing fragmentation — enabling much higher batch throughput in production serving.

3 numbers. 1. 2-4x throughput improvement over Hugging Face TGI / FasterTransformer at iso-latency. 2. 16-token typical block (page) size. 3. <4% internal fragmentation vs. 60-80% in naive KV cache management.

Card 15 — Jiang 2023, Mistral 7B¶

Summary. A 7B-parameter LM that outperforms Llama 2 13B at iso-evals by using sliding-window attention, grouped-query attention, and careful data quality choices.

3 numbers. 1. 7.3B parameters. 2. 4096-token sliding window for attention (effective context >> 4096 via stacked windows). 3. 8 KV heads under GQA (vs. 32 query heads).

Card 16 — Touvron 2023, Llama 2¶

Summary. Open-weight LM family (7B / 13B / 70B); trained on 2T tokens; chat variants fine-tuned with SFT + RLHF; established the open-weight production baseline through 2024.

3 numbers. 1. 70B parameters in the largest variant. 2. 2T training tokens (~30 tokens per param — past Chinchilla optimal, optimized for inference). 3. GQA-8 (groups of 8 query heads sharing KV) in the 70B variant.

Card 17 — Dubey 2024, Llama 3¶

Summary. Open-weight LM family pushed beyond Chinchilla scaling — trained on 15T tokens to optimize inference economics over training-compute optimality; introduces 405B dense variant rivaling closed frontier models.

3 numbers. 1. 15T training tokens (vs. Chinchilla-optimal ~1.4T for 70B parameters). 2. 405B parameters in the flagship variant. 3. 128k tokenizer vocabulary (vs. Llama 2's 32k), reducing tokens-per-character on non-English.

How to drill¶

Round 1 (week 1). Card by card; read each aloud 3x. Then close the deck and recite from memory.
Round 2 (week 2). Shuffle. Draw, recite, verify. Goal: 17 cards in ≤ 10 minutes.
Round 3 (week 3). Add the "expand" drill: for each card, after the 3 lines, add 30 seconds of unscripted depth ("the thing about CLIP is that scale, not the contrastive objective, was the real innovation"). This is the actual interview shape.
Maintenance. Once per week thereafter. Add 1-2 new cards per month from your reading.

Customization¶

Add company-specific cards. For an Anthropic interview, add Bai 2022 (Constitutional AI), Anthropic's Sleeper Agents paper, and Scaling Monosemanticity.
Add domain-specific cards. For a robotics role, add RT-2, OpenVLA. For multimodal, add Flamingo, BLIP-2, LLaVA.
Drop cards once mastered and add new ones; the working deck should be at the edge of your competence.

This completes Module X5. Return to ../README.md for the module map, or to ROADMAP.md (in the repo root) for the broader curriculum.