English · Español
Lab 01 — 17 Paper Pitch Cards¶
🇪🇸 17 tarjetas de elevator pitch para los papers que toda entrevista en un laboratorio de IA puede tocar. Formato: título | resumen de 1 frase | 3 números para recordar. Memoriza las tres líneas de cada tarjeta.
Format¶
Title: <Author Year — Paper Name>
Summary: <One sentence: claim + method.>
3 numbers: 1. <number with units and context>
2. <number with units and context>
3. <number with units and context>
Three numbers per card is the rule. Three is enough to anchor the paper; more is unmemorable. Pick numbers that are load-bearing — that an interviewer might ask about.
Drill protocol¶
- Write each card on a physical index card. Carry the deck.
- Drill: shuffle, draw a card, recite the three lines aloud in ≤ 30 seconds.
- Goal: 17 cards in 8 minutes, no errors.
- Re-drill weekly. Add new cards as you read new papers.
Card 01 — Vaswani 2017, Attention Is All You Need¶
Summary. Replaces RNN/CNN encoder-decoder with pure attention; introduces multi-head scaled dot-product attention and positional encoding for sequence transduction (machine translation).
3 numbers. 1. 65M parameters in the base model (compared to 213M for the best prior NMT model — they outperformed it). 2. 3.5 days on 8 P100 GPUs to train (an order of magnitude less than RNN baselines). 3. 28.4 BLEU on WMT 2014 EN→DE, a +2.0 over the previous SOTA.
→ See theory/03-paper-read-drill.md Paper 1.
Card 02 — Devlin 2018, BERT¶
Summary. Bidirectional transformer encoder pre-trained with masked-language-modeling and next-sentence prediction, fine-tuned for downstream NLU tasks; established the pretrain-then-fine-tune paradigm.
3 numbers. 1. 340M parameters in BERT-Large. 2. 15% token masking rate during MLM pretraining. 3. +7.7 points on GLUE over the prior SOTA at the time of release.
Card 03 — Radford 2019, Language Models are Unsupervised Multitask Learners (GPT-2)¶
Summary. A 1.5B-parameter decoder-only LM trained on WebText shows surprising zero-shot performance on multiple NLP tasks without task-specific fine-tuning.
3 numbers. 1. 1.5B parameters (the largest variant; OpenAI initially withheld this checkpoint citing misuse risk). 2. 40 GB of WebText (filtered Reddit-linked content) for pretraining. 3. 8 of 9 language modeling benchmarks set as zero-shot SOTA at release.
Card 04 — Brown 2020, Language Models are Few-Shot Learners (GPT-3)¶
Summary. A 175B-parameter dense LM exhibits in-context learning: it performs many tasks competitively after only a few examples in the prompt, with no weight updates.
3 numbers. 1. 175B parameters (10x larger than any LM at the time). 2. 300B training tokens. 3. ~$4.6M rough training cost (public estimates).
Card 05 — Raffel 2019, T5¶
Summary. Reframes all NLP tasks as text-to-text; encoder-decoder transformer pretrained on the C4 corpus with a denoising objective, transferred to dozens of downstream tasks via uniform formatting.
3 numbers. 1. 11B parameters in the largest variant. 2. 750 GB of cleaned web text in the C4 corpus. 3. 40 NLP tasks unified under one text-to-text framework.
Card 06 — Radford 2021, CLIP¶
Summary. Contrastive image-text pretraining on 400M web pairs produces a vision encoder with zero-shot transfer competitive with supervised ImageNet across dozens of classification tasks.
3 numbers. 1. 400M image-text pairs (WIT dataset). 2. 76.2% zero-shot ImageNet top-1 with ViT-L/14, ~ResNet-50 supervised. 3. 27 datasets for zero-shot generalization eval.
→ See theory/03-paper-read-drill.md Paper 2.
Card 07 — Dosovitskiy 2020, An Image is Worth 16x16 Words (ViT)¶
Summary. Applies a standard transformer directly to image patches (16×16, treated as tokens) and matches or exceeds CNNs on image classification given sufficient pretraining data.
3 numbers. 1. 16×16 patch size on 224×224 images → 196 tokens per image. 2. 300M images in JFT-300M, the pretraining set that unlocks ViT's performance. 3. +0.5 to +2% over ResNet baselines at iso-FLOPs on ImageNet after large-scale pretraining.
Card 08 — Radford 2022, Whisper¶
Summary. Encoder-decoder transformer trained on 680k hours of multilingual labeled audio-text from the web; performs ASR and translation across 99 languages with zero-shot robustness.
3 numbers. 1. 680k hours of labeled audio (compared to LibriSpeech's 1k hours). 2. 99 languages covered. 3. 1.55B parameters in the large variant.
Card 09 — Ouyang 2022, InstructGPT¶
Summary. Fine-tune GPT-3 with supervised demonstrations + RLHF (reward model trained on human preference comparisons, PPO with KL penalty) to follow instructions better than the base model.
3 numbers. 1. 1.3B-parameter InstructGPT preferred over 175B-parameter GPT-3 by labelers in 85% of cases. 2. 40 labelers wrote the demonstration and preference data. 3. 3-stage pipeline: SFT → RM training → PPO RL.
Card 10 — Hoffmann 2022, Chinchilla¶
Summary. For a fixed compute budget, parameters and training tokens should scale roughly equally (N ∝ D); previous models (GPT-3, Gopher) were dramatically undertrained relative to compute-optimality.
3 numbers. 1. 70B parameters trained on 1.4T tokens beats 280B Gopher on 300B tokens. 2. ~20 tokens per parameter is compute-optimal. 3. L(N, D) = E + A/N^0.34 + B/D^0.28 fitted parametric loss law.
→ See theory/03-paper-read-drill.md Paper 4.
Card 11 — Rafailov 2023, Direct Preference Optimization (DPO)¶
Summary. The KL-regularized RLHF optimum has a closed form in policy space, so preference optimization reduces to a classification-style loss on the LM directly — no reward model, no PPO loop.
3 numbers. 1. Zero reward models required (vs. RLHF, which requires one). 2. β (typically 0.1-0.5) controls the strength of the KL constraint to the SFT reference. 3. Comparable or better win rates than RLHF on TL;DR summarization and Anthropic-HH at a fraction of the engineering cost.
→ See theory/03-paper-read-drill.md Paper 3.
Card 12 — Hu 2021, LoRA¶
Summary. Freeze the pretrained weights and inject trainable low-rank decomposition B A (rank r << min(d, k)) into each linear layer; fine-tunes large LMs with 30-100x fewer parameters at negligible quality loss.
3 numbers.
1. r = 8 or 16 is the typical rank for LLM fine-tuning.
2. 30-100x fewer trainable parameters than full fine-tuning.
3. 0 latency overhead at inference because B A can be merged into W.
Card 13 — Dao 2022, FlashAttention¶
Summary. Tiles the attention computation to avoid materializing the n × n attention matrix in HBM; exact (not approximate); recomputes on backward to save memory.
3 numbers.
1. 2-4x training speedup on transformers at typical context lengths.
2. O(n) memory instead of O(n²).
3. 64KB SRAM per SM (A100) — the tile-size constraint.
Card 14 — Kwon 2023, Efficient Memory Management for LLM Serving with PagedAttention (vLLM)¶
Summary. Manages the KV cache like virtual memory with fixed-size pages, sharing pages across requests with common prefixes and reducing fragmentation — enabling much higher batch throughput in production serving.
3 numbers. 1. 2-4x throughput improvement over Hugging Face TGI / FasterTransformer at iso-latency. 2. 16-token typical block (page) size. 3. <4% internal fragmentation vs. 60-80% in naive KV cache management.
Card 15 — Jiang 2023, Mistral 7B¶
Summary. A 7B-parameter LM that outperforms Llama 2 13B at iso-evals by using sliding-window attention, grouped-query attention, and careful data quality choices.
3 numbers. 1. 7.3B parameters. 2. 4096-token sliding window for attention (effective context >> 4096 via stacked windows). 3. 8 KV heads under GQA (vs. 32 query heads).
Card 16 — Touvron 2023, Llama 2¶
Summary. Open-weight LM family (7B / 13B / 70B); trained on 2T tokens; chat variants fine-tuned with SFT + RLHF; established the open-weight production baseline through 2024.
3 numbers. 1. 70B parameters in the largest variant. 2. 2T training tokens (~30 tokens per param — past Chinchilla optimal, optimized for inference). 3. GQA-8 (groups of 8 query heads sharing KV) in the 70B variant.
Card 17 — Dubey 2024, Llama 3¶
Summary. Open-weight LM family pushed beyond Chinchilla scaling — trained on 15T tokens to optimize inference economics over training-compute optimality; introduces 405B dense variant rivaling closed frontier models.
3 numbers. 1. 15T training tokens (vs. Chinchilla-optimal ~1.4T for 70B parameters). 2. 405B parameters in the flagship variant. 3. 128k tokenizer vocabulary (vs. Llama 2's 32k), reducing tokens-per-character on non-English.
How to drill¶
- Round 1 (week 1). Card by card; read each aloud 3x. Then close the deck and recite from memory.
- Round 2 (week 2). Shuffle. Draw, recite, verify. Goal: 17 cards in ≤ 10 minutes.
- Round 3 (week 3). Add the "expand" drill: for each card, after the 3 lines, add 30 seconds of unscripted depth ("the thing about CLIP is that scale, not the contrastive objective, was the real innovation"). This is the actual interview shape.
- Maintenance. Once per week thereafter. Add 1-2 new cards per month from your reading.
Customization¶
- Add company-specific cards. For an Anthropic interview, add Bai 2022 (Constitutional AI), Anthropic's Sleeper Agents paper, and Scaling Monosemanticity.
- Add domain-specific cards. For a robotics role, add RT-2, OpenVLA. For multimodal, add Flamingo, BLIP-2, LLaVA.
- Drop cards once mastered and add new ones; the working deck should be at the edge of your competence.
This completes Module X5. Return to ../README.md for the module map, or to ROADMAP.md (in the repo root) for the broader curriculum.