English · Español

01 — 25 Whiteboard ML Questions with Depth Trees¶

🇪🇸 25 preguntas clásicas de whiteboard ML, cada una con respuesta de 3 párrafos, árbol de follow-ups de 3 niveles, y la "trampa" (qué respuesta incorrecta revela falta de profundidad).

The "depth signal"¶

Top labs do not grade breadth — every applicant knows the buzzwords. They grade depth: can you go three levels deeper than the surface answer without flinching? Each question below carries a 3-level follow-up tree showing how an interviewer will push.

The pit-of-failure annotation is what your wrong answer reveals about you — usually that you have read about the topic but not implemented it.

Q1 — Explain attention.¶

Model answer (3 paragraphs).

Attention is a content-based lookup. Given a query vector q, a set of key vectors K, and a set of value vectors V, attention computes a weighted average of V where the weights are similarities between q and each row of K. Concretely: softmax(q K^T / sqrt(d_k)) V. The sqrt(d_k) scaling exists because the dot product of two d_k-dim vectors with unit-variance entries has variance d_k, so without scaling the softmax saturates and gradients vanish.

Self-attention is the case where Q, K, V are linear projections of the same input. Multi-head attention splits the embedding dimension into h heads, runs attention in parallel per head, and concatenates. The motivation is that one softmax mixes positions in one way; multiple heads let the layer attend to multiple relations (e.g. syntactic head, semantic head).

The cost is O(n^2 d) in compute and O(n^2) in memory per head (the attention matrix). This is the dominant bottleneck for long context; FlashAttention removes the n^2 memory cost by tiling and recomputing on the backward pass.

Depth tree. - L1: Why sqrt(d_k)? (variance argument above) - L2: Why does softmax saturation kill gradients? (off-diagonal softmax derivatives go to zero) - L3: What is the exact gradient of softmax through the attention matrix? (dA = (P - P^T diag(P^T g)) g for output g — see drill 01)

Pit-of-failure. Saying "attention is like a database lookup" without being able to write the formula. Signals: you read the blog post, never coded it.

→ Phase 15 of the core curriculum.

Q2 — Why does layer normalization help training?¶

Model answer.

LayerNorm normalizes each token's activation vector to zero mean, unit variance across the feature dimension, then applies a learned affine gamma * x + beta. Empirically it stabilizes training in transformers, where batch sizes can be small and sequences are heterogeneous.

The original (Ba et al. 2016) framing was "reduce covariate shift across layers". The Santurkar et al. 2018 paper argued this is not what BatchNorm does, and the real benefit is that normalization smooths the loss landscape (reduces the Lipschitz constant of the loss). The same argument applies to LayerNorm.

In transformers, pre-LN (normalize before the residual block) trains more stably than post-LN at large depth because the residual stream's variance does not grow with depth. Pre-LN is what GPT-2 and every later transformer uses.

Depth tree. - L1: Pre-LN vs post-LN — which one? (pre-LN; gradient norms stay bounded at depth) - L2: Why does RMSNorm work as well as LayerNorm? (empirically the mean-subtraction is not critical; variance scaling is) - L3: What is the backward pass for LayerNorm? (involves the mean and variance gradients — see drill 08)

Pit-of-failure. "It normalizes activations." Why? When? Forward formula? Backward? — silence.

→ Phase 10, drill 08.

Q3 — What is the bias-variance tradeoff in modern overparameterized models?¶

Model answer.

The classical bias-variance curve predicts a U-shape: too-simple models underfit (high bias), too-complex models overfit (high variance). Modern overparameterized neural networks violate this — test error often decreases monotonically with parameter count beyond the interpolation threshold. This is the "double descent" phenomenon (Belkin et al. 2019, Nakkiran et al. 2020).

The mechanism: among the many solutions that interpolate the training data, SGD has an implicit bias toward low-norm solutions, and low-norm solutions generalize. So "more parameters" does not mean "more variance" in the classical sense — it means more room for SGD to find a flat minimum.

For practical purposes: in the regime where modern LMs operate (1B+ parameters, trillions of tokens), the right framing is not bias-variance but compute-optimality (Chinchilla). The question of "is my model too big" is answered by "are you compute-optimal given your token budget".

Depth tree. - L1: What is double descent? (epoch-wise vs model-wise, both observed) - L2: What does "implicit regularization of SGD" mean precisely? (it finds low-norm / flat solutions among interpolators) - L3: How does Chinchilla relate compute to parameters? (N ∝ D, both ∝ sqrt(C); see paper-read 04)

Pit-of-failure. Quoting the classical U-curve as if it were universal. Signals: pre-2019 mental model.

Q4 — Explain backpropagation.¶

Model answer.

Backpropagation is the chain rule applied to a computational graph. Forward pass: evaluate each node, store intermediate activations. Backward pass: for each node from output to input, multiply incoming gradient by the local Jacobian of that node. The "trick" is that you store activations on the forward pass so the backward pass is O(forward cost), not O(forward cost * depth).

For a scalar loss L and a parameter w in some layer: dL/dw = dL/dy * dy/dw, where dL/dy is the gradient flowing into the layer (computed by recursion) and dy/dw is the local Jacobian. The matrix-multiply form: if y = W x, then dL/dW = (dL/dy) x^T and dL/dx = W^T (dL/dy).

Crucially, backprop is not automatic differentiation in general — it is reverse-mode AD. Forward-mode AD is O(input dim) per derivative; reverse-mode is O(output dim). For ML we have scalar loss (output dim = 1) and millions of parameters, so reverse-mode wins by a factor of n_params.

Depth tree. - L1: Why reverse-mode and not forward-mode? (scalar loss, billions of params) - L2: What does activation checkpointing trade off? (memory vs recompute on backward) - L3: How does gradient checkpointing interact with mixed precision? (stored activations in fp16/bf16, recompute in same dtype — see drill 03)

Pit-of-failure. "It propagates errors backward" without being able to write dL/dW = dL/dy x^T. Signals: no autograd implementation experience.

→ Phase 07 (scalar autograd), Phase 08 (tensor autograd).

Q5 — Why does Adam outperform SGD for transformers?¶

Model answer.

Adam maintains per-parameter running averages of the gradient (m_t) and squared gradient (v_t), and updates theta -= lr * m_t / (sqrt(v_t) + eps). The 1/sqrt(v_t) term makes the effective learning rate per-parameter — rare features with large but infrequent gradients are scaled down; frequent features with small gradients are scaled up.

For transformers specifically, the gradient distribution is heavy-tailed (a few attention heads, a few layers dominate). SGD with a single learning rate either overshoots the dominant directions or undershoots the rare ones. Adam's per-parameter scaling handles this automatically.

The cost: 2x parameter memory (the m and v states). For 7B models that is 14B floats. AdamW additionally decouples weight decay from the gradient, which is what every modern transformer trainer uses (Loshchilov & Hutter 2019).

Depth tree. - L1: Adam vs AdamW — what's the difference? (AdamW decouples L2 weight decay from gradient) - L2: How do beta_1, beta_2 interact with batch size? (larger batches → less noise → can lower beta_2) - L3: Why do we see Adam-8bit / Adam-Lion / Sophia in 2024+? (memory pressure for Adam states; Sophia uses Hessian diagonal)

Pit-of-failure. "Adam is adaptive". Adaptive to what? — silence.

Q6 — Walk me through the cross-entropy loss and its gradient.¶

Model answer.

For a categorical distribution with K classes and a one-hot target y (true class c), the cross-entropy loss is L = -log p_c where p = softmax(z) and z are the logits. Equivalently: L = -log( exp(z_c) / sum_k exp(z_k) ) = -z_c + logsumexp(z).

The gradient with respect to logits is famously clean: dL/dz = p - y. That is, the gradient is the predicted probability minus the one-hot target. Derivation: dL/dz_k = -delta_{ck} + p_k, which is p_k - y_k since y is one-hot.

For numerical stability we never compute exp(z) directly — we use the log-sum-exp trick: subtract max(z) before exponentiating. Most frameworks fuse this into a single cross_entropy(logits, target) op that takes raw logits, not softmax outputs.

Depth tree. - L1: Why does the gradient simplify to p - y? (derivation above) - L2: What is label smoothing and what does it do to the gradient? (replaces one-hot with (1-eps) y + eps/K; gradient becomes p - y_smooth) - L3: How does this interact with KL-regularized RLHF? (KL term adds log(p / p_ref); gradient adds 1/p - 1/p_ref effectively)

Pit-of-failure. Trying to write softmax then log then nll as separate ops. Signals: no numerical stability instinct.

→ Phase 05, Phase 17.

Q7 — Explain BPE tokenization.¶

Model answer.

Byte-Pair Encoding starts with the vocabulary of all individual bytes (256 for byte-level BPE, used by GPT-2+). It iteratively finds the most frequent adjacent pair of tokens in the training corpus and merges them into a new token. After N merges the vocabulary has 256 + N entries. At inference, you apply the same merges in the same order to tokenize new text.

The trick: BPE balances vocabulary size against sequence length. Character-level: small vocab, long sequences, slow. Word-level: large vocab, OOV problems on rare words. BPE: subwords. Common words become single tokens (" the"); rare words become multiple subwords ("unhappiness" → "un" + "happiness" or similar).

Byte-level BPE (GPT-2) starts from raw bytes, not Unicode codepoints, so it handles any input without OOV. The cost is that one Unicode character can take 1-4 tokens. Tiktoken (OpenAI), tokenizers (HuggingFace) and SentencePiece (Google) are the three production implementations.

Depth tree. - L1: Why byte-level and not character-level? (no Unicode OOV; deterministic on any input) - L2: What is the worst-case tokenization ratio for a non-English language? (can be 3-4x English for CJK; see Mistral/Cohere tokenizer designs) - L3: How would you train a tokenizer optimized for code? (weight code corpus heavily; preserve whitespace patterns; tab is one token)

Pit-of-failure. Confusing BPE with WordPiece (BERT). Signals: didn't implement either.

→ Phase 11, drill 02.

Q8 — What is the difference between encoder, decoder, and encoder-decoder architectures?¶

Model answer.

Encoder (BERT): bidirectional self-attention over the input. Output is one vector per input token. Trained with masked language modeling: mask 15% of tokens, predict them. Used for understanding tasks (classification, extraction).

Decoder (GPT): causal self-attention — each token attends only to previous tokens. Trained with next-token prediction. Used for generation. Modern frontier models (GPT-4, Claude, Gemini, Llama 3) are decoder-only.

Encoder-decoder (T5, original Transformer for translation): encoder processes input bidirectionally, decoder generates output autoregressively while cross-attending to the encoder's output. Used historically for translation, summarization. Largely subsumed by decoder-only in 2023+ because you can prompt a decoder-only model to do any of these tasks.

Depth tree. - L1: Why has decoder-only won? (scale + prompting subsumes specialized architectures; one model for all tasks) - L2: When would you still pick encoder-decoder? (very-long-input → short-output ratios where the encoder bottleneck is acceptable, e.g. some summarization, code translation) - L3: What is "prefix LM" and why is it interesting? (decoder with bidirectional attention over the prefix; arguably the best of both worlds, see UL2 / PaLM-2)

Pit-of-failure. Listing the three without being able to articulate why decoder-only won. Signals: textbook knowledge, no taste.

Q9 — Walk me through the forward and backward pass of dropout.¶

Model answer.

Forward (training): sample a Bernoulli mask m with probability 1-p of being 1; output y = x * m / (1-p). The 1/(1-p) scaling ("inverted dropout") makes the expectation of y equal to x, so no scaling is needed at inference. Inference: identity (y = x).

Backward: dropout's local gradient is just the mask. dL/dx = (dL/dy) * m / (1-p). No learned parameters, no state across batches.

Dropout reduces co-adaptation of features by forcing the network to be robust to missing units. In transformers, dropout is applied (a) on attention probabilities, (b) on the output of the FFN, © on the embeddings. Modern frontier models often remove dropout because at the scale of trillions of tokens, you train for less than one epoch — the implicit regularization of stochastic gradients is enough.

Depth tree. - L1: Why scale by 1/(1-p) at training and not at inference? (easier to reason about; inference is the hot path) - L2: Why do large LMs often skip dropout? (undertrained regime; one epoch; implicit SGD reg suffices) - L3: What is "DropPath" / stochastic depth and how does it differ? (drops entire residual blocks, not units; used in ViT, ConvNeXt)

Pit-of-failure. Forgetting the 1/(1-p) scaling. Signals: never wrote the forward pass from scratch.

Q10 — How does the KV cache work and why is it necessary?¶

Model answer.

During autoregressive generation, each new token requires attending to all previous tokens. Naively, you would recompute keys and values for the whole prefix at every step — that is O(n^2) over n generation steps, or O(n^3) total. The KV cache stores K and V matrices from previous steps and only computes the new row, reducing total cost to O(n^2).

The cache size per layer per head is 2 * n * d_head * dtype_bytes (one for K, one for V). For a 7B model with 32 layers, 32 heads, d_head=128, fp16, and n=2048: 2 * 2048 * 32 * 32 * 128 * 2 = 1.07 GB. Per request. This is why long-context inference is memory-bound, not compute-bound.

Optimizations: PagedAttention (vLLM) treats the KV cache like virtual memory with fixed-size pages, reducing fragmentation. Multi-query attention (MQA) and grouped-query attention (GQA) share K/V across heads (Llama 2 70B uses GQA-8), shrinking the cache by 4-8x. Sliding-window attention (Mistral) caps n at a window size.

Depth tree. - L1: What is the exact KV cache size formula? (2 * num_layers * num_kv_heads * d_head * seq_len * dtype_bytes per request) - L2: How does PagedAttention reduce fragmentation? (fixed-size blocks; can share prefix across beams) - L3: What is the tradeoff of MQA vs GQA vs MHA? (MQA: 1 KV head, smallest cache, slight quality loss; GQA: groups of heads, sweet spot; MHA: full quality, biggest cache)

Pit-of-failure. "It caches keys and values" without the memory math. Signals: never sized a deployment.

→ Phase 22, drill 04.

Q11 — What is positional encoding and why do we need it?¶

Model answer.

Self-attention is permutation-equivariant: shuffling the input tokens produces a corresponding shuffle of the output, so the model cannot distinguish "dog bites man" from "man bites dog" without explicit position information. Positional encoding injects this information.

Two families. Absolute: add a fixed or learned vector to each token's embedding. The original Transformer used sinusoids of varying frequencies (sin(pos / 10000^(2i/d))); BERT/GPT-2 used learned vectors. Relative: bias the attention scores by a function of (i - j). Examples: T5's bucketed bias, ALiBi (linear bias by distance, used in MosaicML / Falcon), RoPE.

RoPE (Rotary Positional Embedding, Su et al. 2021) is the modern default (Llama, Mistral, Qwen). It rotates Q and K by an angle proportional to position before computing Q K^T. The inner product (R_i q)^T (R_j k) depends only on (i - j) — so it is relative, but implemented as a per-token rotation. RoPE extends to long context with NTK-aware scaling.

Depth tree. - L1: Why are sinusoidal embeddings able to extrapolate? (in theory: linear functions of position; in practice: poorly) - L2: How does RoPE actually rotate? (splits embedding into pairs, applies a 2D rotation per pair, frequencies follow the sinusoidal schedule) - L3: What does YaRN / NTK-scaling do to extend RoPE context? (interpolates the position grid; rescales the base frequency to avoid OOD rotation angles at long context)

Pit-of-failure. Reciting "sinusoidal" without knowing RoPE exists. Signals: stuck in 2017.

→ Phase 16, drill 10.

Q12 — Explain the chain of operations in a transformer block.¶

Model answer.

Modern (pre-LN) transformer block: x → LN → MHA → +x → LN → FFN → +x. Two residual streams, two normalizations, two sub-blocks. The residual is the key to depth — without it, gradients die exponentially with depth (Phase 10).

MHA: Q, K, V = x W_Q, x W_K, x W_V; split into heads; attn = softmax(Q K^T / sqrt(d_k)) V (per head); concat heads; project with W_O. Cost: O(n^2 d + n d^2).

FFN: typically Linear(d, 4d) → activation → Linear(4d, d). The 4d expansion is conventional. Activation has evolved: ReLU (original) → GELU (BERT, GPT-2) → SwiGLU (Llama, modern). SwiGLU is Swish(x W_1) * (x W_3) → W_2, gated, with 3 matrices instead of 2 (so the FFN is wider per parameter).

Depth tree. - L1: What does SwiGLU buy you over GELU? (empirically better perplexity at iso-params; gating helps) - L2: Where does most of the parameter budget live in a transformer? (in the FFN — roughly ⅔, more with SwiGLU) - L3: Why pre-LN instead of post-LN at scale? (Xiong et al. 2020: pre-LN keeps gradient norms bounded; post-LN needs warmup tricks at depth > 12)

Pit-of-failure. Forgetting that residuals exist, or putting LN in the wrong place. Signals: never wrote forward().

→ Phase 17.

Q13 — How would you debug a training run where loss is NaN at step 1000?¶

Model answer.

Step 1: identify what is NaN. Loss? An activation? A gradient? Common culprits: (a) exploding gradients → check grad_norm; (b) division by zero in softmax / layernorm → check epsilon values; © fp16 overflow → switch to bf16 or fp32 for the suspect layer; (d) bad data → check the batch that triggered it.

Step 2: bisect. Save a checkpoint at step 999, replay step 1000 deterministically (seed + same batch). If reproducible: instrument. If not: it is data-dependent or rng-dependent.

Step 3: instrument with register_forward_hook (PyTorch) or equivalent. Log per-layer activation max/min/mean. Find the first layer where the activation explodes. From there it is usually clear: an attention layer with no clipping, a softmax with no scaling, a residual that grew unboundedly through the network, or an LR spike from a bad scheduler.

Depth tree. - L1: Why is grad clipping (e.g. clip_grad_norm_(model.parameters(), 1.0)) standard? (catches transient spikes from rare batches) - L2: How do you tell fp16 overflow from a real divergence? (switch to bf16 with same lr; if it disappears, was fp16) - L3: What is the "spike → recover" pattern in LLM pretraining loss curves? (loss spikes are normal at scale; PaLM, OPT logs document them; usually due to bad batches)

Pit-of-failure. "I'd lower the learning rate." (Maybe! But that's after you understand the cause.)

→ Phase 19, behavioral anecdote 2.

Q14 — What is the difference between SFT, RLHF, and DPO?¶

Model answer.

SFT (supervised fine-tuning): standard next-token prediction loss on high-quality human-written demonstrations. Cheap, simple, but the model can only imitate what it saw — it does not learn from negative examples.

RLHF (Christiano 2017, Ouyang 2022): three stages. (1) SFT. (2) Train a reward model on human pairwise preferences using the Bradley-Terry likelihood: P(a > b) = sigmoid(r(a) - r(b)). (3) RL-fine-tune the SFT model to maximize r(x, y) - beta * KL(pi || pi_SFT), typically using PPO. The KL term prevents reward hacking.

DPO (Rafailov 2023): observation that the RLHF optimum has a closed-form solution pi*(y|x) = pi_ref(y|x) exp(r(x,y)/beta) / Z(x). Rearranging gives r(x,y) = beta log(pi(y|x) / pi_ref(y|x)) + beta log Z. Substituting into the Bradley-Terry loss gives a simple classification-style loss on preference pairs directly on the LM, with no reward model and no RL. Cheaper, more stable, comparable quality on most benchmarks.

Depth tree. - L1: Why does the KL-to-reference penalty exist? (prevents the RL policy from drifting into off-distribution regions where the reward model is unreliable) - L2: Derive the DPO loss from the RLHF optimum. (Lagrangian → closed form → BT substitution; see X3 theory/04) - L3: When does DPO fail vs RLHF? (DPO is sensitive to preference noise; RLHF's RM averages over many preferences; some papers find DPO underperforms on math / code)

Pit-of-failure. Knowing DPO exists but not being able to derive the loss. Signals: read the abstract, skipped §3.

→ X3 module, drill 06.

Q15 — How does FlashAttention work?¶

Model answer.

Standard attention materializes the n x n attention matrix in HBM (high-bandwidth memory). For n = 4096 and fp16, that is 32 MB per head — and we read/write it twice (forward + backward). HBM bandwidth is the bottleneck.

FlashAttention (Dao et al. 2022) tiles the computation: load a block of Q and a block of K, V into SRAM, compute partial softmax incrementally, accumulate the output. The full n x n matrix is never materialized in HBM. Forward pass is exact (not approximate).

Backward pass: the activations from the forward pass would normally cost O(n^2) to store. FlashAttention instead stores only the softmax normalizer and recomputes the attention matrix block by block on the backward pass. Net result: 2-4x speedup on training, much larger speedup at long context.

Depth tree. - L1: What is the "online softmax" trick? (streaming computation of softmax with running max + running sum, numerically stable) - L2: Why does FlashAttention-2 improve over v1? (better parallelization across sequence length, fewer non-matmul flops) - L3: How does FlashAttention-3 (Hopper) use TMA / async? (asynchronous warp-specialization; tensor memory accelerator on H100)

Pit-of-failure. "It's just a faster kernel". Why is it faster? — silence.

→ Phase 24, Phase 27.

Q16 — What is mixed precision training?¶

Model answer.

Mixed precision (Micikevicius 2017) keeps model weights and the optimizer state in fp32 but does the forward and backward passes in fp16 (or bf16). The forward is faster (fp16 has 2x throughput on tensor cores) and uses half the memory for activations.

The catch: fp16 has a narrow dynamic range (6e-5 to 65504). Small gradients underflow to zero. Solution: loss scaling — multiply the loss by a large factor (e.g. 65536) before backward, then divide the gradients by the same factor before the optimizer step. This shifts gradients into representable range.

bf16 (brain float) has the same exponent range as fp32 (1e-38 to 3e38) but only 7 mantissa bits. No loss scaling needed. Slightly less precise but no underflow. Modern hardware (A100+, TPU v3+) supports bf16 natively, which is why bf16 has become the default for new training runs.

Depth tree. - L1: fp16 vs bf16 — when do they differ? (fp16 needs loss scaling; bf16 doesn't; fp16 has higher precision in the representable range; bf16 has wider range) - L2: How does dynamic loss scaling work? (start high; halve on NaN; double after N non-NaN steps) - L3: Why are master weights kept in fp32? (weight updates w -= lr * g can be much smaller than w; fp16 would lose them in rounding)

Pit-of-failure. "fp16 is faster". Why does training in fp16 sometimes diverge? — silence.

→ Phase 26.

Q17 — Explain LoRA.¶

Model answer.

LoRA (Hu et al. 2021) replaces a frozen weight matrix W (shape d × k) with W + B A where A is r × k, B is d × r, and r << min(d, k) (typical r = 8 or 16). Only A and B are trained; W stays frozen. Parameter count drops from d k to r (d + k) — a 30-100x reduction.

The hypothesis (and the paper's empirical result): the update to W during fine-tuning has low intrinsic rank, so a rank-r decomposition is sufficient. A is initialized Gaussian, B is initialized to zero — so at step 0, the LoRA layer is the identity (B A = 0) and the model behavior is unchanged.

At inference, you can either keep B A separate (forward = W x + B A x) or merge: W' = W + B A, no extra cost. For multi-adapter serving (many fine-tunes of the same base), you keep them separate and swap at runtime.

Depth tree. - L1: Why initialize B = 0? (so the model starts from the pretrained behavior; otherwise random init would destroy the prior) - L2: How does QLoRA combine 4-bit base weights with LoRA adapters? (quantize W to NF4; keep A, B in bf16; the gradients flow through the dequantized W but only A, B update) - L3: What is the "alpha" hyperparameter in LoRA? (rescaling factor; effective update is (alpha / r) * B A x; lets you decouple effective LR from rank choice)

Pit-of-failure. "LoRA is parameter-efficient fine-tuning." How efficient? Why does it work? — silence.

→ Phase 28, drill 05.

Q18 — Walk me through top-p (nucleus) sampling.¶

Model answer.

Given logits z from a language model: compute probabilities p = softmax(z). Sort p in descending order. Find the smallest prefix whose cumulative sum exceeds top_p (e.g. 0.9). Truncate to that prefix, renormalize, sample.

Why top-p instead of top-k? Top-k cuts at a fixed count, which is too restrictive for high-entropy contexts (many plausible next tokens) and too loose for low-entropy contexts (one obvious next token). Top-p adapts — it grows the candidate set when the distribution is flat and shrinks it when peaked.

In practice you combine top-p with temperature (z / T before softmax). T < 1 sharpens; T > 1 flattens. Common settings: T = 0.7, top_p = 0.9 for assistant chat; T = 1.0, top_p = 1.0 for diverse creative output; greedy (T → 0 or top_k = 1) for deterministic eval.

Depth tree. - L1: What is min-p sampling? (filter tokens with p < min_p * max(p); threshold relative to peak; works well at low T) - L2: How does typical sampling differ? (filters by absolute entropy from the conditional; Meister 2022) - L3: How would you implement constrained decoding for JSON schemas? (grammar-constrained sampling — mask logits of tokens that would violate the grammar; outlines / jsonformer / xgrammar)

Pit-of-failure. Confusing top-p with top-k. Signals: never tuned sampling for a real product.

→ Phase 21, drill 07.

Q19 — What is RAG and where does it break?¶

Model answer.

Retrieval-Augmented Generation: at query time, embed the user's query, retrieve top-k passages from a vector store, concatenate them into the prompt, and let the LLM answer with that context. The LLM does the reasoning; the retriever does the memory.

The standard pipeline: chunk documents (often 500-1000 tokens with overlap), embed with a sentence encoder (BGE, E5, OpenAI ada), index with FAISS / Qdrant / pgvector. Query: embed, ANN search, optional rerank with a cross-encoder, stuff into prompt.

Break modes. (1) Chunking loses cross-chunk context (a sentence at a chunk boundary). (2) Single-query retrieval fails for multi-hop questions ("who succeeded the predecessor of X?"). (3) Hallucinated citations — the model fabricates a passage that wasn't retrieved. (4) Retriever-generator mismatch — the retriever uses semantic similarity, the LLM uses literal text matching for evidence. Mitigations: hierarchical retrieval, query rewriting, hybrid sparse+dense, reranker, citation-with-span constraints.

Depth tree. - L1: What is hybrid retrieval and why does it help? (BM25 + dense; BM25 catches exact terms / IDs; dense catches semantics) - L2: How do you evaluate a RAG system end-to-end? (retrieval@k vs generation faithfulness — separate metrics; RAGAS, TruLens; precision/recall + groundedness) - L3: How does graph RAG (Microsoft 2024) differ? (builds a knowledge graph from the corpus; retrieves subgraphs; handles multi-hop better)

Pit-of-failure. "RAG works great". When doesn't it? — silence.

→ Phase 29.

Q20 — How does a tokenizer's vocabulary size affect model design?¶

Model answer.

Larger vocab → more parameters in the embedding and output projection (both are vocab_size × d_model). For Llama 3's 128k vocab and d_model = 4096, that is 524M parameters per matrix — 1B total, comparable to a whole 1B model.

Larger vocab → shorter sequences for the same text → less attention compute per sentence (O(n^2) saved). For multilingual or code-heavy domains, this matters: Llama 2's 32k vocab tokenizes Korean at 3-4 tokens/char; Llama 3's 128k cuts that roughly in half.

Tradeoffs. Larger vocab: more rare tokens that may be undertrained (poor embeddings for tokens that appear once); larger embedding init footprint; larger softmax compute at the output layer (mitigated by tied input/output embeddings or factorized softmax). Smaller vocab: more attention compute per sentence; better-trained embeddings; worse for non-English.

Depth tree. - L1: What is tied embedding? (input embedding matrix = output projection matrix transposed; halves the parameter cost) - L2: What is adaptive softmax / hierarchical softmax? (cluster tokens by frequency; rare tokens get cheaper representations; from Grave 2017) - L3: How would you measure "tokenizer efficiency" for a target domain? (tokens-per-byte on a held-out corpus; bits-per-byte of the LM divided by tokenizer compression)

Pit-of-failure. "Bigger vocab is better." Bigger by what factor and at what cost? — silence.

→ Phase 11.

Q21 — Explain gradient checkpointing.¶

Model answer.

During backprop, you need each layer's input activations to compute its weight gradient. Naively, you store all activations from the forward pass — that scales as O(num_layers × batch × seq × d), which dominates memory at long context.

Gradient checkpointing trades compute for memory: store activations only at a subset of layers ("checkpoints"). On the backward pass, recompute the activations between checkpoints by re-running the forward of that segment. If you checkpoint every sqrt(L) layers, memory drops from O(L) to O(sqrt(L)) and compute increases by ~33%.

PyTorch exposes this as torch.utils.checkpoint.checkpoint. The function wraps a forward computation and re-runs it on backward. Caveats: (1) RNG state must be preserved if dropout is in the wrapped block; (2) the saved tensors must be detached properly to avoid double-backprop bugs; (3) it does not compose trivially with autocast — must be careful with mixed precision.

Depth tree. - L1: What is the optimal checkpointing frequency? (every sqrt(L) layers minimizes memory × compute product) - L2: How does selective recomputation in Megatron differ? (only recompute the cheap ops — softmax, dropout — store the expensive ones; better than uniform) - L3: How does activation offloading to CPU compare? (saves more memory but PCIe transfer can bottleneck; only worth it on very long context)

Pit-of-failure. "It saves memory." How? At what cost? — silence.

→ Drill 03.

Q22 — What is in-context learning and why is it surprising?¶

Model answer.

In-context learning (ICL): a frozen language model performs a new task by being shown examples in its prompt, no weight updates. GPT-3 (Brown 2020) is the canonical demonstration — give it 5 translation examples, then a new sentence, and it translates.

Why surprising: the model was trained to predict the next token on web text. Nothing in that objective explicitly teaches "if you see English: foo / Spanish: bar / English: baz / Spanish:, output a Spanish translation." Yet at scale it emerges. This suggested that scale alone produces general-purpose reasoning, which kicked off the 2020-2023 scaling race.

Mechanism is still debated. Hypotheses: (a) ICL is gradient descent in activation space — the model implements an inner optimization (Akyürek 2022, Garg 2022, von Oswald 2022); (b) ICL is meta-learning over the pretraining distribution — common patterns in training data (e.g. Q/A formats) make few-shot prompts in-distribution; © ICL is just sophisticated pattern matching. Most likely all three contribute.

Depth tree. - L1: What is chain-of-thought prompting? (asking the model to reason step by step before answering; emerged in Wei 2022; works only at scale) - L2: How does the "induction heads" mechanism (Olsson 2022) relate to ICL? (specific 2-layer attention circuit that does sequence completion; appears at a phase transition during training) - L3: How would you test whether a model is "really learning" from in-context examples vs pattern matching? (use counterfactual examples that violate the training distribution; see "Counterfactual Reasoning in ICL")

Pit-of-failure. Confusing ICL with fine-tuning. Signals: never read the GPT-3 paper.

Q23 — How does evaluation differ between pretraining, instruction-tuning, and chat fine-tuning?¶

Model answer.

Pretraining is evaluated with perplexity on held-out web text, plus zero-shot / few-shot academic benchmarks (HellaSwag, MMLU, ARC). These measure language modeling quality and emergent capabilities. Cheap, automatic, deterministic.

Instruction-tuning is evaluated on instruction-following benchmarks (MT-Bench, AlpacaEval, IFEval). These use either (a) LLM-as-judge (GPT-4 or Claude grades pairwise) or (b) constraint-satisfaction checks (IFEval: "did the model produce exactly 5 bullet points"). Less reliable than perplexity but closer to user-facing utility.

Chat / preference-tuned models are evaluated with human pairwise preferences on a held-out set (Chatbot Arena, internal red-team evals) and increasingly with capability-specific evals (code: HumanEval, MBPP, LiveCodeBench; math: GSM8K, MATH; reasoning: ARC-AGI, GPQA). For safety / alignment: red-team rates of harmful completions, refusal rates on benign asks.

Depth tree. - L1: Why is LLM-as-judge biased? (position bias, verbosity bias, self-preference; mitigated by score swapping, length-normalized rubrics) - L2: What is "Goodhart on evals"? (when the eval becomes a training target — overfit; e.g. early MT-Bench scores inflated by long answers) - L3: How do you design a held-out eval that survives a year? (private; rotates; covers capability + safety + refusal-calibration; anchored to real user complaints)

Pit-of-failure. Quoting MMLU number without knowing what MMLU measures or how it leaks.

→ Phase 20.

Q24 — What is constitutional AI?¶

Model answer.

Constitutional AI (Bai et al. 2022, Anthropic) is a method to train a harmless assistant using AI feedback rather than human feedback. Two phases.

SL-CAI: the model generates a response to a red-team prompt; it then critiques its own response against a written constitution (e.g. "please rewrite this response to be less harmful"); the revised response replaces the original in SFT. Iterate.

RL-CAI: the model generates two responses to a prompt; another (or the same) model picks the better one according to the constitution. These pairwise preferences train a reward model, which then drives RLHF (or DPO). The human is replaced by an AI judge anchored to a written set of principles.

Why this matters at Anthropic: it scales harmlessness training without requiring a large human red-team for every iteration. The constitution is auditable text. Tradeoff: the AI judge inherits the biases of whatever model was used; the constitution must be carefully written; "harmless" can drift into "evasive".

Depth tree. - L1: How does RLAIF (Lee 2023) differ from CAI? (RLAIF is the general technique — AI feedback instead of human; CAI is the Anthropic-specific recipe with a written constitution) - L2: What does the Anthropic public constitution look like? (set of principles drawn from UN Declaration of Human Rights, Apple ToS, and Anthropic-internal; published in the CAI paper appendix) - L3: What is the relationship between CAI and "Claude's character"? (the constitution shapes refusal patterns, tone, hedging behavior; iterative refinement of the constitution is part of how Claude evolves)

Pit-of-failure. Confusing CAI with RLHF. Signals: critical for an Anthropic interview.

→ X3 theory/05.

Q25 — How do you decide between fine-tuning, RAG, and prompt engineering?¶

Model answer.

The decision tree. (1) Can prompt engineering hit the bar? Try first — zero cost, instant iteration. Use few-shot, structured output, chain-of-thought. (2) Is the issue knowledge (model doesn't know the facts) or behavior (model knows but answers wrong)? Knowledge → RAG. Behavior → fine-tune. (3) Is the knowledge static (textbook) or dynamic (changes weekly)? Static → can fine-tune; dynamic → RAG (cheaper to update an index than retrain).

Fine-tuning cost: tens to thousands of dollars (LoRA), thousands to millions (full SFT). Time: hours to days. Update cost on new data: re-train. Inference cost: same as base. Reproducibility: high (weights are versioned).

RAG cost: index build is cheap; per-query is a vector search + larger prompt (more tokens billed). Update cost on new data: re-embed and upsert (minutes). Inference cost: higher prompt tokens. Reproducibility: depends on index version and retrieval determinism.

Depth tree. - L1: When does fine-tuning + RAG combine? (fine-tune for tone / format; RAG for facts; common production pattern) - L2: What is "long-context" as an alternative? (skip RAG, stuff everything in 200k context; works for small corpora but cost scales linearly with tokens; quality may degrade with "lost in the middle") - L3: How do you A/B test fine-tune vs RAG in production? (shadow-traffic the two pipelines; measure quality (human or LLM-judge) and cost; pick on cost-per-quality-unit; see Phase 38 / CpQU)

Pit-of-failure. "Fine-tuning is always better." When isn't it? — silence.

→ Phase 29, Phase 38.

How to drill this file¶

Cover the model answer. Read the question. Speak for 3 minutes aloud.
Compare to the model answer. Note gaps.
Have a friend (or a script) randomly fire follow-ups from the depth tree.
Re-drill weekly until each question is < 90 seconds for the L1 path, < 3 minutes for L1→L2→L3.

→ Next: 02-systems-design-for-llms.md