English · Español
01 — Batching, padding, masks, and loss reduction¶
🇪🇸 Aquí se decide qué cuenta como "un ejemplo". Tres convenciones triviales (mean-vs-sum, mask de padding, shift causal) determinan si el gradiente apunta hacia donde quieres ir o hacia la mediana de tus errores de implementación.
This file is the longest in Phase 18 because it covers the four most common training-loop bugs in the same place. Read it twice. The math is trivial; the conventions are not.
The batch as a 2-D tensor¶
A batch is a tensor of shape (B, L) where:
- B = batch size (number of sequences in this step).
- L = sequence length (number of tokens per sequence, after padding to the batch's max).
For the verb-grammar corpus, a single example is a tokenized string like:
"yo trabajo / I work" → tokens: [<bos>, "yo", "trabajo", "/", "I", "work", <eos>]
"ella va a trabajar / she is going to work" → [<bos>, "ella", "va", "a", "trabajar", "/", "she", "is", "going", "to", "work", <eos>]
Lengths vary from 6 to 14 tokens. We pack the batch by padding all sequences to the longest one with a special <pad> token:
input_ids: shape (B=4, L=14)
[<bos> yo trabajo / I work <eos> <pad> <pad> <pad> <pad> <pad> <pad> <pad>]
[<bos> tú trabajas / you work <eos> <pad> <pad> <pad> <pad> <pad> <pad> <pad>]
[<bos> él trabaja / he work s <eos> <pad> <pad> <pad> <pad> <pad> <pad> ]
[<bos> ella va a trabajar / she is going to work <eos> ]
<pad> is a real token in the vocabulary, but it must never contribute to either the loss or the attention. That's what the attention mask + loss mask are for. Confusing the two, or skipping one, is bug #1 in this phase.
Two masks, not one¶
There are two masks, and they have different shapes:
- The attention mask is
(B, L, L)(or(B, 1, L, L)broadcasted across heads). It tells the attention layer "query position \(i\) may not attend to key position \(j\)" for two reasons: - \(j > i\) — the future. Causal mask.
-
position \(j\) is
<pad>in this batch row. Padding mask. The two are AND-ed into a single mask. The attention then sets those positions to \(-\infty\) in the logits before softmax. -
The loss mask is
(B, L). It tells the loss reduction "position \((b, l)\) is a real predicted token; include it in the average." It's1everywhere except where the target token is<pad>. (Note: the target, not the input. We'll see why in a moment.)
If you accidentally use only one mask:
- Skip the attention pad mask → the model attends to <pad> keys and learns to copy <pad> predictions, contaminating other positions.
- Skip the loss pad mask → loss is averaged including <pad> positions; the optimizer gets gradients that push the model to predict <pad> more confidently, hurting everything else.
Both must exist. Both must be derived from the same source of truth: the per-token boolean is_real = (input_ids != PAD_ID). Two derived views, one fact.
The causal shift¶
For language modeling, the model predicts token \(l+1\) from tokens \(1..l\). So the input to the model is input_ids[:, :-1] and the target is input_ids[:, 1:]. This is the causal shift.
input to model: [<bos> yo trabajo / I work <eos> <pad> <pad> ...]
↓ ↓ ↓ ↓ ↓ ↓
target (next): [yo trabajo / I work <eos> <pad> <pad> ...]
The loss at position l measures "did the model predict target[l] from input[:l+1]?". Critically: the first target token is yo, not <bos>; the last input position predicts <eos> from the preceding context, which is a meaningful loss term — <eos> placement is a learned signal. The position where the target is <pad> is the first place we mask out: there, the model is being asked "predict <pad> from real tokens", which is nonsense and we don't want gradients flowing through it.
The standard implementation:
shifted_inputs = input_ids[:, :-1] # (B, L-1)
shifted_targets = input_ids[:, 1:] # (B, L-1)
loss_mask = (shifted_targets != PAD_ID).astype(np.float32) # (B, L-1)
logits = model(shifted_inputs) # (B, L-1, V)
per_token_loss = cross_entropy(logits, shifted_targets) # (B, L-1)
loss = (per_token_loss * loss_mask).sum() / loss_mask.sum() # scalar
The denominator loss_mask.sum() is the number of real predicted tokens in the batch. This is the token-level mean — the only reduction that makes the LR meaningful across batches with different sequence-length distributions. We derive why next.
Loss reduction: the trap¶
You can reduce the per-token loss tensor (B, L-1) into a scalar three ways:
| Reduction | Formula | Per-step gradient scale |
|---|---|---|
| Token-level mean | \(\frac{1}{\sum_{b,l} M_{b,l}} \sum_{b,l} M_{b,l} \cdot \ell_{b,l}\) | \(O(1)\) — invariant to \(B, L\) |
| Sequence-level mean (per-sequence sum, then batch mean) | \(\frac{1}{B} \sum_b \sum_l M_{b,l} \ell_{b,l}\) | \(O(L)\) — grows with sequence length |
| Sum | \(\sum_{b,l} M_{b,l} \ell_{b,l}\) | \(O(B \cdot L)\) — grows with both |
If you use "sum" with lr=1e-3, then doubling the batch size halves the effective LR per token, because gradients are summed but the update is grad × lr. You'd need to re-tune lr for every batch size. If you use "sequence mean", you re-tune for every change in average sequence length. Token-level mean is the only reduction whose lr doesn't depend on batching choices.
Borja's convention for Phase 18 going forward: token-level mean. Pin it. Test for it. Don't change it without writing the change into the manifest.json.
Derivation: why token-mean is the right choice¶
The model produces a probability distribution per token. The log-likelihood of the corpus is:
Maximum-likelihood training maximizes this sum. The gradient of the sum is the sum of gradients. But SGD/Adam takes a step proportional to the mean over the batch — that's what lr measures: the size of the update per "unit of work". If the unit of work is "one token's likelihood," then token-mean is what you want. Sequence-mean treats every sentence as one unit regardless of length, which double-counts short sentences vs long ones. Sum treats every batch as one unit, which makes lr depend on B.
This is not just bookkeeping. A model trained with sequence-mean loss systematically over-emphasizes short sentences. For verb conjugations, short sentences are the regular present-tense ones (I work); long sentences are the periphrastic simple-future forms (he is going to work). A sequence-mean loss biases the model toward the easy cases — exactly the opposite of what we want, which is for the hard irregular long-tail cases to get more attention, not less.
Sequence packing (optional, deferred)¶
A common optimization is to concatenate multiple short sequences into one long packed sequence, with a "sequence ID" mask in the attention that prevents cross-sequence attention. This raises GPU utilization on long-context models. We do not do this in Phase 18. The corpus has 600 forms; padding is cheap. Sequence packing introduces its own off-by-one mask bugs that aren't worth debugging at this scale. Phase 27's modern-attention notes touch on packing for FlashAttention.
Putting it together: the canonical batching step¶
def make_batch(examples: list[list[int]], pad_id: int) -> tuple[ndarray, ndarray, ndarray]:
"""examples: list of token-id sequences (variable length)."""
B = len(examples)
L = max(len(x) for x in examples)
input_ids = np.full((B, L), pad_id, dtype=np.int32)
for i, x in enumerate(examples):
input_ids[i, :len(x)] = x
# Causal shift produces input/target of length L-1.
inp = input_ids[:, :-1] # (B, L-1)
tgt = input_ids[:, 1:] # (B, L-1)
# Loss mask: 1 where target is not pad.
loss_mask = (tgt != pad_id).astype(np.float32) # (B, L-1)
# Attention mask: 1 where input is not pad (used by attention layer).
attn_pad_mask = (inp != pad_id).astype(np.float32) # (B, L-1)
return inp, tgt, loss_mask, attn_pad_mask
The attention layer then combines attn_pad_mask with a causal triangular mask internally. Lab 00 implements this.
The held-out split for verb grammar¶
The Phase-12 corpus is the 20 × 5 × 3 = 300 English forms + 300 Spanish forms = 600 total, plus the pairing structure. We need a held-out (val) split.
Three split policies, all valid:
- Hold out 4 verbs entirely. Train on 16 verbs × 5 tenses × 3 persons = 240 forms; val on 4 verbs × 5 tenses × 3 persons = 60 forms. Tests new-verb generalization: can the model conjugate a verb it never saw?
- Hold out 1 tense entirely. Train on 20 verbs × 4 tenses × 3 persons = 240 forms; val on 20 verbs × 1 tense × 3 persons = 60 forms. Tests new-tense generalization: can the model handle a tense pattern it never saw?
- Hold out 1 person entirely. Train on 20 × 5 × 2 = 200; val on 20 × 5 × 1 = 100. Tests new-person generalization.
Phase 18's default is policy 1: hold out 4 verbs — 2 regular (look, like) + 2 irregular (see, eat). This isolates "can the model learn the regular/irregular pattern and apply it to unseen lexemes?" The Phase-14 baseline is recomputed on the same split, so the comparison is apples-to-apples.
Random uniform 80/20 holdout is forbidden. With 600 forms and a model that has ~103k params, uniform random hold-out is trivially memorizable; the model can interpolate (verb, tense, person) cells by averaging the surrounding forms in embedding space without learning the underlying rule. That's not generalization — that's high-dimensional spreadsheet lookup. The "hold out entire dimensions" policy forces actual rule-learning.
Drill problems (do before lab 00)¶
- The batch has 4 sequences of length [6, 11, 9, 14]. After packing and causal shift, what's the shape of
input_ids[:, :-1]? How many<pad>positions are inloss_mask(i.e., zeros)? - You forgot to mask
<pad>in the loss, and 30% of tokens in the batch are<pad>. The reported loss is 0.7× the true loss. Show why. - You set
lr = 1e-3and train with batch size 32 using token-mean reduction. You double the batch to 64. Do you need to re-tune the LR? Why or why not? - The Phase-12 corpus has 600 forms. You hold out 4 verbs. The training set has 240 forms. Your val PPL is 9.0; your train PPL is 4.5. The Phase-14 baseline val PPL is 13.0. Is the model overfitting? Is it beating baseline?
If you can answer all four, move on. Otherwise re-read the relevant section.
One-paragraph recap¶
A batch is (B, L) of token IDs, padded to the longest sequence. Two masks derived from input != PAD: an attention mask (combined with causal triangle) of shape (B, L, L), and a loss mask of shape (B, L-1) applied after the causal shift. The loss is reduced as a token-level mean to keep lr invariant to batch shape. The held-out split holds out entire verbs, not random forms, to force rule-learning over memorization. Three off-by-one bugs hide here: attention without the pad mask, loss without the pad mask, and sum/sequence-mean reduction silently rescaling the gradient. Read this file twice before lab/00.
What this section does NOT cover¶
- Sequence packing for throughput. Phase 27.
- Curriculum learning / dynamic batching. Out of scope; same fixed config the whole run.
- Augmentation (back-translation, token dropout). Phase 28 if at all.
Next: theory/02-optimizer-and-schedule.md.