Skip to content

English · Español

01 — Batching, padding, masks, and loss reduction

🇪🇸 Aquí se decide qué cuenta como "un ejemplo". Tres convenciones triviales (mean-vs-sum, mask de padding, shift causal) determinan si el gradiente apunta hacia donde quieres ir o hacia la mediana de tus errores de implementación.


This file is the longest in Phase 18 because it covers the four most common training-loop bugs in the same place. Read it twice. The math is trivial; the conventions are not.

The batch as a 2-D tensor

A batch is a tensor of shape (B, L) where: - B = batch size (number of sequences in this step). - L = sequence length (number of tokens per sequence, after padding to the batch's max).

For the verb-grammar corpus, a single example is a tokenized string like:

"yo trabajo / I work"          → tokens: [<bos>, "yo", "trabajo", "/", "I", "work", <eos>]
"ella va a trabajar / she is going to work"  → [<bos>, "ella", "va", "a", "trabajar", "/", "she", "is", "going", "to", "work", <eos>]

Lengths vary from 6 to 14 tokens. We pack the batch by padding all sequences to the longest one with a special <pad> token:

input_ids: shape (B=4, L=14)
[<bos> yo  trabajo /  I  work  <eos> <pad> <pad> <pad> <pad> <pad> <pad> <pad>]
[<bos> tú  trabajas / you work  <eos> <pad> <pad> <pad> <pad> <pad> <pad> <pad>]
[<bos> él  trabaja  / he work s <eos> <pad> <pad> <pad> <pad> <pad> <pad>     ]
[<bos> ella va  a   trabajar / she is going to work <eos>                    ]

<pad> is a real token in the vocabulary, but it must never contribute to either the loss or the attention. That's what the attention mask + loss mask are for. Confusing the two, or skipping one, is bug #1 in this phase.

Two masks, not one

There are two masks, and they have different shapes:

  1. The attention mask is (B, L, L) (or (B, 1, L, L) broadcasted across heads). It tells the attention layer "query position \(i\) may not attend to key position \(j\)" for two reasons:
  2. \(j > i\) — the future. Causal mask.
  3. position \(j\) is <pad> in this batch row. Padding mask. The two are AND-ed into a single mask. The attention then sets those positions to \(-\infty\) in the logits before softmax.

  4. The loss mask is (B, L). It tells the loss reduction "position \((b, l)\) is a real predicted token; include it in the average." It's 1 everywhere except where the target token is <pad>. (Note: the target, not the input. We'll see why in a moment.)

If you accidentally use only one mask: - Skip the attention pad mask → the model attends to <pad> keys and learns to copy <pad> predictions, contaminating other positions. - Skip the loss pad mask → loss is averaged including <pad> positions; the optimizer gets gradients that push the model to predict <pad> more confidently, hurting everything else.

Both must exist. Both must be derived from the same source of truth: the per-token boolean is_real = (input_ids != PAD_ID). Two derived views, one fact.

The causal shift

For language modeling, the model predicts token \(l+1\) from tokens \(1..l\). So the input to the model is input_ids[:, :-1] and the target is input_ids[:, 1:]. This is the causal shift.

input  to model:  [<bos> yo  trabajo /  I  work  <eos> <pad> <pad> ...]
                          ↓   ↓       ↓  ↓   ↓     ↓
target (next):    [yo   trabajo /     I  work <eos> <pad>  <pad> ...]

The loss at position l measures "did the model predict target[l] from input[:l+1]?". Critically: the first target token is yo, not <bos>; the last input position predicts <eos> from the preceding context, which is a meaningful loss term — <eos> placement is a learned signal. The position where the target is <pad> is the first place we mask out: there, the model is being asked "predict <pad> from real tokens", which is nonsense and we don't want gradients flowing through it.

The standard implementation:

shifted_inputs  = input_ids[:, :-1]          # (B, L-1)
shifted_targets = input_ids[:, 1:]           # (B, L-1)
loss_mask       = (shifted_targets != PAD_ID).astype(np.float32)  # (B, L-1)

logits = model(shifted_inputs)               # (B, L-1, V)
per_token_loss = cross_entropy(logits, shifted_targets)  # (B, L-1)
loss = (per_token_loss * loss_mask).sum() / loss_mask.sum()  # scalar

The denominator loss_mask.sum() is the number of real predicted tokens in the batch. This is the token-level mean — the only reduction that makes the LR meaningful across batches with different sequence-length distributions. We derive why next.

Loss reduction: the trap

You can reduce the per-token loss tensor (B, L-1) into a scalar three ways:

Reduction Formula Per-step gradient scale
Token-level mean \(\frac{1}{\sum_{b,l} M_{b,l}} \sum_{b,l} M_{b,l} \cdot \ell_{b,l}\) \(O(1)\) — invariant to \(B, L\)
Sequence-level mean (per-sequence sum, then batch mean) \(\frac{1}{B} \sum_b \sum_l M_{b,l} \ell_{b,l}\) \(O(L)\) — grows with sequence length
Sum \(\sum_{b,l} M_{b,l} \ell_{b,l}\) \(O(B \cdot L)\) — grows with both

If you use "sum" with lr=1e-3, then doubling the batch size halves the effective LR per token, because gradients are summed but the update is grad × lr. You'd need to re-tune lr for every batch size. If you use "sequence mean", you re-tune for every change in average sequence length. Token-level mean is the only reduction whose lr doesn't depend on batching choices.

Borja's convention for Phase 18 going forward: token-level mean. Pin it. Test for it. Don't change it without writing the change into the manifest.json.

Derivation: why token-mean is the right choice

The model produces a probability distribution per token. The log-likelihood of the corpus is:

\[\log P(\text{corpus}) = \sum_{\text{real tokens}} \log P(t \mid \text{context}(t))\]

Maximum-likelihood training maximizes this sum. The gradient of the sum is the sum of gradients. But SGD/Adam takes a step proportional to the mean over the batch — that's what lr measures: the size of the update per "unit of work". If the unit of work is "one token's likelihood," then token-mean is what you want. Sequence-mean treats every sentence as one unit regardless of length, which double-counts short sentences vs long ones. Sum treats every batch as one unit, which makes lr depend on B.

This is not just bookkeeping. A model trained with sequence-mean loss systematically over-emphasizes short sentences. For verb conjugations, short sentences are the regular present-tense ones (I work); long sentences are the periphrastic simple-future forms (he is going to work). A sequence-mean loss biases the model toward the easy cases — exactly the opposite of what we want, which is for the hard irregular long-tail cases to get more attention, not less.

Sequence packing (optional, deferred)

A common optimization is to concatenate multiple short sequences into one long packed sequence, with a "sequence ID" mask in the attention that prevents cross-sequence attention. This raises GPU utilization on long-context models. We do not do this in Phase 18. The corpus has 600 forms; padding is cheap. Sequence packing introduces its own off-by-one mask bugs that aren't worth debugging at this scale. Phase 27's modern-attention notes touch on packing for FlashAttention.

Putting it together: the canonical batching step

def make_batch(examples: list[list[int]], pad_id: int) -> tuple[ndarray, ndarray, ndarray]:
    """examples: list of token-id sequences (variable length)."""
    B = len(examples)
    L = max(len(x) for x in examples)
    input_ids = np.full((B, L), pad_id, dtype=np.int32)
    for i, x in enumerate(examples):
        input_ids[i, :len(x)] = x
    # Causal shift produces input/target of length L-1.
    inp = input_ids[:, :-1]              # (B, L-1)
    tgt = input_ids[:, 1:]               # (B, L-1)
    # Loss mask: 1 where target is not pad.
    loss_mask = (tgt != pad_id).astype(np.float32)   # (B, L-1)
    # Attention mask: 1 where input is not pad (used by attention layer).
    attn_pad_mask = (inp != pad_id).astype(np.float32)  # (B, L-1)
    return inp, tgt, loss_mask, attn_pad_mask

The attention layer then combines attn_pad_mask with a causal triangular mask internally. Lab 00 implements this.

The held-out split for verb grammar

The Phase-12 corpus is the 20 × 5 × 3 = 300 English forms + 300 Spanish forms = 600 total, plus the pairing structure. We need a held-out (val) split.

Three split policies, all valid:

  1. Hold out 4 verbs entirely. Train on 16 verbs × 5 tenses × 3 persons = 240 forms; val on 4 verbs × 5 tenses × 3 persons = 60 forms. Tests new-verb generalization: can the model conjugate a verb it never saw?
  2. Hold out 1 tense entirely. Train on 20 verbs × 4 tenses × 3 persons = 240 forms; val on 20 verbs × 1 tense × 3 persons = 60 forms. Tests new-tense generalization: can the model handle a tense pattern it never saw?
  3. Hold out 1 person entirely. Train on 20 × 5 × 2 = 200; val on 20 × 5 × 1 = 100. Tests new-person generalization.

Phase 18's default is policy 1: hold out 4 verbs — 2 regular (look, like) + 2 irregular (see, eat). This isolates "can the model learn the regular/irregular pattern and apply it to unseen lexemes?" The Phase-14 baseline is recomputed on the same split, so the comparison is apples-to-apples.

Random uniform 80/20 holdout is forbidden. With 600 forms and a model that has ~103k params, uniform random hold-out is trivially memorizable; the model can interpolate (verb, tense, person) cells by averaging the surrounding forms in embedding space without learning the underlying rule. That's not generalization — that's high-dimensional spreadsheet lookup. The "hold out entire dimensions" policy forces actual rule-learning.

Drill problems (do before lab 00)

  1. The batch has 4 sequences of length [6, 11, 9, 14]. After packing and causal shift, what's the shape of input_ids[:, :-1]? How many <pad> positions are in loss_mask (i.e., zeros)?
  2. You forgot to mask <pad> in the loss, and 30% of tokens in the batch are <pad>. The reported loss is 0.7× the true loss. Show why.
  3. You set lr = 1e-3 and train with batch size 32 using token-mean reduction. You double the batch to 64. Do you need to re-tune the LR? Why or why not?
  4. The Phase-12 corpus has 600 forms. You hold out 4 verbs. The training set has 240 forms. Your val PPL is 9.0; your train PPL is 4.5. The Phase-14 baseline val PPL is 13.0. Is the model overfitting? Is it beating baseline?

If you can answer all four, move on. Otherwise re-read the relevant section.

One-paragraph recap

A batch is (B, L) of token IDs, padded to the longest sequence. Two masks derived from input != PAD: an attention mask (combined with causal triangle) of shape (B, L, L), and a loss mask of shape (B, L-1) applied after the causal shift. The loss is reduced as a token-level mean to keep lr invariant to batch shape. The held-out split holds out entire verbs, not random forms, to force rule-learning over memorization. Three off-by-one bugs hide here: attention without the pad mask, loss without the pad mask, and sum/sequence-mean reduction silently rescaling the gradient. Read this file twice before lab/00.

What this section does NOT cover

  • Sequence packing for throughput. Phase 27.
  • Curriculum learning / dynamic batching. Out of scope; same fixed config the whole run.
  • Augmentation (back-translation, token dropout). Phase 28 if at all.

Next: theory/02-optimizer-and-schedule.md.