English · Español
Break 00 — Misaligned shard sizes across DP workers; all-reduce stalls¶
🇪🇸 El all-reduce de NCCL/gloo asume que cada worker contribuye exactamente el mismo número de bytes. Si un worker tiene un shard 1 elemento más pequeño que los demás, la operación o bien cuelga indefinidamente o devuelve resultados sin sentido. Este
/breakintroduce ese desajuste y muestra cómo se ve el stall.
What you'll do¶
Make the data-parallel split assign one fewer sample to worker 0 than to the others. The forward pass still works (gradients are well-defined), but the all-reduce step expects the gradient buffers to be identical sizes and either hangs or returns silently-wrong values.
Step 1 — Locate the shard splitter¶
src/minidp/dataloader.py # the per-worker shard slicing (Phase 35 lab 01)
src/minidp/allreduce.py # the gloo all-reduce wrapper
Step 2 — Introduce the bug¶
In src/minidp/dataloader.py, the sharding currently uses np.array_split which keeps shards balanced. Replace with an unbalanced slice:
# OLD — balanced shards, total batch = N * B per step
shards = np.array_split(np.arange(global_batch), world_size)
# NEW (the broken version)
# Worker 0 gets one fewer sample than the others.
shards = np.array_split(np.arange(global_batch), world_size)
if rank == 0:
shards = shards[:-1] # one fewer sample on worker 0
The training loop still produces gradients. The shapes diverge by 1 only in the batch dimension which gets averaged away inside the loss, so the gradient tensors are still well-defined and have the same parameter shape — the bug doesn't trip a shape check.
But the gradient norm on worker 0 is wrong (it's averaged over B-1 samples instead of B), and depending on the gloo build, the all-reduce either:
- hangs (gloo with strict size assertions),
- silently completes with a wrong sum (older gloo paths),
- or trips a
RuntimeError: size mismatchfrom the wrapper if the buffer is dynamically resized.
Step 3 — Record the break¶
learners/borja/phase-35/notes/breaks.md:
- bug-id: 35-01
concept: collective ops require matching contributions
symptom: training step at iter 1 either hangs (no progress, no logs after
"starting all-reduce") OR loss diverges within 10 steps with a
wrong gradient magnitude on worker 0.
hidden_cause: dataloader.py drops one sample on rank 0; effective batch
differs across workers; the gradient average is wrong.
hint_1: "Print len(batch) on each worker at iter 1. Are they equal?"
hint_2: "Compare loss on worker 0 vs worker 1 for the same global step."
hint_3: "grep 'array_split' in dataloader.py. What's the slice doing?"
fix_diff: remove the `if rank == 0: shards = shards[:-1]` post-split tweak.
Step 4 — Verify it's observable¶
Run just dp-train (or whatever the lab 01 entrypoint is). Expected with bug:
[rank=0] step 0 loss=2.40
[rank=1] step 0 loss=2.39
[rank=0] step 1 starting all-reduce ...
[rank=1] step 1 starting all-reduce ...
[STALL: 60 seconds, no further output]
or, on the variant that doesn't hang:
Either is observable. The test tests/phase35/test_dp_consistency.py::test_ranks_agree_at_step_zero goes red.
Step 5 — The teaching moment¶
Collective operations are synchronous and size-strict by design. The performance argument (Phase 35 theory 03/05) depends on every worker contributing the same number of bytes; an asymmetric contribution either deadlocks the protocol or produces a meaningless sum.
This is one of the most common bugs in real distributed training: a dataloader that gives the last worker the leftover batch ends up with B - r items where r = global_batch % world_size, and unless you pad-or-drop consistently, gradients diverge. The standard fix is DistributedSampler(drop_last=True) or padding.
The lesson: the dataloader is part of the distributed contract, not a local concern.
Hard rules respected¶
- Single bug; single conditional.
- Reversible in 2 lines.
- Observable (hang or divergence; either is a failing test).
- No security impact.
- Tests not modified.
Next: when green, read ../theory/05-ring-allreduce-derivation-and-strategy-choice.md.