Skip to content

English · Español

Break 00 — Misaligned shard sizes across DP workers; all-reduce stalls

🇪🇸 El all-reduce de NCCL/gloo asume que cada worker contribuye exactamente el mismo número de bytes. Si un worker tiene un shard 1 elemento más pequeño que los demás, la operación o bien cuelga indefinidamente o devuelve resultados sin sentido. Este /break introduce ese desajuste y muestra cómo se ve el stall.


What you'll do

Make the data-parallel split assign one fewer sample to worker 0 than to the others. The forward pass still works (gradients are well-defined), but the all-reduce step expects the gradient buffers to be identical sizes and either hangs or returns silently-wrong values.

Step 1 — Locate the shard splitter

src/minidp/dataloader.py          # the per-worker shard slicing (Phase 35 lab 01)
src/minidp/allreduce.py           # the gloo all-reduce wrapper

Step 2 — Introduce the bug

In src/minidp/dataloader.py, the sharding currently uses np.array_split which keeps shards balanced. Replace with an unbalanced slice:

# OLD — balanced shards, total batch = N * B per step
shards = np.array_split(np.arange(global_batch), world_size)

# NEW (the broken version)
# Worker 0 gets one fewer sample than the others.
shards = np.array_split(np.arange(global_batch), world_size)
if rank == 0:
    shards = shards[:-1]   # one fewer sample on worker 0

The training loop still produces gradients. The shapes diverge by 1 only in the batch dimension which gets averaged away inside the loss, so the gradient tensors are still well-defined and have the same parameter shape — the bug doesn't trip a shape check.

But the gradient norm on worker 0 is wrong (it's averaged over B-1 samples instead of B), and depending on the gloo build, the all-reduce either:

  • hangs (gloo with strict size assertions),
  • silently completes with a wrong sum (older gloo paths),
  • or trips a RuntimeError: size mismatch from the wrapper if the buffer is dynamically resized.

Step 3 — Record the break

learners/borja/phase-35/notes/breaks.md:

- bug-id: 35-01
  concept: collective ops require matching contributions
  symptom: training step at iter 1 either hangs (no progress, no logs after
           "starting all-reduce") OR loss diverges within 10 steps with a
           wrong gradient magnitude on worker 0.
  hidden_cause: dataloader.py drops one sample on rank 0; effective batch
                differs across workers; the gradient average is wrong.
  hint_1: "Print len(batch) on each worker at iter 1. Are they equal?"
  hint_2: "Compare loss on worker 0 vs worker 1 for the same global step."
  hint_3: "grep 'array_split' in dataloader.py. What's the slice doing?"
  fix_diff: remove the `if rank == 0: shards = shards[:-1]` post-split tweak.

Step 4 — Verify it's observable

Run just dp-train (or whatever the lab 01 entrypoint is). Expected with bug:

[rank=0] step 0 loss=2.40
[rank=1] step 0 loss=2.39
[rank=0] step 1 starting all-reduce ...
[rank=1] step 1 starting all-reduce ...
[STALL: 60 seconds, no further output]

or, on the variant that doesn't hang:

[rank=0] step 10 loss=2.38
[rank=1] step 10 loss=1.92   <-- divergent!

Either is observable. The test tests/phase35/test_dp_consistency.py::test_ranks_agree_at_step_zero goes red.

Step 5 — The teaching moment

Collective operations are synchronous and size-strict by design. The performance argument (Phase 35 theory 03/05) depends on every worker contributing the same number of bytes; an asymmetric contribution either deadlocks the protocol or produces a meaningless sum.

This is one of the most common bugs in real distributed training: a dataloader that gives the last worker the leftover batch ends up with B - r items where r = global_batch % world_size, and unless you pad-or-drop consistently, gradients diverge. The standard fix is DistributedSampler(drop_last=True) or padding.

The lesson: the dataloader is part of the distributed contract, not a local concern.

Hard rules respected

  • Single bug; single conditional.
  • Reversible in 2 lines.
  • Observable (hang or divergence; either is a failing test).
  • No security impact.
  • Tests not modified.

Next: when green, read ../theory/05-ring-allreduce-derivation-and-strategy-choice.md.