English · Español

Break — the kernel that overruns shared memory; what the failure looks like¶

🇪🇸 Reservamos un buffer en shared memory un poco más pequeño de lo que realmente usamos. En CUDA, este bug puede pasar inadvertido durante meses si el resto de la memoria compartida del bloque queda intacta. Lo causamos a propósito en un kernel pequeño y vemos los síntomas que delatan el problema.

Symptom Borja will see¶

Two CUDA kernels for a simple block-tiled matmul on a \(32 \times 32\) matrix using shared-memory tiles of size 16:

Run A (control): allocates __shared__ float tile[16][16] and indexes within bounds.
Run B (broken): allocates __shared__ float tile[16][15] (off-by-one in the second dimension) but indexes tile[r][c] for c ∈ [0, 16).

Both compile without warnings. Both run without crashing (no CUDA_ERROR_ILLEGAL_ADDRESS at runtime, because shared memory accesses are not bounds-checked by the hardware).

Output:

Run A: matches the CPU reference matmul exactly (within fp32 rounding).
Run B: differs from the reference by varying amounts, with a specific pattern — every 16^th element of every row is wrong, and the wrong values look like they came from neighboring rows.

For a \(32 \times 32\) output, Run B has ~30/1024 elements (~3%) that are wildly wrong, ~50/1024 (~5%) that are subtly wrong (off by 1-5%), and the rest are correct.

If you run a unit test with a tolerance of 1e-3 on max element-wise difference, it fails — but the failure message gives only the first bad element, not the pattern. A learner who fixes the first bad element by adding a guard might convince themselves the test now passes, while many other elements are still corrupted.

The break, mechanically¶

// Run A (control)
__global__ void tile_matmul_correct(...) {
    __shared__ float tile_a[16][16];
    __shared__ float tile_b[16][16];
    // ... load, sync, multiply, accumulate, store ...
    tile_a[threadIdx.y][threadIdx.x] = A[...];   // y in [0,16), x in [0,16)
    __syncthreads();
    // ...
}

// Run B (break)
__global__ void tile_matmul_broken(...) {
    __shared__ float tile_a[16][15];   // <-- one column too small
    __shared__ float tile_b[16][16];
    // SAME indexing as above
    tile_a[threadIdx.y][threadIdx.x] = A[...];   // when threadIdx.x = 15, writes tile_a[y][15] — OOB!
    __syncthreads();
    // ...
}

When threadIdx.x = 15, tile_a[threadIdx.y][15] writes one column past the allocated buffer. In CUDA's shared-memory layout, this overwrites the next allocation in the same block — which is the first element of tile_b[0][0]. The 16 threads writing the last column of tile_a each corrupt one element of tile_b.

The pattern: every 16^th element of tile_b is corrupted (because every threadIdx.y row's threadIdx.x=15 hits the same overflow region, but for different rows of tile_b).

Why this is the paradigmatic GPU bug¶

In CUDA, shared memory is allocated per-block from a fixed pool (96 KiB on Ampere, 64 KiB on older architectures). Multiple __shared__ declarations within a kernel are concatenated in the shared-memory address space. An out-of-bounds write in one buffer silently corrupts the next buffer in shared memory.

There is no runtime check. The hardware does not enforce bounds on shared-memory accesses. The compiler does check static bounds for compile-time-constant indices, but anything indexed by threadIdx is dynamic — bounds are not checked.

Contrast with global memory: out-of-bounds writes to global memory do get caught at runtime (you get CUDA_ERROR_ILLEGAL_ADDRESS or, with cuda-memcheck, a precise diagnosis). Shared memory has no such guard. This is the trade-off for shared-memory's nanosecond-scale latency.

The same issue exists in Triton. tl.zeros((16, 15)) instead of tl.zeros((16, 16)) produces the same pattern. Triton's vector abstractions hide the per-thread indexing but not the buffer-size mistake.

Diagnostic ladder Borja should walk¶

First check: the unit test fails. Look at the error: "max diff at element [3, 7] is 8.2". One element. The kernel produced the wrong number.
Second check: compare all elements with the reference. Pattern: 3% are wildly wrong, 5% are subtly wrong. The "wildly wrong" cluster.
Third check: the wildly-wrong elements share a structure — they are at column indices 0, 15, 16, 31 of the output. Or rows 0, 1, 16, 17. The pattern hints at "every 16^th".
Fourth check: cuda-memcheck (or compute-sanitizer on newer toolchains) reports the OOB write. This is the smoking gun. Output: "Invalid shared write of size 4 at ...".
Diagnosis: the shared-memory buffer is one column too small.

Reproducer¶

# Compile both
just phase-24-build-cuda

# Run with the broken version; observe failure
./phase24_matmul_broken 32 > /tmp/output_broken.txt
diff /tmp/output_broken.txt /tmp/output_reference.txt | head -20

# Run with compute-sanitizer
compute-sanitizer ./phase24_matmul_broken 32
# Look for "Invalid __shared__ write"

Or in Triton:

# Triton version of the same bug — replace BLOCK_SIZE_K = 16 with BLOCK_SIZE_K = 15 in the kernel
just phase-24-triton-matmul broken

Hint cascade¶

(Mild) "The unit test reports a single bad element. Plot the full element-wise diff. What's the pattern?"
(Medium) "Run compute-sanitizer (or cuda-memcheck) on the kernel. What does it report?"
(Direct) "The shared-memory allocation size is one less than the loop bound. Match them up."

Fix¶

Restore __shared__ float tile_a[16][16]. Or, defensively, use BLOCK_SIZE-named constants and the buffer dimensions: __shared__ float tile_a[BLOCK_SIZE][BLOCK_SIZE] where BLOCK_SIZE = 16 is constexpr.

Better: write a size assertion into the kernel via a static_assert when possible, or a runtime check when not. For Triton, declare BLOCK_M, BLOCK_K, BLOCK_N as tl.constexpr parameters and use them consistently.

What makes this break educational¶

This bug demonstrates the trade-off that defines GPU programming: the hardware skips the bounds checks that CPU OS / runtime would do, because checking bounds on every shared-memory access would cost half the throughput. The cost: you can corrupt your own program silently.

The defense is tooling: compute-sanitizer catches it at runtime; nvcc -G (debug mode) helps; Triton's compile-time bounds checks help (when indices are compile-time constants); your test suite must include full-output equivalence checks, not single-element spot checks.

This is the GPU analog of the C strcpy bug — silent buffer overflow with delayed observable failure. Phase 24 introduces both the bug class and the tools (compute-sanitizer, exhaustive equivalence tests, compile-time bounds) that defend against it.

CPU-only fallback¶

If Borja doesn't have access to a CUDA GPU (i5-8250U has no NVIDIA hardware), this break can be simulated in CPU code: write a flat float tile_a[16*15] C array, index tile_a[r*16 + c] for c ∈ [0, 16). The C array has no bounds check either; the OOB write corrupts whatever happens to be next in stack memory. The pattern is harder to reproduce reliably on CPU because stack layout varies, but the concept is identical.

The Phase 23 lab 01-device-query.md and Phase 24 00-hello-cuda.md already gate GPU-requiring steps behind a CUDA check. Adapt this break into the gated path.

What this break is NOT¶

Not a correctness bug in the matmul algorithm.
Not a numerical-precision bug.
Not a memory-allocation bug at the host level (no malloc failure).

It is a silent shared-memory overflow — the most insidious class of GPU bug, because the hardware refuses to help you find it. The defense is tooling discipline, not algorithmic skill.

Cross-refs¶

theory/01-cuda-programming-model.md — the execution / memory model that makes this possible.
theory/05-triton-vs-cuda-build-before-abstract.md — Triton inherits the same hazard.
Phase 25 theory/01-dispatcher-and-aten.md — how PyTorch's kernels avoid this by using compile-time constants and template parameters.