Skip to content

English · Español

Lab 01 — Naive Fused-Softmax Kernel (Correct, Slow)

Goal: write the naive CUDA C version of fused softmax over the grammar MiniGPT's ~600-form logit row. Confirm correctness against NumPy. Place the dot on the GPU roofline. Don't tune — the next lab does that.

Estimated time: 2–4 hours.

Prereq: lab/00-hello-cuda.md complete. src/minikernel/ directory exists with BLUEPRINT.md reviewed.


What you produce

A directory experiments/24-naive-kernel/ and supporting source in src/minikernel/:

  • src/minikernel/softmax_naive.cu — the kernel.
  • src/minikernel/softmax_naive.py — Python launcher (loads kernel via cupy.RawKernel).
  • src/minikernel/dispatch.pyinitial draft: a dispatcher that picks CUDA-kernel vs NumPy-fallback. CPU fallback = np.exp(x - x.max()) / s.
  • tests/test_softmax_naive.py — numerical-equivalence test (1e-5 abs vs np.softmax).
  • experiments/24-naive-kernel/bench.py — single-config timing.
  • experiments/24-naive-kernel/manifest.json.
  • experiments/24-naive-kernel/README.md.

The operator

Row-wise softmax with numerical-stability trick (subtract max), on inputs of shape (B, V) with \(V = 600\) (grammar MiniGPT vocab from §A13). For Phase 24, \(B\) sweeps \(\{1, 8, 64, 512, 4096\}\).

TODOs

Block A — write the naive kernel

  • Per theory/02 §"Version 1: Naive": one thread per (row, column-stride) pair. Each thread re-reads the row to compute max and sum. Wasteful by design.
  • Use fp32 inputs, fp32 accumulator, fp32 outputs. No mixed precision yet (lab 02 explores it).
  • Handle \(V\) not a power of 2: guard with if (col < V).
  • Launch with grid = (B,), block = (V_padded,) where V_padded = next_pow_2(V) = 1024.

Block B — correctness test

  • tests/test_softmax_naive.py: generate random logits with np.random.default_rng(42), shape (64, 600). Compare CUDA-kernel output to np.exp(x - x.max(axis=-1, keepdims=True)) / np.exp(x - x.max(axis=-1, keepdims=True)).sum(axis=-1, keepdims=True) at atol=1e-5.
  • Skipif when no CUDA detected; in that case, the test exercises the NumPy fallback via dispatch.py and asserts agreement with itself (sanity check that the fallback exists).
  • Run on CPU (Borja's laptop) — dispatch returns NumPy path; test runs.
  • Run on cloud GPU — dispatch returns CUDA path; test runs.

Block C — bench it

  • bench.py: sweep \(B \in \{1, 8, 64, 512, 4096\}\). For each, time 100 launches (after 3 warm-ups). Record median.
  • Compute achieved HBM bandwidth: bytes-moved per row × \(B\) × launches / time.
  • Compute fraction of peak HBM (from your phase-23 device query).
  • Expected: 1–5% of peak. Bad — but the baseline. Lab 02 climbs from here.

Block D — place on roofline

  • Compute \(I = \text{FLOPs} / \text{bytes}\) for this kernel at \(B = 64, V = 600\), fp32. Expected \(\approx 1\) FLOP/byte (memory-bound).
  • Plot one dot on a roofline scaffold (carry forward to lab 03).

Block E — manifest

{
  "experiment": "24-naive-kernel",
  "date": "YYYY-MM-DD",
  "seed": 42,
  "gpu": {"model": null, "compute_capability": null},
  "kernel": {"name": "softmax_naive", "dtype": "fp32", "V": 600},
  "results": {
    "B_sweep": [1, 8, 64, 512, 4096],
    "median_us_per_launch": [null, null, null, null, null],
    "achieved_bandwidth_gbs": [null, null, null, null, null],
    "fraction_of_hbm_peak": [null, null, null, null, null],
    "correctness": "passed | failed"
  }
}

Constraints

  • NumPy fallback works on CPU. Borja's machine has no CUDA; the dispatcher must route to the NumPy path with no error.
  • Don't tune. No SMEM, no parallel reduction, no online softmax. Just the naive 3-pass kernel. The point is to have a correct slow baseline.
  • Test on fp32 only. fp16 lives in lab 02.
  • Use seed 42 for all random inputs.

Stop conditions

Done when:

  1. CUDA kernel runs on cloud GPU, output matches np.softmax to 1e-5.
  2. NumPy fallback works on Borja's laptop, output matches np.softmax to 1e-7.
  3. Bench sweep across \(B\) recorded; one dot plotted on a stub roofline.
  4. manifest.json committed.
  5. tests/test_softmax_naive.py green in both environments (CUDA + CPU).

Pitfalls

  • Sum overflows or underflows. Without the - max trick, \(\exp(x)\) overflows for \(x > 88\) in fp32. With the trick, \(\exp(x - m) \leq 1\). Use the trick from the start; the lab specifies it.
  • NaN in the output. Almost always: row of all -inf or all-zero gradients. For randomly initialized inputs this shouldn't happen, but a 0/0 in the normalize pass is the giveaway. Guard with s = max(s, 1e-30).
  • Dispatcher silently picks CUDA when no CUDA installed. Detect at import time, set a module-level flag, route accordingly.
  • fp32 mantissa loss in the sum. Order-of-summation matters; lab 01's naive kernel might not match NumPy at 1e-7 (only 1e-5). Document if so.

When to consult solutions/

After all stop conditions met. The reference shows the canonical naive kernel and dispatcher.


Next lab: lab/02-tuned-kernel.md.