English · Español
Lab 01 — Naive Fused-Softmax Kernel (Correct, Slow)¶
Goal: write the naive CUDA C version of fused softmax over the grammar MiniGPT's ~600-form logit row. Confirm correctness against NumPy. Place the dot on the GPU roofline. Don't tune — the next lab does that.
Estimated time: 2–4 hours.
Prereq:
lab/00-hello-cuda.mdcomplete.src/minikernel/directory exists withBLUEPRINT.mdreviewed.
What you produce¶
A directory experiments/24-naive-kernel/ and supporting source in src/minikernel/:
src/minikernel/softmax_naive.cu— the kernel.src/minikernel/softmax_naive.py— Python launcher (loads kernel viacupy.RawKernel).src/minikernel/dispatch.py— initial draft: a dispatcher that picks CUDA-kernel vs NumPy-fallback. CPU fallback =np.exp(x - x.max()) / s.tests/test_softmax_naive.py— numerical-equivalence test (1e-5 abs vsnp.softmax).experiments/24-naive-kernel/bench.py— single-config timing.experiments/24-naive-kernel/manifest.json.experiments/24-naive-kernel/README.md.
The operator¶
Row-wise softmax with numerical-stability trick (subtract max), on inputs of shape (B, V) with \(V = 600\) (grammar MiniGPT vocab from §A13). For Phase 24, \(B\) sweeps \(\{1, 8, 64, 512, 4096\}\).
TODOs¶
Block A — write the naive kernel¶
- Per
theory/02§"Version 1: Naive": one thread per (row, column-stride) pair. Each thread re-reads the row to compute max and sum. Wasteful by design. - Use fp32 inputs, fp32 accumulator, fp32 outputs. No mixed precision yet (lab 02 explores it).
- Handle \(V\) not a power of 2: guard with
if (col < V). - Launch with
grid = (B,),block = (V_padded,)whereV_padded = next_pow_2(V) = 1024.
Block B — correctness test¶
-
tests/test_softmax_naive.py: generate random logits withnp.random.default_rng(42), shape(64, 600). Compare CUDA-kernel output tonp.exp(x - x.max(axis=-1, keepdims=True)) / np.exp(x - x.max(axis=-1, keepdims=True)).sum(axis=-1, keepdims=True)atatol=1e-5. - Skipif when no CUDA detected; in that case, the test exercises the NumPy fallback via
dispatch.pyand asserts agreement with itself (sanity check that the fallback exists). - Run on CPU (Borja's laptop) — dispatch returns NumPy path; test runs.
- Run on cloud GPU — dispatch returns CUDA path; test runs.
Block C — bench it¶
-
bench.py: sweep \(B \in \{1, 8, 64, 512, 4096\}\). For each, time 100 launches (after 3 warm-ups). Record median. - Compute achieved HBM bandwidth: bytes-moved per row × \(B\) × launches / time.
- Compute fraction of peak HBM (from your phase-23 device query).
- Expected: 1–5% of peak. Bad — but the baseline. Lab 02 climbs from here.
Block D — place on roofline¶
- Compute \(I = \text{FLOPs} / \text{bytes}\) for this kernel at \(B = 64, V = 600\), fp32. Expected \(\approx 1\) FLOP/byte (memory-bound).
- Plot one dot on a roofline scaffold (carry forward to lab 03).
Block E — manifest¶
{
"experiment": "24-naive-kernel",
"date": "YYYY-MM-DD",
"seed": 42,
"gpu": {"model": null, "compute_capability": null},
"kernel": {"name": "softmax_naive", "dtype": "fp32", "V": 600},
"results": {
"B_sweep": [1, 8, 64, 512, 4096],
"median_us_per_launch": [null, null, null, null, null],
"achieved_bandwidth_gbs": [null, null, null, null, null],
"fraction_of_hbm_peak": [null, null, null, null, null],
"correctness": "passed | failed"
}
}
Constraints¶
- NumPy fallback works on CPU. Borja's machine has no CUDA; the dispatcher must route to the NumPy path with no error.
- Don't tune. No SMEM, no parallel reduction, no online softmax. Just the naive 3-pass kernel. The point is to have a correct slow baseline.
- Test on fp32 only. fp16 lives in lab 02.
- Use seed 42 for all random inputs.
Stop conditions¶
Done when:
- CUDA kernel runs on cloud GPU, output matches
np.softmaxto 1e-5. - NumPy fallback works on Borja's laptop, output matches
np.softmaxto 1e-7. - Bench sweep across \(B\) recorded; one dot plotted on a stub roofline.
manifest.jsoncommitted.tests/test_softmax_naive.pygreen in both environments (CUDA + CPU).
Pitfalls¶
- Sum overflows or underflows. Without the
- maxtrick, \(\exp(x)\) overflows for \(x > 88\) in fp32. With the trick, \(\exp(x - m) \leq 1\). Use the trick from the start; the lab specifies it. - NaN in the output. Almost always: row of all
-infor all-zero gradients. For randomly initialized inputs this shouldn't happen, but a 0/0 in the normalize pass is the giveaway. Guard withs = max(s, 1e-30). - Dispatcher silently picks CUDA when no CUDA installed. Detect at import time, set a module-level flag, route accordingly.
- fp32 mantissa loss in the sum. Order-of-summation matters; lab 01's naive kernel might not match NumPy at 1e-7 (only 1e-5). Document if so.
When to consult solutions/¶
After all stop conditions met. The reference shows the canonical naive kernel and dispatcher.
Next lab: lab/02-tuned-kernel.md.