English · Español

Phase 23 — GPU Architecture Fundamentals¶

Requires: 01 — Hardware & Computing Substrate · 22 — KV Cache: From Math to Memory Teaches: gpu-architecture · sm · warps · occupancy · hbm · coalescing Jump to any chapter from the phase reference index.

Chapter map¶

Pre-written per A12. Theory and lab problem statements are stable drafts; the cloud-platform choice is the single load-bearing open question (see PHASE_23_PLAN.md §7.a).

🇪🇸 Esta es la fase donde se cruza la frontera: del portátil sin CUDA a una GPU rentada en la nube. No escribimos kernels todavía — eso es Fase 24. Aquí construimos el modelo mental (SM, warp, jerarquía HBM/L2/SMEM/registros, coalescing, ocupación) y medimos, igual que en Fase 1 medimos el roofline del i5-8250U.

Goal¶

A mechanical understanding of the GPU Borja will rent — enough that "this kernel is HBM-bound" or "this branch divergence cost us a 32× factor" is a statement Borja can prove with measurements on his actual cloud-rented hardware, not a phrase from a NVIDIA marketing slide.

Phase 23 is the first phase to require cloud GPU. Phase 1 through 22 ran on Borja's i5-8250U. Phase 23 onward assumes an Ampere-class-or-newer GPU rented from a cloud provider (decision in PHASE_23_PLAN.md §7.a).

Read order¶

theory/00-motivation.md — why a separate phase for GPU mental model; what CPU intuitions transfer and which break.
theory/01-gpu-vs-cpu-mental-model.md — execution model differences: SIMT vs out-of-order, warps vs cores, occupancy vs context switching.
theory/02-gpu-memory-hierarchy.md — HBM → L2 → SMEM → registers; bandwidths and latencies; coalescing rules.
theory/03-warps-and-occupancy.md — the 32-thread warp; divergence cost; occupancy as a register/SMEM/threads tradeoff.
theory/04-gpu-roofline.md — Phase-1's roofline re-derived for GPU; multi-dtype ceilings; Tensor Core nuances; where the Phase-22 decode operator lands on this new plot.
lab/00-provision-cloud-gpu.md — first cloud-GPU session, end-to-end. Picks the platform (per §7.a) and runs nvidia-smi. One-shot.
lab/01-device-query.md — programmatic device inspection; populate device_query.json with every relevant field.
lab/02-bandwidth-test.md — H2D, D2H, D2D cudaMemcpy bandwidth measurements; compare to theoretical peaks.
lab/03-gpu-roofline.md — plot the roofline; overlay Phase-22 decode attention; identify regime.

solutions/ is empty during pre-write — populated at phase open after the cloud platform is chosen (since the device-query output and bandwidth numbers depend on the specific GPU rented).

Definition of Done¶

See PHASE_23_PLAN.md §6. Briefly:

Successful end-to-end cloud-GPU session: provision → benchmark → terminate, all logged.
device_query.json, bandwidth_test.json, peak_flops.json committed.
GPU roofline plot committed with the Phase-22 decode-attention operator placed on it.
Borja can recite from memory: memory hierarchy, peak fp16/bf16 TFLOPS, warp size, coalescing rule, occupancy bottlenecks.

What this phase intentionally does NOT cover¶

Writing kernels. Phase 24. Phase 23 uses pre-built kernels (cuBLAS GEMM, cudaMemcpy) and measures them. The discipline mirrors Phase 1: measure the machine before writing code that runs on it.
PyTorch. First imported in Phase 24, deliberately. Phase 23 uses cupy (NumPy-on-GPU) for the one porting task that needs GPU arrays. See PHASE_23_PLAN.md §7.g for the rationale.
Multi-GPU. Phase 35 (distributed training).
Custom CUDA toolchains. Phase 24 introduces nvcc. Phase 23 uses what comes pre-installed in the cloud image.
GPU profiling tools (nsight, ncu). Touched in Phase 24; Phase 23 uses nvidia-smi + Python timers only.
TPU / non-NVIDIA accelerators. Out of scope for the entire curriculum.
Buying / configuring a local GPU. Borja's monthly budget is for cloud rental; no local GPU purchase is planned. If that changes, this phase needs revision.

Phase 23's scope is: provision a cloud GPU, measure its peaks, place the Phase-22 operator on the roofline. Nothing more.

Next phase preview: docs/phase-24-cuda-triton/ — first time writing GPU kernels; first time importing torch.