English · Español
Phase 23 — GPU Architecture Fundamentals¶
Requires: 01 — Hardware & Computing Substrate · 22 — KV Cache: From Math to Memory Teaches:
gpu-architecture·sm·warps·occupancy·hbm·coalescingJump to any chapter from the phase reference index.
Chapter map¶
Pre-written per A12. Theory and lab problem statements are stable drafts; the cloud-platform choice is the single load-bearing open question (see
PHASE_23_PLAN.md§7.a).🇪🇸 Esta es la fase donde se cruza la frontera: del portátil sin CUDA a una GPU rentada en la nube. No escribimos kernels todavía — eso es Fase 24. Aquí construimos el modelo mental (SM, warp, jerarquía HBM/L2/SMEM/registros, coalescing, ocupación) y medimos, igual que en Fase 1 medimos el roofline del i5-8250U.
Goal¶
A mechanical understanding of the GPU Borja will rent — enough that "this kernel is HBM-bound" or "this branch divergence cost us a 32× factor" is a statement Borja can prove with measurements on his actual cloud-rented hardware, not a phrase from a NVIDIA marketing slide.
Phase 23 is the first phase to require cloud GPU. Phase 1 through 22 ran on Borja's i5-8250U. Phase 23 onward assumes an Ampere-class-or-newer GPU rented from a cloud provider (decision in PHASE_23_PLAN.md §7.a).
Read order¶
theory/00-motivation.md— why a separate phase for GPU mental model; what CPU intuitions transfer and which break.theory/01-gpu-vs-cpu-mental-model.md— execution model differences: SIMT vs out-of-order, warps vs cores, occupancy vs context switching.theory/02-gpu-memory-hierarchy.md— HBM → L2 → SMEM → registers; bandwidths and latencies; coalescing rules.theory/03-warps-and-occupancy.md— the 32-thread warp; divergence cost; occupancy as a register/SMEM/threads tradeoff.theory/04-gpu-roofline.md— Phase-1's roofline re-derived for GPU; multi-dtype ceilings; Tensor Core nuances; where the Phase-22 decode operator lands on this new plot.lab/00-provision-cloud-gpu.md— first cloud-GPU session, end-to-end. Picks the platform (per §7.a) and runsnvidia-smi. One-shot.lab/01-device-query.md— programmatic device inspection; populatedevice_query.jsonwith every relevant field.lab/02-bandwidth-test.md— H2D, D2H, D2DcudaMemcpybandwidth measurements; compare to theoretical peaks.lab/03-gpu-roofline.md— plot the roofline; overlay Phase-22 decode attention; identify regime.
solutions/ is empty during pre-write — populated at phase open after the cloud platform is chosen (since the device-query output and bandwidth numbers depend on the specific GPU rented).
Definition of Done¶
See PHASE_23_PLAN.md §6. Briefly:
- Successful end-to-end cloud-GPU session: provision → benchmark → terminate, all logged.
device_query.json,bandwidth_test.json,peak_flops.jsoncommitted.- GPU roofline plot committed with the Phase-22 decode-attention operator placed on it.
- Borja can recite from memory: memory hierarchy, peak fp16/bf16 TFLOPS, warp size, coalescing rule, occupancy bottlenecks.
What this phase intentionally does NOT cover¶
- Writing kernels. Phase 24. Phase 23 uses pre-built kernels (cuBLAS GEMM,
cudaMemcpy) and measures them. The discipline mirrors Phase 1: measure the machine before writing code that runs on it. - PyTorch. First imported in Phase 24, deliberately. Phase 23 uses
cupy(NumPy-on-GPU) for the one porting task that needs GPU arrays. SeePHASE_23_PLAN.md§7.g for the rationale. - Multi-GPU. Phase 35 (distributed training).
- Custom CUDA toolchains. Phase 24 introduces
nvcc. Phase 23 uses what comes pre-installed in the cloud image. - GPU profiling tools (nsight, ncu). Touched in Phase 24; Phase 23 uses
nvidia-smi+ Python timers only. - TPU / non-NVIDIA accelerators. Out of scope for the entire curriculum.
- Buying / configuring a local GPU. Borja's monthly budget is for cloud rental; no local GPU purchase is planned. If that changes, this phase needs revision.
Phase 23's scope is: provision a cloud GPU, measure its peaks, place the Phase-22 operator on the roofline. Nothing more.
Next phase preview: docs/phase-24-cuda-triton/ — first time writing GPU kernels; first time importing torch.
Further reading¶
Optional — enrichment, not required to pass the phase.
- 📘 CUDA C++ Programming Guide — NVIDIA · 2024. the ground truth on the memory hierarchy you measure.
- 📕 Programming Massively Parallel Processors — Kirk & Hwu · 2022. the GPU-programming textbook.