English · Español

03 — Warps and Occupancy¶

🇪🇸 Ocupación = (warps activos en un SM) / (warps máximos por SM). No es lo mismo que "muchos hilos en vuelo"; es "suficientes warps residentes para que el scheduler siempre tenga uno listo cuando otro espera memoria". Se limita por registros/hilo, SMEM/bloque y hilos/bloque — tres presupuestos que compiten.

This page makes occupancy a precise notion, derives the three resources that bound it, and explains the relationship to throughput.

What occupancy measures¶

An SM has a hardware cap on resident warps — typically 48 or 64 warps per SM (1536 or 2048 threads). "Resident" = registers reserved, state allocated, ready to run when the scheduler picks it.

Occupancy = (resident warps) / (max warps per SM).

100% occupancy means the SM has the maximum allowed warp pool. 25% means only a quarter of that pool is loaded.

Why care: the warp scheduler picks one warp per cycle to issue an instruction. When a warp stalls (memory load not yet returned, dependency on a slow instruction), the scheduler picks another warp. With enough warps, every cycle has a ready candidate; the SM never idles. With too few, the SM idles during stalls.

Memory-bound kernels need high occupancy to hide HBM latency (~500 cycles). Compute-bound kernels can run at lower occupancy because they don't stall as often.

What bounds occupancy¶

Three resources, all per-SM:

1. Register file¶

Each thread holds some number of registers (decided by the compiler from the kernel's logic). Total registers reserved = threads × registers/thread. This must fit in the SM's register file (e.g., 64K 32-bit registers = 256 KiB on A100).

If your kernel uses 64 registers/thread, and the SM has 64K registers, then max resident threads = 64K / 64 = 1024. If the SM cap is 2048 threads, you're at 1024/2048 = 50% occupancy.

To raise occupancy: use fewer registers per thread. Tradeoff: forced register reuse may spill locals to "local memory" (which is actually HBM!) — a stall waiting on a 500-cycle load every time you touch a spilled variable. So lowering register count can hurt a compute-bound kernel.

2. SMEM per block¶

Each block reserves some __shared__ SMEM at launch. Total SMEM reserved = blocks/SM × SMEM/block. Must fit in the SM's SMEM allocation (typically 100–164 KiB).

If your kernel uses 100 KiB SMEM/block and the SM has 100 KiB, you get exactly one block per SM. With 128 threads/block (4 warps), you're at 4/64 = ~6% occupancy. Lethal.

To raise: use less SMEM per block by tiling differently (smaller tiles). Or split work across more blocks.

3. Threads per block¶

The block size you choose (e.g., <<<grid, 256>>> = 256 threads/block) must divide the SM's thread cap. With 2048-thread cap and 256-thread blocks, you get 8 blocks/SM (max). With 1024-thread blocks, you get 2 blocks/SM.

Larger blocks = fewer blocks/SM but more threads/block. Smaller blocks = more blocks/SM. The choice matters because each block reserves its own SMEM and registers — total resource use scales with block count.

Rule of thumb: block size of 128–256 threads (4–8 warps) is usually a good starting point. Powers of two for clean math.

The occupancy equation¶

occupancy = min(
    threads_per_block / (max_threads_per_SM / blocks_per_SM_by_registers),
    threads_per_block / (max_threads_per_SM / blocks_per_SM_by_SMEM),
    threads_per_block / (max_threads_per_SM / blocks_per_SM_by_threads)
)

This is what NVIDIA's CUDA Occupancy Calculator (and nvcc --resource-usage + the --ptxas-options=-v output) computes for you. Phase 24 uses these. Phase 23 just observes the result via cudaOccupancyMaxActiveBlocksPerMultiprocessor().

Occupancy vs throughput: not the same thing¶

A high-occupancy kernel can still be slow:

Memory-bound: even at 100% occupancy, you're capped by HBM bandwidth. Adding more warps doesn't help once the memory pipe is full.
Coalescing-broken: even at 100% occupancy, if every warp is doing 32 uncoalesced loads, your effective bandwidth is 1/32 peak. Throughput drops 32× regardless of occupancy.
Divergence-broken: even at 100% occupancy, a 4-way divergent warp runs at 25% efficiency in the divergent region.

Conversely, a low-occupancy kernel can be fast:

Compute-bound with high ILP: a kernel that does 1000 FLOPs per memory load barely stalls; one warp per SM is enough to saturate the FPUs (in principle — really you want 2–3 for pipeline fill).
Tensor Core-bound matmul: a single warp issuing Tensor Core instructions can do enormous work; occupancy past ~50% gives diminishing returns.

Use occupancy as a diagnostic, not a goal. The goal is throughput. Low occupancy → suspect (might be register pressure or SMEM overuse); investigate. High occupancy + low throughput → suspect coalescing, divergence, or memory boundedness.

How Phase 23 measures occupancy¶

You won't write a kernel in Phase 23. But the experiments/23-device-profile/ benchmarks call into cuBLAS and cudaMemcpy, which have known occupancy/throughput characteristics. You can:

Query the device for max_threads_per_SM, max_blocks_per_SM, max_warps_per_SM, register_file_size, shared_mem_per_SM.
Compute the occupancy budget — i.e., if you wrote a kernel with X registers and Y KiB SMEM and Z threads/block, what occupancy would you get?
Confirm that cuBLAS GEMM at the size you measured hits the published peak fp16/bf16 number (>80% is realistic). If yes, NVIDIA's tuning of cuBLAS achieves high effective throughput; the GPU's potential is real.

The actual experiment is in lab/02-bandwidth-test.md and lab/03-gpu-roofline.md.

A worked occupancy example¶

You write a kernel for fused softmax with:

128 threads/block.
32 registers/thread (compiler reports this).
16 KiB SMEM/block (you use it to hold a row of logits for the row-wise softmax).

On an A100 (max 2048 threads/SM, 64K regs/SM, 164 KiB SMEM/SM, 32 blocks/SM):

Threads bound: max 2048 threads / 128 = 16 blocks/SM. Below the 32-block cap.
Register bound: max 64K regs / (128 × 32) = 64K / 4096 = 16 blocks/SM. Same as threads bound.
SMEM bound: 164 KiB / 16 KiB = 10 blocks/SM. This is the binding constraint.

Max blocks/SM = min(16, 16, 10) = 10. Max threads = 10 × 128 = 1280. Occupancy = 1280 / 2048 = 62.5%.

To raise occupancy: reduce SMEM/block to ~10 KiB (smaller softmax tile, but with more global reloads); or reduce register count.

To check if it's worth raising: profile (Phase 24 uses ncu).

Tensor Cores: a parallel compute pipeline¶

A modern GPU has two kinds of FPUs per SM: CUDA cores (one fp32 op per cycle per core) and Tensor Cores (one matrix-multiply-accumulate operation per cycle per Tensor Core — much higher peak throughput).

For Tensor Core operations (matmul of small tiles in fp16/bf16/fp8/int8), the throughput is 4–16× higher than CUDA cores at the same dtype. This is why "fp16 TFLOPS" numbers on a spec sheet are an order of magnitude higher than "fp32 TFLOPS" — the fp16 number assumes Tensor Cores; the fp32 number assumes CUDA cores.

Phase 23 does not write Tensor Core kernels. But the measured peak in peak_flops.py (via cuBLAS GEMM at fp16) hits Tensor Cores by default — that's how cuBLAS gets to ~80% of vendor peak. Phase 24's labs write a manual Tensor Core kernel.

Why occupancy is the last hurdle for Phase 23 understanding¶

Occupancy is the abstraction that ties together all the resource accounting on the GPU. Once you can:

Compute it from a kernel sketch (using the equation above),
Diagnose what bounds it (which of three),
Relate it to throughput (high occupancy is necessary but not sufficient),

...you have the vocabulary to read any GPU performance paper. Phase 24 will be tuning kernels, and every tuning knob (register count, block size, tile size, SMEM split) lands somewhere in this occupancy / throughput / coalescing space.

Drill problems¶

On A100 (64K regs/SM, 164 KiB SMEM/SM, 2048 thread cap, 32-block cap): a kernel uses 256 threads/block, 48 regs/thread, 32 KiB SMEM/block. Compute the binding constraint and the resulting occupancy.
On H100 (similar caps): the same kernel. Same occupancy? Different? Why?
A kernel is "memory-bound at 25% occupancy and 80% bandwidth utilization." Should you raise occupancy? Show the reasoning.
A kernel is "compute-bound at 100% occupancy, hitting 70% of fp16 Tensor-Core peak." How would you try to gain the missing 30%? (Hint: not by raising occupancy.)

What you should now be able to do¶

State the three resource bounds on occupancy.
Compute occupancy for a kernel given its register/SMEM/thread footprint.
Explain why occupancy is necessary but not sufficient for high throughput.
Read a profile output and identify which resource bounds occupancy.

What this page does NOT cover¶

Tuning a real kernel. Phase 24's lab is where occupancy-tuning becomes hands-on.
Asynchronous warp scheduling (warp specialization). Cutting-edge Hopper feature; Phase 27/36.
Tensor Core programming. Phase 24's softmax-over-vocab kernel doesn't use Tensor Cores; GEMM kernels do. Tensor Cores are a Phase 27 topic.

Next: theory/04-gpu-roofline.md — re-derive the roofline plot for GPU and place the Phase-22 operators on it.