English · Español

01 — GPU vs CPU: The Execution Model¶

🇪🇸 SIMT (Single Instruction, Multiple Threads) significa que la GPU corre 32 hilos en paso de bloqueo dentro de un warp. Si el código tiene un if con ramas distintas en lanes diferentes, la GPU ejecuta ambas ramas y enmascara las lanes inactivas. No es "ramas baratas como en CPU OoO". Es "ramas que te cuestan los lanes que descartan".

This page explains the SIMT execution model and three of its CPU-violating consequences. By the end you should be able to predict, given a kernel sketch, which lines hurt performance and why.

CPU: out-of-order execution¶

A modern CPU core is a speculative execution engine with a large bookkeeping structure (reorder buffer, register file with renaming, branch predictor, memory disambiguation, etc.). One thread issues instructions; the CPU finds opportunities to overlap, predict, and reorder them. A branch is cheap if predicted correctly and only mildly expensive (10–20 cycles) on mispredict. Memory loads start early and the rest of the pipeline continues. Threads on the CPU are heavyweight: each has its own register file, kernel-managed stack, scheduler entry.

The dominant axis of parallelism per CPU core is instruction-level parallelism (ILP): get one thread's stream to overlap as much as possible. To scale beyond one core, you spawn another heavyweight thread.

GPU: SIMT¶

A GPU SM contains 4–8 warp schedulers, each managing a pool of warps (32 threads each). Each cycle a scheduler picks a warp that's ready (its operands are in registers, no stall) and issues one instruction across all 32 threads in lockstep. Each thread has its own program counter (formally — actually they share a PC except across branches), its own register-file allocation (carved out of the SM's register file), and a thread-id determining which data it processes.

The dominant axis of parallelism per SM is thread-level parallelism (TLP) at warp granularity: keep many warps available so when one stalls (waiting on memory), another runs on the next cycle. No reorder buffer, no branch predictor in the CPU sense, no speculative execution.

Throughput-per-SM = warp-instructions-per-cycle × FLOPs-per-warp-instruction.

To max this out, the SM needs enough resident warps that every cycle has at least one ready to issue. This is the entire game.

Consequence 1: Branch divergence¶

CPU: if (x > 0) a; else b; — the predictor guesses, the wrong path gets squashed if wrong. Cost (worst case): 10 cycles for one thread.

GPU: same code in a warp. If half the threads have x > 0 and half don't, the warp executes a for the first half (16 lanes), masking out the others (their result is discarded), then executes b for the second half. Both branches run. Throughput drops 2× for that warp during the branched region.

If you nest branches three deep with full divergence at each level, you drop to ⅛ throughput. Real example: a kernel that dispatches "this token attends, that one is padding" via runtime branch will be ~2× slower than the same kernel processing only real tokens.

Fix: structure your kernel so warps are coherent — all 32 threads in a warp take the same branch. If divergence is inherent (e.g., different sequences in a batch have different lengths), use thread-block-level dispatch: assign each block to one sequence so the warps within a block are coherent.

Consequence 2: Coalesced memory access¶

CPU: scattered reads from DRAM are slow because each cache-line fill brings 64 bytes you may not use, but the memory subsystem still handles them efficiently — the prefetcher learns access patterns, the cache caches everything brought in.

GPU: a warp doing 32 memory loads in one instruction asks the memory subsystem for 32 addresses. If those addresses are adjacent (thread i reads addr + i * 4), the GPU consolidates them into one 128-byte transaction. If they're scattered (thread i reads addr + perm[i] * 4 for some permutation perm), the GPU issues 32 separate transactions. Effective bandwidth drops to ≈1/32.

Fix: structure data so adjacent threads access adjacent memory. The "structure-of-arrays" (SoA) layout is preferred over "array-of-structures" (AoS) precisely because SoA gives this coalescing for free. Phase-22's KV cache layout (B, H, S, d_h) was chosen partly with coalescing in mind — adjacent threads working on adjacent d_h get coalesced reads.

The standard memory-coalescing example you'll see again in Phase 24:

// Coalesced — good.
__global__ void copy_coalesced(float* dst, float* src, int N) {
    int tid = blockIdx.x * blockDim.x + threadIdx.x;
    if (tid < N) dst[tid] = src[tid];          // thread i reads src[i], adjacent
}

// Uncoalesced — bad.
__global__ void copy_strided(float* dst, float* src, int N, int stride) {
    int tid = blockIdx.x * blockDim.x + threadIdx.x;
    if (tid < N) dst[tid] = src[tid * stride]; // thread i reads src[i*stride]
}

At stride=32, the second kernel runs 32× slower on the memory leg of the work — every load is its own transaction.

Consequence 3: Latency hiding via warp switching¶

CPU: latency hiding is the CPU's job, via ILP + cache + speculative execution. The programmer doesn't think about it (until they profile).

GPU: latency hiding is also automatic, but via warp switching. When warp A issues a memory load that misses L2 and must go to HBM (~400 cycles), the warp scheduler switches to warp B (already in residency, registers ready) and issues its next instruction. By the time warps B, C, D, E, F, G, H, I have each issued an instruction, warp A's load is back, and the scheduler can issue A's next instruction.

This mechanism works only if enough warps are resident. Occupancy = (active warps / max warps per SM). Below ~50% occupancy on a memory-bound kernel, the SM stalls during memory access — there aren't enough other warps to switch to. Above ~50% occupancy, latency is fully hidden (in principle).

Compute-bound kernels are less occupancy-sensitive: even at 25% occupancy, you can saturate the FPUs if the kernel does enough math per memory access.

What none of this means¶

A few misreads of the above that recur:

"GPU branches are expensive, so I'll write branchless code." Only matters within a warp. Cross-warp branches are free (warps run independently). And many "branchless" tricks (masking, predication) are exactly what the GPU does internally for divergence — you're not avoiding the cost, you're just spelling it out.
"GPU loves dense access, so I'll redesign my data structure." Yes — but the redesign sometimes adds compute. Run the roofline analysis: if your kernel is compute-bound, uncoalesced access may not be the bottleneck.
"More threads = more speed." Only up to occupancy saturation. After that, more threads doesn't help (the SM is already keeping warps swapped). Below saturation, more resident threads = more speed, but more threads in flight doesn't necessarily help.
"SIMT == SIMD." Close but not identical. SIMD is one instruction, fixed-width lanes; SIMT is one instruction, lanes with their own thread-id and per-thread state (e.g., registers). The practical difference: SIMT can do divergent branches (slowly); SIMD cannot. SIMT can do scatter/gather (slowly); SIMD often can't at all.

The CUDA model in one diagram¶

Grid (host launches it)
└── Blocks (each block runs on ONE SM, can't migrate)
    └── Warps (32 threads, lockstep execution within a warp)
        └── Threads (own register-file allocation; thread-id determines data)

SM (Streaming Multiprocessor)
├── Warp schedulers (4–8)
├── CUDA cores (FP32 ALUs, e.g., 128 per SM on A100)
├── Tensor Cores (matrix-multiply units, e.g., 4 per SM on A100)
├── Register file (e.g., 64K 32-bit registers per SM)
├── SMEM (e.g., 96–164 KiB per SM, programmer-managed)
└── L1 cache (sometimes unified with SMEM)

A grid launches; CUDA assigns blocks to SMs; each SM picks warps from its assigned blocks; warp schedulers issue. A block never migrates between SMs (so SMEM and __syncthreads() work). Threads in different blocks cannot synchronize (except by exiting the kernel and starting another).

This entire model is explicit in the CUDA programming model. You declare grid + block shape at launch (<<<grid, block>>> syntax). The CPU has nothing analogous — you don't say "this loop iterates as a grid of size 1024×32". You will, on the GPU.

What you should now be able to do¶

Read a CUDA kernel and identify potential warp divergence.
Identify the data layout that maximizes coalescing for a given access pattern.
Decide whether a given kernel is occupancy-limited and which resource (registers, SMEM, threads) limits it.
Explain why "running more threads is always faster on a GPU" is false.
State the difference between SIMT and SIMD without grasping at vocabulary.

What this page does NOT cover¶

Kernel syntax. No CUDA C++, no Triton DSL. That's Phase 24.
Specific GPU architectures. Volta vs Turing vs Ampere vs Hopper differ in details (Tensor Core generations, async copy, TMA) — covered as needed in Phase 27/36. This page builds the cross-architecture mental model.
Multi-GPU. Phase 35.

Next: theory/02-gpu-memory-hierarchy.md — HBM, L2, SMEM, registers, with bandwidths and the coalescing rule formalized.