English · Español
01 — The CUDA Programming Model¶
🇪🇸 CUDA = un lenguaje C/C++ extendido con cualificadores (
__global__,__device__,__shared__) y unas pocas variables intrínsecas (blockIdx,threadIdx,blockDim,gridDim). El kernel se lanza con<<<grid, block>>>y cada hilo identifica su trabajo con un índice global que tú calculas a partir de los intrínsecos. Es minimalista; lo difícil no es la sintaxis sino mapear el problema a la jerarquía grid → bloque → warp → hilo.
This page gives the CUDA programming model the same compact treatment Phase-23 gave the execution model. After this you can read any CUDA C kernel and identify: what each thread does, what each block does, what memory each thing touches.
The five qualifiers¶
__global__ void kernel(...) { ... } // host calls; runs on device
__device__ int helper(int x) { ... } // device-only, callable from kernel
__host__ int main(...) { ... } // host-only (default if unqualified)
__shared__ float tile[BLOCK_SIZE]; // SMEM, per-block
__constant__ float lookup[256]; // constant memory, read-only from device
That's most of it. CUDA C is C + these qualifiers + a few launch syntax additions + some intrinsics.
The launch¶
dim3 grid(num_blocks_x, num_blocks_y, num_blocks_z); // up to 3D
dim3 block(threads_x, threads_y, threads_z); // up to 3D, product ≤ 1024
kernel<<<grid, block, shared_mem_bytes, stream>>>(args...);
The <<<...>>> is CUDA C's only non-standard syntax. It tells nvcc to emit a kernel launch.
grid: how many blocks to launch. Each block runs on one SM, can't migrate.block: how many threads per block. Threads in a block can sync (__syncthreads()), share SMEM, and are bundled into warps of 32.shared_mem_bytes: dynamic SMEM size if the kernel usesextern __shared__. Optional.stream: which CUDA stream this kernel runs on. Multiple streams enable kernel concurrency. Defaults to the default stream (synchronizes with everything).
Total threads launched = grid × block (product of all dimensions). For a 1024×1024 matmul output, you might launch <<<dim3(64, 64), dim3(16, 16)>>> — 4096 blocks × 256 threads = 1M threads, one per output element (or per group thereof for tiling).
The intrinsics¶
Inside a kernel, every thread has access to:
threadIdx.x, threadIdx.y, threadIdx.z // its position in the block
blockIdx.x, blockIdx.y, blockIdx.z // its block's position in the grid
blockDim.x, blockDim.y, blockDim.z // block shape
gridDim.x, gridDim.y, gridDim.z // grid shape
Plus:
warpSize // always 32 today
laneId = threadIdx.x % 32 // position within warp (some helpers)
warpId = threadIdx.x / 32 // warp index within block
The canonical "what's my global ID?" pattern for a 1D launch:
For 2D (matrix work):
You'll write tid or row, col at the top of nearly every kernel.
The minimal vector-add kernel¶
__global__ void vec_add(const float* a, const float* b, float* c, int N) {
int tid = blockIdx.x * blockDim.x + threadIdx.x;
if (tid < N) {
c[tid] = a[tid] + b[tid];
}
}
That's a complete kernel. To launch over N = 1M elements with 256-thread blocks:
int N = 1 << 20;
int block_size = 256;
int grid_size = (N + block_size - 1) / block_size; // ceil(N / block_size)
vec_add<<<grid_size, block_size>>>(d_a, d_b, d_c, N);
The pattern (N + block_size - 1) / block_size is the "round-up division" idiom — it ensures enough blocks to cover N even when N isn't a multiple of block_size, and the if (tid < N) guard inside the kernel handles the surplus threads.
Synchronization primitives¶
__syncthreads(); // barrier: all threads in this block must reach this point
__syncwarp(); // barrier within a warp (cheap, lockstep)
__threadfence(); // memory fence within device
__threadfence_block(); // memory fence within block
cudaDeviceSynchronize(); // host-side: wait for all kernels to finish
cudaStreamSynchronize(stream); // host-side: wait for a stream
The host-side syncs you've already met in lab 00 — they're what makes timing meaningful. The device-side __syncthreads() is the crucial one for SMEM correctness: load a tile, sync, then read the tile. Without the sync, some threads might read before others have finished loading. Subtle bugs live here.
Three things to remember:
__syncthreads()must be called by all threads in the block. Calling it inside a divergent branch is undefined behavior (deadlock or wrong result).__syncthreads()is not across blocks. Blocks are independent. Cross-block sync requires exiting the kernel and starting another.__syncwarp()is cheaper than__syncthreads()but only syncs within a warp. Useful for warp-level reductions.
Memory spaces, as cuda qualifiers¶
| Qualifier | Where | Access | Scope |
|---|---|---|---|
float* x = cudaMalloc(...); |
Global (HBM) | r/w | Device-wide |
__shared__ float t[N]; |
SMEM | r/w | Per-block |
int local = ...; |
Register (or local mem if spilled) | r/w | Per-thread |
__constant__ float c[N]; |
Constant cache | r-only from device | Device-wide |
texture<...> |
Texture cache | special access | Device-wide |
SMEM and registers were covered in Phase-23 theory/02. Constant memory is an under-used niche: very fast for uniformly read values (all threads read the same address). Texture memory is mostly graphics legacy; modern AI kernels rarely use it.
A canonical pattern: SMEM-tiled reduction¶
This is the pattern you'll write a dozen times:
__global__ void block_sum(const float* x, float* out, int N) {
__shared__ float tile[BLOCK];
int tid = threadIdx.x;
int gid = blockIdx.x * blockDim.x + tid;
// Load coalesced into SMEM.
tile[tid] = (gid < N) ? x[gid] : 0.0f;
__syncthreads();
// Reduce in SMEM (tree reduction).
for (int s = blockDim.x / 2; s > 0; s >>= 1) {
if (tid < s) tile[tid] += tile[tid + s];
__syncthreads();
}
// Thread 0 writes the block's sum.
if (tid == 0) out[blockIdx.x] = tile[0];
}
Read this until it's obvious. Every line maps to a concept from Phase 23:
tile[tid] = x[gid]: coalesced global load (adjacent threads → adjacent addresses).__syncthreads()after the load: required before any thread reads tile values written by another.- The tree reduction: log₂(BLOCK) sync rounds, each halving the active threads.
- The if-guard
if (tid < s): half the threads sit idle each round. Optimal? No, but simple. Real reductions use warp-level intrinsics (__shfl_down_sync) to skip the last few rounds.
A real reduction also handles N > blocks × block_size by having each block reduce multiple input chunks, but the principle is the same.
Kernel launch overhead¶
A kernel launch from the host costs ~5–10 microseconds. Sounds tiny — but for a 50-microsecond kernel, that's 10–20% overhead. For a 5-microsecond kernel, the overhead dominates.
Implications:
- Launching many tiny kernels is bad. Fuse them.
torch.compile/ Inductor in Phase 25 will fuse for you.- Hand-written kernels often combine multiple operators (e.g., GEMM + bias + ReLU) into one launch.
- CUDA Graphs (Phase 33) batch kernel launches into one host-side submission, dropping per-launch overhead.
For Phase 24's kernel, this is mostly a non-issue (the kernel is the work, not the launch). But it's worth knowing.
Asynchrony in one paragraph¶
Kernel launches return immediately on the host — they're added to a stream and executed asynchronously. The host continues; the device runs in parallel. cudaDeviceSynchronize() blocks the host until the device catches up.
cudaMemcpyAsync is async; cudaMemcpy is sync. Streams interleave: you can H2D-transfer one batch while computing the previous batch's kernel, overlapping memory and compute. Phase 33 will use this; Phase 24 doesn't.
What this gives you for the labs¶
By the end of this page you can read the labs without lookups:
__global__,<<<grid, block>>>,threadIdx,blockIdx,__shared__,__syncthreads()are all primitive vocabulary.- The "round-up grid size + guard with
if (tid < N)" pattern is standard. - The "load tile to SMEM → sync → reduce → sync → write back" pattern is standard.
That's enough to read the naive softmax kernel in lab/01 and the tuned one in lab/02.
What this page does NOT cover¶
- CUDA C++ templates,
__launch_bounds__advanced use, CUTLASS. Phase 24 stays in C-flavored CUDA; templates are not necessary for the chosen kernel. - Cooperative groups (
cg::thread_block_tile). A modern alternative to raw__syncthreads; mentioned in Phase 27 if needed. - PTX inline assembly. You won't need it for Phase 24.
- CPU/Triton/PyTorch syntax. Theory 02, 03, 04 cover those.
Next: theory/02-from-naive-to-tiled.md — the canonical optimization path for a single kernel.