English · Español

01 — The CUDA Programming Model¶

🇪🇸 CUDA = un lenguaje C/C++ extendido con cualificadores (__global__, __device__, __shared__) y unas pocas variables intrínsecas (blockIdx, threadIdx, blockDim, gridDim). El kernel se lanza con <<<grid, block>>> y cada hilo identifica su trabajo con un índice global que tú calculas a partir de los intrínsecos. Es minimalista; lo difícil no es la sintaxis sino mapear el problema a la jerarquía grid → bloque → warp → hilo.

This page gives the CUDA programming model the same compact treatment Phase-23 gave the execution model. After this you can read any CUDA C kernel and identify: what each thread does, what each block does, what memory each thing touches.

The five qualifiers¶

__global__ void kernel(...)   { ... }   // host calls; runs on device
__device__ int helper(int x)  { ... }   // device-only, callable from kernel
__host__   int main(...)      { ... }   // host-only (default if unqualified)

__shared__ float tile[BLOCK_SIZE];      // SMEM, per-block
__constant__ float lookup[256];         // constant memory, read-only from device

That's most of it. CUDA C is C + these qualifiers + a few launch syntax additions + some intrinsics.

The launch¶

dim3 grid(num_blocks_x, num_blocks_y, num_blocks_z);   // up to 3D
dim3 block(threads_x, threads_y, threads_z);            // up to 3D, product ≤ 1024
kernel<<<grid, block, shared_mem_bytes, stream>>>(args...);

The <<<...>>> is CUDA C's only non-standard syntax. It tells nvcc to emit a kernel launch.

grid: how many blocks to launch. Each block runs on one SM, can't migrate.
block: how many threads per block. Threads in a block can sync (__syncthreads()), share SMEM, and are bundled into warps of 32.
shared_mem_bytes: dynamic SMEM size if the kernel uses extern __shared__. Optional.
stream: which CUDA stream this kernel runs on. Multiple streams enable kernel concurrency. Defaults to the default stream (synchronizes with everything).

Total threads launched = grid × block (product of all dimensions). For a 1024×1024 matmul output, you might launch <<<dim3(64, 64), dim3(16, 16)>>> — 4096 blocks × 256 threads = 1M threads, one per output element (or per group thereof for tiling).

The intrinsics¶

Inside a kernel, every thread has access to:

threadIdx.x, threadIdx.y, threadIdx.z    // its position in the block
blockIdx.x,  blockIdx.y,  blockIdx.z     // its block's position in the grid
blockDim.x,  blockDim.y,  blockDim.z     // block shape
gridDim.x,   gridDim.y,   gridDim.z      // grid shape

Plus:

warpSize                                  // always 32 today
laneId  =  threadIdx.x % 32              // position within warp (some helpers)
warpId  =  threadIdx.x / 32              // warp index within block

The canonical "what's my global ID?" pattern for a 1D launch:

int tid = blockIdx.x * blockDim.x + threadIdx.x;

For 2D (matrix work):

int row = blockIdx.y * blockDim.y + threadIdx.y;
int col = blockIdx.x * blockDim.x + threadIdx.x;

You'll write tid or row, col at the top of nearly every kernel.

The minimal vector-add kernel¶

__global__ void vec_add(const float* a, const float* b, float* c, int N) {
    int tid = blockIdx.x * blockDim.x + threadIdx.x;
    if (tid < N) {
        c[tid] = a[tid] + b[tid];
    }
}

That's a complete kernel. To launch over N = 1M elements with 256-thread blocks:

int N = 1 << 20;
int block_size = 256;
int grid_size = (N + block_size - 1) / block_size;   // ceil(N / block_size)
vec_add<<<grid_size, block_size>>>(d_a, d_b, d_c, N);

The pattern (N + block_size - 1) / block_size is the "round-up division" idiom — it ensures enough blocks to cover N even when N isn't a multiple of block_size, and the if (tid < N) guard inside the kernel handles the surplus threads.

Synchronization primitives¶

__syncthreads();              // barrier: all threads in this block must reach this point
__syncwarp();                 // barrier within a warp (cheap, lockstep)
__threadfence();              // memory fence within device
__threadfence_block();        // memory fence within block

cudaDeviceSynchronize();      // host-side: wait for all kernels to finish
cudaStreamSynchronize(stream); // host-side: wait for a stream

The host-side syncs you've already met in lab 00 — they're what makes timing meaningful. The device-side __syncthreads() is the crucial one for SMEM correctness: load a tile, sync, then read the tile. Without the sync, some threads might read before others have finished loading. Subtle bugs live here.

Three things to remember:

__syncthreads() must be called by all threads in the block. Calling it inside a divergent branch is undefined behavior (deadlock or wrong result).
__syncthreads() is not across blocks. Blocks are independent. Cross-block sync requires exiting the kernel and starting another.
__syncwarp() is cheaper than __syncthreads() but only syncs within a warp. Useful for warp-level reductions.

Memory spaces, as cuda qualifiers¶

Qualifier	Where	Access	Scope
`float* x = cudaMalloc(...);`	Global (HBM)	r/w	Device-wide
`__shared__ float t[N];`	SMEM	r/w	Per-block
`int local = ...;`	Register (or local mem if spilled)	r/w	Per-thread
`__constant__ float c[N];`	Constant cache	r-only from device	Device-wide
`texture<...>`	Texture cache	special access	Device-wide

SMEM and registers were covered in Phase-23 theory/02. Constant memory is an under-used niche: very fast for uniformly read values (all threads read the same address). Texture memory is mostly graphics legacy; modern AI kernels rarely use it.

A canonical pattern: SMEM-tiled reduction¶

This is the pattern you'll write a dozen times:

__global__ void block_sum(const float* x, float* out, int N) {
    __shared__ float tile[BLOCK];

    int tid = threadIdx.x;
    int gid = blockIdx.x * blockDim.x + tid;

    // Load coalesced into SMEM.
    tile[tid] = (gid < N) ? x[gid] : 0.0f;
    __syncthreads();

    // Reduce in SMEM (tree reduction).
    for (int s = blockDim.x / 2; s > 0; s >>= 1) {
        if (tid < s) tile[tid] += tile[tid + s];
        __syncthreads();
    }

    // Thread 0 writes the block's sum.
    if (tid == 0) out[blockIdx.x] = tile[0];
}

Read this until it's obvious. Every line maps to a concept from Phase 23:

tile[tid] = x[gid]: coalesced global load (adjacent threads → adjacent addresses).
__syncthreads() after the load: required before any thread reads tile values written by another.
The tree reduction: log₂(BLOCK) sync rounds, each halving the active threads.
The if-guard if (tid < s): half the threads sit idle each round. Optimal? No, but simple. Real reductions use warp-level intrinsics (__shfl_down_sync) to skip the last few rounds.

A real reduction also handles N > blocks × block_size by having each block reduce multiple input chunks, but the principle is the same.

Kernel launch overhead¶

A kernel launch from the host costs ~5–10 microseconds. Sounds tiny — but for a 50-microsecond kernel, that's 10–20% overhead. For a 5-microsecond kernel, the overhead dominates.

Implications:

Launching many tiny kernels is bad. Fuse them.
torch.compile / Inductor in Phase 25 will fuse for you.
Hand-written kernels often combine multiple operators (e.g., GEMM + bias + ReLU) into one launch.
CUDA Graphs (Phase 33) batch kernel launches into one host-side submission, dropping per-launch overhead.

For Phase 24's kernel, this is mostly a non-issue (the kernel is the work, not the launch). But it's worth knowing.

Asynchrony in one paragraph¶

Kernel launches return immediately on the host — they're added to a stream and executed asynchronously. The host continues; the device runs in parallel. cudaDeviceSynchronize() blocks the host until the device catches up.

cudaMemcpyAsync is async; cudaMemcpy is sync. Streams interleave: you can H2D-transfer one batch while computing the previous batch's kernel, overlapping memory and compute. Phase 33 will use this; Phase 24 doesn't.

What this gives you for the labs¶

By the end of this page you can read the labs without lookups:

__global__, <<<grid, block>>>, threadIdx, blockIdx, __shared__, __syncthreads() are all primitive vocabulary.
The "round-up grid size + guard with if (tid < N)" pattern is standard.
The "load tile to SMEM → sync → reduce → sync → write back" pattern is standard.

That's enough to read the naive softmax kernel in lab/01 and the tuned one in lab/02.

What this page does NOT cover¶

CUDA C++ templates, __launch_bounds__ advanced use, CUTLASS. Phase 24 stays in C-flavored CUDA; templates are not necessary for the chosen kernel.
Cooperative groups (cg::thread_block_tile). A modern alternative to raw __syncthreads; mentioned in Phase 27 if needed.
PTX inline assembly. You won't need it for Phase 24.
CPU/Triton/PyTorch syntax. Theory 02, 03, 04 cover those.

Next: theory/02-from-naive-to-tiled.md — the canonical optimization path for a single kernel.