Skip to content

English · Español

Phase 24 — Quizzes (mirror)

🇪🇸 Las preguntas canónicas viven en data/quizzes/phase-24-cuda-triton.yaml.


q-24-01 — When CUDA wins over Triton

Prompt (EN): Which of the following is a case where raw CUDA is preferable to Triton?

  • A. Implementing a vector-add with stride 1.
  • B. A simple block matmul.
  • C. Warp specialization (different warps in a block playing different roles).
  • D. Quick prototyping of a new attention variant.

Correct: C. Warp specialization needs control over which threads in a block do what; Triton's vector abstractions assume warp uniformity. The other three are exactly where Triton's tile abstraction wins.


q-24-02 — Shared memory overflow

Prompt (EN): A CUDA kernel writes tile[threadIdx.y][15] but tile is declared __shared__ float tile[16][15]. What happens?

  • A. The compiler rejects the kernel.
  • B. Runtime raises CUDA_ERROR_ILLEGAL_ADDRESS on the OOB write.
  • C. The write silently corrupts the next shared-memory allocation in the kernel.
  • D. The kernel silently produces correct output (the GPU pads).

Correct: C. CUDA does not bounds-check shared memory at runtime. The OOB write corrupts whatever follows in the block's shared-memory layout.


q-24-03 — Diagnosing shared-mem bugs

Prompt (EN): In one or two sentences, name the tool you would use to catch a silent shared-memory OOB write in a CUDA kernel, and what the tool reports.

Free response. Expected mentions: compute-sanitizer (or cuda-memcheck); reports "Invalid shared write of size N at...".


q-24-04 — Build-before-abstract layering

Prompt (EN): Select every statement that correctly characterizes the layering of GPU programming abstractions per CLAUDE.md §0.1.

  • A. NumPy / hand-written kernels come before CUDA.
  • B. CUDA comes before Triton.
  • C. Triton comes before PyTorch's torch.nn.functional.
  • D. PyTorch's high-level modules come before NumPy.

Correct: A, B, C. Each layer is meaningful only if the layer below was first practiced.


q-24-05 — Triton block-size auto-tuning

Prompt (EN): A Triton kernel declares BLOCK_SIZE: tl.constexpr and is auto-tuned over [64, 128, 256]. What does this mean for the compiled kernel?

  • A. The runtime picks BLOCK_SIZE based on input size on every call.
  • B. Triton emits three specialized kernels; one is selected based on auto-tuning at first call.
  • C. BLOCK_SIZE is a runtime parameter passed in by the host.
  • D. The compiler picks one value at compile time and never reconsiders.

Correct: B. tl.constexpr parameters are baked into the compiled kernel; multiple specializations are generated for the auto-tune set, and the fastest is selected on first invocation.