Skip to content

English · Español

Lab 00 — Hello, CUDA: Toolchain Check

Goal: verify the cloud-GPU toolchain works end-to-end. Write a trivial CUDA kernel (vector-add), compile, launch, validate against NumPy. The point is to surface tooling issues before the harder labs depend on them.

Estimated time: 1–2 hours (most of which is environment).

Prereq: Phase 23 complete. Cloud GPU instance available (rented per phase-23/lab/00-provision-cloud-gpu.md ritual). cupy or cuda-python installed.


What you produce

A directory experiments/24-hello-cuda/ containing:

  • vec_add.cu — the kernel source.
  • run.py — launch + correctness check.
  • manifest.json — see template below.
  • README.md — 1 paragraph: which build path (cupy.RawKernel or cuda-python), driver version, any pitfalls hit.

The kernel

__global__ void vec_add(const float* a, const float* b, float* c, int N) {
    int tid = blockIdx.x * blockDim.x + threadIdx.x;
    if (tid < N) c[tid] = a[tid] + b[tid];
}

TODOs

Block A — pick the build path

  • Decide: cupy.RawKernel (simplest; Python-to-PTX JIT) or cuda-python with nvrtc (more flexible; closer to production). Default: cupy.RawKernel.
  • Install the chosen path via uv pip install. Pin version in pyproject.toml.
  • Verify nvidia-smi shows your GPU. Verify nvcc --version or the JIT path works.

Block B — launch the kernel

  • run.py: allocate two host arrays of size \(N = 2^{20}\) with np.random.default_rng(42).
  • Copy to device (cupy.asarray or equivalent).
  • Launch with <<<grid, 256>>> where grid = (N + 255) // 256.
  • Copy result back to host. Compare to a + b from NumPy. Assert exact equality (float32 addition is associative-enough for one op).

Block C — time it (lightly)

  • Time the kernel with CUDA events (NOT Python time.perf_counter — the launch is async; you'd measure submission, not execution).
  • Print bandwidth achieved: \(3 \cdot N \cdot 4\) bytes / time. Compare to peak HBM (nvidia-smi -q | grep "Memory Bus" or the cloud GPU's published spec).
  • Expected: vector-add is bandwidth-bound; should hit 60–80% of peak HBM. If it's <20%, something's wrong (likely cudaMemcpy is in the timing window).

Block D — manifest

{
  "experiment": "24-hello-cuda",
  "date": "YYYY-MM-DD",
  "seed": 42,
  "gpu": {"model": null, "compute_capability": null, "driver": null},
  "build_path": "cupy.RawKernel",
  "versions": {"python": "3.11.x", "cupy": null, "cuda_runtime": null},
  "results": {
    "N": 1048576,
    "kernel_time_us": null,
    "achieved_bandwidth_gbs": null,
    "peak_bandwidth_gbs": null,
    "fraction_of_peak": null,
    "correctness": "passed | failed"
  }
}

Constraints

  • Don't time cudaMemcpy in the kernel time. Use CUDA events bracketing only the kernel launch.
  • Don't proceed if correctness fails. Vector-add is the simplest possible kernel; if it's wrong, the toolchain or the launch syntax is broken.
  • Don't tune. This is a toolchain check, not a perf benchmark. ≥20% of peak is plenty.

Stop conditions

Done when:

  1. run.py runs end-to-end on the cloud GPU.
  2. Correctness assert passes.
  3. Achieved bandwidth is ≥20% of peak (sanity check).
  4. manifest.json committed.
  5. Cloud instance billing recorded in your phase-23 cost log.

Pitfalls

  • cupy.RawKernel first-call overhead. The JIT compile happens on first call (~1–2 s). Time the second call. Or use cp.cuda.compile_with_cache(...).
  • cupy version mismatch with installed CUDA runtime. cupy-cuda12x is not the same wheel as cupy-cuda11x. Match to driver.
  • Off-by-one in grid. N // 256 misses the tail when N % 256 != 0. Use (N + 255) // 256 and the if (tid < N) guard.
  • Race on the device pointer. Returning the result before cudaDeviceSynchronize or stream sync gives stale data. cupy syncs on .get(); verify.

When to consult solutions/

After all stop conditions met. The reference walks through the cupy.RawKernel setup and a known-good vector-add timing.


Next lab: lab/01-naive-kernel.md.