English · Español

Lab 00 — Hello, CUDA: Toolchain Check¶

Goal: verify the cloud-GPU toolchain works end-to-end. Write a trivial CUDA kernel (vector-add), compile, launch, validate against NumPy. The point is to surface tooling issues before the harder labs depend on them.

Estimated time: 1–2 hours (most of which is environment).

Prereq: Phase 23 complete. Cloud GPU instance available (rented per phase-23/lab/00-provision-cloud-gpu.md ritual). cupy or cuda-python installed.

What you produce¶

A directory experiments/24-hello-cuda/ containing:

vec_add.cu — the kernel source.
run.py — launch + correctness check.
manifest.json — see template below.
README.md — 1 paragraph: which build path (cupy.RawKernel or cuda-python), driver version, any pitfalls hit.

The kernel¶

__global__ void vec_add(const float* a, const float* b, float* c, int N) {
    int tid = blockIdx.x * blockDim.x + threadIdx.x;
    if (tid < N) c[tid] = a[tid] + b[tid];
}

TODOs¶

Block A — pick the build path¶

Decide: cupy.RawKernel (simplest; Python-to-PTX JIT) or cuda-python with nvrtc (more flexible; closer to production). Default: cupy.RawKernel.
Install the chosen path via uv pip install. Pin version in pyproject.toml.
Verify nvidia-smi shows your GPU. Verify nvcc --version or the JIT path works.

Block B — launch the kernel¶

run.py: allocate two host arrays of size \(N = 2^{20}\) with np.random.default_rng(42).
Copy to device (cupy.asarray or equivalent).
Launch with <<<grid, 256>>> where grid = (N + 255) // 256.
Copy result back to host. Compare to a + b from NumPy. Assert exact equality (float32 addition is associative-enough for one op).

Block C — time it (lightly)¶

Time the kernel with CUDA events (NOT Python time.perf_counter — the launch is async; you'd measure submission, not execution).
Print bandwidth achieved: \(3 \cdot N \cdot 4\) bytes / time. Compare to peak HBM (nvidia-smi -q | grep "Memory Bus" or the cloud GPU's published spec).
Expected: vector-add is bandwidth-bound; should hit 60–80% of peak HBM. If it's <20%, something's wrong (likely cudaMemcpy is in the timing window).

Block D — manifest¶

{
  "experiment": "24-hello-cuda",
  "date": "YYYY-MM-DD",
  "seed": 42,
  "gpu": {"model": null, "compute_capability": null, "driver": null},
  "build_path": "cupy.RawKernel",
  "versions": {"python": "3.11.x", "cupy": null, "cuda_runtime": null},
  "results": {
    "N": 1048576,
    "kernel_time_us": null,
    "achieved_bandwidth_gbs": null,
    "peak_bandwidth_gbs": null,
    "fraction_of_peak": null,
    "correctness": "passed | failed"
  }
}

Constraints¶

Don't time cudaMemcpy in the kernel time. Use CUDA events bracketing only the kernel launch.
Don't proceed if correctness fails. Vector-add is the simplest possible kernel; if it's wrong, the toolchain or the launch syntax is broken.
Don't tune. This is a toolchain check, not a perf benchmark. ≥20% of peak is plenty.

Stop conditions¶

Done when:

run.py runs end-to-end on the cloud GPU.
Correctness assert passes.
Achieved bandwidth is ≥20% of peak (sanity check).
manifest.json committed.
Cloud instance billing recorded in your phase-23 cost log.

Pitfalls¶

cupy.RawKernel first-call overhead. The JIT compile happens on first call (~1–2 s). Time the second call. Or use cp.cuda.compile_with_cache(...).
cupy version mismatch with installed CUDA runtime. cupy-cuda12x is not the same wheel as cupy-cuda11x. Match to driver.
Off-by-one in grid. N // 256 misses the tail when N % 256 != 0. Use (N + 255) // 256 and the if (tid < N) guard.
Race on the device pointer. Returning the result before cudaDeviceSynchronize or stream sync gives stale data. cupy syncs on .get(); verify.

When to consult `solutions/`¶

After all stop conditions met. The reference walks through the cupy.RawKernel setup and a known-good vector-add timing.

Next lab: lab/01-naive-kernel.md.