English · Español
Lab 00 — Hello, CUDA: Toolchain Check¶
Goal: verify the cloud-GPU toolchain works end-to-end. Write a trivial CUDA kernel (vector-add), compile, launch, validate against NumPy. The point is to surface tooling issues before the harder labs depend on them.
Estimated time: 1–2 hours (most of which is environment).
Prereq: Phase 23 complete. Cloud GPU instance available (rented per
phase-23/lab/00-provision-cloud-gpu.mdritual).cupyorcuda-pythoninstalled.
What you produce¶
A directory experiments/24-hello-cuda/ containing:
vec_add.cu— the kernel source.run.py— launch + correctness check.manifest.json— see template below.README.md— 1 paragraph: which build path (cupy.RawKernelorcuda-python), driver version, any pitfalls hit.
The kernel¶
__global__ void vec_add(const float* a, const float* b, float* c, int N) {
int tid = blockIdx.x * blockDim.x + threadIdx.x;
if (tid < N) c[tid] = a[tid] + b[tid];
}
TODOs¶
Block A — pick the build path¶
- Decide:
cupy.RawKernel(simplest; Python-to-PTX JIT) orcuda-pythonwithnvrtc(more flexible; closer to production). Default:cupy.RawKernel. - Install the chosen path via
uv pip install. Pin version inpyproject.toml. - Verify
nvidia-smishows your GPU. Verifynvcc --versionor the JIT path works.
Block B — launch the kernel¶
-
run.py: allocate two host arrays of size \(N = 2^{20}\) withnp.random.default_rng(42). - Copy to device (
cupy.asarrayor equivalent). - Launch with
<<<grid, 256>>>wheregrid = (N + 255) // 256. - Copy result back to host. Compare to
a + bfrom NumPy. Assert exact equality (float32 addition is associative-enough for one op).
Block C — time it (lightly)¶
- Time the kernel with CUDA events (NOT Python
time.perf_counter— the launch is async; you'd measure submission, not execution). - Print bandwidth achieved: \(3 \cdot N \cdot 4\) bytes / time. Compare to peak HBM (
nvidia-smi -q | grep "Memory Bus"or the cloud GPU's published spec). - Expected: vector-add is bandwidth-bound; should hit 60–80% of peak HBM. If it's <20%, something's wrong (likely
cudaMemcpyis in the timing window).
Block D — manifest¶
{
"experiment": "24-hello-cuda",
"date": "YYYY-MM-DD",
"seed": 42,
"gpu": {"model": null, "compute_capability": null, "driver": null},
"build_path": "cupy.RawKernel",
"versions": {"python": "3.11.x", "cupy": null, "cuda_runtime": null},
"results": {
"N": 1048576,
"kernel_time_us": null,
"achieved_bandwidth_gbs": null,
"peak_bandwidth_gbs": null,
"fraction_of_peak": null,
"correctness": "passed | failed"
}
}
Constraints¶
- Don't time
cudaMemcpyin the kernel time. Use CUDA events bracketing only the kernel launch. - Don't proceed if correctness fails. Vector-add is the simplest possible kernel; if it's wrong, the toolchain or the launch syntax is broken.
- Don't tune. This is a toolchain check, not a perf benchmark. ≥20% of peak is plenty.
Stop conditions¶
Done when:
run.pyruns end-to-end on the cloud GPU.- Correctness assert passes.
- Achieved bandwidth is ≥20% of peak (sanity check).
manifest.jsoncommitted.- Cloud instance billing recorded in your phase-23 cost log.
Pitfalls¶
cupy.RawKernelfirst-call overhead. The JIT compile happens on first call (~1–2 s). Time the second call. Or usecp.cuda.compile_with_cache(...).cupyversion mismatch with installed CUDA runtime.cupy-cuda12xis not the same wheel ascupy-cuda11x. Match to driver.- Off-by-one in
grid.N // 256misses the tail whenN % 256 != 0. Use(N + 255) // 256and theif (tid < N)guard. - Race on the device pointer. Returning the result before
cudaDeviceSynchronizeor stream sync gives stale data.cupysyncs on.get(); verify.
When to consult solutions/¶
After all stop conditions met. The reference walks through the cupy.RawKernel setup and a known-good vector-add timing.
Next lab: lab/01-naive-kernel.md.