English · Español

Lab 01 — Device Query: Know Your GPU¶

Goal: programmatically dump every relevant spec of the rented GPU into JSON, and verify each one against the manufacturer's datasheet.

Estimated time: 60–90 minutes.

Prereq: lab/00-provision-cloud-gpu.md complete. A running cloud-GPU instance with cupy working.

What you produce¶

A directory experiments/23-device-profile/ containing:

device_query.py — your inspection script.
device_query.json — populated output, every field non-null.
comparison.md — table of your-GPU's numbers vs A100/H100/4090 reference values, with notes on what makes your GPU different.
manifest.json.

The fields to populate¶

device_query.json schema:

{
  "device_count": null,
  "device_index": 0,
  "name": null,
  "compute_capability": {"major": null, "minor": null},
  "total_memory_gib": null,
  "memory_bus_width_bits": null,
  "memory_clock_rate_mhz": null,
  "theoretical_hbm_bandwidth_gbs": null,

  "multiprocessor_count": null,
  "cuda_cores_per_sm": null,
  "total_cuda_cores": null,
  "tensor_cores_per_sm": null,
  "total_tensor_cores": null,

  "max_threads_per_sm": null,
  "max_threads_per_block": null,
  "max_warps_per_sm": null,
  "warp_size": 32,

  "max_blocks_per_sm": null,
  "max_grid_dimensions": [null, null, null],
  "max_block_dimensions": [null, null, null],

  "register_file_size_per_sm_kib": null,
  "max_registers_per_thread": null,
  "max_shared_memory_per_block_kib": null,
  "shared_memory_per_sm_kib": null,
  "l2_cache_size_mib": null,

  "clock_rate_mhz": null,
  "memory_clock_rate_effective_mhz": null,

  "peak_fp32_tflops_cuda_cores": null,
  "peak_fp16_tflops_tensor_cores": null,
  "peak_bf16_tflops_tensor_cores": null,
  "peak_fp8_tflops_tensor_cores_if_supported": null,

  "pci_bus_id": null,
  "pcie_generation": null,
  "pcie_lane_width": null
}

TODOs¶

Block A — write `device_query.py`¶

Use cupy.cuda.runtime.getDeviceProperties(0) for the bulk; supplement with cupy.cuda.runtime.deviceGetAttribute for fields not in deviceProperties. (For Phase 23, do not use torch even though torch.cuda.get_device_properties is also an option — see Plan §7.g.)
Compute derived fields:
theoretical_hbm_bandwidth_gbs = 2 * memory_clock_rate_mhz * memory_bus_width_bits / 8 / 1000 (the factor of 2 is DDR; check whether your GPU's HBM is DDR — most are, but document).
total_cuda_cores = sm_count × cuda_cores_per_sm (the "cuda_cores_per_sm" depends on arch — Ampere = 128 fp32, Hopper = 128 fp32, Turing = 64 fp32; look up your arch).
peak_fp32_tflops_cuda_cores = clock_rate_GHz × total_cuda_cores × 2 (the 2 is FMA = mul + add).
For Tensor Core peak: tricky, varies by arch. Look up the "operations per clock per Tensor Core" for your compute capability (Ampere = 256 fp16-FMA/clock/tc; Hopper = 512 fp16-FMA/clock/tc). Then peak_fp16_tflops = clock_GHz × total_tensor_cores × ops_per_clock × 2 (FMA).
Save to device_query.json with every field populated.

Block B — verify against datasheet¶

Open NVIDIA's whitepaper / datasheet for your GPU. Find the official "Peak FP32 TFLOPS", "Peak FP16 Tensor Core TFLOPS", "HBM Bandwidth", "L2 cache size".
Make a table in comparison.md:
Your-GPU measured/queried | Datasheet | Match? (Y/N) | Why-if-not
Any field where your computed peak differs from datasheet by >10% needs an explanation. Common reasons: (a) clock you're querying is base, not boost; (b) different definition of "core"; © dense vs sparse TFLOPS reporting.

Block C — explain the architecture in `comparison.md`¶

Two paragraphs, in your own words:

Which architecture is your GPU? (Ampere, Ada Lovelace, Hopper, etc.) What was new in that arch vs the prior gen? (E.g., Ampere introduced TF32 Tensor Cores; Hopper introduced fp8.)
What makes your specific SKU different from the data-center flagship? E.g., RTX 3090 = consumer Ampere with same SMs as A100 but fewer of them, faster boost, less HBM (GDDR6X instead of HBM2). Note the practical implication: bandwidth-bound kernels on a 3090 run at ~50% the speed of A100 because of bandwidth gap.

Block D — manifest¶

{
  "experiment": "23-device-profile",
  "date": "YYYY-MM-DD",
  "gpu_name": null,
  "compute_capability": null,
  "versions": {"python": "3.11.x", "cupy": "X.Y.Z"},
  "queried_fields_total": null,
  "non_null_fields": null,
  "datasheet_mismatch_count": null
}

Constraints¶

No PyTorch yet — cupy and cuda-python only. See Plan §7.g.
Don't hardcode "32 warp size" — query it. It's been 32 forever, but if someone runs this on AMD with 64-wide wavefronts, the hardcoded value is wrong.
Computed fields must be derivable from queried fields. E.g., don't put peak_fp16_tflops in if you got it from the datasheet — compute from clock × tensor cores. The datasheet is the verification, not the source.

Stop conditions¶

Done when:

device_query.json has every field non-null.
comparison.md table has all "Match?" cells filled, with explanations for mismatches.
comparison.md architecture paragraphs written.
manifest.json complete.

Pitfalls¶

getDeviceProperties returns kHz, not MHz, for clock fields. Always check the unit by comparing to the datasheet.
"Cuda cores per SM" varies by arch. Don't assume 128. Look up your compute capability.
"Tensor Core operations per clock" is the trickiest number. It's poorly documented; the cleanest source is NVIDIA's per-arch whitepaper. The Triton docs also list these.
Boost vs base clock. clock_rate_mhz from the API is typically base. Real workloads run at boost. The datasheet's "peak TFLOPS" assumes boost.

When to consult `solutions/`¶

After all stop conditions met. The reference at solutions/01-device-query-ref.md (written at phase open) shows the script + a worked comparison table for the reference GPU (TBD by §7.a).

Next lab: lab/02-bandwidth-test.md.