English · Español
Lab 01 — Device Query: Know Your GPU¶
Goal: programmatically dump every relevant spec of the rented GPU into JSON, and verify each one against the manufacturer's datasheet.
Estimated time: 60–90 minutes.
Prereq:
lab/00-provision-cloud-gpu.mdcomplete. A running cloud-GPU instance withcupyworking.
What you produce¶
A directory experiments/23-device-profile/ containing:
device_query.py— your inspection script.device_query.json— populated output, every field non-null.comparison.md— table of your-GPU's numbers vs A100/H100/4090 reference values, with notes on what makes your GPU different.manifest.json.
The fields to populate¶
device_query.json schema:
{
"device_count": null,
"device_index": 0,
"name": null,
"compute_capability": {"major": null, "minor": null},
"total_memory_gib": null,
"memory_bus_width_bits": null,
"memory_clock_rate_mhz": null,
"theoretical_hbm_bandwidth_gbs": null,
"multiprocessor_count": null,
"cuda_cores_per_sm": null,
"total_cuda_cores": null,
"tensor_cores_per_sm": null,
"total_tensor_cores": null,
"max_threads_per_sm": null,
"max_threads_per_block": null,
"max_warps_per_sm": null,
"warp_size": 32,
"max_blocks_per_sm": null,
"max_grid_dimensions": [null, null, null],
"max_block_dimensions": [null, null, null],
"register_file_size_per_sm_kib": null,
"max_registers_per_thread": null,
"max_shared_memory_per_block_kib": null,
"shared_memory_per_sm_kib": null,
"l2_cache_size_mib": null,
"clock_rate_mhz": null,
"memory_clock_rate_effective_mhz": null,
"peak_fp32_tflops_cuda_cores": null,
"peak_fp16_tflops_tensor_cores": null,
"peak_bf16_tflops_tensor_cores": null,
"peak_fp8_tflops_tensor_cores_if_supported": null,
"pci_bus_id": null,
"pcie_generation": null,
"pcie_lane_width": null
}
TODOs¶
Block A — write device_query.py¶
- Use
cupy.cuda.runtime.getDeviceProperties(0)for the bulk; supplement withcupy.cuda.runtime.deviceGetAttributefor fields not indeviceProperties. (For Phase 23, do not usetorcheven thoughtorch.cuda.get_device_propertiesis also an option — see Plan §7.g.) - Compute derived fields:
theoretical_hbm_bandwidth_gbs = 2 * memory_clock_rate_mhz * memory_bus_width_bits / 8 / 1000(the factor of 2 is DDR; check whether your GPU's HBM is DDR — most are, but document).total_cuda_cores = sm_count × cuda_cores_per_sm(the "cuda_cores_per_sm" depends on arch — Ampere = 128 fp32, Hopper = 128 fp32, Turing = 64 fp32; look up your arch).peak_fp32_tflops_cuda_cores = clock_rate_GHz × total_cuda_cores × 2(the 2 is FMA = mul + add).- For Tensor Core peak: tricky, varies by arch. Look up the "operations per clock per Tensor Core" for your compute capability (Ampere = 256 fp16-FMA/clock/tc; Hopper = 512 fp16-FMA/clock/tc). Then
peak_fp16_tflops = clock_GHz × total_tensor_cores × ops_per_clock × 2 (FMA). - Save to
device_query.jsonwith every field populated.
Block B — verify against datasheet¶
- Open NVIDIA's whitepaper / datasheet for your GPU. Find the official "Peak FP32 TFLOPS", "Peak FP16 Tensor Core TFLOPS", "HBM Bandwidth", "L2 cache size".
- Make a table in
comparison.md: - Your-GPU measured/queried | Datasheet | Match? (Y/N) | Why-if-not
- Any field where your computed peak differs from datasheet by >10% needs an explanation. Common reasons: (a) clock you're querying is base, not boost; (b) different definition of "core"; © dense vs sparse TFLOPS reporting.
Block C — explain the architecture in comparison.md¶
Two paragraphs, in your own words:
- Which architecture is your GPU? (Ampere, Ada Lovelace, Hopper, etc.) What was new in that arch vs the prior gen? (E.g., Ampere introduced TF32 Tensor Cores; Hopper introduced fp8.)
- What makes your specific SKU different from the data-center flagship? E.g., RTX 3090 = consumer Ampere with same SMs as A100 but fewer of them, faster boost, less HBM (GDDR6X instead of HBM2). Note the practical implication: bandwidth-bound kernels on a 3090 run at ~50% the speed of A100 because of bandwidth gap.
Block D — manifest¶
{
"experiment": "23-device-profile",
"date": "YYYY-MM-DD",
"gpu_name": null,
"compute_capability": null,
"versions": {"python": "3.11.x", "cupy": "X.Y.Z"},
"queried_fields_total": null,
"non_null_fields": null,
"datasheet_mismatch_count": null
}
Constraints¶
- No PyTorch yet —
cupyandcuda-pythononly. See Plan §7.g. - Don't hardcode "32 warp size" — query it. It's been 32 forever, but if someone runs this on AMD with 64-wide wavefronts, the hardcoded value is wrong.
- Computed fields must be derivable from queried fields. E.g., don't put
peak_fp16_tflopsin if you got it from the datasheet — compute from clock × tensor cores. The datasheet is the verification, not the source.
Stop conditions¶
Done when:
device_query.jsonhas every field non-null.comparison.mdtable has all "Match?" cells filled, with explanations for mismatches.comparison.mdarchitecture paragraphs written.manifest.jsoncomplete.
Pitfalls¶
getDevicePropertiesreturns kHz, not MHz, for clock fields. Always check the unit by comparing to the datasheet.- "Cuda cores per SM" varies by arch. Don't assume 128. Look up your compute capability.
- "Tensor Core operations per clock" is the trickiest number. It's poorly documented; the cleanest source is NVIDIA's per-arch whitepaper. The Triton docs also list these.
- Boost vs base clock.
clock_rate_mhzfrom the API is typically base. Real workloads run at boost. The datasheet's "peak TFLOPS" assumes boost.
When to consult solutions/¶
After all stop conditions met. The reference at solutions/01-device-query-ref.md (written at phase open) shows the script + a worked comparison table for the reference GPU (TBD by §7.a).
Next lab: lab/02-bandwidth-test.md.