Skip to content

English · Español

Lab 01 — Device Query: Know Your GPU

Goal: programmatically dump every relevant spec of the rented GPU into JSON, and verify each one against the manufacturer's datasheet.

Estimated time: 60–90 minutes.

Prereq: lab/00-provision-cloud-gpu.md complete. A running cloud-GPU instance with cupy working.


What you produce

A directory experiments/23-device-profile/ containing:

  • device_query.py — your inspection script.
  • device_query.json — populated output, every field non-null.
  • comparison.md — table of your-GPU's numbers vs A100/H100/4090 reference values, with notes on what makes your GPU different.
  • manifest.json.

The fields to populate

device_query.json schema:

{
  "device_count": null,
  "device_index": 0,
  "name": null,
  "compute_capability": {"major": null, "minor": null},
  "total_memory_gib": null,
  "memory_bus_width_bits": null,
  "memory_clock_rate_mhz": null,
  "theoretical_hbm_bandwidth_gbs": null,

  "multiprocessor_count": null,
  "cuda_cores_per_sm": null,
  "total_cuda_cores": null,
  "tensor_cores_per_sm": null,
  "total_tensor_cores": null,

  "max_threads_per_sm": null,
  "max_threads_per_block": null,
  "max_warps_per_sm": null,
  "warp_size": 32,

  "max_blocks_per_sm": null,
  "max_grid_dimensions": [null, null, null],
  "max_block_dimensions": [null, null, null],

  "register_file_size_per_sm_kib": null,
  "max_registers_per_thread": null,
  "max_shared_memory_per_block_kib": null,
  "shared_memory_per_sm_kib": null,
  "l2_cache_size_mib": null,

  "clock_rate_mhz": null,
  "memory_clock_rate_effective_mhz": null,

  "peak_fp32_tflops_cuda_cores": null,
  "peak_fp16_tflops_tensor_cores": null,
  "peak_bf16_tflops_tensor_cores": null,
  "peak_fp8_tflops_tensor_cores_if_supported": null,

  "pci_bus_id": null,
  "pcie_generation": null,
  "pcie_lane_width": null
}

TODOs

Block A — write device_query.py

  • Use cupy.cuda.runtime.getDeviceProperties(0) for the bulk; supplement with cupy.cuda.runtime.deviceGetAttribute for fields not in deviceProperties. (For Phase 23, do not use torch even though torch.cuda.get_device_properties is also an option — see Plan §7.g.)
  • Compute derived fields:
  • theoretical_hbm_bandwidth_gbs = 2 * memory_clock_rate_mhz * memory_bus_width_bits / 8 / 1000 (the factor of 2 is DDR; check whether your GPU's HBM is DDR — most are, but document).
  • total_cuda_cores = sm_count × cuda_cores_per_sm (the "cuda_cores_per_sm" depends on arch — Ampere = 128 fp32, Hopper = 128 fp32, Turing = 64 fp32; look up your arch).
  • peak_fp32_tflops_cuda_cores = clock_rate_GHz × total_cuda_cores × 2 (the 2 is FMA = mul + add).
  • For Tensor Core peak: tricky, varies by arch. Look up the "operations per clock per Tensor Core" for your compute capability (Ampere = 256 fp16-FMA/clock/tc; Hopper = 512 fp16-FMA/clock/tc). Then peak_fp16_tflops = clock_GHz × total_tensor_cores × ops_per_clock × 2 (FMA).
  • Save to device_query.json with every field populated.

Block B — verify against datasheet

  • Open NVIDIA's whitepaper / datasheet for your GPU. Find the official "Peak FP32 TFLOPS", "Peak FP16 Tensor Core TFLOPS", "HBM Bandwidth", "L2 cache size".
  • Make a table in comparison.md:
  • Your-GPU measured/queried | Datasheet | Match? (Y/N) | Why-if-not
  • Any field where your computed peak differs from datasheet by >10% needs an explanation. Common reasons: (a) clock you're querying is base, not boost; (b) different definition of "core"; © dense vs sparse TFLOPS reporting.

Block C — explain the architecture in comparison.md

Two paragraphs, in your own words:

  1. Which architecture is your GPU? (Ampere, Ada Lovelace, Hopper, etc.) What was new in that arch vs the prior gen? (E.g., Ampere introduced TF32 Tensor Cores; Hopper introduced fp8.)
  2. What makes your specific SKU different from the data-center flagship? E.g., RTX 3090 = consumer Ampere with same SMs as A100 but fewer of them, faster boost, less HBM (GDDR6X instead of HBM2). Note the practical implication: bandwidth-bound kernels on a 3090 run at ~50% the speed of A100 because of bandwidth gap.

Block D — manifest

{
  "experiment": "23-device-profile",
  "date": "YYYY-MM-DD",
  "gpu_name": null,
  "compute_capability": null,
  "versions": {"python": "3.11.x", "cupy": "X.Y.Z"},
  "queried_fields_total": null,
  "non_null_fields": null,
  "datasheet_mismatch_count": null
}

Constraints

  • No PyTorch yetcupy and cuda-python only. See Plan §7.g.
  • Don't hardcode "32 warp size" — query it. It's been 32 forever, but if someone runs this on AMD with 64-wide wavefronts, the hardcoded value is wrong.
  • Computed fields must be derivable from queried fields. E.g., don't put peak_fp16_tflops in if you got it from the datasheet — compute from clock × tensor cores. The datasheet is the verification, not the source.

Stop conditions

Done when:

  1. device_query.json has every field non-null.
  2. comparison.md table has all "Match?" cells filled, with explanations for mismatches.
  3. comparison.md architecture paragraphs written.
  4. manifest.json complete.

Pitfalls

  • getDeviceProperties returns kHz, not MHz, for clock fields. Always check the unit by comparing to the datasheet.
  • "Cuda cores per SM" varies by arch. Don't assume 128. Look up your compute capability.
  • "Tensor Core operations per clock" is the trickiest number. It's poorly documented; the cleanest source is NVIDIA's per-arch whitepaper. The Triton docs also list these.
  • Boost vs base clock. clock_rate_mhz from the API is typically base. Real workloads run at boost. The datasheet's "peak TFLOPS" assumes boost.

When to consult solutions/

After all stop conditions met. The reference at solutions/01-device-query-ref.md (written at phase open) shows the script + a worked comparison table for the reference GPU (TBD by §7.a).


Next lab: lab/02-bandwidth-test.md.