English · Español
Lab 02 — Bandwidth Test: H2D, D2H, D2D¶
Goal: measure the real bandwidth of host-to-device, device-to-host, and device-to-device memory transfers; compare against PCIe and HBM theoretical peaks.
Estimated time: 60–90 minutes.
Prereq:
lab/01-device-query.mdcomplete. Thedevice_query.jsonnumbers are the comparison baseline.
What you produce¶
experiments/23-device-profile/ (extends what lab 01 made):
bandwidth_test.py.bandwidth_test.json— measured throughputs at multiple sizes.bandwidth_test.png— plot, size on x (log), throughput on y, three curves (H2D, D2H, D2D).interpretation.md— what the plot tells you.
The kernels¶
Three transfers:
- H2D (host-to-device) —
cudaMemcpyAsyncfrom a pinned (cudaMallocHost) host buffer to a device buffer. - D2H (device-to-host) — reverse direction, same buffers.
- D2D (device-to-device) —
cudaMemcpyAsyncfrom one device buffer to another, same device.
Sizes: at least 12 points, log-spaced from 1 KiB to 1 GiB. Same scale as Phase-1 lab 01 (memcpy-bandwidth).
TODOs¶
Block A — write bandwidth_test.py¶
- Allocate pinned host buffers (
cupy.cuda.alloc_pinned_memoryor viacudaHostAlloc). Pinned memory is required for max H2D / D2H bandwidth; pageable host memory transfers ~half as fast. - Allocate two device buffers (1 GiB each).
- For each transfer direction and each size:
- Pick
iterssuch that total transfer time ≥ 100 ms. - Warm up: 3 iterations before timing.
- Synchronize with
cupy.cuda.runtime.deviceSynchronize()before and after the timing block. - Use
cupy.cuda.Streamevents for finer-grained timing if you can;time.perf_counter_nsafter sync also works. - Compute: throughput = bytes_per_iter × iters / elapsed_s / 1e9 [GB/s].
- Record in
bandwidth_test.json:
Block B — plot¶
- matplotlib. x = size (log scale, KiB), y = GB/s (linear). Three curves with distinct colors and clear legend.
- Annotate the theoretical ceilings (horizontal dashed lines):
- H2D / D2H ceiling: PCIe peak. From
device_query.json:pcie_generation × pcie_lane_width / 8GB/s (very roughly — PCIe 4.0 x16 = 32 GB/s, PCIe 5.0 x16 = 64 GB/s). - D2D ceiling: HBM bandwidth from
device_query.json(theoretical_hbm_bandwidth_gbs).
Block C — interpret in interpretation.md¶
Three paragraphs:
- The H2D/D2H asymmetry. They're often within 10% of each other if both directions hit the same PCIe link. If they're very different, something is asymmetric (e.g., spotty PCIe quality, or a unidirectional optimization in the driver).
- The D2D plateau. At what size does D2D saturate? It should plateau at 50–80% of
theoretical_hbm_bandwidth_gbsfor sizes > 1 MiB. (Small sizes can't saturate because they don't generate enough memory transactions.) If the measured plateau is <30% of theoretical, something is wrong — check that pinned memory is actually pinned, check that you're synchronizing correctly. - D2D vs H2D ratio. D2D should be 30–60× faster than H2D at large sizes (HBM bandwidth vs PCIe). State the ratio you measured; compare to the datasheet ratio.
Block D — manifest update¶
Add to the existing manifest.json from lab 01 (or create a new section):
{
...,
"bandwidth_measured": {
"h2d_peak_GBs": null,
"d2h_peak_GBs": null,
"d2d_peak_GBs": null,
"h2d_pinned_ratio_vs_pageable": null,
"d2d_fraction_of_hbm_peak": null
}
}
(Optional: measure pageable-vs-pinned H2D ratio as a side experiment. Pageable should be ~½ of pinned.)
Constraints¶
- Pinned memory only for H2D / D2H peak measurements. Pageable transfers hit ~half the bandwidth; that's a separate experiment.
- Synchronize correctly. Without sync, you're timing kernel launch overhead, not transfer.
- Single-direction at a time. H2D and D2H concurrently (in different streams) doubles aggregate throughput; that's a separate experiment. Phase 23 measures one at a time.
- Single GPU. Multi-GPU peer-to-peer transfers (NVLink) are Phase 35.
Stop conditions¶
Done when:
bandwidth_test.jsonhas all three curves at 12+ points each.bandwidth_test.pngclearly shows the three curves and the ceiling lines.- D2D peak is within 30% of theoretical HBM bandwidth.
- H2D / D2H peaks are within 30% of theoretical PCIe bandwidth.
interpretation.mdanswers all three Block C questions with measurements.- Manifest updated.
Pitfalls¶
- Small-size noise. At 1 KiB transfer, kernel launch overhead is >50% of the timing. Don't expect the H2D curve to rise smoothly from 1 KiB; the curve will be flat-ish until ~64 KiB and then rise.
- D2H slower than H2D. Sometimes happens due to PCIe upstream/downstream asymmetry on certain platforms. Note it; don't try to "fix" it.
- GPU thermal throttling. Long benchmark runs can throttle. Keep individual measurements short (under 5 seconds each).
cudaMemcpynot actually async. Synchronous if either buffer is pageable (non-pinned). The async version requires both pinned.
When to consult solutions/¶
After stop conditions met. Reference at solutions/02-bandwidth-test-ref.md (written at phase open) compares against the chosen-GPU's expected numbers.
Next lab: lab/03-gpu-roofline.md.