English · Español

Lab 02 — Bandwidth Test: H2D, D2H, D2D¶

Goal: measure the real bandwidth of host-to-device, device-to-host, and device-to-device memory transfers; compare against PCIe and HBM theoretical peaks.

Estimated time: 60–90 minutes.

Prereq: lab/01-device-query.md complete. The device_query.json numbers are the comparison baseline.

What you produce¶

experiments/23-device-profile/ (extends what lab 01 made):

bandwidth_test.py.
bandwidth_test.json — measured throughputs at multiple sizes.
bandwidth_test.png — plot, size on x (log), throughput on y, three curves (H2D, D2H, D2D).
interpretation.md — what the plot tells you.

The kernels¶

Three transfers:

H2D (host-to-device) — cudaMemcpyAsync from a pinned (cudaMallocHost) host buffer to a device buffer.
D2H (device-to-host) — reverse direction, same buffers.
D2D (device-to-device) — cudaMemcpyAsync from one device buffer to another, same device.

Sizes: at least 12 points, log-spaced from 1 KiB to 1 GiB. Same scale as Phase-1 lab 01 (memcpy-bandwidth).

TODOs¶

Block A — write `bandwidth_test.py`¶

Allocate pinned host buffers (cupy.cuda.alloc_pinned_memory or via cudaHostAlloc). Pinned memory is required for max H2D / D2H bandwidth; pageable host memory transfers ~half as fast.
Allocate two device buffers (1 GiB each).
For each transfer direction and each size:
Pick iters such that total transfer time ≥ 100 ms.
Warm up: 3 iterations before timing.
Synchronize with cupy.cuda.runtime.deviceSynchronize() before and after the timing block.
Use cupy.cuda.Stream events for finer-grained timing if you can; time.perf_counter_ns after sync also works.
Compute: throughput = bytes_per_iter × iters / elapsed_s / 1e9 [GB/s].

Record in bandwidth_test.json:

{
  "h2d": [{"size_KiB": 1, "throughput_GBs": ..., "iters": ..., "elapsed_s": ...}, ...],
  "d2h": [...],
  "d2d": [...]
}

Block B — plot¶

matplotlib. x = size (log scale, KiB), y = GB/s (linear). Three curves with distinct colors and clear legend.
Annotate the theoretical ceilings (horizontal dashed lines):
H2D / D2H ceiling: PCIe peak. From device_query.json: pcie_generation × pcie_lane_width / 8 GB/s (very roughly — PCIe 4.0 x16 = 32 GB/s, PCIe 5.0 x16 = 64 GB/s).
D2D ceiling: HBM bandwidth from device_query.json (theoretical_hbm_bandwidth_gbs).

Block C — interpret in `interpretation.md`¶

Three paragraphs:

The H2D/D2H asymmetry. They're often within 10% of each other if both directions hit the same PCIe link. If they're very different, something is asymmetric (e.g., spotty PCIe quality, or a unidirectional optimization in the driver).
The D2D plateau. At what size does D2D saturate? It should plateau at 50–80% of theoretical_hbm_bandwidth_gbs for sizes > 1 MiB. (Small sizes can't saturate because they don't generate enough memory transactions.) If the measured plateau is <30% of theoretical, something is wrong — check that pinned memory is actually pinned, check that you're synchronizing correctly.
D2D vs H2D ratio. D2D should be 30–60× faster than H2D at large sizes (HBM bandwidth vs PCIe). State the ratio you measured; compare to the datasheet ratio.

Block D — manifest update¶

Add to the existing manifest.json from lab 01 (or create a new section):

{
  ...,
  "bandwidth_measured": {
    "h2d_peak_GBs": null,
    "d2h_peak_GBs": null,
    "d2d_peak_GBs": null,
    "h2d_pinned_ratio_vs_pageable": null,
    "d2d_fraction_of_hbm_peak": null
  }
}

(Optional: measure pageable-vs-pinned H2D ratio as a side experiment. Pageable should be ~½ of pinned.)

Constraints¶

Pinned memory only for H2D / D2H peak measurements. Pageable transfers hit ~half the bandwidth; that's a separate experiment.
Synchronize correctly. Without sync, you're timing kernel launch overhead, not transfer.
Single-direction at a time. H2D and D2H concurrently (in different streams) doubles aggregate throughput; that's a separate experiment. Phase 23 measures one at a time.
Single GPU. Multi-GPU peer-to-peer transfers (NVLink) are Phase 35.

Stop conditions¶

Done when:

bandwidth_test.json has all three curves at 12+ points each.
bandwidth_test.png clearly shows the three curves and the ceiling lines.
D2D peak is within 30% of theoretical HBM bandwidth.
H2D / D2H peaks are within 30% of theoretical PCIe bandwidth.
interpretation.md answers all three Block C questions with measurements.
Manifest updated.

Pitfalls¶

Small-size noise. At 1 KiB transfer, kernel launch overhead is >50% of the timing. Don't expect the H2D curve to rise smoothly from 1 KiB; the curve will be flat-ish until ~64 KiB and then rise.
D2H slower than H2D. Sometimes happens due to PCIe upstream/downstream asymmetry on certain platforms. Note it; don't try to "fix" it.
GPU thermal throttling. Long benchmark runs can throttle. Keep individual measurements short (under 5 seconds each).
cudaMemcpy not actually async. Synchronous if either buffer is pageable (non-pinned). The async version requires both pinned.

When to consult `solutions/`¶

After stop conditions met. Reference at solutions/02-bandwidth-test-ref.md (written at phase open) compares against the chosen-GPU's expected numbers.

Next lab: lab/03-gpu-roofline.md.