Skip to content

English · Español

Lab 02 — Bandwidth Test: H2D, D2H, D2D

Goal: measure the real bandwidth of host-to-device, device-to-host, and device-to-device memory transfers; compare against PCIe and HBM theoretical peaks.

Estimated time: 60–90 minutes.

Prereq: lab/01-device-query.md complete. The device_query.json numbers are the comparison baseline.


What you produce

experiments/23-device-profile/ (extends what lab 01 made):

  • bandwidth_test.py.
  • bandwidth_test.json — measured throughputs at multiple sizes.
  • bandwidth_test.png — plot, size on x (log), throughput on y, three curves (H2D, D2H, D2D).
  • interpretation.md — what the plot tells you.

The kernels

Three transfers:

  1. H2D (host-to-device)cudaMemcpyAsync from a pinned (cudaMallocHost) host buffer to a device buffer.
  2. D2H (device-to-host) — reverse direction, same buffers.
  3. D2D (device-to-device)cudaMemcpyAsync from one device buffer to another, same device.

Sizes: at least 12 points, log-spaced from 1 KiB to 1 GiB. Same scale as Phase-1 lab 01 (memcpy-bandwidth).

TODOs

Block A — write bandwidth_test.py

  • Allocate pinned host buffers (cupy.cuda.alloc_pinned_memory or via cudaHostAlloc). Pinned memory is required for max H2D / D2H bandwidth; pageable host memory transfers ~half as fast.
  • Allocate two device buffers (1 GiB each).
  • For each transfer direction and each size:
  • Pick iters such that total transfer time ≥ 100 ms.
  • Warm up: 3 iterations before timing.
  • Synchronize with cupy.cuda.runtime.deviceSynchronize() before and after the timing block.
  • Use cupy.cuda.Stream events for finer-grained timing if you can; time.perf_counter_ns after sync also works.
  • Compute: throughput = bytes_per_iter × iters / elapsed_s / 1e9 [GB/s].
  • Record in bandwidth_test.json:
    {
      "h2d": [{"size_KiB": 1, "throughput_GBs": ..., "iters": ..., "elapsed_s": ...}, ...],
      "d2h": [...],
      "d2d": [...]
    }
    

Block B — plot

  • matplotlib. x = size (log scale, KiB), y = GB/s (linear). Three curves with distinct colors and clear legend.
  • Annotate the theoretical ceilings (horizontal dashed lines):
  • H2D / D2H ceiling: PCIe peak. From device_query.json: pcie_generation × pcie_lane_width / 8 GB/s (very roughly — PCIe 4.0 x16 = 32 GB/s, PCIe 5.0 x16 = 64 GB/s).
  • D2D ceiling: HBM bandwidth from device_query.json (theoretical_hbm_bandwidth_gbs).

Block C — interpret in interpretation.md

Three paragraphs:

  1. The H2D/D2H asymmetry. They're often within 10% of each other if both directions hit the same PCIe link. If they're very different, something is asymmetric (e.g., spotty PCIe quality, or a unidirectional optimization in the driver).
  2. The D2D plateau. At what size does D2D saturate? It should plateau at 50–80% of theoretical_hbm_bandwidth_gbs for sizes > 1 MiB. (Small sizes can't saturate because they don't generate enough memory transactions.) If the measured plateau is <30% of theoretical, something is wrong — check that pinned memory is actually pinned, check that you're synchronizing correctly.
  3. D2D vs H2D ratio. D2D should be 30–60× faster than H2D at large sizes (HBM bandwidth vs PCIe). State the ratio you measured; compare to the datasheet ratio.

Block D — manifest update

Add to the existing manifest.json from lab 01 (or create a new section):

{
  ...,
  "bandwidth_measured": {
    "h2d_peak_GBs": null,
    "d2h_peak_GBs": null,
    "d2d_peak_GBs": null,
    "h2d_pinned_ratio_vs_pageable": null,
    "d2d_fraction_of_hbm_peak": null
  }
}

(Optional: measure pageable-vs-pinned H2D ratio as a side experiment. Pageable should be ~½ of pinned.)

Constraints

  • Pinned memory only for H2D / D2H peak measurements. Pageable transfers hit ~half the bandwidth; that's a separate experiment.
  • Synchronize correctly. Without sync, you're timing kernel launch overhead, not transfer.
  • Single-direction at a time. H2D and D2H concurrently (in different streams) doubles aggregate throughput; that's a separate experiment. Phase 23 measures one at a time.
  • Single GPU. Multi-GPU peer-to-peer transfers (NVLink) are Phase 35.

Stop conditions

Done when:

  1. bandwidth_test.json has all three curves at 12+ points each.
  2. bandwidth_test.png clearly shows the three curves and the ceiling lines.
  3. D2D peak is within 30% of theoretical HBM bandwidth.
  4. H2D / D2H peaks are within 30% of theoretical PCIe bandwidth.
  5. interpretation.md answers all three Block C questions with measurements.
  6. Manifest updated.

Pitfalls

  • Small-size noise. At 1 KiB transfer, kernel launch overhead is >50% of the timing. Don't expect the H2D curve to rise smoothly from 1 KiB; the curve will be flat-ish until ~64 KiB and then rise.
  • D2H slower than H2D. Sometimes happens due to PCIe upstream/downstream asymmetry on certain platforms. Note it; don't try to "fix" it.
  • GPU thermal throttling. Long benchmark runs can throttle. Keep individual measurements short (under 5 seconds each).
  • cudaMemcpy not actually async. Synchronous if either buffer is pageable (non-pinned). The async version requires both pinned.

When to consult solutions/

After stop conditions met. Reference at solutions/02-bandwidth-test-ref.md (written at phase open) compares against the chosen-GPU's expected numbers.


Next lab: lab/03-gpu-roofline.md.