Skip to content

English · Español

Lab 03 — Attention Performance Profile

Goal: measure where time is spent in scaled dot-product attention on Borja's i5-8250U for sequence lengths 64, 128, 256, 512. Identify the memory-bound softmax and forward-reference Phase 27 (Flash Attention).

Estimated time: 60–90 minutes.

Prereq: labs 00–02 committed; Phase 1 roofline experiment committed (you need β_peak for your machine).


What you produce

A directory experiments/15-attention-perf/ containing:

  • perf.py — measurement script.
  • results.json — timings per operation per sequence length.
  • perf.png — stacked-bar chart: time spent in \(QK^\top\), softmax, \(AV\), projections (in/out).
  • roofline.png — overlay your measured attention performance on the Phase 1 roofline.
  • manifest.json.
  • README.md.

Background

The attention forward has six matmul/elementwise operations:

  1. Q = x @ W_Q, K = x @ W_K, V = x @ W_V — three input projections, \(O(T d^2)\) each.
  2. S = Q @ K.T / sqrt(d_k)\(O(T^2 d_k)\).
  3. A = softmax(S + mask)\(O(T^2)\) elementwise.
  4. out_h = A @ V\(O(T^2 d_v)\).
  5. out = concat @ W_O\(O(T d^2)\).

For long sequences (\(T \gg d_k\)), operations 2 and 4 dominate (they scale as \(T^2\)). For short sequences, 1 and 5 dominate (they scale as \(T d^2\), larger constant). The softmax (op 3) is always memory-bound — 1–2 FLOPs per byte moved.

Phase 27's Flash Attention attacks the memory traffic in 2 + 3 + 4 by tiling and not materializing \(A\). We don't fix it; we measure it to set up the future phase.

TODOs

Block A — instrumented forward

  • In perf.py, write attention_forward_timed(x, mha) that returns a dict {"qkv_proj": t, "scores": t, "softmax": t, "av": t, "out_proj": t, "total": t} with per-op timings in seconds.
  • Use time.perf_counter_ns(). Wrap each op with start/stop; force evaluation (numpy is eager, so this just works).
  • Run a warm-up pass before timing.
  • Each timed measurement should aggregate at least 50ms of work (run multiple iterations and divide).

Block B — sweep sequence length

  • Sequence lengths: T ∈ {64, 128, 256, 512}.
  • Fixed: d_model = 64, n_heads = 4, d_head = 16.
  • For each \(T\), run attention_forward_timed enough iterations to get stable numbers. Record means and (optionally) std.
  • Store all results in results.json.

Block C — stacked bar chart

  • Four bars (one per \(T\)). Each bar is stacked: qkv_proj + scores + softmax + av + out_proj.
  • x-axis: \(T\). y-axis: time per forward (ms).
  • Annotate the dominant component per \(T\).
  • Save as perf.png.

Expected observations:

  • At \(T = 64\): input projections (qkv_proj + out_proj) dominate.
  • At \(T = 512\): the \(T^2\) ops (scores, av, softmax) dominate.
  • Softmax is never the smallest component proportionally — even though it's "just" elementwise, its memory traffic is heavy.

Block D — roofline placement

  • Compute arithmetic intensity for each of the five ops at each \(T\):
  • qkv_proj: \(2 T d^2\) FLOPs over \(4 T d + 4 d^2\) bytes → \(I = T d / (2 T + 2 d)\).
  • scores: \(2 T^2 d_k\) FLOPs over \(8 T d_k + 4 T^2\) bytes → \(I = T d_k / (4 d_k + 2 T)\).
  • softmax: \(\sim 5 T^2\) FLOPs over \(8 T^2\) bytes → \(I \approx 0.6\) FLOP/byte.
  • av: same shape as scores.
  • out_proj: same shape as qkv_proj.
  • Compute the measured GFLOPS per op (FLOPs / time).
  • On the Phase 1 roofline plot, overlay each (intensity, GFLOPS) dot. Color by op type.
  • Save as roofline.png.

Block E — write up

In README.md, answer:

  1. Which op is memory-bound on Borja's machine, for which \(T\)? Softmax always. Scores and av for $T < $ some threshold (compute the threshold from \(T d_k > 4(d_k + T) \cdot I_{\text{crit}}\)).
  2. What's the gap between measured performance and the roofline ceiling for the softmax kernel? Compute the ratio. It should be < 10% — softmax sits well below the memory-bound ceiling because it's poorly vectorized for fp32 in numpy (vs a tuned BLAS kernel).
  3. Preview: how would Flash Attention change the roofline picture? Two sentences. (Flash Attention doesn't change the FLOPs of the ops, only the bytes moved — by avoiding materializing \(A\). It moves the \(T^2\) ops up the roofline by reducing their memory traffic.)

Block F — manifest

{
  "experiment": "15-attention-perf",
  "date": "YYYY-MM-DD",
  "seed": 0,
  "versions": { "python": "3.11.x", "numpy": "X.Y.Z" },
  "depends_on": "experiments/01-roofline/manifest.json",
  "config": {
    "d_model": 64,
    "n_heads": 4,
    "T_sweep": [64, 128, 256, 512]
  },
  "hardware": {
    "cpu_model": "Intel Core i5-8250U",
    "cpu_governor_at_run": "performance"
  },
  "results_summary": {
    "softmax_GFLOPS_at_T_512": null,
    "softmax_pct_of_total_at_T_512": null,
    "scores_GFLOPS_at_T_512": null,
    "phase_27_motivation_pct_memory": null
  }
}

Constraints

  • No PyTorch. Numpy alone.
  • performance governor. Same as lab 01-01-memcpy-bandwidth.
  • Cold/warm runs. Warm only — we want repeatable numbers, not first-iteration page-fault timings.
  • No profile-guided optimization. Don't tune the code. The point is to measure the naive implementation; tuning is Phase 27.

Stop conditions

Done when:

  1. All six files committed.
  2. perf.png shows the expected stack across the four \(T\) values.
  3. roofline.png shows each op as a dot on the Phase 1 roofline.
  4. README.md answers all three Block E questions with reference to specific numbers.

Pitfalls

  • Numpy memcpy / view confusion. When timing a slice like Q @ K.T, the transpose is free (just a view) but the matmul writes a new buffer. Make sure you're not accidentally timing only the view operation.
  • Hyperthreading. i5-8250U has 4 cores / 8 threads. Numpy's BLAS will use all 8 by default. For consistency, either set OMP_NUM_THREADS=4 (real cores) or document that you're using 8.
  • Cache-warmth between sizes. Larger \(T\) means larger working set, blowing the cache. This is realistic — don't try to "fix" it.
  • time.perf_counter resolution. On Linux this is nanosecond-resolution, but the kernel scheduler can interrupt you. Running each measurement 100+ times and taking the median is more robust than the mean.

When to consult solutions/

After all six files committed. Solution at solutions/03-attention-perf-ref.md.


End of Phase 15 labs. Write PHASE_15_REPORT.md, fill learners/borja/phase-15/reflections.md.

This is the central derivation phase of the curriculum. The reflection here matters — write it slowly. What landed? What's still fuzzy? What was harder than expected?