English · Español
Lab 03 — Attention Performance Profile¶
Goal: measure where time is spent in scaled dot-product attention on Borja's i5-8250U for sequence lengths 64, 128, 256, 512. Identify the memory-bound softmax and forward-reference Phase 27 (Flash Attention).
Estimated time: 60–90 minutes.
Prereq: labs 00–02 committed; Phase 1 roofline experiment committed (you need
β_peakfor your machine).
What you produce¶
A directory experiments/15-attention-perf/ containing:
perf.py— measurement script.results.json— timings per operation per sequence length.perf.png— stacked-bar chart: time spent in \(QK^\top\), softmax, \(AV\), projections (in/out).roofline.png— overlay your measured attention performance on the Phase 1 roofline.manifest.json.README.md.
Background¶
The attention forward has six matmul/elementwise operations:
Q = x @ W_Q,K = x @ W_K,V = x @ W_V— three input projections, \(O(T d^2)\) each.S = Q @ K.T / sqrt(d_k)— \(O(T^2 d_k)\).A = softmax(S + mask)— \(O(T^2)\) elementwise.out_h = A @ V— \(O(T^2 d_v)\).out = concat @ W_O— \(O(T d^2)\).
For long sequences (\(T \gg d_k\)), operations 2 and 4 dominate (they scale as \(T^2\)). For short sequences, 1 and 5 dominate (they scale as \(T d^2\), larger constant). The softmax (op 3) is always memory-bound — 1–2 FLOPs per byte moved.
Phase 27's Flash Attention attacks the memory traffic in 2 + 3 + 4 by tiling and not materializing \(A\). We don't fix it; we measure it to set up the future phase.
TODOs¶
Block A — instrumented forward¶
- In
perf.py, writeattention_forward_timed(x, mha)that returns a dict{"qkv_proj": t, "scores": t, "softmax": t, "av": t, "out_proj": t, "total": t}with per-op timings in seconds. - Use
time.perf_counter_ns(). Wrap each op with start/stop; force evaluation (numpy is eager, so this just works). - Run a warm-up pass before timing.
- Each timed measurement should aggregate at least 50ms of work (run multiple iterations and divide).
Block B — sweep sequence length¶
- Sequence lengths:
T ∈ {64, 128, 256, 512}. - Fixed:
d_model = 64, n_heads = 4, d_head = 16. - For each \(T\), run
attention_forward_timedenough iterations to get stable numbers. Record means and (optionally) std. - Store all results in
results.json.
Block C — stacked bar chart¶
- Four bars (one per \(T\)). Each bar is stacked: qkv_proj + scores + softmax + av + out_proj.
- x-axis: \(T\). y-axis: time per forward (ms).
- Annotate the dominant component per \(T\).
- Save as
perf.png.
Expected observations:
- At \(T = 64\): input projections (qkv_proj + out_proj) dominate.
- At \(T = 512\): the \(T^2\) ops (scores, av, softmax) dominate.
- Softmax is never the smallest component proportionally — even though it's "just" elementwise, its memory traffic is heavy.
Block D — roofline placement¶
- Compute arithmetic intensity for each of the five ops at each \(T\):
qkv_proj: \(2 T d^2\) FLOPs over \(4 T d + 4 d^2\) bytes → \(I = T d / (2 T + 2 d)\).scores: \(2 T^2 d_k\) FLOPs over \(8 T d_k + 4 T^2\) bytes → \(I = T d_k / (4 d_k + 2 T)\).softmax: \(\sim 5 T^2\) FLOPs over \(8 T^2\) bytes → \(I \approx 0.6\) FLOP/byte.av: same shape as scores.out_proj: same shape as qkv_proj.- Compute the measured GFLOPS per op (FLOPs / time).
- On the Phase 1 roofline plot, overlay each (intensity, GFLOPS) dot. Color by op type.
- Save as
roofline.png.
Block E — write up¶
In README.md, answer:
- Which op is memory-bound on Borja's machine, for which \(T\)? Softmax always. Scores and av for $T < $ some threshold (compute the threshold from \(T d_k > 4(d_k + T) \cdot I_{\text{crit}}\)).
- What's the gap between measured performance and the roofline ceiling for the softmax kernel? Compute the ratio. It should be < 10% — softmax sits well below the memory-bound ceiling because it's poorly vectorized for fp32 in numpy (vs a tuned BLAS kernel).
- Preview: how would Flash Attention change the roofline picture? Two sentences. (Flash Attention doesn't change the FLOPs of the ops, only the bytes moved — by avoiding materializing \(A\). It moves the \(T^2\) ops up the roofline by reducing their memory traffic.)
Block F — manifest¶
{
"experiment": "15-attention-perf",
"date": "YYYY-MM-DD",
"seed": 0,
"versions": { "python": "3.11.x", "numpy": "X.Y.Z" },
"depends_on": "experiments/01-roofline/manifest.json",
"config": {
"d_model": 64,
"n_heads": 4,
"T_sweep": [64, 128, 256, 512]
},
"hardware": {
"cpu_model": "Intel Core i5-8250U",
"cpu_governor_at_run": "performance"
},
"results_summary": {
"softmax_GFLOPS_at_T_512": null,
"softmax_pct_of_total_at_T_512": null,
"scores_GFLOPS_at_T_512": null,
"phase_27_motivation_pct_memory": null
}
}
Constraints¶
- No PyTorch. Numpy alone.
performancegovernor. Same as lab 01-01-memcpy-bandwidth.- Cold/warm runs. Warm only — we want repeatable numbers, not first-iteration page-fault timings.
- No profile-guided optimization. Don't tune the code. The point is to measure the naive implementation; tuning is Phase 27.
Stop conditions¶
Done when:
- All six files committed.
perf.pngshows the expected stack across the four \(T\) values.roofline.pngshows each op as a dot on the Phase 1 roofline.README.mdanswers all three Block E questions with reference to specific numbers.
Pitfalls¶
- Numpy memcpy / view confusion. When timing a slice like
Q @ K.T, the transpose is free (just a view) but the matmul writes a new buffer. Make sure you're not accidentally timing only the view operation. - Hyperthreading. i5-8250U has 4 cores / 8 threads. Numpy's BLAS will use all 8 by default. For consistency, either set
OMP_NUM_THREADS=4(real cores) or document that you're using 8. - Cache-warmth between sizes. Larger \(T\) means larger working set, blowing the cache. This is realistic — don't try to "fix" it.
time.perf_counterresolution. On Linux this is nanosecond-resolution, but the kernel scheduler can interrupt you. Running each measurement 100+ times and taking the median is more robust than the mean.
When to consult solutions/¶
After all six files committed. Solution at solutions/03-attention-perf-ref.md.
End of Phase 15 labs. Write PHASE_15_REPORT.md, fill learners/borja/phase-15/reflections.md.
This is the central derivation phase of the curriculum. The reflection here matters — write it slowly. What landed? What's still fuzzy? What was harder than expected?