English · Español

Lab 03 — Attention Performance Profile¶

Goal: measure where time is spent in scaled dot-product attention on Borja's i5-8250U for sequence lengths 64, 128, 256, 512. Identify the memory-bound softmax and forward-reference Phase 27 (Flash Attention).

Estimated time: 60–90 minutes.

Prereq: labs 00–02 committed; Phase 1 roofline experiment committed (you need β_peak for your machine).

What you produce¶

A directory experiments/15-attention-perf/ containing:

perf.py — measurement script.
results.json — timings per operation per sequence length.
perf.png — stacked-bar chart: time spent in $QK^\top$, softmax, $AV$, projections (in/out).
roofline.png — overlay your measured attention performance on the Phase 1 roofline.
manifest.json.
README.md.

Background¶

The attention forward has six matmul/elementwise operations:

Q = x @ W_Q, K = x @ W_K, V = x @ W_V — three input projections, $O(T d^2)$ each.
S = Q @ K.T / sqrt(d_k) — $O(T^2 d_k)$.
A = softmax(S + mask) — $O(T^2)$ elementwise.
out_h = A @ V — $O(T^2 d_v)$.
out = concat @ W_O — $O(T d^2)$.

For long sequences ($T \gg d_k$), operations 2 and 4 dominate (they scale as $T^2$). For short sequences, 1 and 5 dominate (they scale as $T d^2$, larger constant). The softmax (op 3) is always memory-bound — 1–2 FLOPs per byte moved.

Phase 27's Flash Attention attacks the memory traffic in 2 + 3 + 4 by tiling and not materializing $A$. We don't fix it; we measure it to set up the future phase.

TODOs¶

Block A — instrumented forward¶

In perf.py, write attention_forward_timed(x, mha) that returns a dict {"qkv_proj": t, "scores": t, "softmax": t, "av": t, "out_proj": t, "total": t} with per-op timings in seconds.
Use time.perf_counter_ns(). Wrap each op with start/stop; force evaluation (numpy is eager, so this just works).
Run a warm-up pass before timing.
Each timed measurement should aggregate at least 50ms of work (run multiple iterations and divide).

Block B — sweep sequence length¶

Sequence lengths: T ∈ {64, 128, 256, 512}.
Fixed: d_model = 64, n_heads = 4, d_head = 16.
For each $T$, run attention_forward_timed enough iterations to get stable numbers. Record means and (optionally) std.
Store all results in results.json.

Block C — stacked bar chart¶

Four bars (one per $T$). Each bar is stacked: qkv_proj + scores + softmax + av + out_proj.
x-axis: $T$. y-axis: time per forward (ms).
Annotate the dominant component per $T$.
Save as perf.png.

Expected observations:

At $T = 64$: input projections (qkv_proj + out_proj) dominate.
At $T = 512$: the $T^2$ ops (scores, av, softmax) dominate.
Softmax is never the smallest component proportionally — even though it's "just" elementwise, its memory traffic is heavy.

Block D — roofline placement¶

Compute arithmetic intensity for each of the five ops at each $T$:
qkv_proj: $2 T d^2$ FLOPs over $4 T d + 4 d^2$ bytes → $I = T d / (2 T + 2 d)$.
scores: $2 T^2 d_k$ FLOPs over $8 T d_k + 4 T^2$ bytes → $I = T d_k / (4 d_k + 2 T)$.
softmax: $\sim 5 T^2$ FLOPs over $8 T^2$ bytes → $I \approx 0.6$ FLOP/byte.
av: same shape as scores.
out_proj: same shape as qkv_proj.
Compute the measured GFLOPS per op (FLOPs / time).
On the Phase 1 roofline plot, overlay each (intensity, GFLOPS) dot. Color by op type.
Save as roofline.png.

Block E — write up¶

In README.md, answer:

Which op is memory-bound on Borja's machine, for which $T$? Softmax always. Scores and av for $T < $ some threshold (compute the threshold from $T d_k > 4(d_k + T) \cdot I_{\text{crit}}$).
What's the gap between measured performance and the roofline ceiling for the softmax kernel? Compute the ratio. It should be < 10% — softmax sits well below the memory-bound ceiling because it's poorly vectorized for fp32 in numpy (vs a tuned BLAS kernel).
Preview: how would Flash Attention change the roofline picture? Two sentences. (Flash Attention doesn't change the FLOPs of the ops, only the bytes moved — by avoiding materializing $A$. It moves the $T^2$ ops up the roofline by reducing their memory traffic.)

Block F — manifest¶

{
  "experiment": "15-attention-perf",
  "date": "YYYY-MM-DD",
  "seed": 0,
  "versions": { "python": "3.11.x", "numpy": "X.Y.Z" },
  "depends_on": "experiments/01-roofline/manifest.json",
  "config": {
    "d_model": 64,
    "n_heads": 4,
    "T_sweep": [64, 128, 256, 512]
  },
  "hardware": {
    "cpu_model": "Intel Core i5-8250U",
    "cpu_governor_at_run": "performance"
  },
  "results_summary": {
    "softmax_GFLOPS_at_T_512": null,
    "softmax_pct_of_total_at_T_512": null,
    "scores_GFLOPS_at_T_512": null,
    "phase_27_motivation_pct_memory": null
  }
}

Constraints¶

No PyTorch. Numpy alone.
performance governor. Same as lab 01-01-memcpy-bandwidth.
Cold/warm runs. Warm only — we want repeatable numbers, not first-iteration page-fault timings.
No profile-guided optimization. Don't tune the code. The point is to measure the naive implementation; tuning is Phase 27.

Stop conditions¶

Done when:

All six files committed.
perf.png shows the expected stack across the four $T$ values.
roofline.png shows each op as a dot on the Phase 1 roofline.
README.md answers all three Block E questions with reference to specific numbers.

Pitfalls¶

Numpy memcpy / view confusion. When timing a slice like Q @ K.T, the transpose is free (just a view) but the matmul writes a new buffer. Make sure you're not accidentally timing only the view operation.
Hyperthreading. i5-8250U has 4 cores / 8 threads. Numpy's BLAS will use all 8 by default. For consistency, either set OMP_NUM_THREADS=4 (real cores) or document that you're using 8.
Cache-warmth between sizes. Larger $T$ means larger working set, blowing the cache. This is realistic — don't try to "fix" it.
time.perf_counter resolution. On Linux this is nanosecond-resolution, but the kernel scheduler can interrupt you. Running each measurement 100+ times and taking the median is more robust than the mean.

When to consult `solutions/`¶

After all six files committed. Solution at solutions/03-attention-perf-ref.md.

End of Phase 15 labs. Write PHASE_15_REPORT.md, fill learners/borja/phase-15/reflections.md.

This is the central derivation phase of the curriculum. The reflection here matters — write it slowly. What landed? What's still fuzzy? What was harder than expected?