English · Español
Lab 03 — Vectorization budget¶
Goal: measure, on your own machine, the Python-loop-vs-NumPy speedup ratio as a function of array size. Internalize the 100× rule. Touch each of the four profilers at least once.
Estimated time: 60–90 minutes.
Prereqs: labs 00, 01.
What you produce¶
A directory experiments/06-vectorization-budget/ containing:
bench_sum.py— Python-loop sum vsnp.sumacross sizes2^k,k ∈ {4..24}(21 sizes).results.json—{size, op, mean_time_ns, std_time_ns, n_repeats}per measurement.speedup.png— two plots: (a) absolute time of both ops vs size, log-log; (b) speedup ratio vs size, log-x linear-y.profiler_tour.md— short notes from running each of the four profilers on the same script.manifest.json.README.md— interpretation: where does the crossover happen on your machine? What's the plateau ratio? Why does the ratio saturate?
TODOs¶
Block A — benchmark¶
- Generate sizes:
[2**k for k in range(4, 25)]→ 16 .. 16777216 (~16M). - For each size
N: - Allocate
arr = rng.standard_normal(N, dtype=np.float32). - Warm up both ops once (don't time).
- Time
total = 0.0; for x in arr: total += x(Python loop). Repeat enough times that one repetition is ≥100 ms; record mean and std. - Time
total = arr.sum(). Same protocol. - For very large sizes (
N > 2^20), the Python loop is slow; cap repetitions at 1. - Save to
results.json.
Block B — plot¶
- Plot (a): absolute time per call, log-log. Two lines.
- Plot (b): speedup ratio
t_python / t_numpy, log-x linear-y. One line. - Annotate the crossover point (where ratio ≈ 1) with a vertical dashed line.
- Annotate the plateau ratio with a horizontal dashed line.
- Save as
speedup.png.
Block C — profiler tour¶
Pick the largest size where the Python loop still runs in under 30 seconds (probably 2^22 ≈ 4M). Then:
cProfile: runpython -m cProfile -s cumtime bench_sum.py(with size hardcoded to your chosen N). Note the top 5 functions by cumtime. Snapshot the relevant output toprofiler_tour.md.line_profiler: add@profiledecorator to your loop function; runkernprof -l -v bench_sum.py. Note which line dominates. Snapshot.memory_profiler: decorate the function that allocates the large array; runpython -m memory_profiler bench_sum.py. Note the peak memory delta. Snapshot.py-spy: in one terminal, runpython bench_sum.py. In another,py-spy record -o profile.svg -- python bench_sum.py. Snapshot the flamegraph filename + a 1-sentence description of what dominates.
You do not need to interpret these deeply; just demonstrate you've touched each. The point is "the tool exists, the command worked, here's the file".
Block D — interpretation¶
In README.md:
- Where is the crossover (size where Python loop and NumPy take the same time)? Theory predicts ~256. What do you measure?
- What is the plateau ratio at the largest size? Theory predicts ~50–100×. What do you measure?
- Why does the ratio saturate? Hint: at large sizes, both Python and NumPy are bound by something; what?
- What does the cProfile output tell you that the line_profiler doesn't? And vice-versa?
Three sentences each. Don't pad.
Constraints¶
- CPU governor
performance. Re-set if your machine has been idle. - fp32 throughout.
- One thread. Don't run other workloads during the measurement.
- Same RNG seed across sizes.
Expected results¶
| N | Python sum | NumPy .sum() |
Ratio |
|---|---|---|---|
| 16 | ~1 μs | ~1.5 μs | 0.7× |
| 256 | ~18 μs | ~1.6 μs | 11× |
| 4096 | ~280 μs | ~5 μs | 56× |
| 65536 | ~4.5 ms | ~70 μs | 64× |
| 1048576 | ~75 ms | ~1.0 ms | 75× |
| 16777216 | ~1.2 s | ~16 ms | 75× |
The crossover is between N=16 and N=256. The plateau settles around 50–100×. Your numbers will be within ±50% of these on the i5-8250U.
Stop conditions¶
Done when:
results.jsoncovers all 21 sizes.speedup.pngshows the crossover and plateau clearly.- Each of the four profilers has at least one artifact (output snippet, SVG file, or text snapshot) referenced from
profiler_tour.md. README.mdanswers all four interpretation questions.
Pitfalls¶
for x in arrallocates a Pythonfloatper iteration. That's why it's slow. NumPy's internal sum does not allocate.np.sumfor tiny arrays is slower than Python. ~1.5 μs of dispatch overhead. Don't be surprised at small N.- GC pauses during the Python loop. For 1.2 s loops, GC might fire mid-measurement.
gc.disable()defensively. - Background processes. Browser tabs, etc. Close them. Run
topto verify single-process CPU. py-spypermissions. May needsudoon Fedora. If it complains, see Phase 6 plan §7 open question (d).line_profilerdecorator must be present at runtime. If you decorate with@profileoutside ofkernprof, Python doesn't know what@profileis. Use a conditional decorator (from line_profiler import profileworks in recent versions) or only run viakernprof.
When to consult solutions/¶
After your three scripts and four profiler artifacts exist, and README.md is complete. solutions/03-vectorization-budget-ref.md (at phase open) shows the reference plot and profiler interpretations.
End of Phase 6 labs. Next: write PHASE_06_REPORT.md and learners/borja/phase-06/reflections.md.