Skip to content

English · Español

Lab 03 — Vectorization budget

Goal: measure, on your own machine, the Python-loop-vs-NumPy speedup ratio as a function of array size. Internalize the 100× rule. Touch each of the four profilers at least once.

Estimated time: 60–90 minutes.

Prereqs: labs 00, 01.


What you produce

A directory experiments/06-vectorization-budget/ containing:

  • bench_sum.py — Python-loop sum vs np.sum across sizes 2^k, k ∈ {4..24} (21 sizes).
  • results.json{size, op, mean_time_ns, std_time_ns, n_repeats} per measurement.
  • speedup.png — two plots: (a) absolute time of both ops vs size, log-log; (b) speedup ratio vs size, log-x linear-y.
  • profiler_tour.md — short notes from running each of the four profilers on the same script.
  • manifest.json.
  • README.md — interpretation: where does the crossover happen on your machine? What's the plateau ratio? Why does the ratio saturate?

TODOs

Block A — benchmark

  • Generate sizes: [2**k for k in range(4, 25)] → 16 .. 16777216 (~16M).
  • For each size N:
  • Allocate arr = rng.standard_normal(N, dtype=np.float32).
  • Warm up both ops once (don't time).
  • Time total = 0.0; for x in arr: total += x (Python loop). Repeat enough times that one repetition is ≥100 ms; record mean and std.
  • Time total = arr.sum(). Same protocol.
  • For very large sizes (N > 2^20), the Python loop is slow; cap repetitions at 1.
  • Save to results.json.

Block B — plot

  • Plot (a): absolute time per call, log-log. Two lines.
  • Plot (b): speedup ratio t_python / t_numpy, log-x linear-y. One line.
  • Annotate the crossover point (where ratio ≈ 1) with a vertical dashed line.
  • Annotate the plateau ratio with a horizontal dashed line.
  • Save as speedup.png.

Block C — profiler tour

Pick the largest size where the Python loop still runs in under 30 seconds (probably 2^22 ≈ 4M). Then:

  1. cProfile: run python -m cProfile -s cumtime bench_sum.py (with size hardcoded to your chosen N). Note the top 5 functions by cumtime. Snapshot the relevant output to profiler_tour.md.
  2. line_profiler: add @profile decorator to your loop function; run kernprof -l -v bench_sum.py. Note which line dominates. Snapshot.
  3. memory_profiler: decorate the function that allocates the large array; run python -m memory_profiler bench_sum.py. Note the peak memory delta. Snapshot.
  4. py-spy: in one terminal, run python bench_sum.py. In another, py-spy record -o profile.svg -- python bench_sum.py. Snapshot the flamegraph filename + a 1-sentence description of what dominates.

You do not need to interpret these deeply; just demonstrate you've touched each. The point is "the tool exists, the command worked, here's the file".

Block D — interpretation

In README.md:

  • Where is the crossover (size where Python loop and NumPy take the same time)? Theory predicts ~256. What do you measure?
  • What is the plateau ratio at the largest size? Theory predicts ~50–100×. What do you measure?
  • Why does the ratio saturate? Hint: at large sizes, both Python and NumPy are bound by something; what?
  • What does the cProfile output tell you that the line_profiler doesn't? And vice-versa?

Three sentences each. Don't pad.

Constraints

  • CPU governor performance. Re-set if your machine has been idle.
  • fp32 throughout.
  • One thread. Don't run other workloads during the measurement.
  • Same RNG seed across sizes.

Expected results

N Python sum NumPy .sum() Ratio
16 ~1 μs ~1.5 μs 0.7×
256 ~18 μs ~1.6 μs 11×
4096 ~280 μs ~5 μs 56×
65536 ~4.5 ms ~70 μs 64×
1048576 ~75 ms ~1.0 ms 75×
16777216 ~1.2 s ~16 ms 75×

The crossover is between N=16 and N=256. The plateau settles around 50–100×. Your numbers will be within ±50% of these on the i5-8250U.

Stop conditions

Done when:

  1. results.json covers all 21 sizes.
  2. speedup.png shows the crossover and plateau clearly.
  3. Each of the four profilers has at least one artifact (output snippet, SVG file, or text snapshot) referenced from profiler_tour.md.
  4. README.md answers all four interpretation questions.

Pitfalls

  • for x in arr allocates a Python float per iteration. That's why it's slow. NumPy's internal sum does not allocate.
  • np.sum for tiny arrays is slower than Python. ~1.5 μs of dispatch overhead. Don't be surprised at small N.
  • GC pauses during the Python loop. For 1.2 s loops, GC might fire mid-measurement. gc.disable() defensively.
  • Background processes. Browser tabs, etc. Close them. Run top to verify single-process CPU.
  • py-spy permissions. May need sudo on Fedora. If it complains, see Phase 6 plan §7 open question (d).
  • line_profiler decorator must be present at runtime. If you decorate with @profile outside of kernprof, Python doesn't know what @profile is. Use a conditional decorator (from line_profiler import profile works in recent versions) or only run via kernprof.

When to consult solutions/

After your three scripts and four profiler artifacts exist, and README.md is complete. solutions/03-vectorization-budget-ref.md (at phase open) shows the reference plot and profiler interpretations.


End of Phase 6 labs. Next: write PHASE_06_REPORT.md and learners/borja/phase-06/reflections.md.