English · Español
03 — Vectorization and Profiling¶
🇪🇸 La regla del orden de magnitud: un bucle Python sobre un array NumPy de 10⁶ elementos es ~100× más lento que la expresión vectorizada equivalente. Si la sientes, no la sufres. Y si quieres saber exactamente dónde se va el tiempo o la memoria, hay cuatro herramientas — cProfile, line_profiler, memory_profiler, py-spy — cada una con su caso de uso.
The cost model of Python+NumPy¶
Two competing forces:
- Python is interpreted. Each Python operation involves: bytecode dispatch, attribute lookups, type dispatch, possibly allocations. A single
a + bbetween two Pythonintobjects takes ~50 nanoseconds. Afor x in arr: total += xover a 10⁶-element NumPy array is ~50 ms (10⁶ × 50 ns). - NumPy operations release the GIL and run in C. A vectorized
arr + 1is implemented as a tight C loop over the underlying buffer. It's bounded by memory bandwidth (Phase 1's roofline applies) and SIMD throughput, not Python overhead. The same 10⁶ adds: ~1 ms.
The 50× ratio above is conservative. For more complex per-element ops (np.exp, np.tanh), the ratio can be 100×–500×.
The rule¶
Any time you iterate at the Python level over an
ndarray, you have a performance bug.
Exceptions:
- The array has <100 elements and you're doing it once. Fine.
- You're doing something that genuinely doesn't vectorize (most rarely true; usually you're just not trying hard enough).
- You're calling out to external C code per element — same situation as below.
Vectorization patterns¶
Pattern 1: replace explicit loops with whole-array ops¶
Pattern 2: use broadcasting to avoid loops over pairs¶
# Bad:
dists = np.empty((N, M))
for i in range(N):
for j in range(M):
dists[i, j] = np.sqrt(((a[i] - b[j]) ** 2).sum())
# Good (broadcasting):
diff = a[:, None, :] - b[None, :, :] # shape (N, M, D)
dists = np.sqrt((diff ** 2).sum(-1)) # shape (N, M)
The "good" version allocates (N, M, D) once — fast in NumPy but possibly memory-heavy. For huge N×M, use cdist or batched approaches.
Pattern 3: einsum for non-obvious contractions¶
# Bad:
out = np.empty((B, N, K))
for b in range(B):
out[b] = a[b] @ w # if a is (B, N, M) and w is (M, K)
# Good:
out = np.einsum('bnm,mk->bnk', a, w) # or simply a @ w (NumPy 1.10+ handles batch dims)
einsum is the most powerful tool for "I have N-D tensors and I want to contract over these axes". Pricier than np.matmul for the common cases; equivalent or better for the unusual ones.
Pattern 4: avoid Python if inside hot loops¶
# Bad:
out = np.empty_like(x)
for i in range(len(x)):
out[i] = x[i] if x[i] > 0 else 0
# Good:
out = np.maximum(x, 0) # or np.where(x > 0, x, 0)
Pattern 5: pre-allocate¶
# Bad (re-allocates):
result = np.array([])
for i in range(N):
result = np.append(result, expensive(i)) # quadratic time!
# Good:
result = np.empty(N)
for i in range(N):
result[i] = expensive(i)
# Best (if expensive vectorizes):
result = expensive(np.arange(N))
np.append and np.concatenate re-allocate every call. For incremental construction, pre-allocate and fill, or accumulate in a list and np.array(...) at the end (no quadratic blowup).
When vectorization doesn't help¶
Three situations where the Python loop is fine:
- The op is per-element complex and NumPy doesn't ship a primitive. E.g., per-row sorting with a tie-breaker rule based on another array. You may need
np.apply_along_axis(which is itself a Python loop, just hidden) or accept the cost. - The data is small enough that Python overhead vanishes. A loop over 50 items takes ~5 μs. Doesn't matter.
- You're calling a library function that already vectorizes internally.
scipy.optimize.minimizecalls your objective function many times — it's not your job to vectorize across optimizer iterations.
The four profilers¶
When you do hit a slow path and need to diagnose, four tools cover the ground:
1. cProfile — function-level, statistical¶
Built into stdlib. Counts every function call, total time, time per call, callees. Overhead: ~3–5× — your code runs slower under cProfile, so absolute timings are skewed; relative ranking is reliable.
Use when: you want to know which function dominates. The output is large; pipe to snakeviz (pip install snakeviz; snakeviz output.prof) for a flamegraph.
import cProfile, pstats
with cProfile.Profile() as pr:
expensive_thing()
pstats.Stats(pr).sort_stats('cumtime').print_stats(20)
2. line_profiler — line-by-line, instrumented¶
Reports time per line within decorated functions. Higher overhead (~10×) but you see exactly which line is slow.
Use when: cProfile said "function X is slow"; you need to know which line.
3. memory_profiler — line-by-line memory¶
@profile
def my_function():
a = np.zeros(100_000_000) # ← reports +763 MB here
...
# Run with:
# python -m memory_profiler my_script.py
Reports memory delta per line. Slow (samples RSS per line). Useful for hunting "where did the GB go".
4. py-spy — sampling, no instrumentation¶
py-spy record -o profile.svg --pid 12345
# or to run from start:
py-spy record -o profile.svg -- python my_script.py
Attaches to a running Python process via ptrace (Linux/macOS) and samples the call stack. No code changes required, no overhead inside your code (~1–2% sampling overhead). Produces a flamegraph SVG.
Use when: profiling a long-running training job, or production code you can't modify. The killer feature.
Permission caveat on Fedora: py-spy needs to ptrace your process. Either run with sudo, or set kernel.yama.ptrace_scope = 0 (less secure; revert after).
When to use which (decision tree)¶
Is the code already running and I can't add @profile?
├─ Yes → py-spy
└─ No
├─ Do I know which function is slow?
│ ├─ No → cProfile (then snakeviz)
│ └─ Yes → line_profiler
└─ Is it a memory issue, not a time issue?
└─ memory_profiler (line-level) or `tracemalloc` (snapshot diff)
Logging vs print¶
Past Phase 6, print is a code smell in any committed code (debug prints in experiments are fine, removed before commit). Reasons:
- Unstructured.
print(f"loss={loss}")produces a string that downstream tools (grep, jq, dashboards) can't parse reliably. - Mixed levels.
printdoesn't distinguish debug noise from "this is a critical error". - Performance.
printflushes to stdout, possibly to a TTY with line buffering, possibly causing context switches you don't want during a training step. - Eager formatting.
print(f"x={expensive_repr(x)}")evaluatesexpensive_repr(x)even if you decide to mute prints.
The Phase 6 substitute is structlog:
from src.utils.logging import get_logger
log = get_logger(__name__)
log.info("training_step", step=step, loss=loss, lr=lr)
# Emits a JSON line: {"event": "training_step", "step": 42, "loss": 0.123, "lr": 0.001, ...}
structlog is configured (in Phase 6 lab 00) to:
- Emit JSON for machine consumption (when stdout is a pipe).
- Emit pretty colored output for humans (when stdout is a TTY).
- Always include a timestamp, log level, and configurable context fields (e.g.,
phase).
The reason this matters: by Phase 18 you'll be training models for tens of minutes per run, and log lines will be the only window into the live state. Grepping JSON is sane; grepping ad-hoc print strings is misery.
A worked vectorization budget¶
Lab 03 (vectorization-budget) measures the Python-loop-vs-NumPy ratio for sum across array sizes 2^k for k ∈ [4..24]. Expected shape of the result:
| Size | Python loop | NumPy .sum() |
Ratio |
|---|---|---|---|
| 16 | 1.0 μs | 1.5 μs | 0.7× |
| 256 | 18 μs | 1.6 μs | 11× |
| 4096 | 280 μs | 5 μs | 56× |
| 65536 | 4.5 ms | 70 μs | 64× |
| 1048576 | 75 ms | 1.0 ms | 75× |
| 16777216 | 1.2 s | 16 ms | 75× |
At small sizes, NumPy's per-call overhead (~1.5 μs to enter the C kernel) loses to a tight Python loop. At ~256 elements they cross over. At 1M elements the ratio plateaus at ~75× — NumPy is now bandwidth-bound (Phase 1 roofline!), and Python's overhead per element is what's left after subtracting the bandwidth bound.
Borja's lab will produce this plot on his own machine; the exact numbers will vary, the shape won't.
Pitfalls¶
- Profiling without warmup. First call to a NumPy function loads shared libraries, allocates pools, and is slow. Always run a throwaway iteration before timing.
%timeitin Jupyter. Easy, accurate, but only works for short ops (it auto-decides loop count). For multi-second ops, fall back totime.perf_counter_ns().- Garbage collection during measurement. A 100 MB allocation might trigger a gen-2 GC. For tight measurements,
gc.disable()around the timed block (and re-enable after). - Threading-aware timing.
time.process_time()excludes sleep;time.perf_counter()includes everything. Pick deliberately. - Hardware noise. CPU frequency scaling, thermal throttling, background processes. Phase 1's lab 00 set the governor to
performance; keep that habit for Phase 6 measurements too. np.sumreduction order. For very large fp32 arrays, the order of summation affects the result (catastrophic cancellation, Phase 2). NumPy uses pairwise summation, which is more accurate than naive; still not Kahan. Phase 2 covered this.
One-paragraph recap¶
Python+NumPy has a two-tier cost model: Python loops over arrays are ~100× slower than vectorized expressions at scale, because NumPy's C kernels release the GIL and saturate memory bandwidth while Python's bytecode dispatch dominates per-element cost. Vectorization is the default; explicit loops are a code smell that needs justification. When you do need to diagnose slowness, four profilers cover the ground (cProfile for function-level, line_profiler for line-level, memory_profiler for memory, py-spy for live no-instrumentation). And structured logging (structlog) replaces print from this phase onward — every later phase relies on parseable log lines.
Next: Phase 6 has no further theory pages. Move to lab/00-environment-and-utilities.md.