Skip to content

English · Español

03 — Vectorization and Profiling

🇪🇸 La regla del orden de magnitud: un bucle Python sobre un array NumPy de 10⁶ elementos es ~100× más lento que la expresión vectorizada equivalente. Si la sientes, no la sufres. Y si quieres saber exactamente dónde se va el tiempo o la memoria, hay cuatro herramientas — cProfile, line_profiler, memory_profiler, py-spy — cada una con su caso de uso.


The cost model of Python+NumPy

Two competing forces:

  1. Python is interpreted. Each Python operation involves: bytecode dispatch, attribute lookups, type dispatch, possibly allocations. A single a + b between two Python int objects takes ~50 nanoseconds. A for x in arr: total += x over a 10⁶-element NumPy array is ~50 ms (10⁶ × 50 ns).
  2. NumPy operations release the GIL and run in C. A vectorized arr + 1 is implemented as a tight C loop over the underlying buffer. It's bounded by memory bandwidth (Phase 1's roofline applies) and SIMD throughput, not Python overhead. The same 10⁶ adds: ~1 ms.

The 50× ratio above is conservative. For more complex per-element ops (np.exp, np.tanh), the ratio can be 100×–500×.

The rule

Any time you iterate at the Python level over an ndarray, you have a performance bug.

Exceptions:

  • The array has <100 elements and you're doing it once. Fine.
  • You're doing something that genuinely doesn't vectorize (most rarely true; usually you're just not trying hard enough).
  • You're calling out to external C code per element — same situation as below.

Vectorization patterns

Pattern 1: replace explicit loops with whole-array ops

# Bad:
total = 0.0
for x in arr:
    total += x

# Good:
total = arr.sum()

Pattern 2: use broadcasting to avoid loops over pairs

# Bad:
dists = np.empty((N, M))
for i in range(N):
    for j in range(M):
        dists[i, j] = np.sqrt(((a[i] - b[j]) ** 2).sum())

# Good (broadcasting):
diff = a[:, None, :] - b[None, :, :]    # shape (N, M, D)
dists = np.sqrt((diff ** 2).sum(-1))    # shape (N, M)

The "good" version allocates (N, M, D) once — fast in NumPy but possibly memory-heavy. For huge N×M, use cdist or batched approaches.

Pattern 3: einsum for non-obvious contractions

# Bad:
out = np.empty((B, N, K))
for b in range(B):
    out[b] = a[b] @ w   # if a is (B, N, M) and w is (M, K)

# Good:
out = np.einsum('bnm,mk->bnk', a, w)   # or simply a @ w (NumPy 1.10+ handles batch dims)

einsum is the most powerful tool for "I have N-D tensors and I want to contract over these axes". Pricier than np.matmul for the common cases; equivalent or better for the unusual ones.

Pattern 4: avoid Python if inside hot loops

# Bad:
out = np.empty_like(x)
for i in range(len(x)):
    out[i] = x[i] if x[i] > 0 else 0

# Good:
out = np.maximum(x, 0)   # or np.where(x > 0, x, 0)

Pattern 5: pre-allocate

# Bad (re-allocates):
result = np.array([])
for i in range(N):
    result = np.append(result, expensive(i))  # quadratic time!

# Good:
result = np.empty(N)
for i in range(N):
    result[i] = expensive(i)

# Best (if expensive vectorizes):
result = expensive(np.arange(N))

np.append and np.concatenate re-allocate every call. For incremental construction, pre-allocate and fill, or accumulate in a list and np.array(...) at the end (no quadratic blowup).

When vectorization doesn't help

Three situations where the Python loop is fine:

  1. The op is per-element complex and NumPy doesn't ship a primitive. E.g., per-row sorting with a tie-breaker rule based on another array. You may need np.apply_along_axis (which is itself a Python loop, just hidden) or accept the cost.
  2. The data is small enough that Python overhead vanishes. A loop over 50 items takes ~5 μs. Doesn't matter.
  3. You're calling a library function that already vectorizes internally. scipy.optimize.minimize calls your objective function many times — it's not your job to vectorize across optimizer iterations.

The four profilers

When you do hit a slow path and need to diagnose, four tools cover the ground:

1. cProfile — function-level, statistical

python -m cProfile -s cumtime my_script.py

Built into stdlib. Counts every function call, total time, time per call, callees. Overhead: ~3–5× — your code runs slower under cProfile, so absolute timings are skewed; relative ranking is reliable.

Use when: you want to know which function dominates. The output is large; pipe to snakeviz (pip install snakeviz; snakeviz output.prof) for a flamegraph.

import cProfile, pstats
with cProfile.Profile() as pr:
    expensive_thing()
pstats.Stats(pr).sort_stats('cumtime').print_stats(20)

2. line_profiler — line-by-line, instrumented

@profile
def my_function():
    ...

# Run with:
#   kernprof -l -v my_script.py

Reports time per line within decorated functions. Higher overhead (~10×) but you see exactly which line is slow.

Use when: cProfile said "function X is slow"; you need to know which line.

3. memory_profiler — line-by-line memory

@profile
def my_function():
    a = np.zeros(100_000_000)   # ← reports +763 MB here
    ...

# Run with:
#   python -m memory_profiler my_script.py

Reports memory delta per line. Slow (samples RSS per line). Useful for hunting "where did the GB go".

4. py-spy — sampling, no instrumentation

py-spy record -o profile.svg --pid 12345
# or to run from start:
py-spy record -o profile.svg -- python my_script.py

Attaches to a running Python process via ptrace (Linux/macOS) and samples the call stack. No code changes required, no overhead inside your code (~1–2% sampling overhead). Produces a flamegraph SVG.

Use when: profiling a long-running training job, or production code you can't modify. The killer feature.

Permission caveat on Fedora: py-spy needs to ptrace your process. Either run with sudo, or set kernel.yama.ptrace_scope = 0 (less secure; revert after).

When to use which (decision tree)

Is the code already running and I can't add @profile?
├─ Yes → py-spy
└─ No
   ├─ Do I know which function is slow?
   │  ├─ No  → cProfile (then snakeviz)
   │  └─ Yes → line_profiler
   └─ Is it a memory issue, not a time issue?
      └─ memory_profiler (line-level) or `tracemalloc` (snapshot diff)

Logging vs print

Past Phase 6, print is a code smell in any committed code (debug prints in experiments are fine, removed before commit). Reasons:

  • Unstructured. print(f"loss={loss}") produces a string that downstream tools (grep, jq, dashboards) can't parse reliably.
  • Mixed levels. print doesn't distinguish debug noise from "this is a critical error".
  • Performance. print flushes to stdout, possibly to a TTY with line buffering, possibly causing context switches you don't want during a training step.
  • Eager formatting. print(f"x={expensive_repr(x)}") evaluates expensive_repr(x) even if you decide to mute prints.

The Phase 6 substitute is structlog:

from src.utils.logging import get_logger
log = get_logger(__name__)

log.info("training_step", step=step, loss=loss, lr=lr)
# Emits a JSON line: {"event": "training_step", "step": 42, "loss": 0.123, "lr": 0.001, ...}

structlog is configured (in Phase 6 lab 00) to:

  • Emit JSON for machine consumption (when stdout is a pipe).
  • Emit pretty colored output for humans (when stdout is a TTY).
  • Always include a timestamp, log level, and configurable context fields (e.g., phase).

The reason this matters: by Phase 18 you'll be training models for tens of minutes per run, and log lines will be the only window into the live state. Grepping JSON is sane; grepping ad-hoc print strings is misery.

A worked vectorization budget

Lab 03 (vectorization-budget) measures the Python-loop-vs-NumPy ratio for sum across array sizes 2^k for k ∈ [4..24]. Expected shape of the result:

Size Python loop NumPy .sum() Ratio
16 1.0 μs 1.5 μs 0.7×
256 18 μs 1.6 μs 11×
4096 280 μs 5 μs 56×
65536 4.5 ms 70 μs 64×
1048576 75 ms 1.0 ms 75×
16777216 1.2 s 16 ms 75×

At small sizes, NumPy's per-call overhead (~1.5 μs to enter the C kernel) loses to a tight Python loop. At ~256 elements they cross over. At 1M elements the ratio plateaus at ~75× — NumPy is now bandwidth-bound (Phase 1 roofline!), and Python's overhead per element is what's left after subtracting the bandwidth bound.

Borja's lab will produce this plot on his own machine; the exact numbers will vary, the shape won't.

Pitfalls

  1. Profiling without warmup. First call to a NumPy function loads shared libraries, allocates pools, and is slow. Always run a throwaway iteration before timing.
  2. %timeit in Jupyter. Easy, accurate, but only works for short ops (it auto-decides loop count). For multi-second ops, fall back to time.perf_counter_ns().
  3. Garbage collection during measurement. A 100 MB allocation might trigger a gen-2 GC. For tight measurements, gc.disable() around the timed block (and re-enable after).
  4. Threading-aware timing. time.process_time() excludes sleep; time.perf_counter() includes everything. Pick deliberately.
  5. Hardware noise. CPU frequency scaling, thermal throttling, background processes. Phase 1's lab 00 set the governor to performance; keep that habit for Phase 6 measurements too.
  6. np.sum reduction order. For very large fp32 arrays, the order of summation affects the result (catastrophic cancellation, Phase 2). NumPy uses pairwise summation, which is more accurate than naive; still not Kahan. Phase 2 covered this.

One-paragraph recap

Python+NumPy has a two-tier cost model: Python loops over arrays are ~100× slower than vectorized expressions at scale, because NumPy's C kernels release the GIL and saturate memory bandwidth while Python's bytecode dispatch dominates per-element cost. Vectorization is the default; explicit loops are a code smell that needs justification. When you do need to diagnose slowness, four profilers cover the ground (cProfile for function-level, line_profiler for line-level, memory_profiler for memory, py-spy for live no-instrumentation). And structured logging (structlog) replaces print from this phase onward — every later phase relies on parseable log lines.


Next: Phase 6 has no further theory pages. Move to lab/00-environment-and-utilities.md.