Skip to content

English · Español

02 — Strides, Views, and Broadcasting

Try it — strides & flat index

🇪🇸 La estructura interna del ndarray de NumPy: un puntero a un buffer plano, una forma (shape), un vector de strides, un dtype, y unas flags. Esa estructura explica por qué arr.T es O(1) y np.ascontiguousarray(arr.T) es O(n). El broadcasting es un algoritmo de alineamiento de shapes que también necesitas dominar.


The ndarray, internally

A NumPy ndarray is five things:

ndarray = (
    data      : pointer to a flat buffer in memory,
    shape     : tuple of ints (d_0, d_1, ..., d_{k-1}),
    strides   : tuple of ints in BYTES (s_0, s_1, ..., s_{k-1}),
    dtype     : element type (fp32, int64, ...),
    flags     : metadata (OWNDATA, C_CONTIGUOUS, F_CONTIGUOUS, WRITEABLE, ...),
)

Element (i_0, i_1, ..., i_{k-1}) lives at byte offset:

\[ \text{offset}(i_0, \ldots, i_{k-1}) = \sum_{j=0}^{k-1} i_j \cdot s_j \]

That's it. That single formula is the entire memory model. Everything below — views, transpose, broadcasting, fancy indexing — is a corollary.

Worked example

import numpy as np
a = np.arange(12, dtype=np.int32).reshape(3, 4)
  • a.shape = (3, 4) — three rows, four columns.
  • a.dtype.itemsize = 4 bytes (int32).
  • a.strides = (16, 4) — moving one row costs 16 bytes (4 elements × 4 bytes), moving one column costs 4 bytes.
  • a.flags.C_CONTIGUOUS = True — row-major, the default.
  • a.flags.OWNDATA = Truea owns its buffer (it allocated it).

Element a[2, 1] is at offset 2*16 + 1*4 = 36 bytes into the buffer.

Transpose is free

b = a.T

What just happened? b shares the same buffer (b.base is a is True), but with swapped shape and strides:

  • b.shape = (4, 3).
  • b.strides = (4, 16).
  • b.flags.C_CONTIGUOUS = False (rows are no longer contiguous in memory).
  • b.flags.F_CONTIGUOUS = True (columns are now contiguous — Fortran order).
  • b.flags.OWNDATA = False (b doesn't own the buffer; a does).

Element b[1, 2] is at offset 1*4 + 2*16 = 36 bytes — same byte as a[2, 1]. Transpose is a relabeling of the axes; the bytes never moved.

Cost: O(1). It's just a struct update.

When transpose stops being free

The moment you ask for a contiguous version:

c = np.ascontiguousarray(b)   # or b.copy() or np.asarray(b, order='C')

Now NumPy walks b in its non-contiguous stride order, copying each element to a fresh contiguous buffer. Cost: O(n_elements) in time, O(n_elements * itemsize) in memory.

Why this matters for AI code: many BLAS / LAPACK routines (under np.linalg, np.matmul, np.dot) require contiguous input. NumPy detects the non-contiguous case and inserts a hidden copy. The cost is hidden but real:

a = np.random.randn(1024, 1024).astype(np.float32)
b = np.random.randn(1024, 1024).astype(np.float32)

# Case 1: both contiguous. matmul kernel runs directly.
np.matmul(a, b)

# Case 2: `a.T` is non-contiguous. matmul kernel copies first.
np.matmul(a.T, b)   # measurably slower

Lab 01 makes Borja measure this directly. The number you'll see on the i5-8250U for a 1024×1024 float32 is roughly: contiguous matmul ~30 ms, non-contiguous-with-copy matmul ~50 ms — the copy itself is ~20 ms.

Views vs copies: the full table

Operation View or copy?
a[1:3], a[::2], a[1:5:2] (basic slicing) View
a[1], a[1, 2] (integer indexing) View of subarray; scalar if all axes indexed
a[[1, 3, 5]] (fancy / array indexing) Copy
a[a > 0] (boolean indexing) Copy
a.T, a.transpose(...), np.swapaxes(a, ...) View
a.reshape(...) View if possible (compatible strides), else copy
a.flatten() Copy (always)
a.ravel() View if contiguous, copy otherwise
np.ascontiguousarray(a) Copy if not already C-contiguous, no-op if it is
a.copy() Copy (always)
a.astype(dtype) Copy if dtype differs; view-or-copy if same dtype
np.asarray(a, dtype=X) No-op if a is already an ndarray of dtype X, else copy

How to check at runtime:

  • arr.flags.OWNDATA — if False, arr is a view into someone else's buffer.
  • arr.base — the original object the view references, or None.
  • np.shares_memory(a, b) — definitive (but expensive — O(...) walks both arrays' strided extents).

Stride tricks (power and danger)

np.lib.stride_tricks.as_strided lets you construct an ndarray with arbitrary shape and strides over an existing buffer. This is how rolling windows are implemented without copying:

from numpy.lib.stride_tricks import as_strided

a = np.arange(10, dtype=np.int32)
# Rolling windows of length 3:
window = as_strided(a, shape=(8, 3), strides=(4, 4))
# window[i, j] = a[i + j]

window is a (8, 3) array but shares the same buffer as a. No allocation.

The danger: as_strided does no bounds-checking. If you specify a shape × strides extent that runs past the buffer, you read uninitialized memory. If you write into a strided view that aliases itself, you scribble. Treat as read-only.

The use: modern NumPy provides safer wrappers (np.lib.stride_tricks.sliding_window_view) that compute the right strides automatically. Use those.

Broadcasting, formalized

Broadcasting is NumPy's algorithm for operating on arrays of different shapes. The rule is short, the consequences are subtle.

The three rules

Given two shapes S_a = (a_0, a_1, ..., a_m) and S_b = (b_0, b_1, ..., b_n):

  1. Align right. Pad the shorter shape with 1s on the left until both have the same number of axes. E.g., (3,) vs (2, 4, 3)(1, 1, 3) vs (2, 4, 3).
  2. Per-axis compatibility. For each axis, the two dim sizes must be either equal or one of them must be 1. If neither, raise ValueError.
  3. Result shape. For each axis, take the max of the two dims.

Worked examples

(3,)         (2, 4, 3)    →    (2, 4, 3)
(N,)         (N, 1)       →    (N, N)   ← THE classic bug
(B, 1, M)    (1, N, M)    →    (B, N, M)
(3, 4)       (3,)         →    ValueError  ← (3,) becomes (1, 3); (3, 4) needs second dim 3 not 4. Wait, actually: aligned right, (3, 4) vs (3,) → (3, 4) vs (1, 3) → axis 1: 4 vs 3, NOT compatible. ValueError.

The third example is worth pausing on. np.array([1,2,3]) has shape (3,). If you want it to broadcast over rows of a (3, 4) array, you need to give it shape (3, 1): np.array([1,2,3])[:, None]. The bare (3,) broadcasts over columns.

The (N,) * (N,1) bug

y_pred = np.zeros((100,))      # shape (100,)
y_true = np.array([...]).reshape(100, 1)  # shape (100, 1)
err = y_pred - y_true          # shape (100, 100) !!

Broadcast align right: (100,)(1, 100), (100, 1) stays. Both are now 2-D: (1, 100) vs (100, 1). Per axis: dim 0 is 1 vs 100 → broadcast to 100; dim 1 is 100 vs 1 → broadcast to 100. Result: (100, 100).

100×100 = 10,000 entries. Of these, only 100 are the diagonal that "would have been right". The other 9,900 are cross terms. err.mean() averages all 10,000 and returns the wrong number.

This bug is silent (no error), wrong (returns a number that looks plausible), and ubiquitous. The fix is always reshape predictions and targets to the same shape before subtracting, with .reshape(-1, 1) or [:, None] being the standard idiom.

Why broadcasting exists

It's not a Python convenience; it's a memory optimization. Broadcasting never materializes the expanded array. The expression a + b where a.shape = (N, 1) and b.shape = (1, N) acts as if both were expanded to (N, N) but only allocates the (N, N) output. The "expansion" is done by stride-zero tricks under the hood.

np.broadcast_to(a, (N, N)) returns a view with stride 0 on the broadcasted axis — no memory allocated, but the array reads as if it were (N, N).

Dtype promotion

When you do a + b and the dtypes differ, NumPy promotes to a common dtype:

int32 + int64      → int64
int32 + float32    → float64    (yes, float64. integer-to-float promotes generously)
float32 + float64  → float64
int8 + bool        → int8

NumPy 2.0 changed scalar promotion rules (NEP 50): np.float32(1) + 1.0 is now float32, not float64. The motivation: predictability. Read NEP 50 once during this phase.

The bug pattern this causes: training in fp32 on purpose, but a stray numpy_array + python_float upgrades to fp64 silently. Your memory doubles. Lab does not specifically reproduce this, but the pitfalls list in PHASE_06_PLAN.md §5 calls it out.

Defensive coding:

x = x.astype(np.float32, copy=False)   # cast or no-op
y = np.float32(0.5)                    # explicit
result = x + y                          # guaranteed fp32

copy=False is important: astype(dtype, copy=True) always copies, even when the dtype already matches. copy=False is no-op when dtype matches.

Strides + broadcasting + dtype, combined

Putting it all together: an ndarray expression's cost depends on shape, strides, dtype, and contiguity in addition to the operation. Two expressions that look the same can have wildly different costs:

# (N, N) fp64 contiguous + (N,) fp64 → (N, N) fp64. Broadcast, no copy. O(N²) work, O(N²) write.
a + b

# (N, N) fp32 non-contiguous (transpose) + (N,) fp32 → (N, N) fp32 contiguous output.
# Hidden copy of a before the kernel runs. O(N²) work + O(N²) copy.
a.T + b

The fix: profile (Phase 6 lab 03). The cure: predict.

One-paragraph recap

A NumPy ndarray is (data, shape, strides, dtype, flags). Element offset is Σ i_j · s_j — that single formula explains why transpose is O(1) (just swap shape and strides), why some operations view and others copy (depends on whether a stride-only relabeling can express the result), and why broadcasting is fast (stride-zero magic, no allocation of the expanded shape). The broadcasting rule is short — align right, dims must match or be 1, result is the pairwise max — but its silent failure mode ((N,) * (N,1) → (N,N)) is the #1 AI bug. Dtype promotion has its own NEP-50 surprises. Master this page and a vast class of future bugs simply will not happen to you.


Next: 03-vectorization-and-profiling.md