Skip to content

English · Español

04 — PyTorch as Substrate (First Encounter)

🇪🇸 Después de 23 fases en NumPy puro, esta página introduce PyTorch. No como caja mágica sino como plomería: un Tensor es un Storage + metadatos de vista, una nn.Module es un registro de parámetros, una pasada hacia adelante es una secuencia de dispatch a kernels. Fase 25 desarma el sistema; Fase 24 lo presenta como herramienta para portar el grammar MiniGPT.

This is the first PyTorch theory page in the curriculum. The framework is introduced late and thin — by Phase 24 Borja already knows what every operator does (Phases 7–17), how it runs on a GPU (Phase 23), and how memory flows through it (Phase 22). PyTorch is then presented as the routing layer that wraps the substrate.

The aim of this page is to keep PyTorch in its place: it is a tool, not a worldview. Phase 25 cracks open the dispatcher and autograd internals.


The two-line mental model

nn.Module  =  parameters registry + forward-method-as-graph
Tensor     =  Storage (bytes) + view metadata (shape, stride, dtype, device, requires_grad)

Everything else in PyTorch decorates these two ideas. If you keep this two-line summary in mind, the rest is obvious.

Tensor as Storage + view

A PyTorch tensor is not a contiguous block of memory by itself. It's a view into a Storage (which is a contiguous block of memory). Multiple tensors can share one storage — that's how view, transpose, narrow, permute work without copying.

x = torch.arange(12).reshape(3, 4)
y = x.t()             # transposed view
x.storage() is y.storage()    # True — they share the underlying bytes

The view metadata (stride, offset) describes how to step through the storage. A transposed view has swapped strides; the bytes are identical.

This is exactly the NumPy ndarray model. PyTorch's contribution is adding device, requires_grad, and a few decorations — not a different memory model.

nn.Module as parameter registry

class TinyMLP(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(64, 256)
        self.fc2 = nn.Linear(256, 600)   # 600 = grammar MiniGPT vocab size
    def forward(self, x):
        return self.fc2(torch.relu(self.fc1(x)))

What nn.Module actually adds over a plain Python class:

  • .parameters() recursively walks attributes that are Parameter (a Tensor subclass with requires_grad=True by default).
  • .state_dict() returns {name: Tensor} for serialization.
  • .to(device) recursively moves parameters to a device.
  • .train() / .eval() flips a flag that some layers (Dropout, BatchNorm) read.

That's it. There's no hidden magic; it's a slightly-smart container for parameters.

What model.cuda() actually does

model = TinyMLP()
model.cuda()

The .cuda() call:

  1. Walks all parameters and buffers recursively.
  2. For each, allocates an equivalent tensor on the GPU (calls cudaMalloc).
  3. Copies the host bytes over via cudaMemcpy.
  4. Replaces the parameter in-place (the nn.Module now holds a GPU tensor).

The model's structure and code don't change — only the tensor's device flag and underlying memory. The next forward() call dispatches each op to the CUDA backend instead of the CPU backend. No new model, no copy of the code.

This dispatch is the dispatcher, the subject of Phase 25.

Forward pass as dispatch sequence

For one batch through TinyMLP:

y = model(x)
# Equivalent to:
y = model.fc2(torch.relu(model.fc1(x)))
# Each call dispatches based on (op_name, device, dtype):
#   fc1.weight @ x + fc1.bias   → cuBLAS GEMM kernel (CUDA, fp32)
#   torch.relu                  → eltwise CUDA kernel
#   fc2.weight @ ... + fc2.bias → cuBLAS GEMM kernel

Each line is a kernel launch (or, with torch.compile, a fused kernel covering several). The dispatcher (Phase 25) is what picks which kernel based on (op, device, dtype, layout).

For Phase 24's port of Phase-17 MiniGPT: every np.matmul becomes torch.matmul, every np.exp / np.sum / np.softmax becomes torch.softmax. The model code structure is unchanged. The dispatcher does the rest.

Where PyTorch differs from NumPy

Three things matter for Phase 24:

  1. Devices. Tensors live on a device: cpu, cuda:0, cuda:1. Operations between tensors on different devices error. NumPy has no concept.
  2. Autograd. Tensors with requires_grad=True track operations into a backward graph. Calling .backward() walks the graph and accumulates gradients into .grad. Phase 25 makes this explicit; Phase 24 mostly inhabits inference (.eval() + torch.no_grad()), so the graph doesn't accumulate.
  3. dtype/device promotion rules. PyTorch silently promotes fp16 + fp32 → fp32 (configurable). NumPy has similar rules but the failure modes differ.

For the lab port, Borja runs inference, no gradients, fp32 on CPU first (matches NumPy bit-exact), then fp32 on CUDA (matches NumPy to ~1e-5 due to non-associativity of fp arithmetic), then fp16 on CUDA (matches NumPy to ~1e-2; tests use looser tolerance).

The byte-equivalence sanity check

After porting MiniGPT to PyTorch (src/minimodel/torch_minigpt.py), the validation is:

np_model = load_phase17_numpy_minigpt()
pt_model = load_torch_minigpt_from_same_weights()
x = np.random.randn(2, 32, 64).astype(np.float32)
y_np = np_model(x)
y_pt = pt_model(torch.tensor(x)).numpy()
assert np.allclose(y_np, y_pt, atol=1e-5, rtol=1e-5)

If this passes at fp32 on CPU, the port is faithful. If it fails, the port has a layer-ordering or weight-mapping bug. Debug the port, not the framework.

What PyTorch is not (yet, in Phase 24)

PyTorch does not — in Phase 24's usage — do any of the following:

  • Train (only inference). Phase 25 explores autograd; Phase 24 doesn't.
  • Compile (no torch.compile). Eager mode.
  • Distribute (no DDP/FSDP). Single-device.
  • Quantize (fp32 / fp16 only). Phase 26.
  • Mix with custom autograd. Phase 25 introduces torch.autograd.Function.

Phase 24's PyTorch usage is the smallest viable: load weights, define modules, run forward, slot in the custom kernel (lab 03). Nothing more.

Where PyTorch hooks into the custom kernel

In lab 03, Borja replaces torch.softmax(x, dim=-1) in the MiniGPT LM-head with a call to the Triton softmax from theory/03:

class GrammarMiniGPT(nn.Module):
    def __init__(self, ...):
        ...
        self.lm_head = nn.Linear(d, 600)
    def forward(self, x):
        logits = self.lm_head(x)    # shape (B, V) with V=600
        return triton_softmax(logits)   # custom kernel replaces F.softmax

The model surrounding the swap is unchanged. PyTorch dispatches nn.Linear to cuBLAS; the custom Triton kernel runs the softmax. One line changed; the rest of the framework is unaffected.

This is the contract Phase 24 buys: the ability to slot a custom kernel into a PyTorch model with surgical precision. Phase 25 will formalize how PyTorch dispatches that custom op (via torch.library / custom-op registration).

Drill problems

  1. Why does x.t().contiguous() allocate memory? (Hint: the transposed view has non-trivial strides; making it contiguous reorders bytes.)
  2. What's model.fc1.weight.storage().data_ptr() give you? (Hint: the literal HBM/CPU address of the storage.)
  3. In y = model(x.cuda()); y.cpu(), how many memory copies occurred? (Hint: two — H2D for x, D2H for y. Plus internal CUDA-CUDA work, which doesn't count as "copies".)
  4. Why does fp16 on CUDA give different results than fp32 on CUDA for the same model? (Hint: dynamic range and non-associativity of accumulation.)

What you should now be able to do

  1. State the Tensor = Storage + view model from memory.
  2. Predict what .cuda() does at the byte level.
  3. Port a NumPy module to PyTorch by mechanical substitution and verify byte-equivalence at fp32.
  4. Slot a custom kernel into an nn.Module's forward method.
  5. Distinguish what is "PyTorch the framework" vs "the kernel libraries it dispatches to" (cuBLAS, cuDNN, ATen, NCCL).

What this page does NOT cover

  • __torch_dispatch__ and the dispatcher internals. Phase 25.
  • Autograd engine internals (Function, Variable, backward graph traversal). Phase 25.
  • Custom op registration via torch.library. Phase 25 (briefly) and Phase 27 (in depth for paged attention).
  • torch.compile / Inductor. Phase 25.
  • DataLoader, optimizer, training loop. Phase 18 already taught these in NumPy; the PyTorch versions are introduced as needed in Phase 25.

Next: lab/00-hello-cuda.md — the toolchain check.