English · Español

01 — The Dispatcher and ATen¶

🇪🇸 El dispatcher de PyTorch es una tabla. Toma (nombre_op, conjunto_de_claves) y devuelve el kernel a ejecutar. El conjunto de claves codifica device (CPU/CUDA), dtype, layout, requires_grad, autocast, y media docena más. ATen es la biblioteca de kernels de backend que se referencian. Esta página formaliza la tabla, las claves, y traza un linear(x, W, b) a través de ella.

This page explains the PyTorch dispatcher mechanically. After it you can read TORCH_LOGS=dispatcher output and explain every line. The lab will instrument a real call and log the decisions.

The mental model: a giant `dict`¶

Conceptually:

dispatcher: dict[(op_name, KeySet), Callable] = {
    ("aten::matmul", {"CPU", "Float", "Strided"}):       cpu_matmul_fp32,
    ("aten::matmul", {"CUDA", "Float", "Strided"}):      cuda_matmul_fp32,
    ("aten::matmul", {"CUDA", "Half", "Strided"}):       cuda_matmul_fp16,
    ("aten::matmul", {"CUDA", "Float", "AutogradCUDA"}): autograd_wrapper,
    ...
}

In practice it's not literally a Python dict (it's a C++ data structure for performance), but the logical lookup is identical. Given an op call:

y = torch.matmul(x, W.T)   # both fp32 CUDA tensors with requires_grad=True

The dispatcher computes the key set from the inputs:

{CUDA, Float, Strided, AutogradCUDA}

Then it looks up ("aten::matmul", key_set) and dispatches.

The key set¶

A dispatch key encodes one of several axes:

Axis	Examples
Backend	`CPU`, `CUDA`, `MPS` (Apple), `XLA`, `Meta` (shape-only)
Layout	`Strided`, `Sparse`, `SparseCSR`, `MkldnnCPU`
Dtype "key"	`Float`, `Half`, `BFloat16`, `Long`, `Bool`
Autograd flag	`AutogradCPU`, `AutogradCUDA`, `AutogradFunctionality`
Autocast	`AutocastCPU`, `AutocastCUDA`
Fake / Meta	`FuncTorchBatched`, `BackendSelect`
...	several more (see `torch._C._dispatch_keys`)

The full key set for one tensor is the union of all keys that apply: a fp32 CUDA tensor with requires_grad=True has {Backend.CUDA, Layout.Strided, AutogradCUDA, ...}.

When two or more tensors enter an op, the union is taken across tensors. The dispatcher picks the highest-priority key with a registered kernel.

Priority ordering¶

Keys have a priority order. High-priority keys are checked first. Roughly:

Autograd keys (so backward graph is built before the actual computation).
Autocast keys (so dtype promotion happens before backend).
Functorch / vmap keys (so vmap can batch).
Sparse / Mkldnn keys (so sparse/MKL backends override).
Backend keys (CPU, CUDA, MPS).

The walk-down is: highest-priority key with a registered kernel wins. That kernel runs; if it redispatches (a common pattern), the next-highest key takes over.

This is why a requires_grad=True torch.matmul first dispatches to the autograd kernel (which records the backward graph), which re-dispatches without the autograd key to the CUDA kernel (which actually computes).

Tracing a `linear(x, W, b)` call¶

torch.nn.functional.linear(x, W, b) is internally:

# Approximately:
return torch.addmm(b, x, W.T)
# Where addmm(b, x, W^T) computes b + x @ W^T

At Phase-25's running example shape: x: (2, 64) fp32 CUDA requires_grad, W: (600, 64) fp32 CUDA requires_grad, b: (600,) fp32 CUDA requires_grad.

The dispatch trace (TORCH_LOGS=dispatcher):

[step 1] aten::linear   key=AutogradCUDA → autograd wrapper
[step 2]   aten::linear key=CUDA         → calls aten::addmm internally
[step 3]   aten::addmm  key=AutogradCUDA → autograd wrapper (records AddmmBackward0)
[step 4]     aten::addmm key=CUDA        → calls cuBLAS gemm
[step 5]     cuBLAS launches kernel sm80_xmma_gemm_f32f32_*
[step 6]   y returned (shape (2, 600), grad_fn=<AddmmBackward0>)

Six steps. Each is a single dispatcher decision. Lab 00 instruments exactly this sequence with TORCH_LOGS=dispatcher and you annotate each line.

ATen, briefly¶

ATen is "A TENsor library" — the C++ kernel library that the dispatcher routes to. Operators are named with aten:: prefix: aten::matmul, aten::softmax, aten::linear. Each has potentially dozens of registered kernels — one per (backend, dtype, layout) combination plus autograd wrappers.

You don't write ATen kernels in Phase 25 (that requires a PyTorch source build). But you read which ATen op was dispatched, and via what kernel.

The Python-side equivalent — torch._C._dispatch_print_registrations_for_dispatch_key("CUDA") — prints every op that has a CUDA-registered kernel. Roughly 2,500 operators. The framework's apparent complexity is mostly table size.

How does a new op enter the dispatcher?¶

Two paths:

In-tree (PyTorch source): add a C++ kernel + a TORCH_LIBRARY_IMPL block. Requires building PyTorch from source. Not your path.
Out-of-tree (Python-side): torch.library.custom_op(...). Adds entries to the dispatcher table from Python, at import time. This is the Phase-24 Triton softmax's promotion path in Phase 25 lab 02.

@torch.library.custom_op("mylib::softmax_triton", mutates_args=())
def softmax_triton(x: torch.Tensor) -> torch.Tensor:
    # call into Triton kernel
    return ...

# Register meta (shape inference)
@softmax_triton.register_fake
def _softmax_triton_fake(x):
    return torch.empty_like(x)

# Register backward
def _softmax_triton_backward(ctx, grad_out):
    ...
softmax_triton.register_autograd(_softmax_triton_backward, setup_context=...)

After registration, torch.ops.mylib.softmax_triton(x) dispatches just like any other ATen op. The dispatcher doesn't care that it's "custom".

Autocast: the implicit-dtype dispatch¶

torch.autocast(device_type='cuda', dtype=torch.float16) flips an Autocast dispatch key on. Inside the autocast region, ops first dispatch to the autocast kernel, which may cast inputs to fp16 before re-dispatching to the CUDA kernel.

Which ops cast and to what dtype: a hard-coded table in aten/src/ATen/autocast_mode.cpp. Roughly: GEMMs go fp16, reductions stay fp32, softmax stays fp32 (numerical stability). You don't see this in user code; the dispatcher does it.

This is the first layer that wraps your raw call. Then autograd. Then backend. The dispatcher walks through each layer in priority order.

Performance cost of dispatch¶

Each dispatch decision costs ~1–2 μs in C++ overhead. For a model with 100 ops per forward pass, that's 100–200 μs of just dispatching. For a model running at 10 ms/forward, that's 1–2% overhead. For a model running at 100 μs/forward (small grammar MiniGPT inference), that's 100–200% overhead — dispatch dominates.

This is why torch.compile exists: it captures the graph once, then runs the optimized kernel without per-op dispatch. CUDA Graphs (Phase 33) do the same at the kernel-launch level. Phase 25 lab 03 measures the dispatch overhead concretely.

What you should now be able to do¶

State the key-set of a tensor (backend, dtype, layout, autograd) by inspecting it.
Read a TORCH_LOGS=dispatcher line and explain what's being dispatched.
Find which ATen op a Python frontend call maps to (linear → addmm, etc.).
Predict the dispatch sequence for a forward pass through nn.Linear(64, 600).
Explain why autograd dispatch happens before backend dispatch.

What this page does NOT cover¶

__torch_dispatch__. A user-mode dispatch override (per-tensor). Powerful but niche; mentioned in theory/02 only for the autograd contrast.
__torch_function__. Even more niche; subclass-level override. Skip.
PyTorch source-side dispatch implementation (Dispatcher.cpp). Out of scope.
MPS (Apple Silicon) dispatch path. Mentioned; Borja's cloud GPU is NVIDIA, so the CUDA path is the focus.

Next: theory/02-autograd-engine.md — the second half: forward graph capture, backward traversal, custom-op autograd registration.