English · Español
Lab 03 — torch.compile on grammar MiniGPT + distributed survey¶
🇪🇸 Compilas el grammar MiniGPT con
torch.compile, vuelcas el código que genera Inductor, identificas una kernel fusionada y la explicas. Después escribes el survey de 1 página sobre DDP/FSDP/TP/PP — los cuatro patrones distribuidos, sin implementarlos.
Objective¶
Run torch.compile on the grammar MiniGPT's forward pass (or a minimal stand-in: Linear(64, 600) → softmax), dump Inductor's generated kernels, identify one fused kernel, and explain what it does. Then write a 1-page distributed-survey README distinguishing DDP, FSDP, tensor-parallel, and pipeline-parallel.
Setup¶
torch >= 2.1. CPU compile path works without CUDA.- Phase 17's PyTorch port (the grammar MiniGPT). If not yet ported, use the minimal stand-in below.
Part A — Compile the model¶
Minimal stand-in (use if Phase 17 PyTorch port isn't available):
import torch
import torch.nn as nn
class TinyHead(nn.Module):
def __init__(self, d=64, vocab=600):
super().__init__()
self.fc = nn.Linear(d, vocab)
def forward(self, x):
return torch.softmax(self.fc(x), dim=-1)
torch.manual_seed(42)
model = TinyHead()
model_c = torch.compile(model, mode="default")
x = torch.randn(2, 64)
y1 = model(x)
y2 = model_c(x)
print("compile match:", (y1 - y2).abs().max().item()) # < 1e-6
Part B — Dump Inductor output¶
Set the env var before importing torch, or set it via os.environ:
import os
os.environ["TORCH_LOGS"] = "output_code"
os.environ["TORCH_COMPILE_DEBUG"] = "1" # writes to /tmp/torchinductor_<user>/
import torch
# ... rest of model code
After running:
You should see a directory tree with .py files (the generated Python wrappers) and .cpp or .cu files (the generated kernels). On a CPU-only build, expect C++; with CUDA, expect Triton.
Part C — Read one generated kernel¶
Pick the largest .py file in /tmp/torchinductor_<user>/. It will look something like:
# Generated by torch._inductor
triton_poi_fused_softmax_0 = ... # or cpp_fused_softmax_0 for CPU
@triton.jit # or extern C
def kernel(...):
# one or more aten ops fused
...
def call(args):
# orchestration
...
For our Linear + softmax model, you should see at least one fused softmax kernel: it computes the max, exp, sum, divide in a single pass without materializing the intermediate exp tensor across kernel boundaries.
Save the most interesting kernel to experiments/25-compile/kernel.py (or .cpp).
Part D — Annotate the kernel¶
In experiments/25-compile/KERNEL_ANNOTATION.md, walk through the kernel line by line. Identify:
- Which ATen ops were fused. (Likely:
max,sub,exp,sum,div.) - Where the input is read from. The pointer arithmetic / index expression.
- Where the output is written.
- Whether the kernel uses a reduction. Softmax requires reductions for
maxandsum; how does Inductor express them? - What's not there. No intermediate buffer for
exp(x - m)— that's the fusion's win.
This is reading, not writing. You don't need to modify the kernel. The goal is to see Inductor's output is generated code, not magic.
Part E — Profile compiled vs eager¶
import time
def bench(fn, x, n=1000):
# warm-up
for _ in range(10): fn(x)
t0 = time.perf_counter()
for _ in range(n): fn(x)
return (time.perf_counter() - t0) / n * 1e6 # μs per call
x = torch.randn(2, 64)
print("eager: ", bench(model, x), "μs")
print("compiled: ", bench(model_c, x), "μs")
On CPU, the compiled version may be 1.1×-2× faster, or in our tiny case possibly slower (overhead dominates for small models). Document what you see — both outcomes are valid lessons.
Part F — The distributed survey¶
Write experiments/25-compile/DISTRIBUTED.md (~1 page, ~500 words). Four sections, ~125 words each:
1. DDP — Distributed Data Parallel¶
- Pattern: model replicated on every device; data shard split across devices; gradients all-reduced after every
.backward(). - API:
nn.parallel.DistributedDataParallel(model). - When to use: the model fits on one device, and you're scaling throughput.
- Communication: one all-reduce per parameter, per step. Overlapped with backward.
- When NOT to use: the model doesn't fit. Use FSDP or TP.
2. FSDP — Fully Sharded Data Parallel¶
- Pattern: parameters, gradients, and optimizer state are sharded across devices. Each device holds 1/N. During forward, the layer's parameters are
all_gathered from peers; freed after the layer. Backward similar. - API:
torch.distributed.fsdp.FullyShardedDataParallel(model). - When to use: the model doesn't fit on one device, but a layer does.
- Communication: all-gather per forward layer + reduce-scatter per backward layer. Much more comm than DDP.
- When NOT to use: the model fits on one device (DDP is cheaper) or even a layer doesn't fit (need TP).
3. Tensor parallel (TP)¶
- Pattern: within a single layer, the weight matrix is split across devices (row- or column-wise). The matmul is partitioned; outputs are concatenated.
- API: library-level (Megatron-LM, FairScale,
torch.distributed.tensor.parallel). - When to use: a single layer's weights don't fit. Common in 70B+ models for the LM head.
- Communication: one all-reduce per layer (for the row-split form). High; requires NVLink-quality interconnect.
- When NOT to use: comm is slow relative to compute. Use FSDP instead.
4. Pipeline parallel (PP)¶
- Pattern: model split along depth. Device 0 holds layers 1-10, device 1 layers 11-20, etc. Activations flow forward, gradients flow backward, in a bubble pattern.
- API:
torch.distributed.pipeline.sync.Pipeor library wrappers. - When to use: the model has many sequential layers; comm bandwidth between devices is low.
- Communication: one send/recv per micro-batch per stage. Low volume but high latency.
- When NOT to use: few layers, lots of bandwidth — DDP or TP dominates.
End the survey with a 3-sentence "which would I pick" paragraph for the grammar MiniGPT (answer: DDP, because the model is tiny — but the question is meant to make you reason).
Part G — Write the report¶
experiments/25-compile/REPORT.md:
- The compile-match check (Part A): max-err
< 1e-6. - Pointer to the kernel file (Part C) and the annotation (Part D).
- Profile numbers (Part E) with honest interpretation (small models may not benefit).
- Pointer to
DISTRIBUTED.md(Part F).
Deliverable¶
experiments/25-compile/:
- REPORT.md.
- kernel.py or kernel.cpp — the Inductor-generated kernel.
- KERNEL_ANNOTATION.md — your walkthrough.
- DISTRIBUTED.md — the 1-page survey.
- manifest.json.
Acceptance¶
torch.compile'd model matches eager output within1e-6.- One Inductor-generated kernel is saved and annotated.
- The annotation correctly identifies the fused softmax pattern.
- The distributed survey distinguishes the four patterns in 2-sentence form per pattern.
Pitfalls¶
- TORCH_LOGS set after import. The env var must be set before
import torch. Easiest: set it in your shell or useos.environat the top of the script before any torch import. /tmp/torchinductor_<user>/cleared between runs. Inductor caches by graph hash; clearing the dir forces a fresh compile. Useful for debugging.- First call is slow.
torch.compileis JIT — first call traces and compiles. Benchmark only the warm path. - Recompiles on shape change. Pass the same shape to every call when benchmarking, or use
mode="reduce-overhead"carefully. - CPU compile uses C++. Expect
.cppand.sofiles, not Triton. The lesson is the same — fused, generated, readable. - "My MiniGPT isn't ported to PyTorch yet." Use the stand-in. The point is reading Inductor output, which doesn't depend on model size.
Stretch¶
- Compile the full grammar MiniGPT (decoder block + LM head). Identify a fused attention+softmax kernel. Compare to Phase 27's flash-attention preview.
- Run with
mode="max-autotune". Compare compile time and runtime tomode="default". - Use
torch._dynamo.exportto extract the FX graph as a standalone artifact.
End of Phase 25 labs. Time to write PHASE_25_REPORT.md and prep for Phase 26.
Next: Phase 26 — Quantization.