Skip to content

English · Español

Lab 03 — torch.compile on grammar MiniGPT + distributed survey

🇪🇸 Compilas el grammar MiniGPT con torch.compile, vuelcas el código que genera Inductor, identificas una kernel fusionada y la explicas. Después escribes el survey de 1 página sobre DDP/FSDP/TP/PP — los cuatro patrones distribuidos, sin implementarlos.

Objective

Run torch.compile on the grammar MiniGPT's forward pass (or a minimal stand-in: Linear(64, 600) → softmax), dump Inductor's generated kernels, identify one fused kernel, and explain what it does. Then write a 1-page distributed-survey README distinguishing DDP, FSDP, tensor-parallel, and pipeline-parallel.

Setup

  • torch >= 2.1. CPU compile path works without CUDA.
  • Phase 17's PyTorch port (the grammar MiniGPT). If not yet ported, use the minimal stand-in below.

Part A — Compile the model

Minimal stand-in (use if Phase 17 PyTorch port isn't available):

import torch
import torch.nn as nn

class TinyHead(nn.Module):
    def __init__(self, d=64, vocab=600):
        super().__init__()
        self.fc = nn.Linear(d, vocab)
    def forward(self, x):
        return torch.softmax(self.fc(x), dim=-1)

torch.manual_seed(42)
model = TinyHead()
model_c = torch.compile(model, mode="default")

x = torch.randn(2, 64)
y1 = model(x)
y2 = model_c(x)
print("compile match:", (y1 - y2).abs().max().item())   # < 1e-6

Part B — Dump Inductor output

Set the env var before importing torch, or set it via os.environ:

import os
os.environ["TORCH_LOGS"] = "output_code"
os.environ["TORCH_COMPILE_DEBUG"] = "1"   # writes to /tmp/torchinductor_<user>/

import torch
# ... rest of model code

After running:

ls /tmp/torchinductor_$(whoami)/

You should see a directory tree with .py files (the generated Python wrappers) and .cpp or .cu files (the generated kernels). On a CPU-only build, expect C++; with CUDA, expect Triton.

Part C — Read one generated kernel

Pick the largest .py file in /tmp/torchinductor_<user>/. It will look something like:

# Generated by torch._inductor
triton_poi_fused_softmax_0 = ...  # or cpp_fused_softmax_0 for CPU

@triton.jit  # or extern C
def kernel(...):
    # one or more aten ops fused
    ...

def call(args):
    # orchestration
    ...

For our Linear + softmax model, you should see at least one fused softmax kernel: it computes the max, exp, sum, divide in a single pass without materializing the intermediate exp tensor across kernel boundaries.

Save the most interesting kernel to experiments/25-compile/kernel.py (or .cpp).

Part D — Annotate the kernel

In experiments/25-compile/KERNEL_ANNOTATION.md, walk through the kernel line by line. Identify:

  1. Which ATen ops were fused. (Likely: max, sub, exp, sum, div.)
  2. Where the input is read from. The pointer arithmetic / index expression.
  3. Where the output is written.
  4. Whether the kernel uses a reduction. Softmax requires reductions for max and sum; how does Inductor express them?
  5. What's not there. No intermediate buffer for exp(x - m) — that's the fusion's win.

This is reading, not writing. You don't need to modify the kernel. The goal is to see Inductor's output is generated code, not magic.

Part E — Profile compiled vs eager

import time

def bench(fn, x, n=1000):
    # warm-up
    for _ in range(10): fn(x)
    t0 = time.perf_counter()
    for _ in range(n): fn(x)
    return (time.perf_counter() - t0) / n * 1e6   # μs per call

x = torch.randn(2, 64)
print("eager:    ", bench(model, x), "μs")
print("compiled: ", bench(model_c, x), "μs")

On CPU, the compiled version may be 1.1×-2× faster, or in our tiny case possibly slower (overhead dominates for small models). Document what you see — both outcomes are valid lessons.

Part F — The distributed survey

Write experiments/25-compile/DISTRIBUTED.md (~1 page, ~500 words). Four sections, ~125 words each:

1. DDP — Distributed Data Parallel

  • Pattern: model replicated on every device; data shard split across devices; gradients all-reduced after every .backward().
  • API: nn.parallel.DistributedDataParallel(model).
  • When to use: the model fits on one device, and you're scaling throughput.
  • Communication: one all-reduce per parameter, per step. Overlapped with backward.
  • When NOT to use: the model doesn't fit. Use FSDP or TP.

2. FSDP — Fully Sharded Data Parallel

  • Pattern: parameters, gradients, and optimizer state are sharded across devices. Each device holds 1/N. During forward, the layer's parameters are all_gathered from peers; freed after the layer. Backward similar.
  • API: torch.distributed.fsdp.FullyShardedDataParallel(model).
  • When to use: the model doesn't fit on one device, but a layer does.
  • Communication: all-gather per forward layer + reduce-scatter per backward layer. Much more comm than DDP.
  • When NOT to use: the model fits on one device (DDP is cheaper) or even a layer doesn't fit (need TP).

3. Tensor parallel (TP)

  • Pattern: within a single layer, the weight matrix is split across devices (row- or column-wise). The matmul is partitioned; outputs are concatenated.
  • API: library-level (Megatron-LM, FairScale, torch.distributed.tensor.parallel).
  • When to use: a single layer's weights don't fit. Common in 70B+ models for the LM head.
  • Communication: one all-reduce per layer (for the row-split form). High; requires NVLink-quality interconnect.
  • When NOT to use: comm is slow relative to compute. Use FSDP instead.

4. Pipeline parallel (PP)

  • Pattern: model split along depth. Device 0 holds layers 1-10, device 1 layers 11-20, etc. Activations flow forward, gradients flow backward, in a bubble pattern.
  • API: torch.distributed.pipeline.sync.Pipe or library wrappers.
  • When to use: the model has many sequential layers; comm bandwidth between devices is low.
  • Communication: one send/recv per micro-batch per stage. Low volume but high latency.
  • When NOT to use: few layers, lots of bandwidth — DDP or TP dominates.

End the survey with a 3-sentence "which would I pick" paragraph for the grammar MiniGPT (answer: DDP, because the model is tiny — but the question is meant to make you reason).

Part G — Write the report

experiments/25-compile/REPORT.md:

  1. The compile-match check (Part A): max-err < 1e-6.
  2. Pointer to the kernel file (Part C) and the annotation (Part D).
  3. Profile numbers (Part E) with honest interpretation (small models may not benefit).
  4. Pointer to DISTRIBUTED.md (Part F).

Deliverable

experiments/25-compile/: - REPORT.md. - kernel.py or kernel.cpp — the Inductor-generated kernel. - KERNEL_ANNOTATION.md — your walkthrough. - DISTRIBUTED.md — the 1-page survey. - manifest.json.

Acceptance

  • torch.compile'd model matches eager output within 1e-6.
  • One Inductor-generated kernel is saved and annotated.
  • The annotation correctly identifies the fused softmax pattern.
  • The distributed survey distinguishes the four patterns in 2-sentence form per pattern.

Pitfalls

  • TORCH_LOGS set after import. The env var must be set before import torch. Easiest: set it in your shell or use os.environ at the top of the script before any torch import.
  • /tmp/torchinductor_<user>/ cleared between runs. Inductor caches by graph hash; clearing the dir forces a fresh compile. Useful for debugging.
  • First call is slow. torch.compile is JIT — first call traces and compiles. Benchmark only the warm path.
  • Recompiles on shape change. Pass the same shape to every call when benchmarking, or use mode="reduce-overhead" carefully.
  • CPU compile uses C++. Expect .cpp and .so files, not Triton. The lesson is the same — fused, generated, readable.
  • "My MiniGPT isn't ported to PyTorch yet." Use the stand-in. The point is reading Inductor output, which doesn't depend on model size.

Stretch

  • Compile the full grammar MiniGPT (decoder block + LM head). Identify a fused attention+softmax kernel. Compare to Phase 27's flash-attention preview.
  • Run with mode="max-autotune". Compare compile time and runtime to mode="default".
  • Use torch._dynamo.export to extract the FX graph as a standalone artifact.

End of Phase 25 labs. Time to write PHASE_25_REPORT.md and prep for Phase 26.

Next: Phase 26 — Quantization.