Skip to content

English · Español

Lab 02 — Tensor-parallel inference on 2 cloud GPUs (the only spending lab)

Goal: run MiniGPT-grammar with tensor parallel across 2 cloud GPUs. See the all-reduce in NCCL. Measure speedup (expect slowdown!). Spend ≤ $3.

Estimated time: 3–4 hours wall, ≤ 3 hours of actual GPU runtime.

Prereq: Labs 00 and 01 complete. BudgetGuard working. Vendor account funded with the lab's budget cap (e.g., $3 prepaid on RunPod). Phase 17 MiniGPT-grammar (trained checkpoint) ready to upload.

The only lab in the entire curriculum that spends real money. Read theory 03's bandwidth math and lab 00's checklist before spinning anything up.


What you produce

A working TP inference run at experiments/35-tp-inference/ that:

  1. Uses the BudgetGuard from lab 00 to authorize the spinup.
  2. Spins up a 2× consumer-tier GPU instance (RTX 4090 or similar, single node, ideally with NVLink).
  3. Uploads the trained MiniGPT-grammar checkpoint and the test prompt set ("he goed to school" etc.).
  4. Runs a 2-GPU TP inference loop using NCCL backend, sharding the embedding table across the 2 workers.
  5. Compares per-token latency vs single-GPU. Records the ratio.
  6. Terminates the instance. Vendor-console screenshot proves termination.
  7. Logs actual $ spent in experiments/35-cloud-budget/spend.jsonl via BudgetGuard.record_actual.

Plus a continued extension to src/minitrain/:

  • src/minitrain/tensor_parallel.pyColumnParallelLinear, RowParallelLinear, ShardedEmbedding classes. Educational implementations following Megatron's pattern (theory/02). Not production-grade.

TODOs

Block A — implement the TP primitives (locally first)

Before any cloud spinup, write and test src/minitrain/tensor_parallel.py on a single-CPU mocked world_size=2 environment using torch.multiprocessing.spawn:

# src/minitrain/tensor_parallel.py — skeleton (Borja writes the body)

import torch
import torch.distributed as dist
from torch.nn import Module, Parameter

class ColumnParallelLinear(Module):
    """Linear(in_features, out_features // world_size) per worker.
    Output is sharded along out_features; *no* comm at the output.
    Input is replicated (must match) — *no* comm at the input.
    """
    def __init__(self, in_features: int, out_features: int): ...
    def forward(self, x): ...

class RowParallelLinear(Module):
    """Linear(in_features // world_size, out_features) per worker.
    Input is sharded along in_features; output is full but partial-sum.
    Forward includes one all-reduce.
    """
    def __init__(self, in_features: int, out_features: int): ...
    def forward(self, x): ...

class ShardedEmbedding(Module):
    """nn.Embedding sharded by vocab_size // world_size.
    Lookup is local-only for hits in the worker's slice; an all-reduce
    combines partial outputs.
    """
    def __init__(self, vocab_size: int, embedding_dim: int): ...
    def forward(self, ids): ...

Tests in tests/minitrain/test_tensor_parallel.py:

  • test_column_parallel_matches_single_gpu — TP linear output (after concatenating shards) matches non-TP linear within 1e-4.
  • test_row_parallel_matches_single_gpu — same.
  • test_sharded_embedding_lookup — sharded embedding output matches non-sharded.
  • test_mlp_pattern_minimalColumnParallel → GELU → RowParallel with one all-reduce matches single-GPU MLP.

These tests run on the local CPU before any cloud spend.

Block B — prepare the cloud experiment script

experiments/35-tp-inference/run.py:

# Skeleton — Borja writes the body

def main():
    rank, world_size = init_distributed("nccl")
    assert world_size == 2, "This lab is 2-GPU only."

    model = build_minigpt_grammar_tp(rank, world_size)   # TP-sharded version
    load_checkpoint(model, "checkpoint-trained.pt", strict_shard_aware=True)

    prompts = ["he goed to school", "i has eat", "she dont like apples", ...]

    times_single = run_single_gpu_baseline(prompts) if rank == 0 else None
    dist.barrier()

    times_tp = run_tp_inference(model, prompts)

    if rank == 0:
        save_results(times_single, times_tp, "results.json")

    cleanup()

The single-GPU baseline runs only on rank 0 (other rank idles) so the comparison is apples-to-apples.

Block C — pre-flight (mandatory)

Before runpodctl create:

  1. BudgetGuard.remaining reads ≥ $3.50 (cap + buffer).
  2. Vendor console: 80% budget alert set; auto-termination at 4 hours set.
  3. BudgetGuard.authorize("lab02-tp-spinup-rtx4090x2", 3.00) returns without raising.
  4. Capture instance ID + start timestamp in experiments/35-tp-inference/instance.json.

Block D — run

  1. runpodctl create ... (or vendor equivalent). Wait for instance ready.
  2. scp the checkpoint + script up.
  3. ssh <instance> "torchrun --nproc-per-node=2 experiments/35-tp-inference/run.py".
  4. Pull results.json back.
  5. runpodctl terminate <instance-id>. Verify termination in vendor console; screenshot.
  6. BudgetGuard.record_actual("lab02-tp-spinup-rtx4090x2", <actual $>) using vendor-console-reported actual.

Block E — analyze

Compute and record in manifest.json:

  • Single-GPU per-token latency (mean over 100 tokens × 10 prompts).
  • 2-GPU TP per-token latency.
  • Speedup = single / TP (expected: <1, i.e., a slowdown).
  • All-reduce wall time per token (from inside the TP loop).
  • Comm/compute ratio.

Write a one-paragraph "why 2-GPU TP is slower than single-GPU here" reflection in the manifest. Expected reasoning: model is tiny (~500k params), all-reduce volume per token is small but non-zero, NVLink helps but cannot make a 2-GPU TP run faster than a 1-GPU run for a model that wasn't memory-bound on a single GPU. The lesson generalizes: TP costs you latency at small scale; it buys you headroom at large scale.

Block F — scaling curve (synthetic)

The 2-GPU TP slowdown is expected for the grammar tutor. To make the lesson concrete, generate the forward-looking plot:

  • Compute predicted single-GPU latency for hypothetical models at \(d_{\text{model}} = 64, 256, 1024, 4096\) with the same TP pattern.
  • Compute predicted 2-GPU TP latency using the all-reduce bandwidth math from theory/03.
  • Plot crossover: at which \(d_{\text{model}}\) does TP become faster?

Commit experiments/35-tp-inference/scaling-curve.png with the crossover annotated. Expected: crossover somewhere around \(d_{\text{model}} = 1024\)\(2048\) on the test hardware.

Block G — manifest

experiments/35-tp-inference/manifest.json:

{
  "seed": 35200,
  "lab": "02-tp-inference-cloud",
  "vendor": "<chosen>",
  "instance_type": "<e.g., 2x RTX 4090>",
  "nvlink_present": true,
  "instance_id": "<filled in>",
  "start_ts": "<filled in>",
  "end_ts": "<filled in>",
  "duration_hours": "<filled in>",
  "estimated_cost_usd": 3.00,
  "actual_cost_usd": "<filled in>",
  "terminated_screenshot": "experiments/35-tp-inference/proof-terminated.png",
  "single_gpu_per_token_ms": "<filled in>",
  "tp_per_token_ms": "<filled in>",
  "speedup": "<filled in: should be < 1>",
  "lesson_notes": "<the why-slower paragraph>",
  "crossover_d_model": "<filled in from scaling-curve>"
}

Constraints

  • Hard $3 cap. BudgetGuard.authorize enforces it. Beyond that, BudgetGuardExceeded is raised — do not catch it.
  • Spot tier only. If on-demand is the only option, the budget math doesn't work — don't run the lab; document the constraint. Some weeks the price will move and the lab is genuinely not executable on $3. That's an acceptable outcome — write up the analysis without the cloud run, marked EDUCATIONAL_STUB.
  • Single node, 2 GPUs. Multi-node is out of scope.
  • No NCCL tuning. No NCCL_DEBUG, no NCCL_IB_DISABLE. Defaults. The point is seeing the comm, not optimizing it.
  • Terminate before you commit. If the lab fails partway through, terminate the instance first, then debug locally.
  • Two-hour wall-clock cap. Beyond 2 hours of GPU runtime, terminate and analyze whatever you have. The cost-vs-learning rate has already tipped.

Stop conditions

You're done when:

  1. experiments/35-tp-inference/manifest.json has actual_cost_usd filled in and ≤ $3.00.
  2. proof-terminated.png shows the vendor console with the instance terminated.
  3. BudgetGuard.total_spent matches the manifest's actual cost (within rounding).
  4. The "why slower" paragraph and the crossover-curve plot are committed.
  5. src/minitrain/tensor_parallel.py tests pass on the local CPU mock.

Hint of last resort

If anything goes wrong: terminate first, debug second. A forgotten 2-GPU instance burns ~\(0.70/hr = ~\)17/day. If you wake up to "did I terminate?", terminate now, check later.

If the cloud run wedges and you can't ssh to terminate: use the vendor web console to terminate. Bookmark the terminate page before you start.

If the lab's analysis comes out garbled (e.g., the 2-GPU run mysteriously faster than single-GPU on a 500k-param model — that would be unphysical): your single-GPU baseline probably wasn't really single-GPU. Check CUDA_VISIBLE_DEVICES=0 was set for the baseline.

When to consult solutions/

After committing the experiment. Solution lives in solutions/02-tp-inference-ref.md — written at phase open after Borja's Phase 25 PyTorch internals are in. Solution intentionally does not include actual cloud-cost numbers — those depend on vendor pricing the day of execution; the solution explains the expected shape (TP slower than single-GPU at this scale) and the crossover math, not the dollar amount.


Next lab: lab/03-megatron-fsdp-reading.md.