English · Español

Lab 00 — Cloud budget, vendor survey, and budget guard¶

Goal: before you touch any cloud GPU, write down what you will spend and enforce it programmatically. Phase 35 is the only phase with real cloud cost. Get the discipline right first.

Estimated time: 60–90 minutes.

Prereq: nothing from Phase 35 yet. Borja has read theory 00–04.

What you produce¶

A single committed directory at experiments/35-cloud-budget/ containing:

vendor-survey.md — pricing snapshot at lab time, three vendors compared, recommended pick with one-paragraph justification.
budget.md — the $5 hard ceiling, broken into per-lab caps with buffer.
pre-flight-checklist.md — the routine to run before every cloud spinup (budget check, instance type confirmation, time-cap alarm, termination plan).
manifest.json — {seed, lab, hardware_target, vendor, currency, ceiling_usd} for reproducibility.

Plus a small piece of code in the curriculum's existing extension to src/minitrain/ (per A12 plan — Phase 35 does not introduce a new top-level module):

src/minitrain/budget_guard.py — a BudgetGuard class that reads LYNX_CLOUD_BUDGET_USD (env var) and a running spend log at experiments/35-cloud-budget/spend.jsonl, refusing to launch any operation whose estimated cost would breach the ceiling.

The guard is not an SDK to a cloud provider — it's a local enforcement wrapper around subprocess.run(["runpodctl", ...]) or equivalent. It logs the estimated cost on the way in and the actual cost on the way out.

TODOs¶

Block A — vendor survey¶

Pick three plausible vendors for a 2-GPU short-burst job and snapshot today's prices:

RunPod (https://www.runpod.io) — spot 2× consumer GPUs (RTX 4090, RTX 6000 Ada) — record per-GPU $/hr.
Lambda Labs (https://lambdalabs.com) — on-demand 2× A100 or 2× A6000 — record per-GPU $/hr.
Vast.ai (https://vast.ai) — community spot 2× consumer GPUs — record per-GPU $/hr.

For each: per-GPU hourly cost, available memory, network-bandwidth notes, and whether NVLink is present between the 2 GPUs in the same node. NVLink matters for TP (theory file 03's bandwidth math).

Recommendation rule: pick the cheapest spot tier that has NVLink between the pair if available, otherwise the cheapest spot tier with single-node 2-GPU placement guaranteed. Document the choice in one paragraph.

Block B — budget breakdown¶

Allocate the $5 ceiling across the three cloud-using labs:

Lab 02 (TP inference, 2-GPU)        ≤ $3.00
Lab 03 (Megatron+FSDP reading)      = $0.00 (no cloud)
Buffer for retries / setup tax      ≤ $2.00
─────────────────────────────────────────────
Total Phase 35 cloud budget         ≤ $5.00

Write this in budget.md. Include reasoning: "$3 for lab 02 = 2× RTX 4090 at $0.35/hr each × 3 hours = $2.10; rounded up to $3 for 40% buffer for setup tax, container pulls, retries."

Block C — implement `BudgetGuard`¶

# src/minitrain/budget_guard.py — skeleton (Borja writes the body)

import json
from pathlib import Path
import os

class BudgetGuardExceeded(Exception):
    """Raised when an operation would exceed the budget."""

class BudgetGuard:
    def __init__(self, ceiling_usd: float, log_path: Path):
        ...

    def authorize(self, op_label: str, estimated_usd: float) -> None:
        """Raises BudgetGuardExceeded if estimated total > ceiling. Logs intent."""
        ...

    def record_actual(self, op_label: str, actual_usd: float) -> None:
        """Append actual spend after the operation finishes."""
        ...

    @property
    def total_spent(self) -> float:
        ...

    @property
    def remaining(self) -> float:
        ...

Constraints:

No network calls in this module. Cloud-provider integration is out-of-scope for the guard itself; the guard only does bookkeeping. The user invokes it manually before/after runpodctl create ....
Append-only log. experiments/35-cloud-budget/spend.jsonl is the source of truth. Lines are {ts, op_label, estimated, actual?, currency}. Never edit; never delete.
LYNX_CLOUD_BUDGET_USD env var overrides the constructor argument if set — lets you tighten the ceiling without touching code.

Block D — pre-flight checklist¶

Write pre-flight-checklist.md listing the steps to run every time before spinning up a cloud instance:

Check vendor console: any other instances running? (If yes, you forgot to terminate. Terminate them.)
Set cloud-console budget alert at 80% of the lab cap.
Set instance auto-termination at 4 hours (vendor-side, in addition to your own guard).
BudgetGuard.remaining must show ≥ estimated_cost + 50% headroom.
Note the exact start timestamp; you'll subtract from a known end timestamp to compute actual cost.
After the lab: confirm instance terminated via vendor console screenshot; commit the screenshot to experiments/35-tp-inference/proof-terminated.png.

Block E — write the manifest¶

manifest.json:

{
  "seed": 35000,
  "lab": "00-cloud-budget-and-tooling",
  "hardware_target": "to be filled in lab 02",
  "vendor": "<chosen>",
  "currency": "USD",
  "ceiling_usd": 5.00,
  "ceiling_per_lab": {"02": 3.00, "buffer": 2.00},
  "spend_log_path": "experiments/35-cloud-budget/spend.jsonl",
  "vendor_snapshot_date": "<lab date>",
  "vendor_snapshot_prices": {"<vendor>": "<per-gpu $/hr>"}
}

Constraints¶

No cloud spinup in this lab. This is the preparation lab. Spinup happens in lab 02 only.
Money is the boundary. If at any point in writing the budget you find yourself thinking "actually $5 is too tight, let's make it $20", stop and reread theory/00-motivation.md's budget paragraph. The point is to learn distributed parallelism on a learner budget. If you genuinely need a bigger budget, that's a curriculum-level discussion with the wider team.
Spot-tier preemption. Spot instances can be reclaimed by the vendor with 30 s notice. Plan for this: lab 02's experiment must save partial results after each measurement so a preempt doesn't waste the run.

Stop conditions¶

You're done when:

experiments/35-cloud-budget/{vendor-survey,budget,pre-flight-checklist,manifest.json}.md all exist.
src/minitrain/budget_guard.py exists, has tests under tests/minitrain/test_budget_guard.py, and pytest tests/minitrain/test_budget_guard.py passes (with BUDGET_GUARD_TEST_MODE=1 to disable file writes).
BudgetGuard.authorize("lab02-tp-spinup", 3.00) succeeds in a quick REPL test; BudgetGuard.authorize("hypothetical-lab", 10.00) raises BudgetGuardExceeded.
The pre-flight checklist reads as something a tired person could execute mechanically.

Hint of last resort¶

If after 60 minutes you're stuck on vendor pricing — the prices move week-to-week, this lab will eventually go stale — use a vendor-agnostic placeholder: "Assume 2× consumer GPU at $0.35/hr per GPU = $0.70/hr total. Lab 02 runs ≤ 3 hours = $2.10. Budget ceiling: $3.00 with 40% buffer." Note the assumption explicitly in vendor-survey.md. Move on.

When to consult `solutions/`¶

After you have committed the budget files and BudgetGuard. The solution lives in solutions/00-cloud-budget-ref.md — written at phase open with the current vendor pricing. Compare; don't pre-read.

Next lab: lab/01-ddp-on-cpu.md.