Skip to content

English · Español

Lab 00 — Provision a Cloud GPU End-to-End

Goal: rent a GPU, SSH in, run nvidia-smi, run a one-line cuBLAS benchmark, terminate, all in under 30 minutes. Document the workflow that all subsequent labs use.

Estimated time: 30–60 minutes (first time), 5 minutes (thereafter).

Prereq: cloud platform decision resolved per PHASE_23_PLAN.md §7.a. Account created on chosen provider. Payment method on file. Budget cap configured.


What you produce

A directory experiments/23-provisioning/ containing:

  • provisioning_log.md — step-by-step what you did, with timestamps. Future-you's recipe.
  • nvidia-smi.txt — captured nvidia-smi output from your rented instance.
  • cublas_smoke_test.json — output of a one-line "this thing works" benchmark.
  • cost_log.md — start time, stop time, hourly rate, total cost.
  • manifest.json.

TODOs

Block A — pre-flight (5 min)

  • Confirm PHASE_23_PLAN.md §7.a resolved. Open the chosen provider's console.
  • If using SSH-based provider (RunPod / Vast / Lambda): generate an SSH key locally if you don't have one (ssh-keygen -t ed25519 -C "lynx-cortex-phase23"). Upload public key to provider. Do not commit private key.
  • Set a billing alert / spend cap at your monthly budget per learners/borja/profile.md (€50–80). Most providers support this.

Block B — launch (5 min)

  • Pick the smallest reasonable GPU. For Phase 23, an RTX 3090 (24 GB) or A10 (24 GB) is plenty. Spot pricing if available.
  • Pick the smallest reasonable region (lowest latency to you; Borja is in Spain — prefer EU).
  • Use a "CUDA pre-installed Ubuntu 22.04" template. Most providers have this. Do not install drivers yourself.
  • Launch. Note the start time in cost_log.md.

Block C — connect (5 min)

  • SSH in. Confirm: lsb_release -a (Ubuntu 22.04), nvidia-smi (driver + GPU detected).
  • Pipe nvidia-smi output to nvidia-smi.txt, scp back to your machine.
  • git clone git@github.com:borjatarraso/lynx-cortex.git on the instance. The repo is private per A16, so authenticate with a deploy key (read-only is enough for Phase 23) or a short-lived PAT — no anonymous HTTPS clone.
  • cd lynx-cortex && uv sync — install the project's deps in the instance. (uv must be installed: curl -LsSf https://astral.sh/uv/install.sh | sh.)

Block D — smoke test (5 min)

  • Run a one-line cuBLAS GEMM via cupy (or PyTorch — see Plan §7.g; for Phase 23 we use cupy):
    import cupy as cp, time
    A = cp.random.randn(4096, 4096, dtype=cp.float16)
    B = cp.random.randn(4096, 4096, dtype=cp.float16)
    cp.cuda.runtime.deviceSynchronize()
    t0 = time.perf_counter()
    for _ in range(100):
        C = A @ B
    cp.cuda.runtime.deviceSynchronize()
    t1 = time.perf_counter()
    flops = 2 * 4096**3 * 100 / (t1 - t0)
    print(f"{flops / 1e12:.1f} TFLOPS fp16")
    
  • Save the result to cublas_smoke_test.json with {tflops_measured, gpu_name, dtype, matrix_size, iters}.
  • Confirm: number should be 30–80% of vendor peak for fp16 on the rented GPU. If <10%, something is wrong — check nvidia-smi for "default" mode (vs MIG-partitioned), check CUDA version vs your cupy build.

Block E — terminate (2 min — don't skip)

  • git add -A && git commit -m "phase: 23 provisioning smoke test" from the instance. Push.
  • Note the stop time in cost_log.md. Compute total cost = (stop - start) × hourly_rate. Write it down.
  • Terminate the instance from the provider console. Confirm in the console it's gone (not "stopped" — terminated; stopped instances on some providers still bill for storage).
  • Verify your spend on the provider's dashboard.

Block F — write the recipe (10 min)

In provisioning_log.md, write a future-you recipe with:

  • Exact commands you ran (copy-pasteable).
  • Anything that surprised you (e.g., "the spot instance was reclaimed once, had to relaunch").
  • The exact hourly rate you got (spot vs on-demand).
  • An estimate of "time-from-decision-to-running-benchmark" — this is what you'll save in future sessions.

Block G — manifest

manifest.json:

{
  "experiment": "23-provisioning",
  "date": "YYYY-MM-DD",
  "cloud_provider": "<runpod | lambda | vast | colab | other>",
  "region": "<eu-west | us-east | ...>",
  "gpu_name": "<RTX 3090 | A10 | A100 | ...>",
  "gpu_vram_gib": null,
  "pricing_model": "<spot | on-demand>",
  "hourly_rate_usd": null,
  "session_minutes": null,
  "total_cost_usd": null,
  "cuda_version": null,
  "driver_version": null,
  "ubuntu_version": "22.04",
  "smoke_test_tflops_fp16": null,
  "smoke_test_fraction_of_peak": null
}

Constraints

  • Do not install drivers yourself. Use the provider's CUDA-pre-installed image.
  • Do not commit SSH private keys, API tokens, or any provider credentials. .gitignore should already cover ~/.ssh, .env, etc.
  • Do not skip termination. A forgotten instance is a $50–500/day mistake.
  • Do not run training jobs in Phase 23. This phase is measurement only. Anything that takes >5 minutes is suspicious — review and stop.

Stop conditions

Done when:

  1. All six output files committed to experiments/23-provisioning/.
  2. The instance is terminated (confirmed on dashboard).
  3. The cost log shows the actual spend for this session.
  4. The smoke test TFLOPS is in the 30–80% of peak range.
  5. manifest.json complete.

Pitfalls

  • "Stopped" vs "terminated". Stopping a RunPod or Vast instance still bills you for storage. Terminate.
  • CUDA / cupy version mismatch. If cupy-cuda12x is installed but the image has CUDA 11.8, imports fail. Install the matching cupy build: pip install cupy-cuda11x or cupy-cuda12x matching the image's CUDA.
  • nvidia-smi shows GPU but kernel launches fail. Permissions issue; ensure your user can access /dev/nvidia*. The Ubuntu CUDA template handles this but custom AMIs may not.
  • Spot instance reclaimed mid-session. Lose your shell. Re-launch and re-run. Document if it happens.
  • First cudaMemcpy is 100× slower than steady-state. Context init overhead. Don't time the first iteration.

When to consult solutions/

After all stop conditions are met. The reference at solutions/00-provision-cloud-gpu-ref.md (written at phase open) shows the exact commands the reference implementation used on RunPod with an A10, plus the cost breakdown.


Next lab: lab/01-device-query.md.