English · Español
Lab 00 — Machine profile¶
Goal: capture the measured ground truth of your machine before doing any benchmarking. Without this, every later number is uninterpretable.
Estimated time: 30–45 minutes.
What you produce¶
A single committed file at experiments/01-machine-profile/profile.md containing:
- CPU model, microarchitecture name, clock (base + boost), core count, thread count.
- Cache sizes (L1d, L1i, L2, L3) and associativity, line size.
- Memory: total RAM, channel count, DDR generation, DIMM speed.
- SIMD ISA support (AVX, AVX2, AVX-512, FMA).
- OS, kernel version, CPU governor mode at time of capture.
- Computed peak FLOPS (your own derivation, written out).
- Computed peak DRAM bandwidth (your own derivation, written out).
- A short paragraph: "given this machine, the machine-balance arithmetic intensity is ≈ X FLOPs/byte; kernels below this are bandwidth-bound."
This file becomes the reference every later experiment cites.
TODOs¶
Block A — collect raw facts¶
-
lscpu— paste full output. Highlight: model name, virtualization, vendor, cache sizes. -
lstopo --no-io(fromhwlocpackage; install if needed) — paste the topology block. This shows L1/L2/L3 layout per core. -
sudo dmidecode --type 17— paste DIMM info (DDR generation, speed in MT/s, manufacturer). Withoutsudo, fall back tolshw -class memory. -
cat /proc/cpuinfo | grep -E 'model name|flags' | head -2— confirm SIMD flags. -
cpupower frequency-info(orcat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor) — record the current CPU governor. -
uname -r— kernel version.
Block B — derive peak FLOPS by hand¶
From lscpu extract: clock base GHz, cores. From flags extract: widest SIMD ISA (AVX2 / AVX-512). Use the formula in theory/03-roofline-model.md:
Where:
- SIMD_fp32_per_op = 8 for AVX2, 16 for AVX-512.
- FMA_factor = 2 if FMA is in flags (a fused multiply-add does 2 FLOPs in 1 cycle).
- clock_GHz = use base clock, not boost. Boost is unsustained.
Write out the arithmetic. Don't just state the answer.
🇪🇸 Pista: para el i5-8250U: 1.6 GHz base × 4 núcleos × 8 fp32/AVX2 × 2 FMA = 102.4 GFLOPS. El "boost" 3.4 GHz raramente se sostiene por límites térmicos en chips U-series.
Block C — derive peak DRAM bandwidth by hand¶
From dmidecode: DDR speed in MT/s (millions of transfers per second). From lscpu or system topology: channel count (1, 2, or 4 typically).
Standard DDR4-2400 has 2400 MT/s × 8 bytes = 19.2 GB/s per channel. Multiply by channel count.
If you don't have
dmidecodepermission, default to DDR4-2400 for the i5-8250U (the platform's nominal max). Note this assumption in the file.
Block D — synthesize¶
Write the short paragraph: machine-balance intensity, what it implies, which Phase 1 labs will measure each ceiling. Two or three sentences total.
Constraints¶
- No tools beyond what's listed. Don't run
peakperfhere even if it's installed — labs 01–03 will measure peaks empirically; this lab is pure derivation. - Show your work. If the file just lists "100 GFLOPS" with no derivation, redo it.
- Mark Borja-specific values in [brackets]. If you copy this lab to a different machine later, the bracketed slots are the things that change.
Stop conditions¶
You're done when:
experiments/01-machine-profile/profile.mdexists.- Both
π_peakandβ_peakare written out, with the arithmetic visible. - The "machine-balance intensity" sentence is in the file.
- You can read your file aloud to yourself and explain every number without consulting
lscpuagain.
Hint of last resort¶
If you've spent more than 45 minutes and dmidecode is fighting you: hard-code DDR4-2400 dual-channel (19.2 GB/s × 2 = 38.4 GB/s for the i5-8250U platform), note the assumption, and move on. The point of lab 00 is not to fight dmidecode; it's to write down a derivation you understand.
When to consult solutions/¶
After you have committed profile.md. The solution lives in solutions/00-machine-profile-ref.md — written at phase open, not pre-written, because it depends on what Borja's actual machine reports. Compare; don't pre-read.
Next lab: lab/01-memcpy-bandwidth.md.