English · Español

01 — From transistor to CPU¶

🇪🇸 De abajo a arriba: transistor → puerta → ALU → pipeline → ejecución especulativa. Nivel de detalle: suficiente para entender por qué hay caches y por qué mispredict cuesta ciclos, no para diseñar un chip.

Why bottom-up¶

You can use a CPU without knowing what's inside it. But every performance number you'll meet — 3 GHz, 4 cores, 8 threads, AVX2, branch mispredict penalty 12-20 cycles — is a leaky abstraction over the physical machine. When the leak matters (it will), you need to know which abstraction broke. So: bottom-up, fast.

The pace below is deliberately aggressive. Each section is one paragraph of mental model plus one or two numbers worth memorizing. We are not building a digital logic course; we are pinning down the vocabulary that Phase 1's measurements use.

Transistor¶

A transistor is a voltage-controlled switch. You apply voltage to one terminal (gate), and current flows or doesn't flow between the other two (source / drain). That's it. Everything else in your CPU is billions of these arranged into patterns.

Two numbers to remember: - Modern transistor count on a CPU die: tens of billions (Intel's i5-8250U is ~6B; an H100 is ~80B). - Transistor switching delay: picoseconds (10⁻¹² s).

The picosecond switching time is what makes GHz clocks possible. If switching took microseconds, computers would run at kHz.

Logic gate¶

Wire transistors into patterns and you get logic gates: NAND, NOR, AND, OR, XOR, NOT. NAND alone is functionally complete — every other operation can be built from NAND. Real chips use a mix because area and power optimization rewards diversity.

A handful of NANDs make a 1-bit adder. A bunch of 1-bit adders chained make an N-bit adder. A multiplexer + adder + a few control gates makes an ALU (Arithmetic Logic Unit) — the part of the CPU that actually computes a + b or a & b.

Clock¶

Logic gates need time to settle after their inputs change. The clock is a square-wave signal (high / low / high / low) that gates when results are read. Each clock cycle, the ALU takes inputs, computes, and the result is latched on the next clock edge.

Clock period = 1 / frequency. A 3 GHz CPU has a 333 ps clock period.
"Cycles per instruction" (CPI) is the number of clock cycles a given instruction takes. Modern CPUs can do CPI < 1 (multiple instructions per cycle) thanks to pipelining.

Pipeline¶

A naive CPU would: fetch instruction, decode it, execute it, write back result — one at a time. Pipelining notices that those four stages use different parts of the chip, so while stage 1 is fetching instruction N+1, stage 2 can decode instruction N, stage 3 can execute instruction N-1, stage 4 can write back instruction N-2. Four instructions in flight at once, one finishing per cycle.

Real CPUs have 10–20 pipeline stages, not 4. The longer the pipeline, the higher the clock can go (each stage does less work, so the clock period can be shorter) and the more painful pipeline stalls become.

A stall happens when stage k+1 needs the output of stage k before stage k is done. Example: instruction N+1 needs the result of instruction N. The pipeline has to wait — "bubble" — wasting cycles.

The big stall causes: - Data hazards. N+1 depends on N's result. - Control hazards. A branch (if, loop) — until the branch is resolved, the CPU doesn't know which instruction to fetch next. - Memory hazards. N+1 reads from memory; if the value isn't in cache, the pipeline waits 100s of cycles for DRAM.

Branch prediction & speculative execution¶

Control hazards are constant — every loop, every if — so CPUs use a branch predictor: a small piece of hardware that guesses, before the branch is resolved, which way it will go, and starts fetching down the predicted path. If the guess is right (>95% of the time for typical code), the pipeline never stalls. If it's wrong, the speculatively fetched instructions are discarded — mispredict penalty, ~12-20 cycles on modern CPUs.

This is why if-heavy code can be slower than arithmetic-heavy code: every unpredictable branch is a 15-cycle bubble. Loop bodies that do mostly arithmetic with predictable iteration counts run flat-out; loop bodies full of if (data[i] > threshold) on random data are 5–10× slower because the predictor is wrong half the time.

For Phase 1's purposes: branch mispredict cost is negligible compared to a cache miss. We mention it because it's vocabulary that appears in profiling output. Stay focused on memory.

Out-of-order execution¶

Beyond predicting branches, modern CPUs reorder instructions on the fly. If instruction N+1 doesn't depend on N, but N+2 doesn't depend on either and is ready to run, the CPU can execute N+2 before N. The end result, as observed by the program, is the same as in-order execution (architectural state is the same), but the pipeline stays full.

This is wonderful for performance and terrible for predictability. Profiling out-of-order CPUs is a small industry. The takeaway for Phase 1: you cannot reason about individual instruction timings anymore — only about kernels with enough instructions for the average behavior to dominate.

Cores, threads, hyperthreading¶

A core is a full pipeline — fetch, decode, execute, write-back — plus its own L1 and L2 caches. A 4-core CPU has 4 full pipelines running in parallel.

Hyperthreading (SMT) lets one core hold two sets of architectural registers and switch between them rapidly. When one thread stalls (cache miss), the other runs. To the OS this looks like 2 cores. To peak throughput it gives ~1.2-1.4× speedup, not 2×, because the underlying execution units are shared.

Borja's i5-8250U: 4 cores, 8 threads. nproc returns 8. For CPU-bound benchmarks (Phase 1's labs), use 4 threads as the upper bound, not 8. Lab 02 will catch this.

SIMD: AVX and friends¶

A normal ADD instruction adds two scalars. A SIMD (Single Instruction Multiple Data) instruction adds vectors of scalars in one cycle. AVX2 operates on 256-bit vectors — 8 fp32 values at once. AVX-512 doubles that to 16 fp32 at once (not present on i5-8250U).

This is where the "100 GFLOPS" number comes from on a CPU: clock × cores × fp32_per_AVX × FMA_factor. For an i5-8250U:

3.4 GHz × 4 cores × 8 fp32/AVX2 × 2 (fused multiply-add counts as 2 FLOPs)
≈ 218 GFLOPS  (peak, AVX2 thermal-limited)

Sustained is lower because AVX2 workloads thermal-throttle on the U-series. Lab 03 measures the actual ceiling on this machine.

Where caches enter the picture¶

Everything above runs in fractions of a nanosecond. Main RAM is ~70 ns away. Without caches, the CPU would spend 99% of its time waiting for memory. Caches are the answer. They are why Phase 1 spends two entire theory files on the memory hierarchy.

That's theory/02-memory-hierarchy.md. Read it next.

One-paragraph recap¶

A CPU is billions of transistors arranged into ALUs and control logic, run by a clock at GHz speeds, organized into a pipeline that overlaps fetch / decode / execute / write-back across many instructions in flight. It predicts branches, executes out-of-order, runs SIMD vector instructions, and has multiple cores. Every performance optimization the curriculum touches assumes this machine.

Next: theory/02-memory-hierarchy.md.