English · Español
Lab 03 — GGUF-like Export and Round-Trip¶
Goal: hand-write a GGUF-like binary export of MiniGPT, reload it, and verify the dequantized weights match PyTorch's fake-quant within
1e-3.Estimated time: 4–6 hours.
Prereq: labs 00–02 committed; per-group INT4 quantization working in
src/miniquant/quantize.py.
What you produce¶
A directory experiments/26-gguf-export/ containing:
export.py— script that writesminigpt.gguf-lite.load.py— script that readsminigpt.gguf-liteback into a PyTorch module structure.verify.py— script that runs forward on identical inputs in the original (PyTorch fake-quant) and reloaded paths; reports max abs error per layer.manifest.json.README.md— interpretation.
You also commit src/miniquant/gguf_io.py (read+write).
The format (simplified GGUF-lite)¶
GGUF (the format used by llama.cpp) is a binary container. The full spec is in ggerganov/ggml's repo. Our simplified version captures the structure without the legacy tags:
HEADER:
magic u32 = 0x474C4654 ('GLFT' = "GGUF-LiTe")
version u32 = 1
n_tensors u32
metadata_len u32 = number of bytes in metadata KV
METADATA:
metadata_len bytes of key=value strings (utf-8), newline-separated
TENSOR DESCRIPTORS (repeated n_tensors times):
name_len u16
name name_len bytes (utf-8)
n_dims u8
dims n_dims × u32
dtype u8 (0=F32, 1=F16, 2=Q8_per_channel, 3=Q4_per_group_64)
offset u64 (offset into TENSOR DATA section)
TENSOR DATA:
(concatenated, each tensor's bytes per its dtype)
For Q8_per_channel and Q4_per_group_64, the tensor data layout is:
Q8_per_channel:
scales: out × f16
values: out × in × i8
Q4_per_group_64:
scales: out × (in / 64) × f16
values: out × in / 2 × u8 (two 4-bit values packed per byte; low nibble is index 0)
The 4-bit pack: lower nibble = even-index weight (signed 4-bit, two's complement, range [-8, 7]); upper nibble = odd-index weight.
TODOs¶
Block A — implement the writer¶
-
src/miniquant/gguf_io.py:write_gguf_lite(path: str, model: nn.Module, schemes: dict[str, str]). Theschemesdict says which quantization to use per parameter name (e.g.{"layers.0.mlp.fc1.weight": "q4_per_group_64"}). - Walk the state_dict; for each tensor, choose its dtype per the schemes map; quantize if needed; write the descriptor and queue the data.
- Pack INT4 weights two-per-byte. Care: even index → low nibble, odd → high nibble. Use bit-shifts, not arithmetic.
Block B — implement the reader¶
-
read_gguf_lite(path: str) -> dict[str, Tensor]. Returns a dict mapping tensor name → dequantized FP32 tensor. - For each tensor descriptor, seek to the offset, read the right number of bytes, dequantize per dtype.
- INT4 unpack: low nibble → even index; reinterpret as signed 4-bit (subtract 16 if ≥ 8).
Block C — verify round-trip¶
- Run
write_gguf_litethenread_gguf_lite; compare the result to the original PyTorch fake-quant output at the level of dequantized FP32 weights. Per-tensor max abs error should be ≤1e-6(just round-off in scale storage as FP16). - Run a full MiniGPT forward on a fixed input on both: original quantized model in PyTorch, and a re-built model from the reloaded weights. Layer-wise activation max abs error should be ≤
1e-3.
Block D — measure size¶
- Bytes-on-disk of the GGUF-lite file.
- Compare against a naive pickle of the same model (the PyTorch
torch.savebaseline). - Compute the byte overhead of the GGUF header + tensor descriptors.
Block E — interpret in README.md¶
Three questions:
- What's the actual byte savings vs
torch.save(model.state_dict())? Expect ~6–8× for INT4 schemes (the 4× weight saving + amortized header). - Where does most of the file go? Sum bytes by dtype. The largest contributor should be Q4 weights, not scales or metadata.
- Why doesn't INT4 give a clean 8× reduction vs FP32? Identify the overheads: scales (FP16), padding for alignment, the header, the non-quantized parts (embeddings, layer-norms).
Constraints¶
- Little-endian. Borja's x86_64 is little-endian; record this in the magic-version comment but don't write byte-swap code unless asked.
- No
pickle, notorch.savefor the quantized format. The whole point is that you can read this from any language (C, Rust, Zig) that can parse a flat binary. - No dependency on real
ggml. Our format is GGUF-shaped but simplified; it's pedagogically connected to GGUF, not bit-exact.
Stop conditions¶
Done when:
- Writer and reader implemented; tests in
tests/test_gguf_io.pypass. - Full-model round-trip max abs error <
1e-3per layer. - File size ~3× smaller than
torch.save(model.state_dict())for INT8 schemes; ~6× smaller for INT4. README.mdanswers the three questions.
Pitfalls¶
- The reloaded model has wrong shapes. Did you write
dimsin the correct order (PyTorch is row-major,nn.Linear's weight is(out, in))? Document explicitly in the header. - INT4 unpack returns wrong sign. Two's-complement 4-bit: values 8..15 are negative. Use
int8(nibble) - 16 if nibble >= 8 else int8(nibble). - Per-group scales don't line up after reload. The reshape in
quantize_symmetric_per_groupmust be matched by the inverse reshape in the dequant path. Test on a (4, 8) toy tensor before scaling to the real model. - Header offsets wrong. Compute the data offset after writing all descriptors; don't pre-commit to an offset.
Stretch goal — actual GGUF compatibility¶
If time allows, swap the magic value and dtype enum to match the real llama.cpp GGUF spec, and test loading via llama-cli. Not graded; it's a "see if it works" exercise.
When to consult solutions/¶
After all four stop conditions met. solutions/03-gguf-export-ref.md (phase open) walks through the bit-packing carefully.
End of Phase 26 labs. Write PHASE_26_REPORT.md next.