English · Español

Lab 03 — GGUF-like Export and Round-Trip¶

Goal: hand-write a GGUF-like binary export of MiniGPT, reload it, and verify the dequantized weights match PyTorch's fake-quant within 1e-3.

Estimated time: 4–6 hours.

Prereq: labs 00–02 committed; per-group INT4 quantization working in src/miniquant/quantize.py.

What you produce¶

A directory experiments/26-gguf-export/ containing:

export.py — script that writes minigpt.gguf-lite.
load.py — script that reads minigpt.gguf-lite back into a PyTorch module structure.
verify.py — script that runs forward on identical inputs in the original (PyTorch fake-quant) and reloaded paths; reports max abs error per layer.
manifest.json.
README.md — interpretation.

You also commit src/miniquant/gguf_io.py (read+write).

The format (simplified GGUF-lite)¶

GGUF (the format used by llama.cpp) is a binary container. The full spec is in ggerganov/ggml's repo. Our simplified version captures the structure without the legacy tags:

HEADER:
  magic         u32   = 0x474C4654  ('GLFT' = "GGUF-LiTe")
  version       u32   = 1
  n_tensors     u32
  metadata_len  u32   = number of bytes in metadata KV
METADATA:
  metadata_len bytes of key=value strings (utf-8), newline-separated
TENSOR DESCRIPTORS (repeated n_tensors times):
  name_len      u16
  name          name_len bytes (utf-8)
  n_dims        u8
  dims          n_dims × u32
  dtype         u8     (0=F32, 1=F16, 2=Q8_per_channel, 3=Q4_per_group_64)
  offset        u64    (offset into TENSOR DATA section)
TENSOR DATA:
  (concatenated, each tensor's bytes per its dtype)

For Q8_per_channel and Q4_per_group_64, the tensor data layout is:

Q8_per_channel:
  scales:  out × f16
  values:  out × in × i8
Q4_per_group_64:
  scales:  out × (in / 64) × f16
  values:  out × in / 2 × u8    (two 4-bit values packed per byte; low nibble is index 0)

The 4-bit pack: lower nibble = even-index weight (signed 4-bit, two's complement, range [-8, 7]); upper nibble = odd-index weight.

TODOs¶

Block A — implement the writer¶

src/miniquant/gguf_io.py: write_gguf_lite(path: str, model: nn.Module, schemes: dict[str, str]). The schemes dict says which quantization to use per parameter name (e.g. {"layers.0.mlp.fc1.weight": "q4_per_group_64"}).
Walk the state_dict; for each tensor, choose its dtype per the schemes map; quantize if needed; write the descriptor and queue the data.
Pack INT4 weights two-per-byte. Care: even index → low nibble, odd → high nibble. Use bit-shifts, not arithmetic.

Block B — implement the reader¶

read_gguf_lite(path: str) -> dict[str, Tensor]. Returns a dict mapping tensor name → dequantized FP32 tensor.
For each tensor descriptor, seek to the offset, read the right number of bytes, dequantize per dtype.
INT4 unpack: low nibble → even index; reinterpret as signed 4-bit (subtract 16 if ≥ 8).

Block C — verify round-trip¶

Run write_gguf_lite then read_gguf_lite; compare the result to the original PyTorch fake-quant output at the level of dequantized FP32 weights. Per-tensor max abs error should be ≤ 1e-6 (just round-off in scale storage as FP16).
Run a full MiniGPT forward on a fixed input on both: original quantized model in PyTorch, and a re-built model from the reloaded weights. Layer-wise activation max abs error should be ≤ 1e-3.

Block D — measure size¶

Bytes-on-disk of the GGUF-lite file.
Compare against a naive pickle of the same model (the PyTorch torch.save baseline).
Compute the byte overhead of the GGUF header + tensor descriptors.

Block E — interpret in `README.md`¶

Three questions:

What's the actual byte savings vs torch.save(model.state_dict())? Expect ~6–8× for INT4 schemes (the 4× weight saving + amortized header).
Where does most of the file go? Sum bytes by dtype. The largest contributor should be Q4 weights, not scales or metadata.
Why doesn't INT4 give a clean 8× reduction vs FP32? Identify the overheads: scales (FP16), padding for alignment, the header, the non-quantized parts (embeddings, layer-norms).

Constraints¶

Little-endian. Borja's x86_64 is little-endian; record this in the magic-version comment but don't write byte-swap code unless asked.
No pickle, no torch.save for the quantized format. The whole point is that you can read this from any language (C, Rust, Zig) that can parse a flat binary.
No dependency on real ggml. Our format is GGUF-shaped but simplified; it's pedagogically connected to GGUF, not bit-exact.

Stop conditions¶

Done when:

Writer and reader implemented; tests in tests/test_gguf_io.py pass.
Full-model round-trip max abs error < 1e-3 per layer.
File size ~3× smaller than torch.save(model.state_dict()) for INT8 schemes; ~6× smaller for INT4.
README.md answers the three questions.

Pitfalls¶

The reloaded model has wrong shapes. Did you write dims in the correct order (PyTorch is row-major, nn.Linear's weight is (out, in))? Document explicitly in the header.
INT4 unpack returns wrong sign. Two's-complement 4-bit: values 8..15 are negative. Use int8(nibble) - 16 if nibble >= 8 else int8(nibble).
Per-group scales don't line up after reload. The reshape in quantize_symmetric_per_group must be matched by the inverse reshape in the dequant path. Test on a (4, 8) toy tensor before scaling to the real model.
Header offsets wrong. Compute the data offset after writing all descriptors; don't pre-commit to an offset.

Stretch goal — actual GGUF compatibility¶

If time allows, swap the magic value and dtype enum to match the real llama.cpp GGUF spec, and test loading via llama-cli. Not graded; it's a "see if it works" exercise.

When to consult `solutions/`¶

After all four stop conditions met. solutions/03-gguf-export-ref.md (phase open) walks through the bit-packing carefully.

End of Phase 26 labs. Write PHASE_26_REPORT.md next.