Skip to content

English · Español

Phase 09 — Quiz (human-readable mirror)

🇪🇸 Espejo legible del fichero canónico data/quizzes/phase-09-mlp-modules.yaml. El portal (Phase 41) consume el YAML; este .md es para repaso rápido fuera del portal. No edites aquí — edita el YAML y este archivo es regenerable.

Source: data/quizzes/phase-09-mlp-modules.yaml. Schema in src/miniportal/BLUEPRINT.md §1.


q-09-01 — Why does the Module class register Parameters explicitly? (single)

In the Module class you built, why must _parameters be registered explicitly instead of relying on __dict__ introspection?

  • Because Python's __dict__ is unordered before 3.7
  • To allow nested modules and lazy device-move semantics
  • Because Parameter objects don't have __hash__
  • It's a PyTorch convention copied for familiarity

Explicit registration lets a parent Module walk its children deterministically so .parameters() and .to(device) work without runtime surprises. PyTorch made the same choice for the same reason.


q-09-02 — Which initializations are commonly used for an MLP's hidden weights? (multi)

Select every initialization scheme that is appropriate as a default for the hidden linear layers of a ReLU-activated MLP.

  • Zeros for every weight
  • Kaiming (He) normal
  • Kaiming (He) uniform
  • Constant 1.0 for every weight

Kaiming normal/uniform preserve activation variance under ReLU. All-zero or all-ones initializations break symmetry or saturate the network on step 1.


q-09-03 — What does Module.train(False) typically toggle? (free)

In one sentence, what does calling module.train(False) (eval mode) change in modules like Dropout and BatchNorm?

Expected to contain: eval, dropout.

Eval mode disables stochastic dropout and switches BatchNorm to use running statistics so inference is deterministic.


q-09-04 — Backward shape for Linear's weight gradient (single)

In a Linear(in=23, out=16) whose forward is Z = X @ W.T + b with X.shape == (4, 23), what shape does the weight gradient ∇_W L have?

  • (4, 23)
  • (16, 23)
  • (23, 16)
  • (16, 4)

∇_W = (∇_Z)^T @ X has shape (16, 4) @ (4, 23) = (16, 23) — exactly W's shape. Phase 9 theory §04 warns that on square shapes the buggy ∇_W = ∇_Z @ X^T also fits.


q-09-05 — Find the bug: two Linears with no activation (free)

A learner writes Sequential(Linear(23, 16), Linear(16, 5)) for the §A13 tense classifier. Validation accuracy is acceptable but the model generalizes slightly worse than a GELU-equipped twin. What single property of the composite map explains this?

Expected to contain: linear.

Composition of two affine maps is affine — the stack collapses to a single rank-≤5 linear map. The §A13 task is linearly separable so accuracy survives, but capacity to fit subtle agreement is lost. Cross-link: break/00-break-gelu-as-identity.md.