English · Español
Lab 03 — Version the corpus with dvc¶
Goal: add
dvcto the project (per §A8), use it to trackdata/processed/, commit the.dvcpointer files, and confirm reproducibility from the manifest.Estimated time: 45–60 minutes.
Prereq: lab 02 committed;
data/processed/{train,val,test}.jsonl+data/MANIFEST.jsonexist.
What you produce¶
dvcinitialized in the repo (.dvc/directory).data/processed/*.jsonl.dvcpointer files committed.data/processed/*.jsonladded to.gitignore.experiments/12-corpus-stats/with three plots and aREADME.md.- The end-of-phase
PHASE_12_REPORT.md(per §7).
TODOs¶
Block A — install and initialize dvc¶
- Add
dvctopyproject.tomlunder adataopt-in group (per §A8's "pin now, install when needed" pattern). Runuv sync --extra data. -
dvc initin the repo root. Verify.dvc/is created and.dvc/configis empty (no remote). -
git add .dvc/, commit with messagechore: initialize dvc for data layer.
Block B — track the corpus files¶
-
dvc add data/processed/train.jsonl. Confirm: data/processed/train.jsonl.dvcis created.data/processed/train.jsonlis automatically added todata/.gitignore.- Repeat for
val.jsonlandtest.jsonl. -
git add data/processed/*.dvc data/.gitignore. -
git commit -m "chore: track corpus jsonl with dvc". -
git status— confirmdata/processed/*.jsonlis not tracked, but the.dvcfiles are.
Block C — confirm reproducibility¶
- Delete
data/processed/{train,val,test}.jsonl(move to/tmp/for safety, don't permanently delete). - Run the full pipeline:
python scripts/gen_corpus.py --seed 42 && python scripts/validate_corpus.py && python scripts/split_corpus.py --seed 42 && python scripts/build_manifest.py. - Confirm the regenerated files have identical SHA256s to the values in
data/MANIFEST.json. - If they don't match: stop here, debug, do not proceed. Theory 03 lists the likely root causes.
- Restore the (now-regenerated) files via
dvc checkoutif needed (or just leave the regenerated copies; they're byte-identical).
Block D — plots in experiments/12-corpus-stats/¶
- Coverage heatmap. 20 verbs × 6 tense-surfaces, colored by total row count (correct + mis-conjugated) per cell. Per-person counts collapsed for visual clarity. Save as
coverage_heatmap.png. - Token-length histogram per tense. Use the Phase 11 BPE to tokenize every English row; group by tense. Six overlaid histograms (or a small grid of six). Save as
token_length_per_tense.png. - Mis-conjugation Pareto. Bar chart of mis-conjugation count per type. Save as
mis_conjugation_pareto.png. - English vs Spanish length scatter. One point per row; x = English byte length, y = Spanish byte length; color by tense. Save as
english_spanish_length_scatter.png. - Write
experiments/12-corpus-stats/README.mdinterpreting each plot in 2–4 sentences. Note any surprises. - Write
experiments/12-corpus-stats/manifest.jsonper CLAUDE.md §0.5 (seed, versions, inputs).
Block E — PHASE_12_REPORT.md¶
Follow the §7 ritual template. Include:
- Headline numbers: total rows, breakdown by label, cell coverage check status.
- Verb table review: any Spanish translations you'd revise post-implementation.
- Mis-conjugation taxonomy review: any types that turned out to be hard to generate cleanly; any types that overlap (e.g.,
subject_verb_disagreementvsbare_participle_missing_auxboth apply to participle cells). - Split balance: how skewed did the (verb, tense)-stratified split end up? (Some splits may have all the irregular verbs by chance.)
- DoD checklist: every item from
PHASE_12_PLAN.md§6 ticked or explicitly waived. - Open questions raised: anything not anticipated in
PHASE_12_PLAN.md§7 that came up. - Hand-off to Phase 13: what does the embedding lab need to know? E.g., "the Phase 11 BPE was retrained on the full corpus at the close of Phase 12 — see
experiments/12-bpe-rerun/for the new vocab."
Block F — Phase 11 BPE rerun¶
- Per the Phase 11 lab-02 "Re-run this lab at the close of Phase 12" note: re-train the BPE on the full Phase 12 corpus.
- Output:
experiments/12-bpe-rerun/vocabs/512/with the new tokenizer. - Compare top-30 merges to the bootstrap-corpus result; document any new morphology that surfaced (e.g.,
-abaSpanish imperfect suffix if relevant — though not in our v1 tense scope; more likely:trabstem,guststem,going toas a single token). - Update
experiments/12-bpe-rerun/README.mdwith the comparison.
Block G — learners/borja/phase-12/reflections.md¶
Per CLAUDE.md §3 and the per-phase ritual: write the reflection.
- What clicked? What didn't?
- Was the (verb, tense)-stratified split the right grain? Would you change it?
- What's the corpus's biggest blind spot for the Phase 32 tutor?
- Estimated time spent per lab vs the lab statements' estimates.
Constraints¶
dvcis optional dep group.uv sync(no extras) shouldn't pull it.- No
dvc push. No remote configured in v1 (per theory 03). - No
dvc.yamlpipeline. v1 keepsdvcto file tracking only. - Reports are markdown. Plots are PNG. Manifests are JSON.
Stop conditions¶
Done when:
dvcis initialized;data/processed/*.jsonl.dvcfiles committed.data/processed/*.jsonlis in.gitignoreand is regenerable.- Coverage heatmap, length-distribution, mis-conjugation Pareto plots committed in
experiments/12-corpus-stats/. PHASE_12_REPORT.mdwritten and committed.learners/borja/phase-12/reflections.mdfilled and committed.- Phase 11 BPE rerun on the full corpus is committed in
experiments/12-bpe-rerun/.
Pitfalls¶
- Forgetting
.gitignore.dvc addauto-edits.gitignore, but if you accidentallygit add data/processed/*.jsonlbeforedvc add, the file is tracked in git and dvc. Confusing. Reset and redo. - Plotting with the wrong tokenizer. The token-length histogram (Block D point 2) uses the Phase 11 BPE. If you use the bootstrap BPE (small vocab), token counts are inflated. Use the Phase-12-rerun BPE (Block F) for the official plot.
- Spanish length always > English. Expect this — Spanish is more morphologically rich (4 syllables per verb vs 2 in English typically) and accented characters are 2 bytes each in UTF-8. The scatter should show Spanish ≈ 1.3× English. If they're equal, the Spanish field is missing accents.
- DoD item drift. When you fill the checklist in the report, re-read
PHASE_12_PLAN.md§6 line-by-line. Don't summarize from memory.
Hint of last resort¶
If dvc add fails with a "file is tracked by git" error: git rm --cached data/processed/<file> first, then dvc add again. The file is then untracked-by-git but tracked-by-dvc.
When to consult solutions/¶
After everything is committed. Solution: solutions/03-version-with-dvc-ref.md (phase open) contains the canonical commands sequence and a sample PHASE_12_REPORT.md.
Phase 12 complete. Next phase: docs/phase-13-embeddings/.