English · Español

Lab 03 — Version the corpus with `dvc`¶

Goal: add dvc to the project (per §A8), use it to track data/processed/, commit the .dvc pointer files, and confirm reproducibility from the manifest.

Estimated time: 45–60 minutes.

Prereq: lab 02 committed; data/processed/{train,val,test}.jsonl + data/MANIFEST.json exist.

What you produce¶

dvc initialized in the repo (.dvc/ directory).
data/processed/*.jsonl.dvc pointer files committed.
data/processed/*.jsonl added to .gitignore.
experiments/12-corpus-stats/ with three plots and a README.md.
The end-of-phase PHASE_12_REPORT.md (per §7).

TODOs¶

Block A — install and initialize `dvc`¶

Add dvc to pyproject.toml under a data opt-in group (per §A8's "pin now, install when needed" pattern). Run uv sync --extra data.
dvc init in the repo root. Verify .dvc/ is created and .dvc/config is empty (no remote).
git add .dvc/, commit with message chore: initialize dvc for data layer.

Block B — track the corpus files¶

dvc add data/processed/train.jsonl. Confirm:
data/processed/train.jsonl.dvc is created.
data/processed/train.jsonl is automatically added to data/.gitignore.
Repeat for val.jsonl and test.jsonl.
git add data/processed/*.dvc data/.gitignore.
git commit -m "chore: track corpus jsonl with dvc".
git status — confirm data/processed/*.jsonl is not tracked, but the .dvc files are.

Block C — confirm reproducibility¶

Delete data/processed/{train,val,test}.jsonl (move to /tmp/ for safety, don't permanently delete).
Run the full pipeline: python scripts/gen_corpus.py --seed 42 && python scripts/validate_corpus.py && python scripts/split_corpus.py --seed 42 && python scripts/build_manifest.py.
Confirm the regenerated files have identical SHA256s to the values in data/MANIFEST.json.
If they don't match: stop here, debug, do not proceed. Theory 03 lists the likely root causes.
Restore the (now-regenerated) files via dvc checkout if needed (or just leave the regenerated copies; they're byte-identical).

Block D — plots in `experiments/12-corpus-stats/`¶

Coverage heatmap. 20 verbs × 6 tense-surfaces, colored by total row count (correct + mis-conjugated) per cell. Per-person counts collapsed for visual clarity. Save as coverage_heatmap.png.
Token-length histogram per tense. Use the Phase 11 BPE to tokenize every English row; group by tense. Six overlaid histograms (or a small grid of six). Save as token_length_per_tense.png.
Mis-conjugation Pareto. Bar chart of mis-conjugation count per type. Save as mis_conjugation_pareto.png.
English vs Spanish length scatter. One point per row; x = English byte length, y = Spanish byte length; color by tense. Save as english_spanish_length_scatter.png.
Write experiments/12-corpus-stats/README.md interpreting each plot in 2–4 sentences. Note any surprises.
Write experiments/12-corpus-stats/manifest.json per CLAUDE.md §0.5 (seed, versions, inputs).

Block E — `PHASE_12_REPORT.md`¶

Follow the §7 ritual template. Include:

Headline numbers: total rows, breakdown by label, cell coverage check status.
Verb table review: any Spanish translations you'd revise post-implementation.
Mis-conjugation taxonomy review: any types that turned out to be hard to generate cleanly; any types that overlap (e.g., subject_verb_disagreement vs bare_participle_missing_aux both apply to participle cells).
Split balance: how skewed did the (verb, tense)-stratified split end up? (Some splits may have all the irregular verbs by chance.)
DoD checklist: every item from PHASE_12_PLAN.md §6 ticked or explicitly waived.
Open questions raised: anything not anticipated in PHASE_12_PLAN.md §7 that came up.
Hand-off to Phase 13: what does the embedding lab need to know? E.g., "the Phase 11 BPE was retrained on the full corpus at the close of Phase 12 — see experiments/12-bpe-rerun/ for the new vocab."

Block F — Phase 11 BPE rerun¶

Per the Phase 11 lab-02 "Re-run this lab at the close of Phase 12" note: re-train the BPE on the full Phase 12 corpus.
Output: experiments/12-bpe-rerun/vocabs/512/ with the new tokenizer.
Compare top-30 merges to the bootstrap-corpus result; document any new morphology that surfaced (e.g., -aba Spanish imperfect suffix if relevant — though not in our v1 tense scope; more likely: trab stem, gust stem, going to as a single token).
Update experiments/12-bpe-rerun/README.md with the comparison.

Block G — `learners/borja/phase-12/reflections.md`¶

Per CLAUDE.md §3 and the per-phase ritual: write the reflection.

What clicked? What didn't?
Was the (verb, tense)-stratified split the right grain? Would you change it?
What's the corpus's biggest blind spot for the Phase 32 tutor?
Estimated time spent per lab vs the lab statements' estimates.

Constraints¶

dvc is optional dep group. uv sync (no extras) shouldn't pull it.
No dvc push. No remote configured in v1 (per theory 03).
No dvc.yaml pipeline. v1 keeps dvc to file tracking only.
Reports are markdown. Plots are PNG. Manifests are JSON.

Stop conditions¶

Done when:

dvc is initialized; data/processed/*.jsonl.dvc files committed.
data/processed/*.jsonl is in .gitignore and is regenerable.
Coverage heatmap, length-distribution, mis-conjugation Pareto plots committed in experiments/12-corpus-stats/.
PHASE_12_REPORT.md written and committed.
learners/borja/phase-12/reflections.md filled and committed.
Phase 11 BPE rerun on the full corpus is committed in experiments/12-bpe-rerun/.

Pitfalls¶

Forgetting .gitignore. dvc add auto-edits .gitignore, but if you accidentally git add data/processed/*.jsonl before dvc add, the file is tracked in git and dvc. Confusing. Reset and redo.
Plotting with the wrong tokenizer. The token-length histogram (Block D point 2) uses the Phase 11 BPE. If you use the bootstrap BPE (small vocab), token counts are inflated. Use the Phase-12-rerun BPE (Block F) for the official plot.
Spanish length always > English. Expect this — Spanish is more morphologically rich (4 syllables per verb vs 2 in English typically) and accented characters are 2 bytes each in UTF-8. The scatter should show Spanish ≈ 1.3× English. If they're equal, the Spanish field is missing accents.
DoD item drift. When you fill the checklist in the report, re-read PHASE_12_PLAN.md §6 line-by-line. Don't summarize from memory.

Hint of last resort¶

If dvc add fails with a "file is tracked by git" error: git rm --cached data/processed/<file> first, then dvc add again. The file is then untracked-by-git but tracked-by-dvc.

When to consult `solutions/`¶

After everything is committed. Solution: solutions/03-version-with-dvc-ref.md (phase open) contains the canonical commands sequence and a sample PHASE_12_REPORT.md.

Phase 12 complete. Next phase: docs/phase-13-embeddings/.

Lab 03 — Version the corpus with dvc¶