Skip to content

English · Español

Lab 03 — Version the corpus with dvc

Goal: add dvc to the project (per §A8), use it to track data/processed/, commit the .dvc pointer files, and confirm reproducibility from the manifest.

Estimated time: 45–60 minutes.

Prereq: lab 02 committed; data/processed/{train,val,test}.jsonl + data/MANIFEST.json exist.


What you produce

  • dvc initialized in the repo (.dvc/ directory).
  • data/processed/*.jsonl.dvc pointer files committed.
  • data/processed/*.jsonl added to .gitignore.
  • experiments/12-corpus-stats/ with three plots and a README.md.
  • The end-of-phase PHASE_12_REPORT.md (per §7).

TODOs

Block A — install and initialize dvc

  • Add dvc to pyproject.toml under a data opt-in group (per §A8's "pin now, install when needed" pattern). Run uv sync --extra data.
  • dvc init in the repo root. Verify .dvc/ is created and .dvc/config is empty (no remote).
  • git add .dvc/, commit with message chore: initialize dvc for data layer.

Block B — track the corpus files

  • dvc add data/processed/train.jsonl. Confirm:
  • data/processed/train.jsonl.dvc is created.
  • data/processed/train.jsonl is automatically added to data/.gitignore.
  • Repeat for val.jsonl and test.jsonl.
  • git add data/processed/*.dvc data/.gitignore.
  • git commit -m "chore: track corpus jsonl with dvc".
  • git status — confirm data/processed/*.jsonl is not tracked, but the .dvc files are.

Block C — confirm reproducibility

  • Delete data/processed/{train,val,test}.jsonl (move to /tmp/ for safety, don't permanently delete).
  • Run the full pipeline: python scripts/gen_corpus.py --seed 42 && python scripts/validate_corpus.py && python scripts/split_corpus.py --seed 42 && python scripts/build_manifest.py.
  • Confirm the regenerated files have identical SHA256s to the values in data/MANIFEST.json.
  • If they don't match: stop here, debug, do not proceed. Theory 03 lists the likely root causes.
  • Restore the (now-regenerated) files via dvc checkout if needed (or just leave the regenerated copies; they're byte-identical).

Block D — plots in experiments/12-corpus-stats/

  • Coverage heatmap. 20 verbs × 6 tense-surfaces, colored by total row count (correct + mis-conjugated) per cell. Per-person counts collapsed for visual clarity. Save as coverage_heatmap.png.
  • Token-length histogram per tense. Use the Phase 11 BPE to tokenize every English row; group by tense. Six overlaid histograms (or a small grid of six). Save as token_length_per_tense.png.
  • Mis-conjugation Pareto. Bar chart of mis-conjugation count per type. Save as mis_conjugation_pareto.png.
  • English vs Spanish length scatter. One point per row; x = English byte length, y = Spanish byte length; color by tense. Save as english_spanish_length_scatter.png.
  • Write experiments/12-corpus-stats/README.md interpreting each plot in 2–4 sentences. Note any surprises.
  • Write experiments/12-corpus-stats/manifest.json per CLAUDE.md §0.5 (seed, versions, inputs).

Block E — PHASE_12_REPORT.md

Follow the §7 ritual template. Include:

  • Headline numbers: total rows, breakdown by label, cell coverage check status.
  • Verb table review: any Spanish translations you'd revise post-implementation.
  • Mis-conjugation taxonomy review: any types that turned out to be hard to generate cleanly; any types that overlap (e.g., subject_verb_disagreement vs bare_participle_missing_aux both apply to participle cells).
  • Split balance: how skewed did the (verb, tense)-stratified split end up? (Some splits may have all the irregular verbs by chance.)
  • DoD checklist: every item from PHASE_12_PLAN.md §6 ticked or explicitly waived.
  • Open questions raised: anything not anticipated in PHASE_12_PLAN.md §7 that came up.
  • Hand-off to Phase 13: what does the embedding lab need to know? E.g., "the Phase 11 BPE was retrained on the full corpus at the close of Phase 12 — see experiments/12-bpe-rerun/ for the new vocab."

Block F — Phase 11 BPE rerun

  • Per the Phase 11 lab-02 "Re-run this lab at the close of Phase 12" note: re-train the BPE on the full Phase 12 corpus.
  • Output: experiments/12-bpe-rerun/vocabs/512/ with the new tokenizer.
  • Compare top-30 merges to the bootstrap-corpus result; document any new morphology that surfaced (e.g., -aba Spanish imperfect suffix if relevant — though not in our v1 tense scope; more likely: trab stem, gust stem, going to as a single token).
  • Update experiments/12-bpe-rerun/README.md with the comparison.

Block G — learners/borja/phase-12/reflections.md

Per CLAUDE.md §3 and the per-phase ritual: write the reflection.

  • What clicked? What didn't?
  • Was the (verb, tense)-stratified split the right grain? Would you change it?
  • What's the corpus's biggest blind spot for the Phase 32 tutor?
  • Estimated time spent per lab vs the lab statements' estimates.

Constraints

  • dvc is optional dep group. uv sync (no extras) shouldn't pull it.
  • No dvc push. No remote configured in v1 (per theory 03).
  • No dvc.yaml pipeline. v1 keeps dvc to file tracking only.
  • Reports are markdown. Plots are PNG. Manifests are JSON.

Stop conditions

Done when:

  1. dvc is initialized; data/processed/*.jsonl.dvc files committed.
  2. data/processed/*.jsonl is in .gitignore and is regenerable.
  3. Coverage heatmap, length-distribution, mis-conjugation Pareto plots committed in experiments/12-corpus-stats/.
  4. PHASE_12_REPORT.md written and committed.
  5. learners/borja/phase-12/reflections.md filled and committed.
  6. Phase 11 BPE rerun on the full corpus is committed in experiments/12-bpe-rerun/.

Pitfalls

  • Forgetting .gitignore. dvc add auto-edits .gitignore, but if you accidentally git add data/processed/*.jsonl before dvc add, the file is tracked in git and dvc. Confusing. Reset and redo.
  • Plotting with the wrong tokenizer. The token-length histogram (Block D point 2) uses the Phase 11 BPE. If you use the bootstrap BPE (small vocab), token counts are inflated. Use the Phase-12-rerun BPE (Block F) for the official plot.
  • Spanish length always > English. Expect this — Spanish is more morphologically rich (4 syllables per verb vs 2 in English typically) and accented characters are 2 bytes each in UTF-8. The scatter should show Spanish ≈ 1.3× English. If they're equal, the Spanish field is missing accents.
  • DoD item drift. When you fill the checklist in the report, re-read PHASE_12_PLAN.md §6 line-by-line. Don't summarize from memory.

Hint of last resort

If dvc add fails with a "file is tracked by git" error: git rm --cached data/processed/<file> first, then dvc add again. The file is then untracked-by-git but tracked-by-dvc.

When to consult solutions/

After everything is committed. Solution: solutions/03-version-with-dvc-ref.md (phase open) contains the canonical commands sequence and a sample PHASE_12_REPORT.md.


Phase 12 complete. Next phase: docs/phase-13-embeddings/.