English · Español
Lab 00 — Curate the 50-chunk grammar-rule KB¶
🇪🇸 Una base de conocimiento de 50 chunks, escritos a mano, cubriendo los 20 verbos × 5 tiempos × 3 personas en inglés y español. Cada chunk tiene un
chunk_id, un cuerpo de 1-3 frases, etiquetas (tense,person,verb,regularity,language), y citas a referencias. Esta es la materia prima del retriever.
Objective¶
Build data/kb/grammar-rules/: a curated JSONL knowledge base of ~50 chunks covering the §A13 grammar scope, with structured metadata for filtering and evaluation. The KB must be self-contained — every query the eval set asks should be answerable from these chunks.
Setup¶
- Pure data work. No PyTorch, no models.
- The §A13 grid: 20 verbs (12 regular + 8 irregular) × 5 tenses × 3 persons × 2 languages.
- Phase 12 corpus exists (the example sentences); this lab produces the rules.
The chunk schema¶
data/kb/grammar-rules/chunks.jsonl — one line per chunk:
{
"chunk_id": "en-pres-3sg-regular-s-rule-001",
"language": "en",
"topic": "tense_rule",
"tense": "present_simple",
"person": "3sg",
"regularity": "regular",
"verbs": ["work", "play", "walk", "talk", "listen",
"watch", "study", "finish", "start", "look",
"want", "like"],
"title": "Present simple, 3rd-person singular: add -s",
"body": "In English present simple, the 3rd-person singular form of a regular verb adds -s to the base form. For example: 'She works at the office.' / 'He walks home.' Verbs ending in -y after a consonant change y → ies (study → studies); verbs ending in -sh, -ch, -ss, -x, -o add -es (watch → watches, finish → finishes).",
"examples": [
{"en": "She works at the office.", "es": "Ella trabaja en la oficina."},
{"en": "He studies every day.", "es": "Él estudia todos los días."}
],
"source": "hand",
"references": ["Murphy 2019 §3", "RAE Diccionario panhispánico §5"]
}
Required fields:
chunk_id— unique, stable, descriptive prefix (see naming convention below).language—en|es|bilingual.topic—tense_rule|irregular_form|auxiliary|agreement|time_marker|spanish_equivalent.tense— one of the 5 tenses, ornullfor cross-tense chunks.person—1sg|2sg|3sg|nullfor cross-person rules.regularity—regular|irregular|both.verbs— list of the verbs this chunk applies to (subset of the 20).title— one-line summary, ≤ 80 chars.body— 1-3 sentences, ≤ 600 chars. Self-contained — readable without context.
Optional fields (recommended):
examples— list of EN/ES paired example sentences.source—hand|derived.references— list of grammar-book / dictionary references for traceability.
Naming convention for chunk_id¶
Pattern: <lang>-<tense>-<person>-<regularity>-<topic>-<NNN>.
Examples:
- en-pres-3sg-regular-s-rule-001
- en-past-all-irregular-go-001
- es-pres-3sg-regular-er-conj-001
- bilingual-past-1sg-pair-001
Hyphen-separated, lowercase. The prefix lets grep filter chunks during debugging.
Coverage targets¶
| Category | Min chunks |
|---|---|
| English tense rules (5 tenses × 3 persons × 2 regularity) | ≥ 20 |
| Irregular forms (8 verbs × 3-4 forms each) | ≥ 12 |
| Spanish equivalents (per-tense translation rules) | ≥ 10 |
| Bilingual pairs / contrastive | ≥ 5 |
| Time markers (yesterday → past, tomorrow → future, etc.) | ≥ 3 |
| Total | ≥ 50 |
Tasks¶
Part A — Write the coverage matrix¶
Make a spreadsheet (or just a markdown table) with rows = (tense, person, regularity, language) cells. Check off each cell as you write a chunk for it. Aim for ~50 chunks total; deeper coverage on common cells (present-simple-3sg-regular has lots of related rules) is fine.
Part B — Write 50+ chunks¶
For each chunk:
- Title that says what the rule is in 80 chars or fewer.
- Body of 1-3 sentences that teach the rule to a learner who doesn't know it. Include the rule, an example, and the edge case if relevant.
- At least 1-2 examples in
exampleslist. - Filled metadata (don't skip the
regularityfield even when "n/a" — useboth).
Style guide:
- Avoid jargon: "3rd-person singular" not "3PS"; "past simple" not "preterite-mode".
- One rule per chunk. If the rule has exceptions, those go in separate chunks with their own ids and a clear
referencescross-link. - Bilingual paired chunks (where the rule is explicitly comparing EN and ES) live as
language: bilingual.
Part C — Build the loader¶
src/minirag/chunk.py:
from dataclasses import dataclass, field
from pathlib import Path
import json
ALLOWED_TENSES = {"infinitive", "present_simple", "past_simple",
"past_participle", "future", None}
ALLOWED_PERSONS = {"1sg", "2sg", "3sg", None}
ALLOWED_REGULARITY = {"regular", "irregular", "both"}
ALLOWED_LANGUAGES = {"en", "es", "bilingual"}
ALLOWED_TOPICS = {"tense_rule", "irregular_form", "auxiliary",
"agreement", "time_marker", "spanish_equivalent"}
@dataclass(frozen=True)
class Chunk:
chunk_id: str
language: str
topic: str
tense: str | None
person: str | None
regularity: str
verbs: tuple[str, ...]
title: str
body: str
examples: tuple[dict, ...] = field(default_factory=tuple)
source: str = "hand"
references: tuple[str, ...] = field(default_factory=tuple)
def load_chunks(path: Path) -> list[Chunk]: ...
def validate_chunks(chunks: list[Chunk]) -> list[str]:
"""Returns list of errors; [] = pass."""
...
-
load_chunksparses JSONL →Chunkdataclasses. -
validate_chunkschecks: - schema correctness, required fields present.
chunk_iduniqueness.- field values in allowed sets.
verbs⊂ §A13 20-verb list.bodyis 1-3 sentences (count.?!, expect ≤ 3).- title ≤ 80 chars; body ≤ 600 chars.
- Coverage matrix minima are satisfied.
Part D — Tests in tests/minirag/test_chunk.py¶
test_load_clean_kb— parses the seed file, count ≥ 50.test_validate_clean_kb—validate_chunksreturns[].test_rejects_unknown_tense— a chunk withtense="conditional"is rejected.test_rejects_unknown_verb— a chunk withverbs=["run"](not in §A13) is rejected.test_coverage_minima— a stripped-down KB missing the Spanish chunks is rejected with a coverage error.test_chunk_id_unique— duplicate id rejected.
Deliverable¶
data/kb/grammar-rules/chunks.jsonl— ≥ 50 chunks.data/kb/grammar-rules/COVERAGE.md— the coverage matrix (which cells are populated).src/minirag/chunk.py— schema + loader.tests/minirag/test_chunk.py— six tests passing.data/kb/grammar-rules/manifest.json— version, count, seed.
Acceptance¶
pytest tests/minirag/test_chunk.py -vpasses all six.- KB has ≥ 50 chunks covering the matrix per the table above.
- No chunk has
body > 600chars ortitle > 80chars. COVERAGE.mdshows which cells are populated and how many chunks per cell.
Pitfalls¶
- Chunks that paraphrase each other. Two chunks saying "add -s for 3sg" with different examples is fine; two chunks saying the exact same thing in different words inflates the KB without adding signal. Aim for distinct rules per chunk.
- Chunks that are too long. A 3-paragraph chunk doesn't retrieve well — embedders work best on cohesive 1-3 sentence units. Split if you exceed the 600-char cap.
- Forgetting
nullfor cross-cutting rules. A rule like "the past-participle form is used afterhave" applies to all persons. Don't pick one — useperson: null. - Hand-curated leak into eval. Lab 01's eval set should not be derivable from a single chunk literally. The KB is the answer, the queries should paraphrase the answer. If a query is verbatim a chunk title, it's not testing retrieval — it's testing string match.
- Skipping Spanish. Half the §A13 scope is Spanish. An EN-only KB makes the next labs trivial.
Stretch¶
- Add 10 adversarial chunks: rules that are almost right but contain a subtle error or outdated convention. Mark them
source: adversarial. Used in Lab 03's faithfulness eval (a faithful reader doesn't cite an adversarial chunk as authoritative). - Translate every English chunk to Spanish (
language: es) preserving examples. Doubles the KB and exercises bilingual retrieval.
Next lab: lab/01-bi-encoder-baseline.md.