Skip to content

English · Español

Lab 00 — Curate the 50-chunk grammar-rule KB

🇪🇸 Una base de conocimiento de 50 chunks, escritos a mano, cubriendo los 20 verbos × 5 tiempos × 3 personas en inglés y español. Cada chunk tiene un chunk_id, un cuerpo de 1-3 frases, etiquetas (tense, person, verb, regularity, language), y citas a referencias. Esta es la materia prima del retriever.

Objective

Build data/kb/grammar-rules/: a curated JSONL knowledge base of ~50 chunks covering the §A13 grammar scope, with structured metadata for filtering and evaluation. The KB must be self-contained — every query the eval set asks should be answerable from these chunks.

Setup

  • Pure data work. No PyTorch, no models.
  • The §A13 grid: 20 verbs (12 regular + 8 irregular) × 5 tenses × 3 persons × 2 languages.
  • Phase 12 corpus exists (the example sentences); this lab produces the rules.

The chunk schema

data/kb/grammar-rules/chunks.jsonl — one line per chunk:

{
  "chunk_id": "en-pres-3sg-regular-s-rule-001",
  "language": "en",
  "topic": "tense_rule",
  "tense": "present_simple",
  "person": "3sg",
  "regularity": "regular",
  "verbs": ["work", "play", "walk", "talk", "listen",
            "watch", "study", "finish", "start", "look",
            "want", "like"],
  "title": "Present simple, 3rd-person singular: add -s",
  "body": "In English present simple, the 3rd-person singular form of a regular verb adds -s to the base form. For example: 'She works at the office.' / 'He walks home.' Verbs ending in -y after a consonant change y → ies (study → studies); verbs ending in -sh, -ch, -ss, -x, -o add -es (watch → watches, finish → finishes).",
  "examples": [
    {"en": "She works at the office.", "es": "Ella trabaja en la oficina."},
    {"en": "He studies every day.", "es": "Él estudia todos los días."}
  ],
  "source": "hand",
  "references": ["Murphy 2019 §3", "RAE Diccionario panhispánico §5"]
}

Required fields:

  • chunk_id — unique, stable, descriptive prefix (see naming convention below).
  • languageen | es | bilingual.
  • topictense_rule | irregular_form | auxiliary | agreement | time_marker | spanish_equivalent.
  • tense — one of the 5 tenses, or null for cross-tense chunks.
  • person1sg | 2sg | 3sg | null for cross-person rules.
  • regularityregular | irregular | both.
  • verbs — list of the verbs this chunk applies to (subset of the 20).
  • title — one-line summary, ≤ 80 chars.
  • body — 1-3 sentences, ≤ 600 chars. Self-contained — readable without context.

Optional fields (recommended):

  • examples — list of EN/ES paired example sentences.
  • sourcehand | derived.
  • references — list of grammar-book / dictionary references for traceability.

Naming convention for chunk_id

Pattern: <lang>-<tense>-<person>-<regularity>-<topic>-<NNN>.

Examples: - en-pres-3sg-regular-s-rule-001 - en-past-all-irregular-go-001 - es-pres-3sg-regular-er-conj-001 - bilingual-past-1sg-pair-001

Hyphen-separated, lowercase. The prefix lets grep filter chunks during debugging.

Coverage targets

Category Min chunks
English tense rules (5 tenses × 3 persons × 2 regularity) ≥ 20
Irregular forms (8 verbs × 3-4 forms each) ≥ 12
Spanish equivalents (per-tense translation rules) ≥ 10
Bilingual pairs / contrastive ≥ 5
Time markers (yesterday → past, tomorrow → future, etc.) ≥ 3
Total ≥ 50

Tasks

Part A — Write the coverage matrix

Make a spreadsheet (or just a markdown table) with rows = (tense, person, regularity, language) cells. Check off each cell as you write a chunk for it. Aim for ~50 chunks total; deeper coverage on common cells (present-simple-3sg-regular has lots of related rules) is fine.

Part B — Write 50+ chunks

For each chunk:

  1. Title that says what the rule is in 80 chars or fewer.
  2. Body of 1-3 sentences that teach the rule to a learner who doesn't know it. Include the rule, an example, and the edge case if relevant.
  3. At least 1-2 examples in examples list.
  4. Filled metadata (don't skip the regularity field even when "n/a" — use both).

Style guide:

  • Avoid jargon: "3rd-person singular" not "3PS"; "past simple" not "preterite-mode".
  • One rule per chunk. If the rule has exceptions, those go in separate chunks with their own ids and a clear references cross-link.
  • Bilingual paired chunks (where the rule is explicitly comparing EN and ES) live as language: bilingual.

Part C — Build the loader

src/minirag/chunk.py:

from dataclasses import dataclass, field
from pathlib import Path
import json

ALLOWED_TENSES = {"infinitive", "present_simple", "past_simple",
                  "past_participle", "future", None}
ALLOWED_PERSONS = {"1sg", "2sg", "3sg", None}
ALLOWED_REGULARITY = {"regular", "irregular", "both"}
ALLOWED_LANGUAGES = {"en", "es", "bilingual"}
ALLOWED_TOPICS = {"tense_rule", "irregular_form", "auxiliary",
                  "agreement", "time_marker", "spanish_equivalent"}

@dataclass(frozen=True)
class Chunk:
    chunk_id: str
    language: str
    topic: str
    tense: str | None
    person: str | None
    regularity: str
    verbs: tuple[str, ...]
    title: str
    body: str
    examples: tuple[dict, ...] = field(default_factory=tuple)
    source: str = "hand"
    references: tuple[str, ...] = field(default_factory=tuple)

def load_chunks(path: Path) -> list[Chunk]: ...

def validate_chunks(chunks: list[Chunk]) -> list[str]:
    """Returns list of errors; [] = pass."""
    ...
  • load_chunks parses JSONL → Chunk dataclasses.
  • validate_chunks checks:
  • schema correctness, required fields present.
  • chunk_id uniqueness.
  • field values in allowed sets.
  • verbs ⊂ §A13 20-verb list.
  • body is 1-3 sentences (count . ? !, expect ≤ 3).
  • title ≤ 80 chars; body ≤ 600 chars.
  • Coverage matrix minima are satisfied.

Part D — Tests in tests/minirag/test_chunk.py

  1. test_load_clean_kb — parses the seed file, count ≥ 50.
  2. test_validate_clean_kbvalidate_chunks returns [].
  3. test_rejects_unknown_tense — a chunk with tense="conditional" is rejected.
  4. test_rejects_unknown_verb — a chunk with verbs=["run"] (not in §A13) is rejected.
  5. test_coverage_minima — a stripped-down KB missing the Spanish chunks is rejected with a coverage error.
  6. test_chunk_id_unique — duplicate id rejected.

Deliverable

  • data/kb/grammar-rules/chunks.jsonl — ≥ 50 chunks.
  • data/kb/grammar-rules/COVERAGE.md — the coverage matrix (which cells are populated).
  • src/minirag/chunk.py — schema + loader.
  • tests/minirag/test_chunk.py — six tests passing.
  • data/kb/grammar-rules/manifest.json — version, count, seed.

Acceptance

  • pytest tests/minirag/test_chunk.py -v passes all six.
  • KB has ≥ 50 chunks covering the matrix per the table above.
  • No chunk has body > 600 chars or title > 80 chars.
  • COVERAGE.md shows which cells are populated and how many chunks per cell.

Pitfalls

  • Chunks that paraphrase each other. Two chunks saying "add -s for 3sg" with different examples is fine; two chunks saying the exact same thing in different words inflates the KB without adding signal. Aim for distinct rules per chunk.
  • Chunks that are too long. A 3-paragraph chunk doesn't retrieve well — embedders work best on cohesive 1-3 sentence units. Split if you exceed the 600-char cap.
  • Forgetting null for cross-cutting rules. A rule like "the past-participle form is used after have" applies to all persons. Don't pick one — use person: null.
  • Hand-curated leak into eval. Lab 01's eval set should not be derivable from a single chunk literally. The KB is the answer, the queries should paraphrase the answer. If a query is verbatim a chunk title, it's not testing retrieval — it's testing string match.
  • Skipping Spanish. Half the §A13 scope is Spanish. An EN-only KB makes the next labs trivial.

Stretch

  • Add 10 adversarial chunks: rules that are almost right but contain a subtle error or outdated convention. Mark them source: adversarial. Used in Lab 03's faithfulness eval (a faithful reader doesn't cite an adversarial chunk as authoritative).
  • Translate every English chunk to Spanish (language: es) preserving examples. Doubles the KB and exercises bilingual retrieval.

Next lab: lab/01-bi-encoder-baseline.md.