English · Español

Lab 04 — Quizzes, exams, and the SM-2 review loop¶

🇪🇸 El portal aprende a preguntar. Quizzes definidos en YAML por fase, con tipos de respuesta acotados al dominio §A13 (forma verbal, MCQ, código corto, ensayo corto). Cada error genera una tarjeta de repaso; el algoritmo SM-2 decide cuándo vuelve a aparecer. Lo que se falla, no se olvida.

Goal¶

Implement the quiz/exam pipeline: a YAML loader for data/quizzes/phase-NN-name.yaml, a grader for the four answer types (mcq, short, code, conjugation-form), a quiz_attempts persistence layer, automatic review_card creation on every wrong answer, and the daily "Today's reviews" page driven by SM-2 ease/interval updates. Exams are longer-form rubric-graded essays — the rubric is run by the model in LYNX_TEST_MODE=1 and by the teacher in production.

Why this lab exists¶

The portal's pedagogical loop is not "take a quiz, see your score, forget it." It is "take a quiz, mark the failures, surface them again tomorrow." Without the SM-2 surface, the quiz is decorative. With it, a wrong answer becomes a contract the system will honor on the learner's behalf until they get it right twice in a row.

The §A13 scope (20 verbs × 5 tenses × 3 persons) is small enough that the entire reviewable universe is bounded. SM-2 here is overkill in capacity but right in shape — Borja will use it through Phase 41 and reuse it for any later phase that adopts the same pattern.

Prerequisites¶

Labs 00–03 done.
Phase 20 evaluation harness exists and exposes grade_conjugation(prompt, given) -> GradeResult.
A13 verb data exists in data/verbs/ from Phase 12.
markdown-it-py + sanitizer from lab 03.

Deliverables¶

data/quizzes/phase-00-onboarding.yaml (example).
src/miniportal/quizzes/__init__.py — loader, models.
src/miniportal/quizzes/grader.py — dispatch by answer_type.
src/miniportal/quizzes/sm2.py — SM-2 ease/interval update.
Migrations: quiz_attempts, review_cards, exam_attempts.
src/miniportal/routes/quizzes.py — GET /quizzes, GET /quizzes/{phase}/{slug}, POST /quizzes/{...}/submit.
src/miniportal/routes/review.py — GET /review (today's due cards), POST /review/{card_id}/grade (SM-2 feedback button).
src/miniportal/routes/exams.py — exam routes (rubric-graded).
Templates: quiz_view.html.jinja, quiz_result.html.jinja, review_today.html.jinja, exam_view.html.jinja.
tests/portal/test_quiz_grading.py.
tests/portal/test_review_card_creation.py.
tests/portal/test_sm2_update.py.

Step 1 — Quiz YAML format¶

data/quizzes/phase-00-onboarding.yaml:

phase: 0
slug: onboarding
title_en: "Phase 0 onboarding check"
title_es: "Comprobación de incorporación a la fase 0"
items:
  - id: q1
    answer_type: mcq
    prompt_en: "Which command syncs the locked dependencies?"
    prompt_es: "¿Qué comando sincroniza las dependencias bloqueadas?"
    choices: ["pip install -r requirements.txt", "uv sync --frozen", "uv add fastapi", "poetry lock"]
    correct: 1
    rubric: "uv is mandatory per CLAUDE.md §2."
    tags: [tooling, uv]
  - id: q2
    answer_type: conjugation-form
    prompt_en: "Past simple of 'eat', 3rd singular?"
    prompt_es: "Pasado simple de 'eat', 3ª persona del singular."
    correct: "ate"
    rubric: "Irregular verb; same form for all persons."
    tags: [a13, irregular, past-simple]
  - id: q3
    answer_type: short
    prompt_en: "Name the four answer types supported by the grader."
    correct: ["mcq", "short", "code", "conjugation-form"]
    rubric: "Order independent; case insensitive."
    tags: [meta]

Constraint: prompt_en is required; prompt_es is optional but strongly encouraged (A2 bilingual policy). Every item must carry at least one tag — the review queue uses tags to balance daily load.

Step 2 — Loader + models¶

# src/miniportal/quizzes/__init__.py
from dataclasses import dataclass
from pathlib import Path
import yaml


@dataclass(frozen=True)
class QuizItem:
    id: str
    answer_type: str   # "mcq" | "short" | "code" | "conjugation-form"
    prompt_en: str
    prompt_es: str | None
    correct: object    # int (mcq) | str | list[str]
    rubric: str
    tags: tuple[str, ...]


@dataclass(frozen=True)
class Quiz:
    phase: int
    slug: str
    title_en: str
    title_es: str | None
    items: tuple[QuizItem, ...]


def load_quiz(path: Path) -> Quiz:
    raise NotImplementedError("Lab 04 step 2 — parse YAML, validate against schema, return frozen Quiz.")

The schema validation happens at load time (not at grading time). A malformed YAML must fail loud on app start, not on first quiz attempt.

Step 3 — The grader¶

# src/miniportal/quizzes/grader.py
from dataclasses import dataclass

from miniportal.quizzes import QuizItem


@dataclass(frozen=True)
class GradeResult:
    correct: bool
    given: object
    expected: object
    feedback: str  # short, learner-facing


def grade(item: QuizItem, given: object) -> GradeResult:
    if item.answer_type == "mcq":
        raise NotImplementedError("Lab 04 step 3a — given is an int index; compare to item.correct.")
    if item.answer_type == "short":
        raise NotImplementedError(
            "Lab 04 step 3b — given is str; item.correct is list[str]; case-insensitive set membership."
        )
    if item.answer_type == "conjugation-form":
        raise NotImplementedError(
            "Lab 04 step 3c — delegate to Phase 20's grade_conjugation; case-insensitive, strip whitespace."
        )
    if item.answer_type == "code":
        raise NotImplementedError(
            "Lab 04 step 3d — execute given snippet in a sandboxed subprocess (Phase 31 sandbox); "
            "compare stdout to item.correct. Time limit 2 s, memory limit 64 MB."
        )
    raise ValueError(f"unknown answer_type: {item.answer_type}")

The code branch is the only one that touches a subprocess. It reuses Phase 31's sandbox helper — do not roll a new sandbox here.

Step 4 — `quiz_attempts` and `review_cards`¶

# Migration sketch (Lab 04)
# quiz_attempts:
#   id PK, student_id, phase, slug, started_at, submitted_at,
#   score_correct INT, score_total INT, payload_json TEXT (item id -> given)
# review_cards:
#   id PK, student_id, quiz_phase, quiz_slug, item_id,
#   ease_factor REAL DEFAULT 2.5,
#   interval_days INT DEFAULT 0,
#   repetitions INT DEFAULT 0,
#   due_on DATE,
#   created_at, last_reviewed_at
#   UNIQUE(student_id, quiz_phase, quiz_slug, item_id)
# exam_attempts:
#   id PK, student_id, exam_id, body_md, rubric_grade JSON, graded_by ('model' | 'teacher'), graded_at

On every wrong item in a quiz submission:

INSERT OR IGNORE into review_cards with defaults.
If already present, leave it — SM-2 owns the schedule from here on.

Step 5 — SM-2 algorithm¶

# src/miniportal/quizzes/sm2.py
from dataclasses import dataclass


@dataclass(frozen=True)
class ReviewState:
    ease_factor: float
    interval_days: int
    repetitions: int


def update(state: ReviewState, quality: int) -> ReviewState:
    """Standard SM-2 (Wozniak 1990).

    quality ∈ {0..5} (0 = forgot completely, 5 = perfect recall).

    if quality < 3:
        repetitions = 0
        interval_days = 1
    else:
        repetitions += 1
        if repetitions == 1: interval_days = 1
        elif repetitions == 2: interval_days = 6
        else:                   interval_days = round(interval_days * ease_factor)

    ease_factor = max(1.3, ease_factor + 0.1 - (5 - quality) * (0.08 + (5 - quality) * 0.02))

    Returns new ReviewState; pure function.
    """
    raise NotImplementedError("Lab 04 step 5 — implement the formula above; clamp ease_factor at 1.3.")

The portal's grade buttons are mapped: Again=0, Hard=2, Good=4, Easy=5. Quality 1 and 3 are deliberately unreachable to keep the UI to four buttons.

Step 6 — Routes¶

# src/miniportal/routes/quizzes.py
@router.get("/{phase}/{slug}")
async def show_quiz(phase: int, slug: str, student = Depends(current_student)):
    raise NotImplementedError("Lab 04 step 6 — load YAML, render quiz_view with prompts (locale-aware).")


@router.post("/{phase}/{slug}/submit")
async def submit_quiz(phase: int, slug: str, ...):
    """Side effects:
      1. INSERT quiz_attempt row.
      2. For each wrong item, INSERT OR IGNORE review_card.
      3. Update progress.status to 'in_progress' if not done.
      4. Render quiz_result.html with per-item feedback.
    """
    raise NotImplementedError("Lab 04 step 6 — implement the four side effects in a single transaction.")


# src/miniportal/routes/review.py
@router.get("")
async def review_today(student = Depends(current_student)):
    """Cards with due_on <= today, ordered by oldest due first.
    Cap at 30 cards/day (configurable) to avoid burnout."""
    raise NotImplementedError("Lab 04 step 6 — query due cards, render review_today.html.")


@router.post("/{card_id}/grade")
async def grade_review(card_id: int, quality: int = Form(...), student = Depends(current_student)):
    """quality ∈ {0,2,4,5} from the UI buttons. Computes SM-2 update, persists, returns next card."""
    raise NotImplementedError("Lab 04 step 6 — owner check, validate quality, sm2.update, persist, return JSON.")

Step 7 — Exams¶

# src/miniportal/routes/exams.py
@router.post("/{exam_id}/submit")
async def submit_exam(exam_id: int, body_md: str = Form(...), student = Depends(current_student)):
    """Rubric grading:
      - LYNX_TEST_MODE=1: invoke the model with the rubric as a prompt, store JSON grade.
      - Production: graded_by='teacher', status='pending'; teacher fills the rubric in lab 05's view.
    """
    raise NotImplementedError("Lab 04 step 7 — branch on test mode; persist exam_attempts row.")

The model-graded path is restricted to LYNX_TEST_MODE=1 to keep production grading human-anchored. The rubric prompt template lives at data/exams/rubric-template.md.

Step 8 — Tests¶

# tests/portal/test_quiz_grading.py
def test_mcq_correct(): raise NotImplementedError("MCQ index match.")
def test_mcq_wrong(): raise NotImplementedError("MCQ index mismatch.")
def test_short_case_insensitive(): raise NotImplementedError("'ate' == 'Ate'.")
def test_conjugation_form_delegates(): raise NotImplementedError("Mock Phase 20 grader; assert it's called.")
def test_code_sandbox_used(): raise NotImplementedError("Mock Phase 31 sandbox; assert the snippet is dispatched through it.")

# tests/portal/test_review_card_creation.py
def test_wrong_answer_creates_card(): raise NotImplementedError("Submit a quiz with one wrong item; assert one review_card row.")
def test_idempotent_resubmission(): raise NotImplementedError("Submit twice; only one review_card per (student, quiz, item).")
def test_correct_answer_no_card(): raise NotImplementedError("Submit all-correct; zero review_cards inserted.")

# tests/portal/test_sm2_update.py
def test_first_pass_interval_1():
    raise NotImplementedError("State(2.5, 0, 0) + quality 4 -> (≈2.5, 1, 1).")


def test_second_pass_interval_6():
    raise NotImplementedError("State(2.5, 1, 1) + quality 4 -> (≈2.5, 6, 2).")


def test_failure_resets_repetitions():
    raise NotImplementedError("State(2.5, 6, 2) + quality 0 -> (≈ slightly lower, 1, 0).")


def test_ease_floor_1_3():
    raise NotImplementedError("Repeated quality 0 inputs: ease_factor never goes below 1.3.")

What "done" looks like¶

At least one YAML quiz committed (phase-00-onboarding.yaml).
Loader rejects malformed YAML at app start.
All four answer types grade correctly; code runs through the Phase 31 sandbox.
Wrong answers create review cards; correct answers do not.
/review lists due cards capped at 30/day.
SM-2 formula matches the reference implementation; ease floor 1.3 enforced.
Exam route persists; model-grading gated behind LYNX_TEST_MODE=1.
Templates render bilingual prompts when student.locale != 'en' and prompt_es exists.
mypy --strict and bandit clean.

Common pitfalls¶

Rolling your own sandbox for code. Phase 31 already did the work. Reusing it keeps a single hardening surface.
Loading YAML at request time. A 50 ms parse on every quiz hit. Cache parsed quizzes at app start; reload only on file mtime change (or never reload in production).
Letting case sensitivity vary by answer type. conjugation-form is case-insensitive; code stdout is case-sensitive. Document explicitly in the grader's docstring.
Storing the payload_json with raw user input. Sanitize, or store and never render. The code snippets are particularly tempting to dump into a debug page.
A 5-button SM-2 UI. Four buttons (0/2/⅘). Quality 1 and 3 add cognitive load without behavioral difference.
Recreating a review card on every wrong answer. Use INSERT OR IGNORE. SM-2 owns the schedule once the card exists.
Ungated model-graded exam in production. Costs spiral; teacher loses oversight. LYNX_TEST_MODE=1 is the gate.

Next: lab/05-admin-teacher-view.md — the admin dashboard.