English · Español

04 — Evaluating RAG: Retrieval Metrics, End-to-End Metrics, Faithfulness¶

🇪🇸 Evaluar RAG requiere tres planos: ¿el retriever trae el documento bueno? (retrieval), ¿la respuesta es factualmente correcta? (accuracy), ¿la respuesta cita el contexto? (faithfulness). Cada uno puede fallar independientemente. Sin una eval set bien construida, todo el pipeline se vuelve una caja negra.

Three layers of evaluation¶

A RAG pipeline has three layers. Each can be evaluated independently:

Retrieval evaluation: Does the retriever return the correct chunk(s) in the top-k?
Reader evaluation: Given the correct context, does the reader produce a correct answer?
End-to-end evaluation: Given a query, does the full pipeline produce a correct, faithful answer?

A failure at any layer propagates. Diagnosing which layer failed requires evaluating each separately.

The eval set¶

You need a gold standard: a set of (query, expected_chunk_id, expected_answer) triples.

Construction for our KB:

Hand-write ~30-50 queries covering the §A13 verb-tense-person matrix.
"What's the past participle of eat?"
"How do you conjugate work for she in present simple?"
"What's the Spanish translation of I have gone?"
For each query, identify the expected chunk ID (the chunk that should answer it).
For each query, write the expected answer (the correct response).

Avoid: queries that are direct paraphrases of KB chunks (the model would just regurgitate). Prefer: queries that require synthesis or specific extraction from the chunk.

The eval set should be hand-curated, version-controlled, and treated as the ground truth for the phase.

Retrieval metrics¶

Hit-rate@k¶

For each query, does the expected_chunk_id appear in the retriever's top-k?

\[ \text{hit-rate@k} = \frac{1}{|Q|} \sum_{q \in Q} \mathbb{1}[\text{expected\_chunk}(q) \in \text{top-k}(q)] \]

A binary signal per query. Easy to interpret.

Range: 0 to 1. Higher is better.

DoD target: hit-rate@5 ≥ 0.80 for Phase 29 (so 80% of queries have the right doc in top-5).

MRR (Mean Reciprocal Rank)¶

For each query, find the rank r of the expected chunk in the retriever's output (1 = top, 2 = second, etc.). If not retrieved at all: 1/r := 0.

\[ \text{MRR} = \frac{1}{|Q|} \sum_{q \in Q} \frac{1}{\text{rank}_q(\text{expected\_chunk})} \]

A continuous signal. Captures how close to top the expected chunk is.

Range: 0 to 1. Higher is better.

DoD target: MRR ≥ 0.60. Implies the expected chunk averages around position 1-2 across queries.

Recall@k¶

For multi-relevance queries (multiple chunks could answer): fraction of relevant chunks retrieved.

For Phase 29's single-relevance queries, recall@k ≡ hit-rate@k.

Precision@k¶

For multi-relevance queries: fraction of top-k that are relevant.

For Phase 29's single-relevance, precision@k = 1/k if hit, 0 if miss.

Reader metrics (closed-book on retrieved context)¶

Given the correct context (gold chunk + maybe a few distractors), does the reader produce a correct answer?

Exact match (EM)¶

Did the answer contain the expected answer string?

Brittle: "the past participle is eaten" vs "eaten is the past participle" — both correct, but EM may differ.

Use for highly templated answers; less useful for free-form.

F1 (token overlap)¶

F1 between expected answer tokens and predicted answer tokens.

For verb-grammar answers (often short, lexically constrained), F1 is reasonable.

LLM-judge¶

Use a separate (large, strong) LLM as a judge. Prompt: "Given the question, the gold answer, and the candidate answer, score the candidate from 0-1."

Expensive (one large-LLM call per eval query). Used in production. Mentioned for completeness.

For Phase 29: lab 03 uses F1 + manual qualitative inspection on 30 hand-picked queries. No LLM-judge.

End-to-end metrics¶

Combine retrieval and reading: given query, produce an answer; evaluate the answer against the gold.

The end-to-end is harder to interpret because failures could be retrieval or reading. Always include retrieval-only metrics alongside.

Faithfulness¶

The big idea: the answer should be supported by the retrieved context, not pulled from the model's weights.

Why this matters¶

If the answer matches the gold but isn't supported by the retrieved context, the model is using its own knowledge (closed-book), not RAG. This makes the system fragile:

Updates to the KB don't propagate to answers.
Hallucinations can creep in (the model says something not in the context).
Citations are inaccurate (the cited chunks don't actually contain the claim).

Heuristic faithfulness¶

For each (answer_sentence, retrieved_chunks_text), compute token overlap. If the sentence's tokens are mostly covered by the chunks' tokens, mark "supported". Else "unsupported".

Threshold: e.g., 70% of answer tokens (excluding stopwords) must appear in the cited chunks.

This is brittle but quantitative. It captures the shape of grounding without verifying logical entailment.

LLM-judge faithfulness¶

A stronger LLM is given the answer and the chunks; asked "is every claim in the answer supported by the chunks?" Scored 0-1.

More accurate; more expensive.

For Phase 29 we use heuristic faithfulness. DoD: ≥ 70% on a 30-query qualitative eval.

Citation accuracy¶

Independent of faithfulness: do the citations actually refer to the chunks that contain the claim?

If the answer cites chunk-12 for a claim, does chunk-12 contain that claim?
If the answer makes a claim with no citation, is the claim still supported by any retrieved chunk?

For Phase 29, we ask the reader to output citations as [chunk-id] tags. Lab 03 verifies them.

Where each metric typically fails¶

Metric	Common failure mode
hit-rate@5	retriever brings back semantically-close-but-wrong chunks
MRR	retriever finds the gold chunk but at low rank
F1	reader rephrases the answer; F1 misses lexical variations
Faithfulness (heuristic)	reader extracts from chunks but uses synonyms — heuristic misses overlap
Citation accuracy	reader cites a chunk it didn't actually use

A good eval includes all of them. Lab 02 + lab 03 do.

Reading-only evaluation (oracle context)¶

A useful diagnostic: feed the reader the gold chunk directly (skipping retrieval). Measure F1.

If F1 is high in oracle mode but low in end-to-end mode → retrieval is the bottleneck.

If F1 is low even in oracle mode → reader is the bottleneck.

For Phase 29: lab 03's optional Block E does this. Expect reader F1 to be ≥ 80% in oracle mode (the task is simple given correct context); end-to-end F1 should be lower in proportion to retrieval hit-rate.

Calibration of expectations¶

For a 50-doc KB with all-MiniLM-L6-v2:

hit-rate@5 ~ 0.85-0.95 (small KB, well-anchored chunks).
MRR ~ 0.65-0.80.
Reader F1 (oracle) ~ 0.80-0.90.
Reader F1 (end-to-end) ~ 0.70-0.85.
Heuristic faithfulness ~ 0.70-0.85.

These are expected ranges, not guarantees. Lab measurements may shift them.

Failure-mode taxonomy¶

When the end-to-end answer is wrong, classify which layer failed:

Layer	Symptom
Retrieval miss	Gold chunk not in top-k
Retrieval distract	Gold chunk in top-k but reader ignores it, picks distractor
Reader miss	Gold chunk passed to reader; reader extracts the wrong sub-fact
Reader hallucination	Reader adds details not in any retrieved chunk
Reader format	Answer is correct but in wrong format (missed exact match)

The phase report should classify the failures in the eval set. Common: retrieval-distract (gold is in top-3 but distractor is more "salient" to the reader).

Drill problems¶

Solutions at phase open in solutions/04-evaluation-ref.md.

A query has gold chunk at rank 3 of 10. What's MRR for this single query? What if it's at rank 1? At rank 10? Not retrieved?
Reader produces "eaten is the past participle of eat". Expected: "the past participle of eat is eaten". Compute F1 on tokens.
Sketch a failure-mode taxonomy of 30 hand-picked queries: 5 retrieval miss, 8 retrieval distract, 4 reader miss, 13 success. What's hit-rate@5 vs end-to-end accuracy?
Faithfulness heuristic: answer's tokens (excluding stopwords) must appear in retrieved chunks ≥ 70%. Apply to "the past participle of eat is eaten" given a chunk containing "the past participle of the irregular verb eat is eaten (Spanish: comido)". Pass or fail?
Argue: if hit-rate@5 is 90% but end-to-end F1 is 60%, where's the bottleneck? What would you do next?

One-paragraph recap¶

RAG evaluation requires three layers: retrieval (hit-rate@k, MRR), reading (F1 on oracle context), end-to-end (F1 on full pipeline). Plus faithfulness (does the answer cite the context?) and citation accuracy (do the citations refer to relevant chunks?). Each can fail independently. The eval set is hand-curated (query, expected_chunk, expected_answer) triples — small but high-quality. For Phase 29 we target hit-rate@5 ≥ 0.80, MRR ≥ 0.60, faithfulness ≥ 70%. The phase report classifies failures by layer; this drives targeted improvements.

Next: lab/00-kb-curation.md.