# Pathrule Pattern: LLM Evaluations & Testing (1.0.0)
# ::pathrule:package:llm-evals

### [RULE] Gate prompt, model, and retrieval changes on an eval run  (path: /src/ai)
<!-- scope: folder | priority: high | advisory -->

A prompt is code: a small edit can improve one case and silently break ten others. Without an eval gate, you find out from users.

- Run the eval set on every change to a prompt, model id, temperature, tool definition, or retrieval config, and compare the scores to the committed baseline before merging.
- Treat a regression on the eval set like a failing test: it blocks the change. An improvement on your one hand-picked example is not evidence; the aggregate score on the dataset is.
- Pin the model version in the eval run. A provider silently changing a model under you is itself a regression you want the evals to catch.
- Keep eval runs in CI (or a pre-merge step) so the gate is enforced regardless of who makes the change. Record the score so the trend is visible over time.

---

### [MEMORY] Build a labelled eval set that mirrors real usage  (path: /evals)

The eval set is the asset. The model and prompt will change; the dataset is what lets you tell whether a change is better.

- Curate inputs that mirror real traffic: common cases, important edge cases, and adversarial inputs (prompt injection, ambiguous or out-of-scope requests, inputs that should be refused). A dataset of only happy-path examples measures nothing useful.
- For each case, record either an expected output, a reference answer, or explicit acceptance criteria. Some tasks have one right answer; many have a rubric instead, and that is fine as long as it is written down.
- Version the dataset alongside the code and grow it from production failures: every real hallucination or bad answer becomes a new eval case so the same regression cannot return unnoticed.
- Keep the set balanced and labelled honestly; do not overfit prompts to a tiny set of examples you keep re-reading. Aim for coverage of the behaviours that matter.

See /evals for the scoring memory and /src/ai for the eval-gate rule.

---

### [MEMORY] Score with deterministic checks first, then a calibrated judge  (path: /evals)

Pick the cheapest scoring method that actually measures the thing, and only reach for an LLM judge when the output is genuinely open-ended.

- Score deterministically wherever you can: exact match, schema/JSON validity, regex, contains-required-facts, executes-without-error, latency, and cost. These are free, fast, and not themselves subject to model error.
- For open-ended quality (helpfulness, tone, faithfulness), use an LLM-as-judge: a separate model call that scores the output against a specific, written rubric, ideally returning a structured verdict with a reason, not a bare number.
- Calibrate the judge against human labels on a sample: if the judge does not agree with your team's judgments, fix the rubric before trusting it. An uncalibrated judge is just another opinion.
- For RAG and any grounded answer, score faithfulness explicitly: does the answer follow from the retrieved context, or did the model invent it? This is the direct measure of hallucination. (See the rag-embeddings pattern for retrieval quality.)

See /evals for the dataset memory and /src/ai for the eval-gate rule.

---

### [SKILL] llm-eval-set-builder  (path: /)

---
name: llm-eval-set-builder
description: Checklist for building or extending an LLM evaluation set and gating changes on it. Run when adding an LLM feature or after a production quality failure.
---

# LLM eval set builder

## Dataset
- [ ] Inputs mirror real usage: common cases, important edge cases, and adversarial inputs (injection, out-of-scope, must-refuse).
- [ ] Each case has an expected output, reference answer, or written acceptance criteria/rubric.
- [ ] Dataset is versioned with the code and grows from real production failures.
- [ ] Coverage is balanced; prompts are not overfit to a handful of examples.

## Scoring
- [ ] Deterministic checks used where outputs are verifiable (exact match, schema validity, required facts, runs-clean, latency, cost).
- [ ] LLM-as-judge used only for open-ended quality, with a specific written rubric and a structured verdict + reason.
- [ ] Judge calibrated against human labels on a sample; rubric fixed until it agrees.
- [ ] Grounded answers scored for faithfulness (does the answer follow from the source) to catch hallucination.

## Gate
- [ ] Eval run triggers on any prompt/model/temperature/tool/retrieval change; model version pinned.
- [ ] Scores compared to a committed baseline; a regression blocks the change like a failing test.
- [ ] Eval runs in CI / pre-merge; scores recorded so the quality trend is visible.

---

### [SKILL] llm-as-judge-rubric  (path: /)

---
name: llm-as-judge-rubric
description: Guidance for writing a reliable LLM-as-judge scoring prompt. Use when building automated scoring for open-ended LLM outputs.
---

# LLM-as-judge rubric

Use when an output is too open-ended for a deterministic check. A judge is only as good as its rubric.

## Writing the rubric
- [ ] Define each criterion concretely (e.g. faithfulness, relevance, completeness, tone) with what a pass and a fail look like, not just a label.
- [ ] Prefer a small discrete scale (e.g. 1-5 or pass/fail per criterion) over an unanchored 0-100; anchor each level with a description.
- [ ] Ask the judge to give its reasoning and cite the part of the input/source that justifies the score, then the score. Require a structured output (per-criterion verdict + reason).
- [ ] For faithfulness/grounding, give the judge the source context and ask explicitly whether each claim is supported by it.

## Making it reliable
- [ ] Use a capable model as the judge; do not have a weak model grade a strong one.
- [ ] Calibrate: score a sample the team has labelled and measure agreement; revise the rubric until the judge matches human judgment.
- [ ] Watch for known judge biases (position, length, self-preference) and control for them (e.g. randomize order in pairwise comparisons).
- [ ] Keep the judge prompt and model versioned with the eval set; a judge change is itself an eval change.
