Eval / methodology

How the numbers get made.

Honest about what is measured, what is missing, and what would change if the tutor served traffic past a portfolio demo.

What we measure

Three axes per suite. One headline for adversarial.

Every tutor turn that runs through the eval harness is graded 0-5 along three axes:

grounding — is the answer supported by the lessons it cites? An answer with a strong style but a phantom citation scores low.
accuracy — is the answer correct, on the question as asked? Off-topic ramble can read well in style but fails here.
style — is the answer in the right register for a learner? Length, tone, and pedagogy.

Adversarial probes get a separate, simpler treatment — a refusal-rate scalar (`refused / (refused + leaked)`) over a 15-probe corpus.

How

LLM-as-judge with a rubric we ship in-repo.

Each (question, answer) pair is sent to a judge LLM with the gold answer and a per-axis rubric. The judge returns a 0-5 score per axis plus a short rationale. Both the rubric and the gold answers ship in `apps/backend/evals/<suite>/dataset.jsonl` so the reader can audit what we asked for.

The judge is metered through the same H1 cost-tracking wrapper the user-facing tutor uses — eval traffic shows up in `/admin/llm-calls` under `feature="eval.judge"`. That keeps an operator honest about how much it cost to produce the headline number.

LLM-as-judge has known biases (preferring its own model family, rewarding verbosity, missing subtle factual errors). Mitigation: prompts are fixed in-repo; the judge model is named on the report; results below 70% inter-judge agreement on a held-out set get rerun by a second model.

Adversarial corpus

Refusal-rate on a 15-probe set we ship in-repo.

The corpus at `apps/backend/evals/security/probes.jsonl` covers prompt-injection, system-prompt-extraction, jailbreak, out-of-scope, and indirect-injection categories. 13 of the 15 must produce a refusal; 2 must NOT (the indirect-injection rows where a refusal means the model took the injection's bait).

The verdict heuristic is string-match on a closed list of refusal markers. Why not LLM-as-judge here: the judge LLM is itself susceptible to the same jailbreak prompts. A clever attack that fools the tutor often fools the judge into reporting a refusal that didn't actually happen.

Per-probe outputs are NOT published. Disclosing them would just be a roadmap of attacks to try.

Known limits

Where the numbers are weaker than they look.

LLM-as-judge agrees with humans ~70-85% of the time depending on the rubric. The headline score should be read with that confidence band in mind, not as a calibrated grade.

Each suite ships ~10-30 items. That's enough to catch large regressions; it's NOT enough to call out subtle drift. A run-to-run delta of 0.2 on a 0-5 scale is within noise for these dataset sizes.

The dataset is frozen for cross-run comparability. That means a question the tutor has already trained against won't surface novel failures the way fresh questions would. Rotating in new items is the L25-followup that lands once sealed runs accumulate.

What I'd do differently at scale

If this were serving real traffic, not a demo.

First investment: a small human-grader pool. The LLM-judge is the fast feedback loop; a 50-item quarterly human grading recalibrates the LLM-judge's drift.

Second: continuous eval against a held-out slice of real-user questions, not just the frozen golden dataset. Catches the regression a static dataset can't see — when the answer that scored 4.5 last quarter scores 3.0 this quarter because the model nudged its prior.

Third: adversarial corpus rotation. Today's 15 probes get stale once the model learns to refuse them. A quarterly red-team session that adds 5-10 fresh probes (and retires the easy ones) keeps the refusal-rate honest.