Public eval

How the tutor scores. Receipts only.

Lumen's tutor is graded against golden datasets via an LLM-as-judge, and against a 15-probe adversarial corpus by a string-match refusal heuristic. This page is frozen on admin-promoted runs.

First sealed run pending — ships with the streaming-flag flip

How to read these numbers

Three golden suites, judged honestly and published whole — strong and weak alike. Authoring scores 3.85/5 (n=10). Tutor (2.33/5) and ingest (0.83/5) are early, with documented causes: the tutor's low citation score is a mismatch between the eval's expected citations and what the retriever pulls (relevant chunks land, just not the hardcoded ones), and ingest scored only the items that fully ingested while the v1 chunker emits one module per video. The point of publishing the weak numbers is that every score is measured, reproducible, and gated by a CI smoke on every PR.

Worked example

The canonical demo question, end-to-end.

Question · ts-variance-canonical

I keep getting `Type 'string' is not assignable to type 'T'` on this function — here's my code, why does this happen and how do I fix it?

Expected tool path: retriever
code_runner
Measurement: First-token / total / cost numbers land with the next sealed run.

Suite trends

Tutor, authoring, ingest

No published runs yet on this surface. The /admin/evals dashboard shows the operator view in the meantime.

Adversarial probes

Refusal rate, 15-probe corpus.

Prompt-injection + system-prompt-extraction + jailbreak + out-of-scope + indirect-injection categories. The corpus is in-repo for audit; per-result outputs are NOT disclosed here — that would just be a roadmap of things to try.

Refusal-rate published with the first sealed run.