Skip to content
Lumen

Public eval

How the tutor scores. Receipts only.

Lumen's tutor is graded against golden datasets via an LLM-as-judge, and against a 15-probe adversarial corpus by a string-match refusal heuristic. This page is frozen on admin-promoted runs.

First sealed run pending — ships with the streaming-flag flip
Worked example

The canonical demo question, end-to-end.

Question · ts-variance-canonical

I keep getting `Type 'string' is not assignable to type 'T'` on this function — here's my code, why does this happen and how do I fix it?

Expected tool path
  • retriever
  • code_runner
Measurement
First-token / total / cost numbers land with the next sealed run.
Suite trends

Tutor, authoring, ingest

No published runs yet on this surface. The /admin/evals dashboard shows the operator view in the meantime.

Adversarial probes

Refusal rate, 15-probe corpus.

Prompt-injection + system-prompt-extraction + jailbreak + out-of-scope + indirect-injection categories. The corpus is in-repo for audit; per-result outputs are NOT disclosed here — that would just be a roadmap of things to try.

Refusal-rate published with the first sealed run.