Evals & eval-rigor

Judge your judges. Schema adherence, test-retest stability, and inter-judge agreement — before you trust a score.

Most platforms give you a single LLM-as-judge and a number. langprobe treats evaluation as a measurement problem: a score you can't reproduce isn't a score. The point isn't just to grade outputs — it's to tell you whether the judge is trustworthy.

The problem with single-judge scores

LLM judges are noisy. Published work documents intra-rater unreliability, recency and position bias, verbosity and self-preference bias, and large unexplained variance in how a judge applies a rubric. A single judge, run once, hides all of it behind one confident number.

Rigor as a default

langprobe surfaces measurement-theory rigor instead of a bare mean:

Panel-of-judges — score with several judges, not one, and report agreement.
Test–retest stability — run the same judge on the same output more than once and report how consistent it is (intra-rater reliability).
Inter-judge agreement — how much the judges agree with each other (e.g. a κ statistic), so you know whether the rubric is well-defined.
Schema adherence — did the judge actually return the structured verdict it was asked for, every time?
Reference answers & score anchors — reference outputs and per-score descriptions as first-class inputs, so judgments are anchored, not vibes.

Together these tell you a score's confidence, not just its value — the difference between "8.2" and "8.2, but the judges only agree κ=.41, so don't trust it."

Judges

Evals run as eval-runs — a judge (or a panel) fans out over a set of runs and produces scored verdicts with the reliability metrics above. langprobe also supports small, purpose-built open-weight judge models for cheap, repeatable scoring alongside frontier-model judges.

Closing the loop

Because a replay can be scored by the same judges as production, "the fix looks better" becomes "the fix scores at least as well, and the judges agree it does." Replay → diff → eval is one loop.

See the API Reference for the eval-runs, luna-judges, and eval-reliability endpoints.

Evals & eval-rigor

The problem with single-judge scores

Rigor as a default

Judges

Closing the loop

On this page