
Eval Gauntlet
LLM regression testing · in-browser
Synthetic benchmark, hand-authored mock model outputs. The parsing, scoring and regression logic is real and unit-tested.
Synthetic benchmark and mock model outputs. The harness logic is real and unit-tested.
Eval Gauntlet treats prompt changes the way CI treats code changes. The task under test is structured extraction: given a short infrastructure snippet, produce one security finding as strict JSON ({severity, control, evidence}). A hand-authored synthetic benchmark of 24 gold-labeled cases is scored against two static mock model output sets standing in for prompt v1 and prompt v2, seeded with the failure modes that show up in real eval work: missed findings, wrong severities, hallucinated control ids and malformed JSON (truncation, single quotes, trailing commas, prose preambles). The harness parses each output through a strict two-layer validator (syntactic JSON, then schema shape), runs four scorers (exact match, per-field precision/recall/F1, a severity-weighted score so critical cases count most, and a strict-JSON validity rate), then compares the two runs case by case to produce improved/regressed/unchanged verdicts plus aggregate deltas. The built-in comparison makes the core point of regression-aware evals: v2 lifts every aggregate while still breaking two cases that v1 got right, which only the per-case gate catches. The benchmark is synthetic by design; the parsing, scoring and comparison machinery is real, fully deterministic and covered by 27 unit tests.
- TypeScript
- React
- Next.js
- Vitest
- Strict JSON validation
- Precision / Recall / F1
Architecture · gold labels → strict parse → multi-scorer → regression gate
Gold benchmark
24 hand-authored synthetic cases: an infrastructure snippet plus the expected {severity, control, evidence} finding.
Strict parsing
Each raw model output passes two layers: syntactic JSON, then schema shape. Trailing commas, prose preambles and missing keys all fail loudly.
Multi-scorer
Exact match, per-field precision/recall/F1, severity-weighted accuracy and strict-JSON validity, computed per run.
Regression gate
Prompt v1 vs v2 compared case by case into improved/regressed/unchanged verdicts, so a better aggregate cannot hide broken cases.
- Gold-labeled cases
- 24
- Scorers
- 4
- Mock outputs scored
- 48 (24 × 2 prompts)
- Unit tests
- 27 passing