Reproduce the eval¶

The repository ships a recorded eval bundle, so you can reproduce the eval scores deterministically — with no model calls, no quota, and no network. This is the same path CI runs on every merge.

Replay the recorded bundle¶

From a clone of the repository, the bundle lives at tests/data/eval:

gmat-copilot eval --recorded tests/data/eval

This replays the frozen provider completions and judge verdicts while re-running the structural layer live (it is free and deterministic), and prints the per-prompt outcomes, the per-tier pass-rates, and the aggregate:

leo_circular                 [easy  ] structural=PASS judge=True -> PASS
...
  easy  : ...
  medium: ...
  hard  : ...
pass-rate: ...

Because every input is frozen, the numbers are identical on every run. The bundle is three files — prompts.json, completions.json, and judge.json — as described in the evaluation protocol.

Run it live¶

To evaluate fresh drafts and judge them live, point --live at a prompt set and choose a model. This needs a reachable generation provider and a reachable judge (the judge defaults to openai/gpt-4.1-mini on the GitHub Models free tier):

gmat-copilot eval --live --prompts tests/data/eval/prompts.json \
  --model anthropic:claude-... --n 3 --pace 1.0

--n sets the judge votes per prompt (majority, fail on tie) and --pace inserts a delay between calls to respect a free-tier per-minute budget.

Freeze your own bundle¶

Run the live path once and freeze it into a reusable bundle, then replay it deterministically thereafter:

gmat-copilot eval --record my-bundle --prompts tests/data/eval/prompts.json \
  --model anthropic:claude-...
gmat-copilot eval --recorded my-bundle    # same scores, no model calls

--record writes completions.json and judge.json next to your prompts.json; the prompt set itself is left untouched as the source of truth.