API reference¶
Public surface¶
gmat_copilot
¶
gmat-copilot — model-agnostic natural-language → GMAT .script generation.
Retrieval-grounded generation through a provider abstraction, with a static lint gate and a
two-layer evaluation suite. The public surface is :func:draft and the :class:CopilotResult it
returns; the package is GMAT-free for generation and validation.
GmatExtraNotInstalled
¶
Bases: RuntimeError
The dry-run was called without the optional [gmat] extra (or no GMAT install).
Generation and the static lint gate are GMAT-free; only the dynamic dry-run needs gmat-run and a
GMAT install. This is raised eagerly by :func:dry_run so a missing extra is a clear,
actionable error rather than an obscure import failure inside a subprocess.
DraftCancelled
¶
Bases: RuntimeError
Generation was cancelled at a repair-attempt boundary via the cancel callback.
Raised by :func:draft when the caller's cancel predicate returns true before an attempt
begins — the editor surface's cancellable-progress channel (decision D15) routes a user cancel
through this. The check is at attempt boundaries: an in-flight provider call or dry-run
subprocess runs to its own completion / timeout, so a single pass (repair=0) has no boundary
to cancel at and this is raised only once a repair retry would otherwise start.
DraftRejected
¶
Bases: RuntimeError
Strict :func:draft rejected the final draft for blocking diagnostics (decisions D5 / D13).
Raised only after the repair budget is spent. The offending
:class:~gmat_copilot.result.CopilotResult is attached as :attr:result, so the caller can
inspect the script, its lint report, and any dry-run verdict.
Outcome
dataclass
¶
Which draft won and how the run ended (decision D14).
winner indexes the final (returned) draft in :attr:Provenance.repair's attempts — always
the last attempt, recorded explicitly so the sidecar is self-describing. passed is whether
that draft validated clean (lint, plus the dry-run when it ran); strict records the active
mode, so a reader can tell a strict rejection (passed=False, strict=True) from a
permissive best-effort return (passed=False, strict=False). usage is the aggregate token
total across every attempt.
Provenance
dataclass
¶
A versioned record of how a draft was produced (decision D14).
Populated in memory by :func:gmat_copilot.draft for every run — it is the trace the result
already holds — and serialised to a .copilot.json sidecar only on request by the saving
surface (:meth:gmat_copilot.CopilotResult.save). Composes the existing
:class:~gmat_copilot.result.RetrievalTrace and :class:~gmat_copilot.result.RepairTrace; the
per-attempt draft history is :attr:repair's attempts.
CopilotResult
dataclass
¶
Everything a :func:gmat_copilot.draft call produces (decision D10).
save
¶
Write the generated :attr:script to path (UTF-8); return the written path.
With sidecar=True also write the provenance record (decision D14) as a .copilot.json
file next to the script (<path>.copilot.json). The sidecar is written only on request,
never silently; it needs a result from :func:gmat_copilot.draft (whose :attr:provenance
is populated).
Raises:
| Type | Description |
|---|---|
TypeError
|
when |
DraftAttempt
dataclass
¶
One iteration of the repair loop (decision D13): a draft and how it validated.
The loop generates a draft, lints it, and — when the dynamic tier is enabled and lint is clean —
dry-runs it; feedback is what was fed into the next attempt's repair prompt.
DryRunReport
dataclass
¶
The dynamic gmat-run dry-run finding for a script — a separate tier from the lint report.
The dry-run runs only on a lint-clean script (decision D12): Mission.load is the config
tier, and mission.run + Results.converged the execution tier, entered only when the
script has a solver (Target / Optimize). Dry-run findings do not merge into
:class:LintReport — lint diagnostics are precise (rule / severity / line / column) and a
dry-run finding is coarser, so it lands here. ok is the blocking signal: a not-ok report
rejects in strict mode, just as a blocking lint diagnostic does.
LintDiagnostic
dataclass
¶
One linter finding mapped from a gmat_script diagnostic: rule, severity, and location.
LintReport
dataclass
¶
The lint diagnostics for a script, in source order, with severity-filtered views.
The strict/permissive decision lives in the validator (decision D5); this is the raw report.
blocking
¶
blocking(*, strict: bool) -> tuple[LintDiagnostic, ...]
Diagnostics that reject a draft under the given mode (decision D5).
Strict rejects on ERROR and WARNING — every WARNING-level rule is a hard GMAT load error. Permissive never blocks: it returns the best-effort script with all diagnostics attached.
RepairTrace
dataclass
¶
The repair loop's per-attempt history and why it stopped (decision D13).
Attached to :attr:CopilotResult.provenance as the substrate the v0.2 provenance sidecar (D14)
formalises into a versioned record.
RetrievalChunk
dataclass
¶
One corpus chunk surfaced by the retriever, with its source and similarity score.
RetrievalTrace
dataclass
¶
The corpus chunks used to ground a generation, most-relevant first.
dry_run
¶
dry_run(
script: str,
*,
timeout: float = DRYRUN_TIMEOUT_S,
gmat_root: str | None = None,
) -> DryRunReport
Dry-run a lint-clean GMAT script against gmat-run in a fresh subprocess (decision D12).
The script is loaded with Mission.load (the config tier); when it declares a solver
(Target / Optimize) it is also run with mission.run and its Results.converged
checked (the execution tier). The verdict is a :class:~gmat_copilot.result.DryRunReport:
ok is True when the script loads (and, if a solver is present, runs and converges), and a
failure carries one actionable, path-free line distilled from GMAT's own diagnostics.
This is the dynamic tier only — it does not lint. Per decision D12 the dry-run runs only on
a lint-clean script, so callers gate it behind :func:gmat_copilot.validate.validate; the
repair loop sequences the two.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
script
|
str
|
GMAT mission-script source text (lint-clean — see above). |
required |
timeout
|
float
|
wall-clock budget in seconds for the subprocess; on expiry the verdict degrades
to a |
DRYRUN_TIMEOUT_S
|
gmat_root
|
str | None
|
GMAT install root; defaults to |
None
|
Returns:
| Type | Description |
|---|---|
DryRunReport
|
the dry-run verdict as a :class: |
Raises:
| Type | Description |
|---|---|
GmatExtraNotInstalled
|
when the |
require_gmat_extra
¶
Raise :class:GmatExtraNotInstalled unless gmat-run is importable.
The eager guard the dynamic tier and its CLI surfaces call before attempting a dry-run, so a
missing [gmat] extra is a clear, actionable error rather than an obscure import failure.
draft
¶
draft(
request: str,
*,
model: str | None = None,
strict: bool = True,
temperature: float = 0.0,
max_tokens: int = 2048,
retriever: Retriever | None = None,
provider: Provider | None = None,
repair: int = 0,
dry_run: bool = False,
gmat_root: str | None = None,
dry_run_fn: DryRunFn | None = None,
cancel: Callable[[], bool] | None = None,
) -> CopilotResult
Generate a GMAT mission .script from a natural-language request.
Orchestrates retrieve → generate → validate, wrapped in a bounded repair loop (decision D13):
on a failing draft the failing tier's diagnostics are fed back and the model regenerates, up to
repair attempts, stopping on the first clean/runnable draft or on no-progress / oscillation.
Returns a :class:~gmat_copilot.result.CopilotResult for the final draft, with a versioned
:class:~gmat_copilot.provenance.Provenance record (request, retrieval, the per-attempt
history, and the outcome — decision D14) attached on provenance.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
request
|
str
|
what the script should do, in natural language. |
required |
model
|
str | None
|
the |
None
|
strict
|
bool
|
reject the final draft if it does not validate clean — lint ERROR and WARNING
both block (decision D5), and a dry-run failure blocks when the dynamic tier is enabled — by
raising :class: |
True
|
temperature
|
float
|
sampling temperature passed to the provider. |
0.0
|
max_tokens
|
int
|
maximum number of tokens to generate. |
2048
|
retriever
|
Retriever | None
|
corpus retriever used to ground generation; defaults to a
:class: |
None
|
provider
|
Provider | None
|
model provider used to generate; defaults to the one model selects. |
None
|
repair
|
int
|
the retry budget for the repair loop (decision D13). |
0
|
dry_run
|
bool
|
enable the dynamic gmat-run dry-run tier (decision D12) in validation; needs the
|
False
|
gmat_root
|
str | None
|
GMAT install root forwarded to the dry-run (else |
None
|
dry_run_fn
|
DryRunFn | None
|
a dynamic-tier dry-run to use in place of the real gmat-run subprocess (the
eval's deterministic replay seam, decision D7); |
None
|
cancel
|
Callable[[], bool] | None
|
an optional predicate polled before each attempt begins and again after the
provider returns, before validation (decision D15); when it returns true generation stops
with :class: |
None
|
Returns:
| Type | Description |
|---|---|
CopilotResult
|
the final draft's script, its lint report (and dry-run verdict), the retrieval trace,
provider metadata, aggregate usage, and the provenance record on |
Raises:
| Type | Description |
|---|---|
DraftCancelled
|
when cancel returns true before an attempt begins or after the provider returns. |
DraftRejected
|
in strict mode, when the final draft still has blocking diagnostics. |
ProviderError
|
when no model is resolved — either model is |
ValueError
|
when repair is negative. |
read_sidecar
¶
read_sidecar(path: str | Path) -> Provenance
Read a .copilot.json sidecar back into a :class:Provenance (the inverse of writing).
write_sidecar
¶
write_sidecar(
provenance: Provenance, path: str | Path
) -> Path
Write provenance as JSON to path (UTF-8, \n newlines); return the written path.
path is written verbatim — derive the conventional location with :func:sidecar_path.
Result schema¶
result
¶
The result schema returned by :func:gmat_copilot.draft (decision D10).
One stable contract carries everything a generation request produces: the generated .script
text, the lint report, the retrieval trace, and the provider/model/usage that produced it. The
provenance field carries the versioned record of how the draft was produced — the request, the
retrieved chunks, the draft history, and the outcome (decision D14) — and :meth:CopilotResult.save
can serialise it to a .copilot.json sidecar next to the written script.
LintDiagnostic
dataclass
¶
One linter finding mapped from a gmat_script diagnostic: rule, severity, and location.
LintReport
dataclass
¶
The lint diagnostics for a script, in source order, with severity-filtered views.
The strict/permissive decision lives in the validator (decision D5); this is the raw report.
blocking
¶
blocking(*, strict: bool) -> tuple[LintDiagnostic, ...]
Diagnostics that reject a draft under the given mode (decision D5).
Strict rejects on ERROR and WARNING — every WARNING-level rule is a hard GMAT load error. Permissive never blocks: it returns the best-effort script with all diagnostics attached.
DryRunReport
dataclass
¶
The dynamic gmat-run dry-run finding for a script — a separate tier from the lint report.
The dry-run runs only on a lint-clean script (decision D12): Mission.load is the config
tier, and mission.run + Results.converged the execution tier, entered only when the
script has a solver (Target / Optimize). Dry-run findings do not merge into
:class:LintReport — lint diagnostics are precise (rule / severity / line / column) and a
dry-run finding is coarser, so it lands here. ok is the blocking signal: a not-ok report
rejects in strict mode, just as a blocking lint diagnostic does.
RetrievalChunk
dataclass
¶
One corpus chunk surfaced by the retriever, with its source and similarity score.
RetrievalTrace
dataclass
¶
The corpus chunks used to ground a generation, most-relevant first.
DraftAttempt
dataclass
¶
One iteration of the repair loop (decision D13): a draft and how it validated.
The loop generates a draft, lints it, and — when the dynamic tier is enabled and lint is clean —
dry-runs it; feedback is what was fed into the next attempt's repair prompt.
RepairTrace
dataclass
¶
The repair loop's per-attempt history and why it stopped (decision D13).
Attached to :attr:CopilotResult.provenance as the substrate the v0.2 provenance sidecar (D14)
formalises into a versioned record.
CopilotResult
dataclass
¶
Everything a :func:gmat_copilot.draft call produces (decision D10).
save
¶
Write the generated :attr:script to path (UTF-8); return the written path.
With sidecar=True also write the provenance record (decision D14) as a .copilot.json
file next to the script (<path>.copilot.json). The sidecar is written only on request,
never silently; it needs a result from :func:gmat_copilot.draft (whose :attr:provenance
is populated).
Raises:
| Type | Description |
|---|---|
TypeError
|
when |
Provenance¶
provenance
¶
The versioned provenance record and its .copilot.json sidecar (decision D14).
D10 reserved a provenance field on :class:~gmat_copilot.result.CopilotResult; D13 filled it
with a :class:~gmat_copilot.result.RepairTrace (the per-attempt draft history). D14 formalises
that into a versioned record of the whole generation: the request, the resolved provider / model,
the retrieved grounding, the draft history, and the outcome — and serialises it to a
.copilot.json sidecar written next to a saved script (e.g. mission.script.copilot.json).
The in-memory :class:Provenance composes the existing dataclasses (it nests the RepairTrace
D13 already builds); the on-disk JSON follows D14's flat schema — schema_version, request,
provider, model, retrieval, drafts, outcome — and :func:to_json_dict /
:func:from_json_dict map between the two. The JSON is stable (sorted keys, a stamped
schema_version), so a recorded sidecar diffs cleanly, and it carries no credentials: the
record only ever holds the request, the provider / model names, the retrieval trace, the drafts,
and token usage — there is no field a key could enter through.
Outcome
dataclass
¶
Which draft won and how the run ended (decision D14).
winner indexes the final (returned) draft in :attr:Provenance.repair's attempts — always
the last attempt, recorded explicitly so the sidecar is self-describing. passed is whether
that draft validated clean (lint, plus the dry-run when it ran); strict records the active
mode, so a reader can tell a strict rejection (passed=False, strict=True) from a
permissive best-effort return (passed=False, strict=False). usage is the aggregate token
total across every attempt.
Provenance
dataclass
¶
A versioned record of how a draft was produced (decision D14).
Populated in memory by :func:gmat_copilot.draft for every run — it is the trace the result
already holds — and serialised to a .copilot.json sidecar only on request by the saving
surface (:meth:gmat_copilot.CopilotResult.save). Composes the existing
:class:~gmat_copilot.result.RetrievalTrace and :class:~gmat_copilot.result.RepairTrace; the
per-attempt draft history is :attr:repair's attempts.
to_json_dict
¶
to_json_dict(provenance: Provenance) -> dict[str, Any]
Render provenance to D14's flat JSON shape (the draft history flattened to drafts).
from_json_dict
¶
from_json_dict(data: dict[str, Any]) -> Provenance
Reconstruct a :class:Provenance from D14's JSON shape, checking the schema version.
Raises:
| Type | Description |
|---|---|
ValueError
|
when |
dumps
¶
dumps(provenance: Provenance) -> str
Serialise provenance to stable JSON text — sorted keys, indented, trailing newline.
Sorted keys make a recorded sidecar diff cleanly run to run; ensure_ascii=False keeps any
Unicode in the script or feedback readable (the sidecar is a UTF-8 file, not console output).
sidecar_path
¶
The sidecar path for a saved script: <script> -> <script>.copilot.json (D14).
write_sidecar
¶
write_sidecar(
provenance: Provenance, path: str | Path
) -> Path
Write provenance as JSON to path (UTF-8, \n newlines); return the written path.
path is written verbatim — derive the conventional location with :func:sidecar_path.
read_sidecar
¶
read_sidecar(path: str | Path) -> Provenance
Read a .copilot.json sidecar back into a :class:Provenance (the inverse of writing).
Validation¶
validate
¶
The lint validation gate — the v0.1 validator (decision D5).
Generated scripts are checked with the gmat-script static linter: GMAT-free, instant, and
deterministic. Strict mode rejects on lint ERROR and WARNING (every WARNING-level rule is a hard
GMAT load error); permissive mode returns the best-effort script with all diagnostics attached.
The dynamic gmat-run dry-run and the repair loop are a later, gated capability behind the [gmat]
extra; this module is the GMAT-free tier.
validate
¶
validate(
script: str, *, target_version: str | None = None
) -> LintReport
Lint script and return a :class:~gmat_copilot.result.LintReport.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
script
|
str
|
GMAT mission-script source text. |
required |
target_version
|
str | None
|
GMAT catalogue version to lint against; defaults to the newest shipped catalogue. |
None
|
Returns:
| Type | Description |
|---|---|
LintReport
|
the diagnostics in source order. Use :meth: |
Dry-run¶
dryrun
¶
The dynamic gmat-run dry-run tier — the optional [gmat] half of validation (decision D12).
Where :mod:gmat_copilot.validate is the static, GMAT-free, instant lint gate (decision D5), this
is the dynamic backstop: it drives GMAT's own loader and engine over a lint-clean script to catch
the defects a tree-sitter parse cannot — bad numerics, malformed epochs, missing data files, the
undeclared-reference case the linter is too conservative to flag, and solver non-convergence.
The dry-run is tiered (decision D12): Mission.load is the config tier; mission.run +
Results.converged is the execution tier, entered only when the script has a solver
(Target / Optimize), because a script can load and run yet leave a solver unconverged. Each
dry-run runs in a fresh subprocess — gmatpy holds one process-global Moderator and cannot
re-bootstrap in a single interpreter — so a crash or timeout degrades to a failure verdict rather
than taking down the caller, and a repair loop can dry-run several drafts back to back.
The runner imports gmat-run only inside the worker subprocess (:mod:gmat_copilot._dryrun_worker);
this module stays import-safe with the [gmat] extra absent, raising
:class:GmatExtraNotInstalled only when :func:dry_run is actually called without it. The static
lint gate and all of generation remain GMAT-free.
GmatExtraNotInstalled
¶
Bases: RuntimeError
The dry-run was called without the optional [gmat] extra (or no GMAT install).
Generation and the static lint gate are GMAT-free; only the dynamic dry-run needs gmat-run and a
GMAT install. This is raised eagerly by :func:dry_run so a missing extra is a clear,
actionable error rather than an obscure import failure inside a subprocess.
strip_paths
¶
Replace any absolute POSIX path (/dir/.../name) with its basename (name).
extract_feedback_line
¶
Return one actionable line from a raw GMAT log or gmat-run error string.
Prefers the first substantive ERROR / Interpreter-Exception line; falls back to the first
WARNING, then to the first non-blank line. Strips the script-path prefix, the sequence-number
prefix, and the trailing in line: noise GMAT appends, and sanitises any absolute path to its
basename.
require_gmat_extra
¶
Raise :class:GmatExtraNotInstalled unless gmat-run is importable.
The eager guard the dynamic tier and its CLI surfaces call before attempting a dry-run, so a
missing [gmat] extra is a clear, actionable error rather than an obscure import failure.
dry_run
¶
dry_run(
script: str,
*,
timeout: float = DRYRUN_TIMEOUT_S,
gmat_root: str | None = None,
) -> DryRunReport
Dry-run a lint-clean GMAT script against gmat-run in a fresh subprocess (decision D12).
The script is loaded with Mission.load (the config tier); when it declares a solver
(Target / Optimize) it is also run with mission.run and its Results.converged
checked (the execution tier). The verdict is a :class:~gmat_copilot.result.DryRunReport:
ok is True when the script loads (and, if a solver is present, runs and converges), and a
failure carries one actionable, path-free line distilled from GMAT's own diagnostics.
This is the dynamic tier only — it does not lint. Per decision D12 the dry-run runs only on
a lint-clean script, so callers gate it behind :func:gmat_copilot.validate.validate; the
repair loop sequences the two.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
script
|
str
|
GMAT mission-script source text (lint-clean — see above). |
required |
timeout
|
float
|
wall-clock budget in seconds for the subprocess; on expiry the verdict degrades
to a |
DRYRUN_TIMEOUT_S
|
gmat_root
|
str | None
|
GMAT install root; defaults to |
None
|
Returns:
| Type | Description |
|---|---|
DryRunReport
|
the dry-run verdict as a :class: |
Raises:
| Type | Description |
|---|---|
GmatExtraNotInstalled
|
when the |
Repair loop¶
repair
¶
The bounded repair loop's building blocks (decision D13).
The loop itself lives in :func:gmat_copilot.generate.draft (it owns prompt construction and the
provider call); this module supplies the pieces it needs: a combined lint-then-dry-run
:func:evaluate, the :func:build_repair_prompt that feeds a failing draft's diagnostics back to
the model, and small helpers for usage aggregation and the no-progress / oscillation hash check.
Validation is lint-first (decision D13): lint is precise and free, so a lint failure is reported
without paying for the dry-run; the dry-run one-line is the backstop for the
lint-clean-but-unrunnable drafts the loop exists for. evaluate calls the dynamic tier only when
it is enabled and the draft is lint-clean, so the GMAT-free path never touches gmat-run.
Verdict
dataclass
¶
The outcome of validating one draft: pass/fail plus the diagnostics to feed forward.
evaluate
¶
evaluate(
script: str,
*,
dry_run: bool,
gmat_root: str | None = None,
timeout: float = 300.0,
dry_run_fn: DryRunFn | None = None,
) -> Verdict
Validate script lint-first, then (if enabled and lint-clean) dry-run it (decision D13).
A draft passes when it is lint-clean — no ERROR and no WARNING, every WARNING being a hard
GMAT load error (decision D5) — and, when dry_run is enabled, the dynamic tier is ok. On
failure the verdict carries the failing tier's diagnostics as feedback for the next attempt.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
script
|
str
|
the draft to validate. |
required |
dry_run
|
bool
|
whether to run the dynamic gmat-run tier on a lint-clean draft. |
required |
gmat_root
|
str | None
|
GMAT install root forwarded to the dry-run (else |
None
|
timeout
|
float
|
wall-clock budget forwarded to the dry-run. |
300.0
|
dry_run_fn
|
DryRunFn | None
|
a dynamic-tier dry-run to use in place of the real gmat-run subprocess; the
replay seam the recorded eval drives the loop with (decision D7). |
None
|
build_repair_prompt
¶
The repair request: the original intent + the failing draft + the diagnostics to fix.
The result is a new request string fed back through the normal generation prompt (system framing, retrieval grounding, output contract), so a repair attempt is an ordinary generation that additionally sees the prior attempt and why it failed.
draft_hash
¶
A stable content hash of a draft, for the no-progress / oscillation stop conditions.
aggregate_usage
¶
aggregate_usage(
attempts: tuple[DraftAttempt, ...],
) -> dict[str, int]
Sum the per-attempt token usage across the loop, key by key.
Providers¶
providers
¶
The model-agnostic provider abstraction (decisions D4, D7).
One thin :class:Provider protocol with four real adapters (Anthropic, OpenAI, Ollama, GitHub
Models) plus a :class:RecordedProvider that replays committed fixtures for deterministic,
zero-quota CI. There is no default model: selection is explicit ("provider:model"); with
none given, :func:select errors and lists the providers it can reach from configured credentials
— it never auto-picks or recommends one. Credentials come from the environment, never committed.
Each real adapter's complete performs the provider call through its optional extra
([anthropic] / [openai] / [ollama]; GitHub Models needs none) and raises a clear,
actionable error when that extra is not installed or the credential is absent. The protocol,
credential discovery, no-default selection, and the recorded replay path round out the surface;
:class:RecordingProvider captures live completions into the fixture shape the recorded path
replays.
Completion
dataclass
¶
A single provider completion: the text plus the provider/model/usage that produced it.
ProviderError
¶
Bases: RuntimeError
A provider could not satisfy a request (missing credential, unreachable, or unknown).
Provider
¶
Bases: Protocol
The contract every adapter satisfies.
reachable
¶
Whether a call could succeed now — i.e. a credential/host is configured.
complete
¶
complete(
prompt: str,
*,
model: str,
temperature: float = 0.0,
max_tokens: int = 1024,
) -> Completion
Generate a completion for prompt with model.
AnthropicProvider
¶
A Claude model via the user's ANTHROPIC_API_KEY (the [anthropic] extra).
OpenAIProvider
¶
An OpenAI model with the user's key (OPENAI_API_KEY). Needs the [openai] extra.
OllamaProvider
¶
A local Ollama server (OLLAMA_HOST, default http://localhost:11434).
GitHubModelsProvider
¶
GitHub Models (OpenAI-compatible), authenticated with GH_TOKEN / MODELS_PAT.
The free-tier path the eval and CI use; no provider SDK required.
RecordedProvider
¶
Replays committed fixtures keyed by (provider, model, prompt) — fully deterministic.
The CI inference path (decision D7): zero model calls, zero quota. A fixture records whatever
real provider produced it; the replay reports provider == "recorded".
RecordingProvider
¶
Wraps a real provider and records every completion as a replayable fixture (D7 record mode).
A drop-in :class:Provider: :meth:complete delegates to the wrapped provider and stores the
result keyed by (provider, model, prompt) in the shape :class:RecordedProvider and the
eval bundle replay. :meth:save writes the accumulated fixtures to disk, merging with any
already there — the record mode that captures new fixtures for the deterministic CI path.
save
¶
Write the recorded fixtures to path as JSON, merging with any already present.
prompt_key
¶
The deterministic fixture key for a (provider, model, prompt) triple.
reachable_providers
¶
The real providers reachable now from configured credentials, in registry order.
Retrieval¶
rag
¶
Corpus ingest, the FAISS index, and the retriever (decisions D2, D3).
Retrieval grounds generation in the GMAT help pages, the stock sample scripts, the GmatFunctions,
the gmat-script field catalogue, and a curated domain-notes tier. The corpus text and a prebuilt
index for the default embedder are extracted by maintainers at build time (:mod:.build) and
shipped with the package; the runtime :func:.load_corpus loads them with no GMAT install and no
network, rebuilding the index on first use only as a fallback. sentence-transformers / faiss
are imported lazily so importing the package stays light.
BgeEmbedder
¶
The default :class:Embedder: a lazily-loaded BGE sentence-transformer.
sentence_transformers is imported on first use, not at construction, so neither importing
the package nor loading the shipped index (built for this model) pays the model-load cost.
Embedder
¶
Bases: Protocol
Embeds passages and queries into a shared vector space.
Implementations normalise their output so a flat inner-product index measures cosine similarity.
CorpusIndex
¶
The loaded corpus: the chunks, the FAISS index, and the embedder they were built for.
SearchHit
dataclass
¶
A retrieved corpus chunk and its similarity score (higher is closer).
Retriever
¶
Embeds a query and returns the most relevant corpus chunks (decision D2).
Loads the shipped corpus and prebuilt index for the default embedder (rebuilding on first use
only as a fallback for a non-default embedder or a corpus change), then runs a top-k search and
trims the result to a token budget. The corpus and model load lazily on the first
:meth:retrieve, so constructing a Retriever is cheap.
retrieve
¶
retrieve(
query: str, *, top_k: int | None = None
) -> RetrievalTrace
Return the corpus chunks that ground query, most relevant first.
Trims the ranked hits to :attr:token_budget, keeping whole chunks in rank order and always
retaining at least the top hit. To curb hallucinated command syntax, it also pins one
worked-example chunk — the best-ranked passage that actually shows a mission sequence —
into the grounding even when pure relevance ranks it below the kept set (it commonly does
for setup-heavy queries, which retrieve resource definitions but no command syntax). The
returned trace is exactly the set :func:assemble_context formats into the grounding block.
CorpusChunk
dataclass
¶
One retrieval passage with its source provenance.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
str
|
the passage embedded and returned as grounding. |
required |
kind
|
ChunkKind
|
the corpus tier the passage belongs to. |
required |
origin
|
str
|
the source identifier — a help-page name, sample/GmatFunction file name, catalogue type name, or domain-note name. |
required |
section
|
str
|
a finer locator within the origin — a help field-section heading or a sample
|
''
|
load_corpus
¶
load_corpus(
embedder: Embedder | None = None,
*,
corpus_dir: Path | None = None,
) -> CorpusIndex
Load the shipped corpus and its index (decision D2).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
embedder
|
Embedder | None
|
the embedder retrieval will use. |
None
|
corpus_dir
|
Path | None
|
the corpus directory; defaults to the shipped package data. |
None
|
assemble_context
¶
assemble_context(trace: RetrievalTrace) -> str
Format a retrieval trace into a bounded, source-attributed grounding block.
Each chunk is rendered under its source label so generation (and a reader of the result) can see where the grounding came from. Empty when the trace has no chunks.
Evaluation¶
eval
¶
The evaluation suite: prompt set, structural scorer, LLM judge, and scorer (D6/D7).
BoardRow
dataclass
¶
One provider:model entry: public anchor, held-out headline, and close-the-loop cell.
overfit_gap
property
¶
public - held_out — the overfit tell; None until the held-out has been scored.
LeaderboardError
¶
Bases: Exception
A board failed an integrity check (a leak, or a non-reproducing row).
DraftScore
dataclass
¶
One draft's close-the-loop score: the two static layers plus the dynamic dry-run verdict.
LiftReport
dataclass
¶
The close-the-loop outcomes and the per-tier dry-run-agreement and repair-lift aggregates.
base_runnable_by_tier
property
¶
The close-the-loop pass-rate per tier at repair = 0 (the single-pass baseline).
repaired_runnable_by_tier
property
¶
The close-the-loop pass-rate per tier at the repair budget.
lift_by_tier
property
¶
The repair-loop lift per tier: repaired pass-rate minus base pass-rate (decision D13).
dry_run_agreement_by_tier
property
¶
Per tier, the fraction of statically-accepted base drafts that also pass the dry-run.
The denominator is the drafts the v0.1 static eval would pass (structural ∧ judge) at
repair = 0; the numerator, those whose dry-run is also ok. None for a tier with
no statically-accepted draft to compare (the agreement is undefined, not 0). The shortfall
below 1.0 is the static-vs-dynamic gap the dry-run tier exists to surface (decision D12).
repaired_runnable
property
¶
The overall close-the-loop pass-rate at the repair budget.
LiftRow
dataclass
¶
One prompt's close-the-loop outcome at repair = 0 (base) and the budget (repaired).
base is the single-pass (v0.1) draft; repaired is the draft the bounded loop converged to.
RecordedDryRun
dataclass
¶
A :data:~gmat_copilot.repair.DryRunFn that replays recorded verdicts keyed by draft hash.
The deterministic, GMAT-free dynamic tier the recorded close-the-loop eval drives the real loop
with (decision D7): :func:gmat_copilot.draft calls it instead of the gmat-run subprocess.
EvalPrompt
dataclass
¶
One eval prompt: the request sent to the model, its intent, and its structural spec.
StructuralSpec
dataclass
¶
What the deterministic structural layer asserts about a candidate script.
EvalReport
dataclass
¶
The outcomes for an eval run and the aggregate pass-rate.
pass_rate_by_tier
property
¶
The pass-rate within each difficulty tier (decision D6 aggregates per tier).
PromptOutcome
dataclass
¶
The scored outcome for one prompt: structural, judge, and combined verdicts.
StructuralResult
dataclass
¶
The structural verdict for one candidate script and the specific checks it failed.
judge_verdicts
¶
judge_verdicts(
intent: str,
script: str,
*,
model: str = JUDGE_MODEL,
n: int = 3,
provider: Provider | None = None,
pace: float = 0.0,
) -> list[bool | None]
Run the judge n times and return the raw per-run verdicts (decision D6).
The list of verdicts is what the recorded bundle freezes; :func:majority reduces it to the
gate decision. pace seconds are slept between calls to respect the free-tier per-minute budget
when recording live; unit tests leave it at 0.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
provider
|
Provider | None
|
the model provider; defaults to a
:class: |
None
|
majority
¶
Majority vote over verdicts, ignoring None; FAIL on a tie (decision D6).
Returns:
| Type | Description |
|---|---|
bool | None
|
the majority boolean, or |
parse_verdict
¶
Extract the binary verdict from a judge completion.
Prefers the constrained {"satisfies_intent": bool} object; falls back to an unambiguous
bare true/false in the prose. Returns None when the verdict cannot be read — a
None is dropped by :func:majority, never counted as a vote.
build_from_config
¶
build_from_config(
config: dict[str, Any],
*,
root: Path,
generated_at: str,
tool_version: str,
held_out_root: Path | None = None,
) -> tuple[dict[str, Any], list[str]]
Build a board from a seeds config, resolving bundle paths under root.
Returns the board payload and a list of human-readable notes for seeds that were skipped (their public bundle was absent or did not carry the model — e.g. a live-only seed run offline). A seed is scored against its held-out bundle only when held_out_root is given and that bundle exists, so offline and in per-merge CI every held-out cell is pending. Held-out golds are never committed: the held-out bundles live under held_out_root (a gitignored cache fetched in gated CI), never in the repo tree.
build_leaderboard
¶
build_leaderboard(
rows: Iterable[BoardRow],
*,
eval_protocol_version: str,
generated_at: str,
judge_model: str,
public_set: dict[str, Any],
held_out_set: dict[str, Any],
) -> dict[str, Any]
Assemble a ranked leaderboard.json payload — ranked on the held-out headline (D16).
generated_at is injected (never read from the clock) so the payload is byte-deterministic. The public anchor is shown alongside the headline; a pending held-out sorts last.
dumps
¶
Canonical board bytes: sorted-key, 2-space JSON with a trailing newline (stable diffs).
score_entry
¶
score_entry(
model: str,
*,
public_bundle: str | Path,
tool_version: str,
held_out_bundle: str | Path | None = None,
lift_bundle: str | Path | None = None,
provider: str = "recorded",
kind: str = "seed",
n_votes: int = 3,
judge_model: str = JUDGE_MODEL,
submitted_by: str = "seed",
verified: bool = True,
) -> BoardRow
Score one provider:model into a :class:BoardRow through the shipped recorded scorer.
The public number comes from replaying public_bundle (deterministic, quota-free, D7); the
held-out number, when held_out_bundle is given, from replaying it the same way (gated CI
only — the bundle is never committed); the close-the-loop cell, when lift_bundle is given,
from the recorded lift replay. run_recorded raises :class:ProviderError if model is
missing from a bundle — the caller decides whether to skip the seed or fail.
summarize
¶
summarize(report: EvalReport) -> Aggregate
The aggregate-only view of an eval report — the only thing a board row carries from it.
run_live_lift
¶
run_live_lift(
prompts: Sequence[EvalPrompt],
*,
model: str,
judge_model: str = JUDGE_MODEL,
n: int = 3,
budget: int = DEFAULT_BUDGET,
provider: Provider | None = None,
retriever: Retriever | None = None,
judge_provider: Provider | None = None,
gmat_root: str | None = None,
pace: float = 0.0,
) -> LiftReport
Run the close-the-loop eval live: a real model, a real GMAT dry-run, and a live judge.
The on-demand / fixture-refresh path (decision D7) — needs a reachable generation provider, the
[gmat] extra with a GMAT install, and a reachable judge. pace seconds are slept between
prompts to respect the free-tier daily budget. No fixtures are written.
run_recorded_lift
¶
run_recorded_lift(
bundle_dir: str | Path, *, budget: int = DEFAULT_BUDGET
) -> LiftReport
Replay the recorded close-the-loop bundle and return its :class:LiftReport (decision D7).
Deterministic and quota-free: a :class:_TrajectoryProvider replays each prompt's recorded
repair trajectory through the real loop, a :class:RecordedDryRun replays the dynamic tier, and
the recorded judge verdicts settle the semantic layer — zero model calls, zero GMAT.
A bundle directory holds:
prompts.json— the prompt set (see :func:gmat_copilot.eval.prompts.load_prompts).trajectory.json—{prompt_id: [draft_script, ...]}, the recorded repair sequence.verdicts.json—{draft_hash: {"dry_run": {...}, "judge": [verdict, ...]}}.
load_prompts
¶
load_prompts(path: str | Path) -> list[EvalPrompt]
Load an eval prompt set from a JSON file.
record_bundle
¶
record_bundle(
prompts: Sequence[EvalPrompt],
bundle_dir: str | Path,
*,
model: str,
judge_model: str = JUDGE_MODEL,
n: int = 3,
provider: Provider | None = None,
retriever: Retriever | None = None,
judge_provider: Provider | None = None,
pace: float = 0.0,
) -> EvalReport
Run the live eval once and freeze it as a recorded bundle in bundle_dir (decision D7).
Writes completions.json (the generated scripts, keyed for :class:RecordedProvider) and
judge.json (the raw per-run judge verdicts). prompts.json is the authored source of
truth and is left untouched; :func:run_recorded on the same directory then reproduces this
run's scores deterministically. Returns the live :class:EvalReport.
run_live
¶
run_live(
prompts: Sequence[EvalPrompt],
*,
model: str,
judge_model: str = JUDGE_MODEL,
n: int = 3,
provider: Provider | None = None,
retriever: Retriever | None = None,
judge_provider: Provider | None = None,
pace: float = 0.0,
) -> EvalReport
Generate and judge each prompt live, returning the :class:EvalReport (decision D6).
Needs a reachable generation provider (model is a "provider:model" selector unless
provider is given) and a reachable judge. pace seconds are slept between model calls to
respect the free-tier per-minute budget. No fixtures are written — use :func:record_bundle
to freeze a run.
run_recorded
¶
run_recorded(
bundle_dir: str | Path, *, model: str
) -> EvalReport
Replay the recorded eval bundle for model and return its :class:EvalReport.
Deterministic and quota-free: the structural layer re-scores the recorded completion text and the judge layer replays the recorded verdicts (decision D7).
structural_score
¶
structural_score(
script_text: str, spec: StructuralSpec
) -> StructuralResult
Score script_text against spec with the deterministic structural checks.
Leaderboard¶
leaderboard
¶
The per-model leaderboard engine: a ranked board over the eval suite (decision D16).
Where :mod:gmat_copilot.eval.runner scores one provider:model into an
:class:~gmat_copilot.eval.runner.EvalReport, this sweeps a set of explicit provider:model\ s
through the same shipped recorded scorer (no new scoring math) and assembles a ranked
leaderboard.json. Two roles for the eval set decide the ranking:
- the committed public prompt set is the reproducibility anchor — its number reproduces byte-for-byte offline from the recorded bundle (decision D7), pinned by the bundle's content hash;
- a never-committed held-out set is the headline — the board ranks on it, so overfitting the
public prompts buys no rank. A large
public - held_outgap is the overfit tell.
The board carries aggregates only (per-tier pass-rates, the close-the-loop figures, usage, and a
run block); no prompt text, intent, or judge verdict ever reaches it, so a held-out gold cannot leak
through the published JSON (:func:assert_aggregate_only). Held-out scoring runs only in gated CI
against a private store; offline and in per-merge CI the held-out is pending and the public anchor
stands alone.
The engine is pure and injection-driven: generated_at and tool_version are passed in (never
read from the clock), so a built board is byte-deterministic and testable.
LeaderboardError
¶
Bases: Exception
A board failed an integrity check (a leak, or a non-reproducing row).
Aggregate
dataclass
¶
An aggregate-only score for one prompt set: the headline rate plus its per-tier breakdown.
CloseTheLoop
dataclass
¶
The v0.2 close-the-loop figures for a model (decisions D12, D13), as a board cell.
RunMeta
dataclass
¶
The reproducibility block pinning how a row was produced (decision D16).
BoardRow
dataclass
¶
One provider:model entry: public anchor, held-out headline, and close-the-loop cell.
overfit_gap
property
¶
public - held_out — the overfit tell; None until the held-out has been scored.
summarize
¶
summarize(report: EvalReport) -> Aggregate
The aggregate-only view of an eval report — the only thing a board row carries from it.
close_the_loop_from_lift
¶
close_the_loop_from_lift(
report: LiftReport,
) -> CloseTheLoop
Fold a :class:LiftReport into the board's close-the-loop cell (decisions D12, D13).
bundle_sha16
¶
The 16-hex content hash over names in bundle — pins a recorded result (decision D7).
recorded_usage
¶
Sum model's recorded generation usage in bundle, plus the implied judge-call count.
Judge usage is not recorded, so judge_calls is derived as generations times the vote count —
the free-tier transparency the board publishes (decision D16), not a measured token total.
score_entry
¶
score_entry(
model: str,
*,
public_bundle: str | Path,
tool_version: str,
held_out_bundle: str | Path | None = None,
lift_bundle: str | Path | None = None,
provider: str = "recorded",
kind: str = "seed",
n_votes: int = 3,
judge_model: str = JUDGE_MODEL,
submitted_by: str = "seed",
verified: bool = True,
) -> BoardRow
Score one provider:model into a :class:BoardRow through the shipped recorded scorer.
The public number comes from replaying public_bundle (deterministic, quota-free, D7); the
held-out number, when held_out_bundle is given, from replaying it the same way (gated CI
only — the bundle is never committed); the close-the-loop cell, when lift_bundle is given,
from the recorded lift replay. run_recorded raises :class:ProviderError if model is
missing from a bundle — the caller decides whether to skip the seed or fail.
build_leaderboard
¶
build_leaderboard(
rows: Iterable[BoardRow],
*,
eval_protocol_version: str,
generated_at: str,
judge_model: str,
public_set: dict[str, Any],
held_out_set: dict[str, Any],
) -> dict[str, Any]
Assemble a ranked leaderboard.json payload — ranked on the held-out headline (D16).
generated_at is injected (never read from the clock) so the payload is byte-deterministic. The public anchor is shown alongside the headline; a pending held-out sorts last.
dumps
¶
Canonical board bytes: sorted-key, 2-space JSON with a trailing newline (stable diffs).
assert_aggregate_only
¶
Raise :class:LeaderboardError if a row's public/held-out cell carries a non-aggregate key.
The firewall the published board must satisfy (decision D16): a board carries pass-rates and metadata only, never a prompt, an intent, or a judge verdict — so a held-out gold cannot leak through it. A row exposing an unexpected key in an aggregate cell is rejected.
assert_no_leak
¶
Raise if any held-out secret string appears in the serialized board (a hard leak check).
held_out_secrets
¶
The held-out gold strings the published board must never contain — every held-out prompt's request and intent text, read from the private store under held_out_root.
Fed to :func:assert_no_leak so the firewall scans the bytes that ship, not just their key
names (the structural :func:assert_aggregate_only check). Held-out bundles that have not been
fetched are skipped, so this is a no-op offline and the realistic content scan only in gated CI.
build_from_config
¶
build_from_config(
config: dict[str, Any],
*,
root: Path,
generated_at: str,
tool_version: str,
held_out_root: Path | None = None,
) -> tuple[dict[str, Any], list[str]]
Build a board from a seeds config, resolving bundle paths under root.
Returns the board payload and a list of human-readable notes for seeds that were skipped (their public bundle was absent or did not carry the model — e.g. a live-only seed run offline). A seed is scored against its held-out bundle only when held_out_root is given and that bundle exists, so offline and in per-merge CI every held-out cell is pending. Held-out golds are never committed: the held-out bundles live under held_out_root (a gitignored cache fetched in gated CI), never in the repo tree.