Skip to content

API reference

Public surface

gmat_copilot

gmat-copilot — model-agnostic natural-language → GMAT .script generation.

Retrieval-grounded generation through a provider abstraction, with a static lint gate and a two-layer evaluation suite. The public surface is :func:draft and the :class:CopilotResult it returns; the package is GMAT-free for generation and validation.

GmatExtraNotInstalled

Bases: RuntimeError

The dry-run was called without the optional [gmat] extra (or no GMAT install).

Generation and the static lint gate are GMAT-free; only the dynamic dry-run needs gmat-run and a GMAT install. This is raised eagerly by :func:dry_run so a missing extra is a clear, actionable error rather than an obscure import failure inside a subprocess.

DraftCancelled

Bases: RuntimeError

Generation was cancelled at a repair-attempt boundary via the cancel callback.

Raised by :func:draft when the caller's cancel predicate returns true before an attempt begins — the editor surface's cancellable-progress channel (decision D15) routes a user cancel through this. The check is at attempt boundaries: an in-flight provider call or dry-run subprocess runs to its own completion / timeout, so a single pass (repair=0) has no boundary to cancel at and this is raised only once a repair retry would otherwise start.

DraftRejected

Bases: RuntimeError

Strict :func:draft rejected the final draft for blocking diagnostics (decisions D5 / D13).

Raised only after the repair budget is spent. The offending :class:~gmat_copilot.result.CopilotResult is attached as :attr:result, so the caller can inspect the script, its lint report, and any dry-run verdict.

Outcome dataclass

Which draft won and how the run ended (decision D14).

winner indexes the final (returned) draft in :attr:Provenance.repair's attempts — always the last attempt, recorded explicitly so the sidecar is self-describing. passed is whether that draft validated clean (lint, plus the dry-run when it ran); strict records the active mode, so a reader can tell a strict rejection (passed=False, strict=True) from a permissive best-effort return (passed=False, strict=False). usage is the aggregate token total across every attempt.

Provenance dataclass

A versioned record of how a draft was produced (decision D14).

Populated in memory by :func:gmat_copilot.draft for every run — it is the trace the result already holds — and serialised to a .copilot.json sidecar only on request by the saving surface (:meth:gmat_copilot.CopilotResult.save). Composes the existing :class:~gmat_copilot.result.RetrievalTrace and :class:~gmat_copilot.result.RepairTrace; the per-attempt draft history is :attr:repair's attempts.

CopilotResult dataclass

Everything a :func:gmat_copilot.draft call produces (decision D10).

save

save(path: str | Path, *, sidecar: bool = False) -> Path

Write the generated :attr:script to path (UTF-8); return the written path.

With sidecar=True also write the provenance record (decision D14) as a .copilot.json file next to the script (<path>.copilot.json). The sidecar is written only on request, never silently; it needs a result from :func:gmat_copilot.draft (whose :attr:provenance is populated).

Raises:

Type Description
TypeError

when sidecar=True but :attr:provenance is not a populated record (e.g. a hand-built result).

DraftAttempt dataclass

One iteration of the repair loop (decision D13): a draft and how it validated.

The loop generates a draft, lints it, and — when the dynamic tier is enabled and lint is clean — dry-runs it; feedback is what was fed into the next attempt's repair prompt.

DryRunReport dataclass

The dynamic gmat-run dry-run finding for a script — a separate tier from the lint report.

The dry-run runs only on a lint-clean script (decision D12): Mission.load is the config tier, and mission.run + Results.converged the execution tier, entered only when the script has a solver (Target / Optimize). Dry-run findings do not merge into :class:LintReport — lint diagnostics are precise (rule / severity / line / column) and a dry-run finding is coarser, so it lands here. ok is the blocking signal: a not-ok report rejects in strict mode, just as a blocking lint diagnostic does.

LintDiagnostic dataclass

One linter finding mapped from a gmat_script diagnostic: rule, severity, and location.

LintReport dataclass

The lint diagnostics for a script, in source order, with severity-filtered views.

The strict/permissive decision lives in the validator (decision D5); this is the raw report.

clean property

clean: bool

True when the linter reported nothing at all.

blocking

blocking(*, strict: bool) -> tuple[LintDiagnostic, ...]

Diagnostics that reject a draft under the given mode (decision D5).

Strict rejects on ERROR and WARNING — every WARNING-level rule is a hard GMAT load error. Permissive never blocks: it returns the best-effort script with all diagnostics attached.

RepairTrace dataclass

The repair loop's per-attempt history and why it stopped (decision D13).

Attached to :attr:CopilotResult.provenance as the substrate the v0.2 provenance sidecar (D14) formalises into a versioned record.

RetrievalChunk dataclass

One corpus chunk surfaced by the retriever, with its source and similarity score.

RetrievalTrace dataclass

The corpus chunks used to ground a generation, most-relevant first.

dry_run

dry_run(
    script: str,
    *,
    timeout: float = DRYRUN_TIMEOUT_S,
    gmat_root: str | None = None,
) -> DryRunReport

Dry-run a lint-clean GMAT script against gmat-run in a fresh subprocess (decision D12).

The script is loaded with Mission.load (the config tier); when it declares a solver (Target / Optimize) it is also run with mission.run and its Results.converged checked (the execution tier). The verdict is a :class:~gmat_copilot.result.DryRunReport: ok is True when the script loads (and, if a solver is present, runs and converges), and a failure carries one actionable, path-free line distilled from GMAT's own diagnostics.

This is the dynamic tier only — it does not lint. Per decision D12 the dry-run runs only on a lint-clean script, so callers gate it behind :func:gmat_copilot.validate.validate; the repair loop sequences the two.

Parameters:

Name Type Description Default
script str

GMAT mission-script source text (lint-clean — see above).

required
timeout float

wall-clock budget in seconds for the subprocess; on expiry the verdict degrades to a "timeout" failure (decision D12 default: 300 s, to bound a runaway solver).

DRYRUN_TIMEOUT_S
gmat_root str | None

GMAT install root; defaults to GMAT_ROOT / standard-location discovery (gmat-run's locate_gmat), which runs inside the worker subprocess.

None

Returns:

Type Description
DryRunReport

the dry-run verdict as a :class:~gmat_copilot.result.DryRunReport.

Raises:

Type Description
GmatExtraNotInstalled

when the [gmat] extra (gmat-run) is not importable.

require_gmat_extra

require_gmat_extra() -> None

Raise :class:GmatExtraNotInstalled unless gmat-run is importable.

The eager guard the dynamic tier and its CLI surfaces call before attempting a dry-run, so a missing [gmat] extra is a clear, actionable error rather than an obscure import failure.

draft

draft(
    request: str,
    *,
    model: str | None = None,
    strict: bool = True,
    temperature: float = 0.0,
    max_tokens: int = 2048,
    retriever: Retriever | None = None,
    provider: Provider | None = None,
    repair: int = 0,
    dry_run: bool = False,
    gmat_root: str | None = None,
    dry_run_fn: DryRunFn | None = None,
    cancel: Callable[[], bool] | None = None,
) -> CopilotResult

Generate a GMAT mission .script from a natural-language request.

Orchestrates retrieve → generate → validate, wrapped in a bounded repair loop (decision D13): on a failing draft the failing tier's diagnostics are fed back and the model regenerates, up to repair attempts, stopping on the first clean/runnable draft or on no-progress / oscillation. Returns a :class:~gmat_copilot.result.CopilotResult for the final draft, with a versioned :class:~gmat_copilot.provenance.Provenance record (request, retrieval, the per-attempt history, and the outcome — decision D14) attached on provenance.

Parameters:

Name Type Description Default
request str

what the script should do, in natural language.

required
model str | None

the "provider:model" selector (decision D4 — there is no default; selection is always explicit). When provider is supplied this is the bare model name handed to it; otherwise it is resolved with :func:~gmat_copilot.providers.select, which errors and lists the reachable providers when it is None.

None
strict bool

reject the final draft if it does not validate clean — lint ERROR and WARNING both block (decision D5), and a dry-run failure blocks when the dynamic tier is enabled — by raising :class:DraftRejected once the budget is spent. Permissive (strict=False) returns the best-effort draft with every diagnostic attached.

True
temperature float

sampling temperature passed to the provider.

0.0
max_tokens int

maximum number of tokens to generate.

2048
retriever Retriever | None

corpus retriever used to ground generation; defaults to a :class:~gmat_copilot.rag.Retriever. Retrieval is computed once from request and reused across repair attempts.

None
provider Provider | None

model provider used to generate; defaults to the one model selects.

None
repair int

the retry budget for the repair loop (decision D13). 0 (the default) is a single pass — the v0.1 behaviour.

0
dry_run bool

enable the dynamic gmat-run dry-run tier (decision D12) in validation; needs the [gmat] extra and a GMAT install. Off by default, keeping generation GMAT-free.

False
gmat_root str | None

GMAT install root forwarded to the dry-run (else GMAT_ROOT / discovery).

None
dry_run_fn DryRunFn | None

a dynamic-tier dry-run to use in place of the real gmat-run subprocess (the eval's deterministic replay seam, decision D7); None uses the real dry-run.

None
cancel Callable[[], bool] | None

an optional predicate polled before each attempt begins and again after the provider returns, before validation (decision D15); when it returns true generation stops with :class:DraftCancelled. The in-flight provider call (and a running dry-run) still completes — cancelling does not abort an HTTP request mid-flight — but a cancel observed after generation skips the potentially expensive dry-run, so even a single pass (repair=0) is cancellable between its generate and validate phases.

None

Returns:

Type Description
CopilotResult

the final draft's script, its lint report (and dry-run verdict), the retrieval trace, provider metadata, aggregate usage, and the provenance record on provenance.

Raises:

Type Description
DraftCancelled

when cancel returns true before an attempt begins or after the provider returns.

DraftRejected

in strict mode, when the final draft still has blocking diagnostics.

ProviderError

when no model is resolved — either model is None with no provider to apply it to, or :func:~gmat_copilot.providers.select cannot resolve the selector.

ValueError

when repair is negative.

read_sidecar

read_sidecar(path: str | Path) -> Provenance

Read a .copilot.json sidecar back into a :class:Provenance (the inverse of writing).

write_sidecar

write_sidecar(
    provenance: Provenance, path: str | Path
) -> Path

Write provenance as JSON to path (UTF-8, \n newlines); return the written path.

path is written verbatim — derive the conventional location with :func:sidecar_path.

Result schema

result

The result schema returned by :func:gmat_copilot.draft (decision D10).

One stable contract carries everything a generation request produces: the generated .script text, the lint report, the retrieval trace, and the provider/model/usage that produced it. The provenance field carries the versioned record of how the draft was produced — the request, the retrieved chunks, the draft history, and the outcome (decision D14) — and :meth:CopilotResult.save can serialise it to a .copilot.json sidecar next to the written script.

LintDiagnostic dataclass

One linter finding mapped from a gmat_script diagnostic: rule, severity, and location.

LintReport dataclass

The lint diagnostics for a script, in source order, with severity-filtered views.

The strict/permissive decision lives in the validator (decision D5); this is the raw report.

clean property

clean: bool

True when the linter reported nothing at all.

blocking

blocking(*, strict: bool) -> tuple[LintDiagnostic, ...]

Diagnostics that reject a draft under the given mode (decision D5).

Strict rejects on ERROR and WARNING — every WARNING-level rule is a hard GMAT load error. Permissive never blocks: it returns the best-effort script with all diagnostics attached.

DryRunReport dataclass

The dynamic gmat-run dry-run finding for a script — a separate tier from the lint report.

The dry-run runs only on a lint-clean script (decision D12): Mission.load is the config tier, and mission.run + Results.converged the execution tier, entered only when the script has a solver (Target / Optimize). Dry-run findings do not merge into :class:LintReport — lint diagnostics are precise (rule / severity / line / column) and a dry-run finding is coarser, so it lands here. ok is the blocking signal: a not-ok report rejects in strict mode, just as a blocking lint diagnostic does.

RetrievalChunk dataclass

One corpus chunk surfaced by the retriever, with its source and similarity score.

RetrievalTrace dataclass

The corpus chunks used to ground a generation, most-relevant first.

DraftAttempt dataclass

One iteration of the repair loop (decision D13): a draft and how it validated.

The loop generates a draft, lints it, and — when the dynamic tier is enabled and lint is clean — dry-runs it; feedback is what was fed into the next attempt's repair prompt.

RepairTrace dataclass

The repair loop's per-attempt history and why it stopped (decision D13).

Attached to :attr:CopilotResult.provenance as the substrate the v0.2 provenance sidecar (D14) formalises into a versioned record.

CopilotResult dataclass

Everything a :func:gmat_copilot.draft call produces (decision D10).

save

save(path: str | Path, *, sidecar: bool = False) -> Path

Write the generated :attr:script to path (UTF-8); return the written path.

With sidecar=True also write the provenance record (decision D14) as a .copilot.json file next to the script (<path>.copilot.json). The sidecar is written only on request, never silently; it needs a result from :func:gmat_copilot.draft (whose :attr:provenance is populated).

Raises:

Type Description
TypeError

when sidecar=True but :attr:provenance is not a populated record (e.g. a hand-built result).

Provenance

provenance

The versioned provenance record and its .copilot.json sidecar (decision D14).

D10 reserved a provenance field on :class:~gmat_copilot.result.CopilotResult; D13 filled it with a :class:~gmat_copilot.result.RepairTrace (the per-attempt draft history). D14 formalises that into a versioned record of the whole generation: the request, the resolved provider / model, the retrieved grounding, the draft history, and the outcome — and serialises it to a .copilot.json sidecar written next to a saved script (e.g. mission.script.copilot.json).

The in-memory :class:Provenance composes the existing dataclasses (it nests the RepairTrace D13 already builds); the on-disk JSON follows D14's flat schema — schema_version, request, provider, model, retrieval, drafts, outcome — and :func:to_json_dict / :func:from_json_dict map between the two. The JSON is stable (sorted keys, a stamped schema_version), so a recorded sidecar diffs cleanly, and it carries no credentials: the record only ever holds the request, the provider / model names, the retrieval trace, the drafts, and token usage — there is no field a key could enter through.

Outcome dataclass

Which draft won and how the run ended (decision D14).

winner indexes the final (returned) draft in :attr:Provenance.repair's attempts — always the last attempt, recorded explicitly so the sidecar is self-describing. passed is whether that draft validated clean (lint, plus the dry-run when it ran); strict records the active mode, so a reader can tell a strict rejection (passed=False, strict=True) from a permissive best-effort return (passed=False, strict=False). usage is the aggregate token total across every attempt.

Provenance dataclass

A versioned record of how a draft was produced (decision D14).

Populated in memory by :func:gmat_copilot.draft for every run — it is the trace the result already holds — and serialised to a .copilot.json sidecar only on request by the saving surface (:meth:gmat_copilot.CopilotResult.save). Composes the existing :class:~gmat_copilot.result.RetrievalTrace and :class:~gmat_copilot.result.RepairTrace; the per-attempt draft history is :attr:repair's attempts.

to_json_dict

to_json_dict(provenance: Provenance) -> dict[str, Any]

Render provenance to D14's flat JSON shape (the draft history flattened to drafts).

from_json_dict

from_json_dict(data: dict[str, Any]) -> Provenance

Reconstruct a :class:Provenance from D14's JSON shape, checking the schema version.

Raises:

Type Description
ValueError

when schema_version is absent or not :data:SCHEMA_VERSION — a newer sidecar than this reader understands.

dumps

dumps(provenance: Provenance) -> str

Serialise provenance to stable JSON text — sorted keys, indented, trailing newline.

Sorted keys make a recorded sidecar diff cleanly run to run; ensure_ascii=False keeps any Unicode in the script or feedback readable (the sidecar is a UTF-8 file, not console output).

sidecar_path

sidecar_path(script_path: str | Path) -> Path

The sidecar path for a saved script: <script> -> <script>.copilot.json (D14).

write_sidecar

write_sidecar(
    provenance: Provenance, path: str | Path
) -> Path

Write provenance as JSON to path (UTF-8, \n newlines); return the written path.

path is written verbatim — derive the conventional location with :func:sidecar_path.

read_sidecar

read_sidecar(path: str | Path) -> Provenance

Read a .copilot.json sidecar back into a :class:Provenance (the inverse of writing).

Validation

validate

The lint validation gate — the v0.1 validator (decision D5).

Generated scripts are checked with the gmat-script static linter: GMAT-free, instant, and deterministic. Strict mode rejects on lint ERROR and WARNING (every WARNING-level rule is a hard GMAT load error); permissive mode returns the best-effort script with all diagnostics attached.

The dynamic gmat-run dry-run and the repair loop are a later, gated capability behind the [gmat] extra; this module is the GMAT-free tier.

validate

validate(
    script: str, *, target_version: str | None = None
) -> LintReport

Lint script and return a :class:~gmat_copilot.result.LintReport.

Parameters:

Name Type Description Default
script str

GMAT mission-script source text.

required
target_version str | None

GMAT catalogue version to lint against; defaults to the newest shipped catalogue.

None

Returns:

Type Description
LintReport

the diagnostics in source order. Use :meth:LintReport.blocking to apply the strict/permissive gate.

Dry-run

dryrun

The dynamic gmat-run dry-run tier — the optional [gmat] half of validation (decision D12).

Where :mod:gmat_copilot.validate is the static, GMAT-free, instant lint gate (decision D5), this is the dynamic backstop: it drives GMAT's own loader and engine over a lint-clean script to catch the defects a tree-sitter parse cannot — bad numerics, malformed epochs, missing data files, the undeclared-reference case the linter is too conservative to flag, and solver non-convergence.

The dry-run is tiered (decision D12): Mission.load is the config tier; mission.run + Results.converged is the execution tier, entered only when the script has a solver (Target / Optimize), because a script can load and run yet leave a solver unconverged. Each dry-run runs in a fresh subprocess — gmatpy holds one process-global Moderator and cannot re-bootstrap in a single interpreter — so a crash or timeout degrades to a failure verdict rather than taking down the caller, and a repair loop can dry-run several drafts back to back.

The runner imports gmat-run only inside the worker subprocess (:mod:gmat_copilot._dryrun_worker); this module stays import-safe with the [gmat] extra absent, raising :class:GmatExtraNotInstalled only when :func:dry_run is actually called without it. The static lint gate and all of generation remain GMAT-free.

GmatExtraNotInstalled

Bases: RuntimeError

The dry-run was called without the optional [gmat] extra (or no GMAT install).

Generation and the static lint gate are GMAT-free; only the dynamic dry-run needs gmat-run and a GMAT install. This is raised eagerly by :func:dry_run so a missing extra is a clear, actionable error rather than an obscure import failure inside a subprocess.

strip_paths

strip_paths(text: str) -> str

Replace any absolute POSIX path (/dir/.../name) with its basename (name).

extract_feedback_line

extract_feedback_line(raw: str) -> str

Return one actionable line from a raw GMAT log or gmat-run error string.

Prefers the first substantive ERROR / Interpreter-Exception line; falls back to the first WARNING, then to the first non-blank line. Strips the script-path prefix, the sequence-number prefix, and the trailing in line: noise GMAT appends, and sanitises any absolute path to its basename.

require_gmat_extra

require_gmat_extra() -> None

Raise :class:GmatExtraNotInstalled unless gmat-run is importable.

The eager guard the dynamic tier and its CLI surfaces call before attempting a dry-run, so a missing [gmat] extra is a clear, actionable error rather than an obscure import failure.

dry_run

dry_run(
    script: str,
    *,
    timeout: float = DRYRUN_TIMEOUT_S,
    gmat_root: str | None = None,
) -> DryRunReport

Dry-run a lint-clean GMAT script against gmat-run in a fresh subprocess (decision D12).

The script is loaded with Mission.load (the config tier); when it declares a solver (Target / Optimize) it is also run with mission.run and its Results.converged checked (the execution tier). The verdict is a :class:~gmat_copilot.result.DryRunReport: ok is True when the script loads (and, if a solver is present, runs and converges), and a failure carries one actionable, path-free line distilled from GMAT's own diagnostics.

This is the dynamic tier only — it does not lint. Per decision D12 the dry-run runs only on a lint-clean script, so callers gate it behind :func:gmat_copilot.validate.validate; the repair loop sequences the two.

Parameters:

Name Type Description Default
script str

GMAT mission-script source text (lint-clean — see above).

required
timeout float

wall-clock budget in seconds for the subprocess; on expiry the verdict degrades to a "timeout" failure (decision D12 default: 300 s, to bound a runaway solver).

DRYRUN_TIMEOUT_S
gmat_root str | None

GMAT install root; defaults to GMAT_ROOT / standard-location discovery (gmat-run's locate_gmat), which runs inside the worker subprocess.

None

Returns:

Type Description
DryRunReport

the dry-run verdict as a :class:~gmat_copilot.result.DryRunReport.

Raises:

Type Description
GmatExtraNotInstalled

when the [gmat] extra (gmat-run) is not importable.

Repair loop

repair

The bounded repair loop's building blocks (decision D13).

The loop itself lives in :func:gmat_copilot.generate.draft (it owns prompt construction and the provider call); this module supplies the pieces it needs: a combined lint-then-dry-run :func:evaluate, the :func:build_repair_prompt that feeds a failing draft's diagnostics back to the model, and small helpers for usage aggregation and the no-progress / oscillation hash check.

Validation is lint-first (decision D13): lint is precise and free, so a lint failure is reported without paying for the dry-run; the dry-run one-line is the backstop for the lint-clean-but-unrunnable drafts the loop exists for. evaluate calls the dynamic tier only when it is enabled and the draft is lint-clean, so the GMAT-free path never touches gmat-run.

Verdict dataclass

The outcome of validating one draft: pass/fail plus the diagnostics to feed forward.

evaluate

evaluate(
    script: str,
    *,
    dry_run: bool,
    gmat_root: str | None = None,
    timeout: float = 300.0,
    dry_run_fn: DryRunFn | None = None,
) -> Verdict

Validate script lint-first, then (if enabled and lint-clean) dry-run it (decision D13).

A draft passes when it is lint-clean — no ERROR and no WARNING, every WARNING being a hard GMAT load error (decision D5) — and, when dry_run is enabled, the dynamic tier is ok. On failure the verdict carries the failing tier's diagnostics as feedback for the next attempt.

Parameters:

Name Type Description Default
script str

the draft to validate.

required
dry_run bool

whether to run the dynamic gmat-run tier on a lint-clean draft.

required
gmat_root str | None

GMAT install root forwarded to the dry-run (else GMAT_ROOT / discovery).

None
timeout float

wall-clock budget forwarded to the dry-run.

300.0
dry_run_fn DryRunFn | None

a dynamic-tier dry-run to use in place of the real gmat-run subprocess; the replay seam the recorded eval drives the loop with (decision D7). None uses the real :func:gmat_copilot.dryrun.dry_run.

None

build_repair_prompt

build_repair_prompt(
    request: str,
    prev_script: str,
    feedback: tuple[str, ...],
) -> str

The repair request: the original intent + the failing draft + the diagnostics to fix.

The result is a new request string fed back through the normal generation prompt (system framing, retrieval grounding, output contract), so a repair attempt is an ordinary generation that additionally sees the prior attempt and why it failed.

draft_hash

draft_hash(script: str) -> str

A stable content hash of a draft, for the no-progress / oscillation stop conditions.

aggregate_usage

aggregate_usage(
    attempts: tuple[DraftAttempt, ...],
) -> dict[str, int]

Sum the per-attempt token usage across the loop, key by key.

Providers

providers

The model-agnostic provider abstraction (decisions D4, D7).

One thin :class:Provider protocol with four real adapters (Anthropic, OpenAI, Ollama, GitHub Models) plus a :class:RecordedProvider that replays committed fixtures for deterministic, zero-quota CI. There is no default model: selection is explicit ("provider:model"); with none given, :func:select errors and lists the providers it can reach from configured credentials — it never auto-picks or recommends one. Credentials come from the environment, never committed.

Each real adapter's complete performs the provider call through its optional extra ([anthropic] / [openai] / [ollama]; GitHub Models needs none) and raises a clear, actionable error when that extra is not installed or the credential is absent. The protocol, credential discovery, no-default selection, and the recorded replay path round out the surface; :class:RecordingProvider captures live completions into the fixture shape the recorded path replays.

Completion dataclass

A single provider completion: the text plus the provider/model/usage that produced it.

ProviderError

Bases: RuntimeError

A provider could not satisfy a request (missing credential, unreachable, or unknown).

Provider

Bases: Protocol

The contract every adapter satisfies.

reachable

reachable() -> bool

Whether a call could succeed now — i.e. a credential/host is configured.

complete

complete(
    prompt: str,
    *,
    model: str,
    temperature: float = 0.0,
    max_tokens: int = 1024,
) -> Completion

Generate a completion for prompt with model.

AnthropicProvider

A Claude model via the user's ANTHROPIC_API_KEY (the [anthropic] extra).

OpenAIProvider

An OpenAI model with the user's key (OPENAI_API_KEY). Needs the [openai] extra.

OllamaProvider

A local Ollama server (OLLAMA_HOST, default http://localhost:11434).

GitHubModelsProvider

GitHub Models (OpenAI-compatible), authenticated with GH_TOKEN / MODELS_PAT.

The free-tier path the eval and CI use; no provider SDK required.

RecordedProvider

Replays committed fixtures keyed by (provider, model, prompt) — fully deterministic.

The CI inference path (decision D7): zero model calls, zero quota. A fixture records whatever real provider produced it; the replay reports provider == "recorded".

RecordingProvider

Wraps a real provider and records every completion as a replayable fixture (D7 record mode).

A drop-in :class:Provider: :meth:complete delegates to the wrapped provider and stores the result keyed by (provider, model, prompt) in the shape :class:RecordedProvider and the eval bundle replay. :meth:save writes the accumulated fixtures to disk, merging with any already there — the record mode that captures new fixtures for the deterministic CI path.

save

save(path: str | Path) -> Path

Write the recorded fixtures to path as JSON, merging with any already present.

prompt_key

prompt_key(provider: str, model: str, prompt: str) -> str

The deterministic fixture key for a (provider, model, prompt) triple.

reachable_providers

reachable_providers() -> list[str]

The real providers reachable now from configured credentials, in registry order.

select

select(spec: str | None) -> tuple[Provider, str]

Resolve a "provider:model" spec to a (provider, model) pair — no default (D4).

Raises:

Type Description
ProviderError

if spec is missing (lists the reachable providers), malformed, or names an unknown provider.

Retrieval

rag

Corpus ingest, the FAISS index, and the retriever (decisions D2, D3).

Retrieval grounds generation in the GMAT help pages, the stock sample scripts, the GmatFunctions, the gmat-script field catalogue, and a curated domain-notes tier. The corpus text and a prebuilt index for the default embedder are extracted by maintainers at build time (:mod:.build) and shipped with the package; the runtime :func:.load_corpus loads them with no GMAT install and no network, rebuilding the index on first use only as a fallback. sentence-transformers / faiss are imported lazily so importing the package stays light.

BgeEmbedder

The default :class:Embedder: a lazily-loaded BGE sentence-transformer.

sentence_transformers is imported on first use, not at construction, so neither importing the package nor loading the shipped index (built for this model) pays the model-load cost.

Embedder

Bases: Protocol

Embeds passages and queries into a shared vector space.

Implementations normalise their output so a flat inner-product index measures cosine similarity.

dim property

dim: int

The embedding dimension.

encode

encode(
    texts: Sequence[str], *, is_query: bool = False
) -> NDArray[float32]

Embed texts, returning one row per text. Set is_query for retrieval queries.

CorpusIndex

The loaded corpus: the chunks, the FAISS index, and the embedder they were built for.

search

search(
    query: str, *, embedder: Embedder, k: int = 8
) -> list[SearchHit]

Embed query and return the k most similar chunks, most relevant first.

embedder must be the one the index was built for (the same model that loaded or rebuilt it), so the query lands in the same vector space.

SearchHit dataclass

A retrieved corpus chunk and its similarity score (higher is closer).

Retriever

Embeds a query and returns the most relevant corpus chunks (decision D2).

Loads the shipped corpus and prebuilt index for the default embedder (rebuilding on first use only as a fallback for a non-default embedder or a corpus change), then runs a top-k search and trims the result to a token budget. The corpus and model load lazily on the first :meth:retrieve, so constructing a Retriever is cheap.

retrieve

retrieve(
    query: str, *, top_k: int | None = None
) -> RetrievalTrace

Return the corpus chunks that ground query, most relevant first.

Trims the ranked hits to :attr:token_budget, keeping whole chunks in rank order and always retaining at least the top hit. To curb hallucinated command syntax, it also pins one worked-example chunk — the best-ranked passage that actually shows a mission sequence — into the grounding even when pure relevance ranks it below the kept set (it commonly does for setup-heavy queries, which retrieve resource definitions but no command syntax). The returned trace is exactly the set :func:assemble_context formats into the grounding block.

CorpusChunk dataclass

One retrieval passage with its source provenance.

Parameters:

Name Type Description Default
text str

the passage embedded and returned as grounding.

required
kind ChunkKind

the corpus tier the passage belongs to.

required
origin str

the source identifier — a help-page name, sample/GmatFunction file name, catalogue type name, or domain-note name.

required
section str

a finer locator within the origin — a help field-section heading or a sample %---- banner label; empty for whole-file tiers.

''

load_corpus

load_corpus(
    embedder: Embedder | None = None,
    *,
    corpus_dir: Path | None = None,
) -> CorpusIndex

Load the shipped corpus and its index (decision D2).

Parameters:

Name Type Description Default
embedder Embedder | None

the embedder retrieval will use. None selects the default the index was built for, so the prebuilt index is loaded directly. A non-default embedder triggers a one-time fallback rebuild, cached under the XDG cache directory.

None
corpus_dir Path | None

the corpus directory; defaults to the shipped package data.

None

assemble_context

assemble_context(trace: RetrievalTrace) -> str

Format a retrieval trace into a bounded, source-attributed grounding block.

Each chunk is rendered under its source label so generation (and a reader of the result) can see where the grounding came from. Empty when the trace has no chunks.

Evaluation

eval

The evaluation suite: prompt set, structural scorer, LLM judge, and scorer (D6/D7).

BoardRow dataclass

One provider:model entry: public anchor, held-out headline, and close-the-loop cell.

overfit_gap property

overfit_gap: float | None

public - held_out — the overfit tell; None until the held-out has been scored.

LeaderboardError

Bases: Exception

A board failed an integrity check (a leak, or a non-reproducing row).

DraftScore dataclass

One draft's close-the-loop score: the two static layers plus the dynamic dry-run verdict.

static_pass property

static_pass: bool

What the v0.1 static eval accepts: structurally clean and judged intent-correct.

runnable property

runnable: bool

The full close-the-loop pass: static-accepted and the dry-run is ok.

LiftReport dataclass

The close-the-loop outcomes and the per-tier dry-run-agreement and repair-lift aggregates.

base_runnable_by_tier property

base_runnable_by_tier: dict[str, float]

The close-the-loop pass-rate per tier at repair = 0 (the single-pass baseline).

repaired_runnable_by_tier property

repaired_runnable_by_tier: dict[str, float]

The close-the-loop pass-rate per tier at the repair budget.

lift_by_tier property

lift_by_tier: dict[str, float]

The repair-loop lift per tier: repaired pass-rate minus base pass-rate (decision D13).

dry_run_agreement_by_tier property

dry_run_agreement_by_tier: dict[str, float | None]

Per tier, the fraction of statically-accepted base drafts that also pass the dry-run.

The denominator is the drafts the v0.1 static eval would pass (structural ∧ judge) at repair = 0; the numerator, those whose dry-run is also ok. None for a tier with no statically-accepted draft to compare (the agreement is undefined, not 0). The shortfall below 1.0 is the static-vs-dynamic gap the dry-run tier exists to surface (decision D12).

base_runnable property

base_runnable: float

The overall close-the-loop pass-rate at repair = 0.

repaired_runnable property

repaired_runnable: float

The overall close-the-loop pass-rate at the repair budget.

lift property

lift: float

The overall repair-loop lift: repaired minus base pass-rate.

LiftRow dataclass

One prompt's close-the-loop outcome at repair = 0 (base) and the budget (repaired).

base is the single-pass (v0.1) draft; repaired is the draft the bounded loop converged to.

RecordedDryRun dataclass

A :data:~gmat_copilot.repair.DryRunFn that replays recorded verdicts keyed by draft hash.

The deterministic, GMAT-free dynamic tier the recorded close-the-loop eval drives the real loop with (decision D7): :func:gmat_copilot.draft calls it instead of the gmat-run subprocess.

EvalPrompt dataclass

One eval prompt: the request sent to the model, its intent, and its structural spec.

StructuralSpec dataclass

What the deterministic structural layer asserts about a candidate script.

EvalReport dataclass

The outcomes for an eval run and the aggregate pass-rate.

pass_rate_by_tier property

pass_rate_by_tier: dict[str, float]

The pass-rate within each difficulty tier (decision D6 aggregates per tier).

PromptOutcome dataclass

The scored outcome for one prompt: structural, judge, and combined verdicts.

StructuralResult dataclass

The structural verdict for one candidate script and the specific checks it failed.

judge_verdicts

judge_verdicts(
    intent: str,
    script: str,
    *,
    model: str = JUDGE_MODEL,
    n: int = 3,
    provider: Provider | None = None,
    pace: float = 0.0,
) -> list[bool | None]

Run the judge n times and return the raw per-run verdicts (decision D6).

The list of verdicts is what the recorded bundle freezes; :func:majority reduces it to the gate decision. pace seconds are slept between calls to respect the free-tier per-minute budget when recording live; unit tests leave it at 0.

Parameters:

Name Type Description Default
provider Provider | None

the model provider; defaults to a :class:~gmat_copilot.providers.GitHubModelsProvider (the free-tier path the judge is specified against, decision D7).

None

majority

majority(verdicts: Sequence[bool | None]) -> bool | None

Majority vote over verdicts, ignoring None; FAIL on a tie (decision D6).

Returns:

Type Description
bool | None

the majority boolean, or None if there are no non-None verdicts.

parse_verdict

parse_verdict(content: str) -> bool | None

Extract the binary verdict from a judge completion.

Prefers the constrained {"satisfies_intent": bool} object; falls back to an unambiguous bare true/false in the prose. Returns None when the verdict cannot be read — a None is dropped by :func:majority, never counted as a vote.

build_from_config

build_from_config(
    config: dict[str, Any],
    *,
    root: Path,
    generated_at: str,
    tool_version: str,
    held_out_root: Path | None = None,
) -> tuple[dict[str, Any], list[str]]

Build a board from a seeds config, resolving bundle paths under root.

Returns the board payload and a list of human-readable notes for seeds that were skipped (their public bundle was absent or did not carry the model — e.g. a live-only seed run offline). A seed is scored against its held-out bundle only when held_out_root is given and that bundle exists, so offline and in per-merge CI every held-out cell is pending. Held-out golds are never committed: the held-out bundles live under held_out_root (a gitignored cache fetched in gated CI), never in the repo tree.

build_leaderboard

build_leaderboard(
    rows: Iterable[BoardRow],
    *,
    eval_protocol_version: str,
    generated_at: str,
    judge_model: str,
    public_set: dict[str, Any],
    held_out_set: dict[str, Any],
) -> dict[str, Any]

Assemble a ranked leaderboard.json payload — ranked on the held-out headline (D16).

generated_at is injected (never read from the clock) so the payload is byte-deterministic. The public anchor is shown alongside the headline; a pending held-out sorts last.

dumps

dumps(board: dict[str, Any]) -> str

Canonical board bytes: sorted-key, 2-space JSON with a trailing newline (stable diffs).

score_entry

score_entry(
    model: str,
    *,
    public_bundle: str | Path,
    tool_version: str,
    held_out_bundle: str | Path | None = None,
    lift_bundle: str | Path | None = None,
    provider: str = "recorded",
    kind: str = "seed",
    n_votes: int = 3,
    judge_model: str = JUDGE_MODEL,
    submitted_by: str = "seed",
    verified: bool = True,
) -> BoardRow

Score one provider:model into a :class:BoardRow through the shipped recorded scorer.

The public number comes from replaying public_bundle (deterministic, quota-free, D7); the held-out number, when held_out_bundle is given, from replaying it the same way (gated CI only — the bundle is never committed); the close-the-loop cell, when lift_bundle is given, from the recorded lift replay. run_recorded raises :class:ProviderError if model is missing from a bundle — the caller decides whether to skip the seed or fail.

summarize

summarize(report: EvalReport) -> Aggregate

The aggregate-only view of an eval report — the only thing a board row carries from it.

run_live_lift

run_live_lift(
    prompts: Sequence[EvalPrompt],
    *,
    model: str,
    judge_model: str = JUDGE_MODEL,
    n: int = 3,
    budget: int = DEFAULT_BUDGET,
    provider: Provider | None = None,
    retriever: Retriever | None = None,
    judge_provider: Provider | None = None,
    gmat_root: str | None = None,
    pace: float = 0.0,
) -> LiftReport

Run the close-the-loop eval live: a real model, a real GMAT dry-run, and a live judge.

The on-demand / fixture-refresh path (decision D7) — needs a reachable generation provider, the [gmat] extra with a GMAT install, and a reachable judge. pace seconds are slept between prompts to respect the free-tier daily budget. No fixtures are written.

run_recorded_lift

run_recorded_lift(
    bundle_dir: str | Path, *, budget: int = DEFAULT_BUDGET
) -> LiftReport

Replay the recorded close-the-loop bundle and return its :class:LiftReport (decision D7).

Deterministic and quota-free: a :class:_TrajectoryProvider replays each prompt's recorded repair trajectory through the real loop, a :class:RecordedDryRun replays the dynamic tier, and the recorded judge verdicts settle the semantic layer — zero model calls, zero GMAT.

A bundle directory holds:

  • prompts.json — the prompt set (see :func:gmat_copilot.eval.prompts.load_prompts).
  • trajectory.json{prompt_id: [draft_script, ...]}, the recorded repair sequence.
  • verdicts.json{draft_hash: {"dry_run": {...}, "judge": [verdict, ...]}}.

load_prompts

load_prompts(path: str | Path) -> list[EvalPrompt]

Load an eval prompt set from a JSON file.

record_bundle

record_bundle(
    prompts: Sequence[EvalPrompt],
    bundle_dir: str | Path,
    *,
    model: str,
    judge_model: str = JUDGE_MODEL,
    n: int = 3,
    provider: Provider | None = None,
    retriever: Retriever | None = None,
    judge_provider: Provider | None = None,
    pace: float = 0.0,
) -> EvalReport

Run the live eval once and freeze it as a recorded bundle in bundle_dir (decision D7).

Writes completions.json (the generated scripts, keyed for :class:RecordedProvider) and judge.json (the raw per-run judge verdicts). prompts.json is the authored source of truth and is left untouched; :func:run_recorded on the same directory then reproduces this run's scores deterministically. Returns the live :class:EvalReport.

run_live

run_live(
    prompts: Sequence[EvalPrompt],
    *,
    model: str,
    judge_model: str = JUDGE_MODEL,
    n: int = 3,
    provider: Provider | None = None,
    retriever: Retriever | None = None,
    judge_provider: Provider | None = None,
    pace: float = 0.0,
) -> EvalReport

Generate and judge each prompt live, returning the :class:EvalReport (decision D6).

Needs a reachable generation provider (model is a "provider:model" selector unless provider is given) and a reachable judge. pace seconds are slept between model calls to respect the free-tier per-minute budget. No fixtures are written — use :func:record_bundle to freeze a run.

run_recorded

run_recorded(
    bundle_dir: str | Path, *, model: str
) -> EvalReport

Replay the recorded eval bundle for model and return its :class:EvalReport.

Deterministic and quota-free: the structural layer re-scores the recorded completion text and the judge layer replays the recorded verdicts (decision D7).

structural_score

structural_score(
    script_text: str, spec: StructuralSpec
) -> StructuralResult

Score script_text against spec with the deterministic structural checks.

Leaderboard

leaderboard

The per-model leaderboard engine: a ranked board over the eval suite (decision D16).

Where :mod:gmat_copilot.eval.runner scores one provider:model into an :class:~gmat_copilot.eval.runner.EvalReport, this sweeps a set of explicit provider:model\ s through the same shipped recorded scorer (no new scoring math) and assembles a ranked leaderboard.json. Two roles for the eval set decide the ranking:

  • the committed public prompt set is the reproducibility anchor — its number reproduces byte-for-byte offline from the recorded bundle (decision D7), pinned by the bundle's content hash;
  • a never-committed held-out set is the headline — the board ranks on it, so overfitting the public prompts buys no rank. A large public - held_out gap is the overfit tell.

The board carries aggregates only (per-tier pass-rates, the close-the-loop figures, usage, and a run block); no prompt text, intent, or judge verdict ever reaches it, so a held-out gold cannot leak through the published JSON (:func:assert_aggregate_only). Held-out scoring runs only in gated CI against a private store; offline and in per-merge CI the held-out is pending and the public anchor stands alone.

The engine is pure and injection-driven: generated_at and tool_version are passed in (never read from the clock), so a built board is byte-deterministic and testable.

LeaderboardError

Bases: Exception

A board failed an integrity check (a leak, or a non-reproducing row).

Aggregate dataclass

An aggregate-only score for one prompt set: the headline rate plus its per-tier breakdown.

CloseTheLoop dataclass

The v0.2 close-the-loop figures for a model (decisions D12, D13), as a board cell.

RunMeta dataclass

The reproducibility block pinning how a row was produced (decision D16).

BoardRow dataclass

One provider:model entry: public anchor, held-out headline, and close-the-loop cell.

overfit_gap property

overfit_gap: float | None

public - held_out — the overfit tell; None until the held-out has been scored.

summarize

summarize(report: EvalReport) -> Aggregate

The aggregate-only view of an eval report — the only thing a board row carries from it.

close_the_loop_from_lift

close_the_loop_from_lift(
    report: LiftReport,
) -> CloseTheLoop

Fold a :class:LiftReport into the board's close-the-loop cell (decisions D12, D13).

bundle_sha16

bundle_sha16(
    bundle: Path, names: Sequence[str] = STATIC_BUNDLE_FILES
) -> str

The 16-hex content hash over names in bundle — pins a recorded result (decision D7).

recorded_usage

recorded_usage(
    bundle: Path, *, model: str, n_votes: int
) -> dict[str, int]

Sum model's recorded generation usage in bundle, plus the implied judge-call count.

Judge usage is not recorded, so judge_calls is derived as generations times the vote count — the free-tier transparency the board publishes (decision D16), not a measured token total.

score_entry

score_entry(
    model: str,
    *,
    public_bundle: str | Path,
    tool_version: str,
    held_out_bundle: str | Path | None = None,
    lift_bundle: str | Path | None = None,
    provider: str = "recorded",
    kind: str = "seed",
    n_votes: int = 3,
    judge_model: str = JUDGE_MODEL,
    submitted_by: str = "seed",
    verified: bool = True,
) -> BoardRow

Score one provider:model into a :class:BoardRow through the shipped recorded scorer.

The public number comes from replaying public_bundle (deterministic, quota-free, D7); the held-out number, when held_out_bundle is given, from replaying it the same way (gated CI only — the bundle is never committed); the close-the-loop cell, when lift_bundle is given, from the recorded lift replay. run_recorded raises :class:ProviderError if model is missing from a bundle — the caller decides whether to skip the seed or fail.

build_leaderboard

build_leaderboard(
    rows: Iterable[BoardRow],
    *,
    eval_protocol_version: str,
    generated_at: str,
    judge_model: str,
    public_set: dict[str, Any],
    held_out_set: dict[str, Any],
) -> dict[str, Any]

Assemble a ranked leaderboard.json payload — ranked on the held-out headline (D16).

generated_at is injected (never read from the clock) so the payload is byte-deterministic. The public anchor is shown alongside the headline; a pending held-out sorts last.

dumps

dumps(board: dict[str, Any]) -> str

Canonical board bytes: sorted-key, 2-space JSON with a trailing newline (stable diffs).

assert_aggregate_only

assert_aggregate_only(board: dict[str, Any]) -> None

Raise :class:LeaderboardError if a row's public/held-out cell carries a non-aggregate key.

The firewall the published board must satisfy (decision D16): a board carries pass-rates and metadata only, never a prompt, an intent, or a judge verdict — so a held-out gold cannot leak through it. A row exposing an unexpected key in an aggregate cell is rejected.

assert_no_leak

assert_no_leak(
    serialized: str, secrets: Iterable[str]
) -> None

Raise if any held-out secret string appears in the serialized board (a hard leak check).

held_out_secrets

held_out_secrets(
    config: dict[str, Any], held_out_root: Path
) -> list[str]

The held-out gold strings the published board must never contain — every held-out prompt's request and intent text, read from the private store under held_out_root.

Fed to :func:assert_no_leak so the firewall scans the bytes that ship, not just their key names (the structural :func:assert_aggregate_only check). Held-out bundles that have not been fetched are skipped, so this is a no-op offline and the realistic content scan only in gated CI.

build_from_config

build_from_config(
    config: dict[str, Any],
    *,
    root: Path,
    generated_at: str,
    tool_version: str,
    held_out_root: Path | None = None,
) -> tuple[dict[str, Any], list[str]]

Build a board from a seeds config, resolving bundle paths under root.

Returns the board payload and a list of human-readable notes for seeds that were skipped (their public bundle was absent or did not carry the model — e.g. a live-only seed run offline). A seed is scored against its held-out bundle only when held_out_root is given and that bundle exists, so offline and in per-merge CI every held-out cell is pending. Held-out golds are never committed: the held-out bundles live under held_out_root (a gitignored cache fetched in gated CI), never in the repo tree.