API reference¶

Public surface¶

gmat_copilot ¶

gmat-copilot — model-agnostic natural-language → GMAT .script generation.

Retrieval-grounded generation through a provider abstraction, with a static lint gate and a two-layer evaluation suite. The public surface is :func:draft and the :class:CopilotResult it returns; the package is GMAT-free for generation and validation.

GmatExtraNotInstalled ¶

Bases: RuntimeError

The dry-run was called without the optional [gmat] extra (or no GMAT install).

Generation and the static lint gate are GMAT-free; only the dynamic dry-run needs gmat-run and a GMAT install. This is raised eagerly by :func:dry_run so a missing extra is a clear, actionable error rather than an obscure import failure inside a subprocess.

DraftCancelled ¶

Bases: RuntimeError

Generation was cancelled at a repair-attempt boundary via the cancel callback.

Raised by :func:draft when the caller's cancel predicate returns true before an attempt begins — the editor surface's cancellable-progress channel (decision D15) routes a user cancel through this. The check is at attempt boundaries: an in-flight provider call or dry-run subprocess runs to its own completion / timeout, so a single pass (repair=0) has no boundary to cancel at and this is raised only once a repair retry would otherwise start.

DraftRejected ¶

Bases: RuntimeError

Strict :func:draft rejected the final draft for blocking diagnostics (decisions D5 / D13).

Raised only after the repair budget is spent. The offending :class:~gmat_copilot.result.CopilotResult is attached as :attr:result, so the caller can inspect the script, its lint report, and any dry-run verdict.

Outcome `dataclass` ¶

Which draft won and how the run ended (decision D14).

winner indexes the final (returned) draft in :attr:Provenance.repair's attempts — always the last attempt, recorded explicitly so the sidecar is self-describing. passed is whether that draft validated clean (lint, plus the dry-run when it ran); strict records the active mode, so a reader can tell a strict rejection (passed=False, strict=True) from a permissive best-effort return (passed=False, strict=False). usage is the aggregate token total across every attempt.

Provenance `dataclass` ¶

A versioned record of how a draft was produced (decision D14).

Populated in memory by :func:gmat_copilot.draft for every run — it is the trace the result already holds — and serialised to a .copilot.json sidecar only on request by the saving surface (:meth:gmat_copilot.CopilotResult.save). Composes the existing :class:~gmat_copilot.result.RetrievalTrace and :class:~gmat_copilot.result.RepairTrace; the per-attempt draft history is :attr:repair's attempts.

CopilotResult `dataclass` ¶

Everything a :func:gmat_copilot.draft call produces (decision D10).

save ¶

save(path: str | Path, *, sidecar: bool = False) -> Path

Write the generated :attr:script to path (UTF-8); return the written path.

With sidecar=True also write the provenance record (decision D14) as a .copilot.json file next to the script (<path>.copilot.json). The sidecar is written only on request, never silently; it needs a result from :func:gmat_copilot.draft (whose :attr:provenance is populated).

Raises:

Type	Description
`TypeError`	when `sidecar=True` but :attr:`provenance` is not a populated record (e.g. a hand-built result).

DraftAttempt `dataclass` ¶

One iteration of the repair loop (decision D13): a draft and how it validated.

The loop generates a draft, lints it, and — when the dynamic tier is enabled and lint is clean — dry-runs it; feedback is what was fed into the next attempt's repair prompt.

DryRunReport `dataclass` ¶

The dynamic gmat-run dry-run finding for a script — a separate tier from the lint report.

The dry-run runs only on a lint-clean script (decision D12): Mission.load is the config tier, and mission.run + Results.converged the execution tier, entered only when the script has a solver (Target / Optimize). Dry-run findings do not merge into :class:LintReport — lint diagnostics are precise (rule / severity / line / column) and a dry-run finding is coarser, so it lands here. ok is the blocking signal: a not-ok report rejects in strict mode, just as a blocking lint diagnostic does.

LintDiagnostic `dataclass` ¶

One linter finding mapped from a gmat_script diagnostic: rule, severity, and location.

LintReport `dataclass` ¶

The lint diagnostics for a script, in source order, with severity-filtered views.

The strict/permissive decision lives in the validator (decision D5); this is the raw report.

clean `property` ¶

clean: bool

True when the linter reported nothing at all.

blocking ¶

blocking(*, strict: bool) -> tuple[LintDiagnostic, ...]

Diagnostics that reject a draft under the given mode (decision D5).

Strict rejects on ERROR and WARNING — every WARNING-level rule is a hard GMAT load error. Permissive never blocks: it returns the best-effort script with all diagnostics attached.

RepairTrace `dataclass` ¶

The repair loop's per-attempt history and why it stopped (decision D13).

Attached to :attr:CopilotResult.provenance as the substrate the v0.2 provenance sidecar (D14) formalises into a versioned record.

RetrievalChunk `dataclass` ¶

One corpus chunk surfaced by the retriever, with its source and similarity score.

RetrievalTrace `dataclass` ¶

The corpus chunks used to ground a generation, most-relevant first.

dry_run ¶

dry_run(
    script: str,
    *,
    timeout: float = DRYRUN_TIMEOUT_S,
    gmat_root: str | None = None,
) -> DryRunReport

Dry-run a lint-clean GMAT script against gmat-run in a fresh subprocess (decision D12).

The script is loaded with Mission.load (the config tier); when it declares a solver (Target / Optimize) it is also run with mission.run and its Results.converged checked (the execution tier). The verdict is a :class:~gmat_copilot.result.DryRunReport: ok is True when the script loads (and, if a solver is present, runs and converges), and a failure carries one actionable, path-free line distilled from GMAT's own diagnostics.

This is the dynamic tier only — it does not lint. Per decision D12 the dry-run runs only on a lint-clean script, so callers gate it behind :func:gmat_copilot.validate.validate; the repair loop sequences the two.

Parameters:

Name	Type	Description	Default
`script`	`str`	GMAT mission-script source text (lint-clean — see above).	required
`timeout`	`float`	wall-clock budget in seconds for the subprocess; on expiry the verdict degrades to a `"timeout"` failure (decision D12 default: 300 s, to bound a runaway solver).	`DRYRUN_TIMEOUT_S`
`gmat_root`	`str \| None`	GMAT install root; defaults to `GMAT_ROOT` / standard-location discovery (gmat-run's `locate_gmat`), which runs inside the worker subprocess.	`None`

Returns:

Type	Description
`DryRunReport`	the dry-run verdict as a :class:`~gmat_copilot.result.DryRunReport`.

Raises:

Type	Description
`GmatExtraNotInstalled`	when the `[gmat]` extra (gmat-run) is not importable.

require_gmat_extra ¶

require_gmat_extra() -> None

Raise :class:GmatExtraNotInstalled unless gmat-run is importable.

The eager guard the dynamic tier and its CLI surfaces call before attempting a dry-run, so a missing [gmat] extra is a clear, actionable error rather than an obscure import failure.

draft ¶

draft(
    request: str,
    *,
    model: str | None = None,
    strict: bool = True,
    temperature: float = 0.0,
    max_tokens: int = 2048,
    retriever: Retriever | None = None,
    provider: Provider | None = None,
    repair: int = 0,
    dry_run: bool = False,
    gmat_root: str | None = None,
    dry_run_fn: DryRunFn | None = None,
    cancel: Callable[[], bool] | None = None,
) -> CopilotResult

Generate a GMAT mission .script from a natural-language request.

Orchestrates retrieve → generate → validate, wrapped in a bounded repair loop (decision D13): on a failing draft the failing tier's diagnostics are fed back and the model regenerates, up to repair attempts, stopping on the first clean/runnable draft or on no-progress / oscillation. Returns a :class:~gmat_copilot.result.CopilotResult for the final draft, with a versioned :class:~gmat_copilot.provenance.Provenance record (request, retrieval, the per-attempt history, and the outcome — decision D14) attached on provenance.

Parameters:

Name	Type	Description	Default
`request`	`str`	what the script should do, in natural language.	required
`model`	`str \| None`	the `"provider:model"` selector (decision D4 — there is no default; selection is always explicit). When provider is supplied this is the bare model name handed to it; otherwise it is resolved with :func:`~gmat_copilot.providers.select`, which errors and lists the reachable providers when it is `None`.	`None`
`strict`	`bool`	reject the final draft if it does not validate clean — lint ERROR and WARNING both block (decision D5), and a dry-run failure blocks when the dynamic tier is enabled — by raising :class:`DraftRejected` once the budget is spent. Permissive (`strict=False`) returns the best-effort draft with every diagnostic attached.	`True`
`temperature`	`float`	sampling temperature passed to the provider.	`0.0`
`max_tokens`	`int`	maximum number of tokens to generate.	`2048`
`retriever`	`Retriever \| None`	corpus retriever used to ground generation; defaults to a :class:`~gmat_copilot.rag.Retriever`. Retrieval is computed once from request and reused across repair attempts.	`None`
`provider`	`Provider \| None`	model provider used to generate; defaults to the one model selects.	`None`
`repair`	`int`	the retry budget for the repair loop (decision D13). `0` (the default) is a single pass — the v0.1 behaviour.	`0`
`dry_run`	`bool`	enable the dynamic gmat-run dry-run tier (decision D12) in validation; needs the `[gmat]` extra and a GMAT install. Off by default, keeping generation GMAT-free.	`False`
`gmat_root`	`str \| None`	GMAT install root forwarded to the dry-run (else `GMAT_ROOT` / discovery).	`None`
`dry_run_fn`	`DryRunFn \| None`	a dynamic-tier dry-run to use in place of the real gmat-run subprocess (the eval's deterministic replay seam, decision D7); `None` uses the real dry-run.	`None`
`cancel`	`Callable[[], bool] \| None`	an optional predicate polled before each attempt begins and again after the provider returns, before validation (decision D15); when it returns true generation stops with :class:`DraftCancelled`. The in-flight provider call (and a running dry-run) still completes — cancelling does not abort an HTTP request mid-flight — but a cancel observed after generation skips the potentially expensive dry-run, so even a single pass (`repair=0`) is cancellable between its generate and validate phases.	`None`

Returns:

Type	Description
`CopilotResult`	the final draft's script, its lint report (and dry-run verdict), the retrieval trace, provider metadata, aggregate usage, and the provenance record on `provenance`.

Raises:

Type	Description
`DraftCancelled`	when cancel returns true before an attempt begins or after the provider returns.
`DraftRejected`	in strict mode, when the final draft still has blocking diagnostics.
`ProviderError`	when no model is resolved — either model is `None` with no provider to apply it to, or :func:`~gmat_copilot.providers.select` cannot resolve the selector.
`ValueError`	when repair is negative.

read_sidecar ¶

read_sidecar(path: str | Path) -> Provenance

Read a .copilot.json sidecar back into a :class:Provenance (the inverse of writing).

write_sidecar ¶

write_sidecar(
    provenance: Provenance, path: str | Path
) -> Path

Write provenance as JSON to path (UTF-8, \n newlines); return the written path.

path is written verbatim — derive the conventional location with :func:sidecar_path.

Result schema¶

result ¶

The result schema returned by :func:gmat_copilot.draft (decision D10).

One stable contract carries everything a generation request produces: the generated .script text, the lint report, the retrieval trace, and the provider/model/usage that produced it. The provenance field carries the versioned record of how the draft was produced — the request, the retrieved chunks, the draft history, and the outcome (decision D14) — and :meth:CopilotResult.save can serialise it to a .copilot.json sidecar next to the written script.

LintDiagnostic `dataclass` ¶

One linter finding mapped from a gmat_script diagnostic: rule, severity, and location.

LintReport `dataclass` ¶

The lint diagnostics for a script, in source order, with severity-filtered views.

The strict/permissive decision lives in the validator (decision D5); this is the raw report.

clean `property` ¶

clean: bool

True when the linter reported nothing at all.

blocking ¶

blocking(*, strict: bool) -> tuple[LintDiagnostic, ...]

Diagnostics that reject a draft under the given mode (decision D5).

Strict rejects on ERROR and WARNING — every WARNING-level rule is a hard GMAT load error. Permissive never blocks: it returns the best-effort script with all diagnostics attached.

DryRunReport `dataclass` ¶

The dynamic gmat-run dry-run finding for a script — a separate tier from the lint report.

The dry-run runs only on a lint-clean script (decision D12): Mission.load is the config tier, and mission.run + Results.converged the execution tier, entered only when the script has a solver (Target / Optimize). Dry-run findings do not merge into :class:LintReport — lint diagnostics are precise (rule / severity / line / column) and a dry-run finding is coarser, so it lands here. ok is the blocking signal: a not-ok report rejects in strict mode, just as a blocking lint diagnostic does.

RetrievalChunk `dataclass` ¶

One corpus chunk surfaced by the retriever, with its source and similarity score.

RetrievalTrace `dataclass` ¶

The corpus chunks used to ground a generation, most-relevant first.

DraftAttempt `dataclass` ¶

One iteration of the repair loop (decision D13): a draft and how it validated.

The loop generates a draft, lints it, and — when the dynamic tier is enabled and lint is clean — dry-runs it; feedback is what was fed into the next attempt's repair prompt.

RepairTrace `dataclass` ¶

The repair loop's per-attempt history and why it stopped (decision D13).

Attached to :attr:CopilotResult.provenance as the substrate the v0.2 provenance sidecar (D14) formalises into a versioned record.

CopilotResult `dataclass` ¶

Everything a :func:gmat_copilot.draft call produces (decision D10).

save ¶

save(path: str | Path, *, sidecar: bool = False) -> Path

Write the generated :attr:script to path (UTF-8); return the written path.

With sidecar=True also write the provenance record (decision D14) as a .copilot.json file next to the script (<path>.copilot.json). The sidecar is written only on request, never silently; it needs a result from :func:gmat_copilot.draft (whose :attr:provenance is populated).

Raises:

Type	Description
`TypeError`	when `sidecar=True` but :attr:`provenance` is not a populated record (e.g. a hand-built result).

Provenance¶

provenance ¶

The versioned provenance record and its .copilot.json sidecar (decision D14).

D10 reserved a provenance field on :class:~gmat_copilot.result.CopilotResult; D13 filled it with a :class:~gmat_copilot.result.RepairTrace (the per-attempt draft history). D14 formalises that into a versioned record of the whole generation: the request, the resolved provider / model, the retrieved grounding, the draft history, and the outcome — and serialises it to a .copilot.json sidecar written next to a saved script (e.g. mission.script.copilot.json).

The in-memory :class:Provenance composes the existing dataclasses (it nests the RepairTrace D13 already builds); the on-disk JSON follows D14's flat schema — schema_version, request, provider, model, retrieval, drafts, outcome — and :func:to_json_dict / :func:from_json_dict map between the two. The JSON is stable (sorted keys, a stamped schema_version), so a recorded sidecar diffs cleanly, and it carries no credentials: the record only ever holds the request, the provider / model names, the retrieval trace, the drafts, and token usage — there is no field a key could enter through.

Outcome `dataclass` ¶

Which draft won and how the run ended (decision D14).

winner indexes the final (returned) draft in :attr:Provenance.repair's attempts — always the last attempt, recorded explicitly so the sidecar is self-describing. passed is whether that draft validated clean (lint, plus the dry-run when it ran); strict records the active mode, so a reader can tell a strict rejection (passed=False, strict=True) from a permissive best-effort return (passed=False, strict=False). usage is the aggregate token total across every attempt.

Provenance `dataclass` ¶

A versioned record of how a draft was produced (decision D14).

Populated in memory by :func:gmat_copilot.draft for every run — it is the trace the result already holds — and serialised to a .copilot.json sidecar only on request by the saving surface (:meth:gmat_copilot.CopilotResult.save). Composes the existing :class:~gmat_copilot.result.RetrievalTrace and :class:~gmat_copilot.result.RepairTrace; the per-attempt draft history is :attr:repair's attempts.

to_json_dict ¶

to_json_dict(provenance: Provenance) -> dict[str, Any]

Render provenance to D14's flat JSON shape (the draft history flattened to drafts).

from_json_dict ¶

from_json_dict(data: dict[str, Any]) -> Provenance

Reconstruct a :class:Provenance from D14's JSON shape, checking the schema version.

Raises:

Type	Description
`ValueError`	when `schema_version` is absent or not :data:`SCHEMA_VERSION` — a newer sidecar than this reader understands.

dumps ¶

dumps(provenance: Provenance) -> str

Serialise provenance to stable JSON text — sorted keys, indented, trailing newline.

Sorted keys make a recorded sidecar diff cleanly run to run; ensure_ascii=False keeps any Unicode in the script or feedback readable (the sidecar is a UTF-8 file, not console output).

sidecar_path ¶

sidecar_path(script_path: str | Path) -> Path

The sidecar path for a saved script: <script> -> <script>.copilot.json (D14).

write_sidecar ¶

write_sidecar(
    provenance: Provenance, path: str | Path
) -> Path

Write provenance as JSON to path (UTF-8, \n newlines); return the written path.

path is written verbatim — derive the conventional location with :func:sidecar_path.

read_sidecar ¶

read_sidecar(path: str | Path) -> Provenance

Read a .copilot.json sidecar back into a :class:Provenance (the inverse of writing).

Validation¶

validate ¶

The lint validation gate — the v0.1 validator (decision D5).

Generated scripts are checked with the gmat-script static linter: GMAT-free, instant, and deterministic. Strict mode rejects on lint ERROR and WARNING (every WARNING-level rule is a hard GMAT load error); permissive mode returns the best-effort script with all diagnostics attached.

The dynamic gmat-run dry-run and the repair loop are a later, gated capability behind the [gmat] extra; this module is the GMAT-free tier.

validate ¶

validate(
    script: str, *, target_version: str | None = None
) -> LintReport

Lint script and return a :class:~gmat_copilot.result.LintReport.

Parameters:

Name	Type	Description	Default
`script`	`str`	GMAT mission-script source text.	required
`target_version`	`str \| None`	GMAT catalogue version to lint against; defaults to the newest shipped catalogue.	`None`

Returns:

Type	Description
`LintReport`	the diagnostics in source order. Use :meth:`LintReport.blocking` to apply the strict/permissive gate.

Dry-run¶

dryrun ¶

The dynamic gmat-run dry-run tier — the optional [gmat] half of validation (decision D12).

Where :mod:gmat_copilot.validate is the static, GMAT-free, instant lint gate (decision D5), this is the dynamic backstop: it drives GMAT's own loader and engine over a lint-clean script to catch the defects a tree-sitter parse cannot — bad numerics, malformed epochs, missing data files, the undeclared-reference case the linter is too conservative to flag, and solver non-convergence.

The dry-run is tiered (decision D12): Mission.load is the config tier; mission.run + Results.converged is the execution tier, entered only when the script has a solver (Target / Optimize), because a script can load and run yet leave a solver unconverged. Each dry-run runs in a fresh subprocess — gmatpy holds one process-global Moderator and cannot re-bootstrap in a single interpreter — so a crash or timeout degrades to a failure verdict rather than taking down the caller, and a repair loop can dry-run several drafts back to back.

The runner imports gmat-run only inside the worker subprocess (:mod:gmat_copilot._dryrun_worker); this module stays import-safe with the [gmat] extra absent, raising :class:GmatExtraNotInstalled only when :func:dry_run is actually called without it. The static lint gate and all of generation remain GMAT-free.

GmatExtraNotInstalled ¶

Bases: RuntimeError

The dry-run was called without the optional [gmat] extra (or no GMAT install).

Generation and the static lint gate are GMAT-free; only the dynamic dry-run needs gmat-run and a GMAT install. This is raised eagerly by :func:dry_run so a missing extra is a clear, actionable error rather than an obscure import failure inside a subprocess.

strip_paths ¶

strip_paths(text: str) -> str

Replace any absolute POSIX path (/dir/.../name) with its basename (name).

extract_feedback_line ¶

extract_feedback_line(raw: str) -> str

Return one actionable line from a raw GMAT log or gmat-run error string.

Prefers the first substantive ERROR / Interpreter-Exception line; falls back to the first WARNING, then to the first non-blank line. Strips the script-path prefix, the sequence-number prefix, and the trailing in line: noise GMAT appends, and sanitises any absolute path to its basename.

require_gmat_extra ¶

require_gmat_extra() -> None

Raise :class:GmatExtraNotInstalled unless gmat-run is importable.

The eager guard the dynamic tier and its CLI surfaces call before attempting a dry-run, so a missing [gmat] extra is a clear, actionable error rather than an obscure import failure.

dry_run ¶

dry_run(
    script: str,
    *,
    timeout: float = DRYRUN_TIMEOUT_S,
    gmat_root: str | None = None,
) -> DryRunReport

Dry-run a lint-clean GMAT script against gmat-run in a fresh subprocess (decision D12).

The script is loaded with Mission.load (the config tier); when it declares a solver (Target / Optimize) it is also run with mission.run and its Results.converged checked (the execution tier). The verdict is a :class:~gmat_copilot.result.DryRunReport: ok is True when the script loads (and, if a solver is present, runs and converges), and a failure carries one actionable, path-free line distilled from GMAT's own diagnostics.

This is the dynamic tier only — it does not lint. Per decision D12 the dry-run runs only on a lint-clean script, so callers gate it behind :func:gmat_copilot.validate.validate; the repair loop sequences the two.

Parameters:

Name	Type	Description	Default
`script`	`str`	GMAT mission-script source text (lint-clean — see above).	required
`timeout`	`float`	wall-clock budget in seconds for the subprocess; on expiry the verdict degrades to a `"timeout"` failure (decision D12 default: 300 s, to bound a runaway solver).	`DRYRUN_TIMEOUT_S`
`gmat_root`	`str \| None`	GMAT install root; defaults to `GMAT_ROOT` / standard-location discovery (gmat-run's `locate_gmat`), which runs inside the worker subprocess.	`None`

Returns:

Type	Description
`DryRunReport`	the dry-run verdict as a :class:`~gmat_copilot.result.DryRunReport`.

Raises:

Type	Description
`GmatExtraNotInstalled`	when the `[gmat]` extra (gmat-run) is not importable.

Repair loop¶

repair ¶

The bounded repair loop's building blocks (decision D13).

The loop itself lives in :func:gmat_copilot.generate.draft (it owns prompt construction and the provider call); this module supplies the pieces it needs: a combined lint-then-dry-run :func:evaluate, the :func:build_repair_prompt that feeds a failing draft's diagnostics back to the model, and small helpers for usage aggregation and the no-progress / oscillation hash check.

Validation is lint-first (decision D13): lint is precise and free, so a lint failure is reported without paying for the dry-run; the dry-run one-line is the backstop for the lint-clean-but-unrunnable drafts the loop exists for. evaluate calls the dynamic tier only when it is enabled and the draft is lint-clean, so the GMAT-free path never touches gmat-run.

Verdict `dataclass` ¶

The outcome of validating one draft: pass/fail plus the diagnostics to feed forward.

evaluate ¶

evaluate(
    script: str,
    *,
    dry_run: bool,
    gmat_root: str | None = None,
    timeout: float = 300.0,
    dry_run_fn: DryRunFn | None = None,
) -> Verdict

Validate script lint-first, then (if enabled and lint-clean) dry-run it (decision D13).

A draft passes when it is lint-clean — no ERROR and no WARNING, every WARNING being a hard GMAT load error (decision D5) — and, when dry_run is enabled, the dynamic tier is ok. On failure the verdict carries the failing tier's diagnostics as feedback for the next attempt.

Parameters:

Name	Type	Description	Default
`script`	`str`	the draft to validate.	required
`dry_run`	`bool`	whether to run the dynamic gmat-run tier on a lint-clean draft.	required
`gmat_root`	`str \| None`	GMAT install root forwarded to the dry-run (else `GMAT_ROOT` / discovery).	`None`
`timeout`	`float`	wall-clock budget forwarded to the dry-run.	`300.0`
`dry_run_fn`	`DryRunFn \| None`	a dynamic-tier dry-run to use in place of the real gmat-run subprocess; the replay seam the recorded eval drives the loop with (decision D7). `None` uses the real :func:`gmat_copilot.dryrun.dry_run`.	`None`

build_repair_prompt ¶

build_repair_prompt(
    request: str,
    prev_script: str,
    feedback: tuple[str, ...],
) -> str

The repair request: the original intent + the failing draft + the diagnostics to fix.

The result is a new request string fed back through the normal generation prompt (system framing, retrieval grounding, output contract), so a repair attempt is an ordinary generation that additionally sees the prior attempt and why it failed.

draft_hash ¶

draft_hash(script: str) -> str

A stable content hash of a draft, for the no-progress / oscillation stop conditions.

aggregate_usage ¶

aggregate_usage(
    attempts: tuple[DraftAttempt, ...],
) -> dict[str, int]

Sum the per-attempt token usage across the loop, key by key.

Providers¶

providers ¶

The model-agnostic provider abstraction (decisions D4, D7).

One thin :class:Provider protocol with four real adapters (Anthropic, OpenAI, Ollama, GitHub Models) plus a :class:RecordedProvider that replays committed fixtures for deterministic, zero-quota CI. There is no default model: selection is explicit ("provider:model"); with none given, :func:select errors and lists the providers it can reach from configured credentials — it never auto-picks or recommends one. Credentials come from the environment, never committed.

Each real adapter's complete performs the provider call through its optional extra ([anthropic] / [openai] / [ollama]; GitHub Models needs none) and raises a clear, actionable error when that extra is not installed or the credential is absent. The protocol, credential discovery, no-default selection, and the recorded replay path round out the surface; :class:RecordingProvider captures live completions into the fixture shape the recorded path replays.

Completion `dataclass` ¶

A single provider completion: the text plus the provider/model/usage that produced it.

ProviderError ¶

Bases: RuntimeError

A provider could not satisfy a request (missing credential, unreachable, or unknown).

Provider ¶

Bases: Protocol

The contract every adapter satisfies.

reachable ¶

reachable() -> bool

Whether a call could succeed now — i.e. a credential/host is configured.

complete ¶

complete(
    prompt: str,
    *,
    model: str,
    temperature: float = 0.0,
    max_tokens: int = 1024,
) -> Completion

Generate a completion for prompt with model.

AnthropicProvider ¶

A Claude model via the user's ANTHROPIC_API_KEY (the [anthropic] extra).

OpenAIProvider ¶

An OpenAI model with the user's key (OPENAI_API_KEY). Needs the [openai] extra.

OllamaProvider ¶

A local Ollama server (OLLAMA_HOST, default http://localhost:11434).

GitHubModelsProvider ¶

GitHub Models (OpenAI-compatible), authenticated with GH_TOKEN / MODELS_PAT.

The free-tier path the eval and CI use; no provider SDK required.

RecordedProvider ¶

Replays committed fixtures keyed by (provider, model, prompt) — fully deterministic.

The CI inference path (decision D7): zero model calls, zero quota. A fixture records whatever real provider produced it; the replay reports provider == "recorded".

RecordingProvider ¶

Wraps a real provider and records every completion as a replayable fixture (D7 record mode).

A drop-in :class:Provider: :meth:complete delegates to the wrapped provider and stores the result keyed by (provider, model, prompt) in the shape :class:RecordedProvider and the eval bundle replay. :meth:save writes the accumulated fixtures to disk, merging with any already there — the record mode that captures new fixtures for the deterministic CI path.

save ¶

save(path: str | Path) -> Path

Write the recorded fixtures to path as JSON, merging with any already present.

prompt_key ¶

prompt_key(provider: str, model: str, prompt: str) -> str

The deterministic fixture key for a (provider, model, prompt) triple.

reachable_providers ¶

reachable_providers() -> list[str]

The real providers reachable now from configured credentials, in registry order.

select ¶

select(spec: str | None) -> tuple[Provider, str]

Resolve a "provider:model" spec to a (provider, model) pair — no default (D4).

Raises:

Type	Description
`ProviderError`	if spec is missing (lists the reachable providers), malformed, or names an unknown provider.

Retrieval¶

rag ¶

Corpus ingest, the FAISS index, and the retriever (decisions D2, D3).

Retrieval grounds generation in the GMAT help pages, the stock sample scripts, the GmatFunctions, the gmat-script field catalogue, and a curated domain-notes tier. The corpus text and a prebuilt index for the default embedder are extracted by maintainers at build time (:mod:.build) and shipped with the package; the runtime :func:.load_corpus loads them with no GMAT install and no network, rebuilding the index on first use only as a fallback. sentence-transformers / faiss are imported lazily so importing the package stays light.

BgeEmbedder ¶

The default :class:Embedder: a lazily-loaded BGE sentence-transformer.

sentence_transformers is imported on first use, not at construction, so neither importing the package nor loading the shipped index (built for this model) pays the model-load cost.

Embedder ¶

Bases: Protocol

Embeds passages and queries into a shared vector space.

Implementations normalise their output so a flat inner-product index measures cosine similarity.

dim `property` ¶

dim: int

The embedding dimension.

encode ¶

encode(
    texts: Sequence[str], *, is_query: bool = False
) -> NDArray[float32]

Embed texts, returning one row per text. Set is_query for retrieval queries.

CorpusIndex ¶

The loaded corpus: the chunks, the FAISS index, and the embedder they were built for.

search ¶

search(
    query: str, *, embedder: Embedder, k: int = 8
) -> list[SearchHit]

Embed query and return the k most similar chunks, most relevant first.

embedder must be the one the index was built for (the same model that loaded or rebuilt it), so the query lands in the same vector space.

SearchHit `dataclass` ¶

A retrieved corpus chunk and its similarity score (higher is closer).

Retriever ¶

Embeds a query and returns the most relevant corpus chunks (decision D2).

Loads the shipped corpus and prebuilt index for the default embedder (rebuilding on first use only as a fallback for a non-default embedder or a corpus change), then runs a top-k search and trims the result to a token budget. The corpus and model load lazily on the first :meth:retrieve, so constructing a Retriever is cheap.

retrieve ¶

retrieve(
    query: str, *, top_k: int | None = None
) -> RetrievalTrace

Return the corpus chunks that ground query, most relevant first.

Trims the ranked hits to :attr:token_budget, keeping whole chunks in rank order and always retaining at least the top hit. To curb hallucinated command syntax, it also pins one worked-example chunk — the best-ranked passage that actually shows a mission sequence — into the grounding even when pure relevance ranks it below the kept set (it commonly does for setup-heavy queries, which retrieve resource definitions but no command syntax). The returned trace is exactly the set :func:assemble_context formats into the grounding block.

CorpusChunk `dataclass` ¶

One retrieval passage with its source provenance.

Parameters:

Name	Type	Description	Default
`text`	`str`	the passage embedded and returned as grounding.	required
`kind`	`ChunkKind`	the corpus tier the passage belongs to.	required
`origin`	`str`	the source identifier — a help-page name, sample/GmatFunction file name, catalogue type name, or domain-note name.	required
`section`	`str`	a finer locator within the origin — a help field-section heading or a sample `%----` banner label; empty for whole-file tiers.	`''`

load_corpus ¶

load_corpus(
    embedder: Embedder | None = None,
    *,
    corpus_dir: Path | None = None,
) -> CorpusIndex

Load the shipped corpus and its index (decision D2).

Parameters:

Name	Type	Description	Default
`embedder`	`Embedder \| None`	the embedder retrieval will use. `None` selects the default the index was built for, so the prebuilt index is loaded directly. A non-default embedder triggers a one-time fallback rebuild, cached under the XDG cache directory.	`None`
`corpus_dir`	`Path \| None`	the corpus directory; defaults to the shipped package data.	`None`

assemble_context ¶

assemble_context(trace: RetrievalTrace) -> str

Format a retrieval trace into a bounded, source-attributed grounding block.

Each chunk is rendered under its source label so generation (and a reader of the result) can see where the grounding came from. Empty when the trace has no chunks.

Evaluation¶

eval ¶

The evaluation suite: prompt set, structural scorer, LLM judge, and scorer (D6/D7).

BoardRow `dataclass` ¶

One provider:model entry: public anchor, held-out headline, and close-the-loop cell.

overfit_gap `property` ¶

overfit_gap: float | None

public - held_out — the overfit tell; None until the held-out has been scored.

LeaderboardError ¶

Bases: Exception

A board failed an integrity check (a leak, or a non-reproducing row).

DraftScore `dataclass` ¶

One draft's close-the-loop score: the two static layers plus the dynamic dry-run verdict.

static_pass `property` ¶

static_pass: bool

What the v0.1 static eval accepts: structurally clean and judged intent-correct.

runnable `property` ¶

runnable: bool

The full close-the-loop pass: static-accepted and the dry-run is ok.

LiftReport `dataclass` ¶

The close-the-loop outcomes and the per-tier dry-run-agreement and repair-lift aggregates.

base_runnable_by_tier `property` ¶

base_runnable_by_tier: dict[str, float]

The close-the-loop pass-rate per tier at repair = 0 (the single-pass baseline).

repaired_runnable_by_tier `property` ¶

repaired_runnable_by_tier: dict[str, float]

The close-the-loop pass-rate per tier at the repair budget.

lift_by_tier `property` ¶

lift_by_tier: dict[str, float]

The repair-loop lift per tier: repaired pass-rate minus base pass-rate (decision D13).

dry_run_agreement_by_tier `property` ¶

dry_run_agreement_by_tier: dict[str, float | None]

Per tier, the fraction of statically-accepted base drafts that also pass the dry-run.

The denominator is the drafts the v0.1 static eval would pass (structural ∧ judge) at repair = 0; the numerator, those whose dry-run is also ok. None for a tier with no statically-accepted draft to compare (the agreement is undefined, not 0). The shortfall below 1.0 is the static-vs-dynamic gap the dry-run tier exists to surface (decision D12).

base_runnable `property` ¶

base_runnable: float

The overall close-the-loop pass-rate at repair = 0.

repaired_runnable `property` ¶

repaired_runnable: float

The overall close-the-loop pass-rate at the repair budget.

lift `property` ¶

lift: float

The overall repair-loop lift: repaired minus base pass-rate.

LiftRow `dataclass` ¶

One prompt's close-the-loop outcome at repair = 0 (base) and the budget (repaired).

base is the single-pass (v0.1) draft; repaired is the draft the bounded loop converged to.

RecordedDryRun `dataclass` ¶

A :data:~gmat_copilot.repair.DryRunFn that replays recorded verdicts keyed by draft hash.

The deterministic, GMAT-free dynamic tier the recorded close-the-loop eval drives the real loop with (decision D7): :func:gmat_copilot.draft calls it instead of the gmat-run subprocess.

EvalPrompt `dataclass` ¶

One eval prompt: the request sent to the model, its intent, and its structural spec.

StructuralSpec `dataclass` ¶

What the deterministic structural layer asserts about a candidate script.

EvalReport `dataclass` ¶

The outcomes for an eval run and the aggregate pass-rate.

pass_rate_by_tier `property` ¶

pass_rate_by_tier: dict[str, float]

The pass-rate within each difficulty tier (decision D6 aggregates per tier).

PromptOutcome `dataclass` ¶

The scored outcome for one prompt: structural, judge, and combined verdicts.

StructuralResult `dataclass` ¶

The structural verdict for one candidate script and the specific checks it failed.

judge_verdicts ¶

judge_verdicts(
    intent: str,
    script: str,
    *,
    model: str = JUDGE_MODEL,
    n: int = 3,
    provider: Provider | None = None,
    pace: float = 0.0,
) -> list[bool | None]

Run the judge n times and return the raw per-run verdicts (decision D6).

The list of verdicts is what the recorded bundle freezes; :func:majority reduces it to the gate decision. pace seconds are slept between calls to respect the free-tier per-minute budget when recording live; unit tests leave it at 0.

Parameters:

Name	Type	Description	Default
`provider`	`Provider \| None`	the model provider; defaults to a :class:`~gmat_copilot.providers.GitHubModelsProvider` (the free-tier path the judge is specified against, decision D7).	`None`

majority ¶

majority(verdicts: Sequence[bool | None]) -> bool | None

Majority vote over verdicts, ignoring None; FAIL on a tie (decision D6).

Returns:

Type	Description
`bool \| None`	the majority boolean, or `None` if there are no non-`None` verdicts.

parse_verdict ¶

parse_verdict(content: str) -> bool | None

Extract the binary verdict from a judge completion.

Prefers the constrained {"satisfies_intent": bool} object; falls back to an unambiguous bare true/false in the prose. Returns None when the verdict cannot be read — a None is dropped by :func:majority, never counted as a vote.

build_from_config ¶

build_from_config(
    config: dict[str, Any],
    *,
    root: Path,
    generated_at: str,
    tool_version: str,
    held_out_root: Path | None = None,
) -> tuple[dict[str, Any], list[str]]

Build a board from a seeds config, resolving bundle paths under root.

Returns the board payload and a list of human-readable notes for seeds that were skipped (their public bundle was absent or did not carry the model — e.g. a live-only seed run offline). A seed is scored against its held-out bundle only when held_out_root is given and that bundle exists, so offline and in per-merge CI every held-out cell is pending. Held-out golds are never committed: the held-out bundles live under held_out_root (a gitignored cache fetched in gated CI), never in the repo tree.

build_leaderboard ¶

build_leaderboard(
    rows: Iterable[BoardRow],
    *,
    eval_protocol_version: str,
    generated_at: str,
    judge_model: str,
    public_set: dict[str, Any],
    held_out_set: dict[str, Any],
) -> dict[str, Any]

Assemble a ranked leaderboard.json payload — ranked on the held-out headline (D16).

generated_at is injected (never read from the clock) so the payload is byte-deterministic. The public anchor is shown alongside the headline; a pending held-out sorts last.

dumps ¶

dumps(board: dict[str, Any]) -> str

Canonical board bytes: sorted-key, 2-space JSON with a trailing newline (stable diffs).

score_entry ¶

score_entry(
    model: str,
    *,
    public_bundle: str | Path,
    tool_version: str,
    held_out_bundle: str | Path | None = None,
    lift_bundle: str | Path | None = None,
    provider: str = "recorded",
    kind: str = "seed",
    n_votes: int = 3,
    judge_model: str = JUDGE_MODEL,
    submitted_by: str = "seed",
    verified: bool = True,
) -> BoardRow

Score one provider:model into a :class:BoardRow through the shipped recorded scorer.

The public number comes from replaying public_bundle (deterministic, quota-free, D7); the held-out number, when held_out_bundle is given, from replaying it the same way (gated CI only — the bundle is never committed); the close-the-loop cell, when lift_bundle is given, from the recorded lift replay. run_recorded raises :class:ProviderError if model is missing from a bundle — the caller decides whether to skip the seed or fail.

summarize ¶

summarize(report: EvalReport) -> Aggregate

The aggregate-only view of an eval report — the only thing a board row carries from it.

run_live_lift ¶

run_live_lift(
    prompts: Sequence[EvalPrompt],
    *,
    model: str,
    judge_model: str = JUDGE_MODEL,
    n: int = 3,
    budget: int = DEFAULT_BUDGET,
    provider: Provider | None = None,
    retriever: Retriever | None = None,
    judge_provider: Provider | None = None,
    gmat_root: str | None = None,
    pace: float = 0.0,
) -> LiftReport

Run the close-the-loop eval live: a real model, a real GMAT dry-run, and a live judge.

The on-demand / fixture-refresh path (decision D7) — needs a reachable generation provider, the [gmat] extra with a GMAT install, and a reachable judge. pace seconds are slept between prompts to respect the free-tier daily budget. No fixtures are written.

run_recorded_lift ¶

run_recorded_lift(
    bundle_dir: str | Path, *, budget: int = DEFAULT_BUDGET
) -> LiftReport

Replay the recorded close-the-loop bundle and return its :class:LiftReport (decision D7).

Deterministic and quota-free: a :class:_TrajectoryProvider replays each prompt's recorded repair trajectory through the real loop, a :class:RecordedDryRun replays the dynamic tier, and the recorded judge verdicts settle the semantic layer — zero model calls, zero GMAT.

A bundle directory holds:

prompts.json — the prompt set (see :func:gmat_copilot.eval.prompts.load_prompts).
trajectory.json — {prompt_id: [draft_script, ...]}, the recorded repair sequence.
verdicts.json — {draft_hash: {"dry_run": {...}, "judge": [verdict, ...]}}.

load_prompts ¶

load_prompts(path: str | Path) -> list[EvalPrompt]

Load an eval prompt set from a JSON file.

record_bundle ¶

record_bundle(
    prompts: Sequence[EvalPrompt],
    bundle_dir: str | Path,
    *,
    model: str,
    judge_model: str = JUDGE_MODEL,
    n: int = 3,
    provider: Provider | None = None,
    retriever: Retriever | None = None,
    judge_provider: Provider | None = None,
    pace: float = 0.0,
) -> EvalReport

Run the live eval once and freeze it as a recorded bundle in bundle_dir (decision D7).

Writes completions.json (the generated scripts, keyed for :class:RecordedProvider) and judge.json (the raw per-run judge verdicts). prompts.json is the authored source of truth and is left untouched; :func:run_recorded on the same directory then reproduces this run's scores deterministically. Returns the live :class:EvalReport.

run_live ¶

run_live(
    prompts: Sequence[EvalPrompt],
    *,
    model: str,
    judge_model: str = JUDGE_MODEL,
    n: int = 3,
    provider: Provider | None = None,
    retriever: Retriever | None = None,
    judge_provider: Provider | None = None,
    pace: float = 0.0,
) -> EvalReport

Generate and judge each prompt live, returning the :class:EvalReport (decision D6).

Needs a reachable generation provider (model is a "provider:model" selector unless provider is given) and a reachable judge. pace seconds are slept between model calls to respect the free-tier per-minute budget. No fixtures are written — use :func:record_bundle to freeze a run.

run_recorded ¶

run_recorded(
    bundle_dir: str | Path, *, model: str
) -> EvalReport

Replay the recorded eval bundle for model and return its :class:EvalReport.

Deterministic and quota-free: the structural layer re-scores the recorded completion text and the judge layer replays the recorded verdicts (decision D7).

structural_score ¶

structural_score(
    script_text: str, spec: StructuralSpec
) -> StructuralResult

Score script_text against spec with the deterministic structural checks.

Leaderboard¶

leaderboard ¶

The per-model leaderboard engine: a ranked board over the eval suite (decision D16).

Where :mod:gmat_copilot.eval.runner scores one provider:model into an :class:~gmat_copilot.eval.runner.EvalReport, this sweeps a set of explicit provider:model\ s through the same shipped recorded scorer (no new scoring math) and assembles a ranked leaderboard.json. Two roles for the eval set decide the ranking:

the committed public prompt set is the reproducibility anchor — its number reproduces byte-for-byte offline from the recorded bundle (decision D7), pinned by the bundle's content hash;
a never-committed held-out set is the headline — the board ranks on it, so overfitting the public prompts buys no rank. A large public - held_out gap is the overfit tell.

The board carries aggregates only (per-tier pass-rates, the close-the-loop figures, usage, and a run block); no prompt text, intent, or judge verdict ever reaches it, so a held-out gold cannot leak through the published JSON (:func:assert_aggregate_only). Held-out scoring runs only in gated CI against a private store; offline and in per-merge CI the held-out is pending and the public anchor stands alone.

The engine is pure and injection-driven: generated_at and tool_version are passed in (never read from the clock), so a built board is byte-deterministic and testable.

LeaderboardError ¶

Bases: Exception

A board failed an integrity check (a leak, or a non-reproducing row).

Aggregate `dataclass` ¶

An aggregate-only score for one prompt set: the headline rate plus its per-tier breakdown.

CloseTheLoop `dataclass` ¶

The v0.2 close-the-loop figures for a model (decisions D12, D13), as a board cell.

RunMeta `dataclass` ¶

The reproducibility block pinning how a row was produced (decision D16).

BoardRow `dataclass` ¶

One provider:model entry: public anchor, held-out headline, and close-the-loop cell.

overfit_gap `property` ¶

overfit_gap: float | None

public - held_out — the overfit tell; None until the held-out has been scored.

summarize ¶

summarize(report: EvalReport) -> Aggregate

The aggregate-only view of an eval report — the only thing a board row carries from it.

close_the_loop_from_lift ¶

close_the_loop_from_lift(
    report: LiftReport,
) -> CloseTheLoop

Fold a :class:LiftReport into the board's close-the-loop cell (decisions D12, D13).

bundle_sha16 ¶

bundle_sha16(
    bundle: Path, names: Sequence[str] = STATIC_BUNDLE_FILES
) -> str

The 16-hex content hash over names in bundle — pins a recorded result (decision D7).

recorded_usage ¶

recorded_usage(
    bundle: Path, *, model: str, n_votes: int
) -> dict[str, int]

Sum model's recorded generation usage in bundle, plus the implied judge-call count.

Judge usage is not recorded, so judge_calls is derived as generations times the vote count — the free-tier transparency the board publishes (decision D16), not a measured token total.

score_entry ¶

score_entry(
    model: str,
    *,
    public_bundle: str | Path,
    tool_version: str,
    held_out_bundle: str | Path | None = None,
    lift_bundle: str | Path | None = None,
    provider: str = "recorded",
    kind: str = "seed",
    n_votes: int = 3,
    judge_model: str = JUDGE_MODEL,
    submitted_by: str = "seed",
    verified: bool = True,
) -> BoardRow

Score one provider:model into a :class:BoardRow through the shipped recorded scorer.

The public number comes from replaying public_bundle (deterministic, quota-free, D7); the held-out number, when held_out_bundle is given, from replaying it the same way (gated CI only — the bundle is never committed); the close-the-loop cell, when lift_bundle is given, from the recorded lift replay. run_recorded raises :class:ProviderError if model is missing from a bundle — the caller decides whether to skip the seed or fail.

build_leaderboard ¶

build_leaderboard(
    rows: Iterable[BoardRow],
    *,
    eval_protocol_version: str,
    generated_at: str,
    judge_model: str,
    public_set: dict[str, Any],
    held_out_set: dict[str, Any],
) -> dict[str, Any]

Assemble a ranked leaderboard.json payload — ranked on the held-out headline (D16).

generated_at is injected (never read from the clock) so the payload is byte-deterministic. The public anchor is shown alongside the headline; a pending held-out sorts last.

dumps ¶

dumps(board: dict[str, Any]) -> str

Canonical board bytes: sorted-key, 2-space JSON with a trailing newline (stable diffs).

assert_aggregate_only ¶

assert_aggregate_only(board: dict[str, Any]) -> None

Raise :class:LeaderboardError if a row's public/held-out cell carries a non-aggregate key.

The firewall the published board must satisfy (decision D16): a board carries pass-rates and metadata only, never a prompt, an intent, or a judge verdict — so a held-out gold cannot leak through it. A row exposing an unexpected key in an aggregate cell is rejected.

assert_no_leak ¶

assert_no_leak(
    serialized: str, secrets: Iterable[str]
) -> None

Raise if any held-out secret string appears in the serialized board (a hard leak check).

held_out_secrets ¶

held_out_secrets(
    config: dict[str, Any], held_out_root: Path
) -> list[str]

The held-out gold strings the published board must never contain — every held-out prompt's request and intent text, read from the private store under held_out_root.

Fed to :func:assert_no_leak so the firewall scans the bytes that ship, not just their key names (the structural :func:assert_aggregate_only check). Held-out bundles that have not been fetched are skipped, so this is a no-op offline and the realistic content scan only in gated CI.

build_from_config ¶

build_from_config(
    config: dict[str, Any],
    *,
    root: Path,
    generated_at: str,
    tool_version: str,
    held_out_root: Path | None = None,
) -> tuple[dict[str, Any], list[str]]

Build a board from a seeds config, resolving bundle paths under root.

Returns the board payload and a list of human-readable notes for seeds that were skipped (their public bundle was absent or did not carry the model — e.g. a live-only seed run offline). A seed is scored against its held-out bundle only when held_out_root is given and that bundle exists, so offline and in per-merge CI every held-out cell is pending. Held-out golds are never committed: the held-out bundles live under held_out_root (a gitignored cache fetched in gated CI), never in the repo tree.

API reference¶

Public surface¶

gmat_copilot ¶

GmatExtraNotInstalled ¶

DraftCancelled ¶

DraftRejected ¶

Outcome dataclass ¶

Provenance dataclass ¶

CopilotResult dataclass ¶

save ¶

DraftAttempt dataclass ¶

DryRunReport dataclass ¶

LintDiagnostic dataclass ¶

LintReport dataclass ¶

clean property ¶

blocking ¶

RepairTrace dataclass ¶

RetrievalChunk dataclass ¶

RetrievalTrace dataclass ¶

dry_run ¶

require_gmat_extra ¶

draft ¶

read_sidecar ¶

write_sidecar ¶

Result schema¶

result ¶

LintDiagnostic dataclass ¶

LintReport dataclass ¶

clean property ¶

blocking ¶

DryRunReport dataclass ¶

RetrievalChunk dataclass ¶

RetrievalTrace dataclass ¶

DraftAttempt dataclass ¶

RepairTrace dataclass ¶

CopilotResult dataclass ¶

save ¶

Provenance¶

provenance ¶

Outcome dataclass ¶

Provenance dataclass ¶

to_json_dict ¶

from_json_dict ¶

dumps ¶

sidecar_path ¶

write_sidecar ¶

read_sidecar ¶

Validation¶

validate ¶

validate ¶

Dry-run¶

dryrun ¶

GmatExtraNotInstalled ¶

strip_paths ¶

extract_feedback_line ¶

require_gmat_extra ¶

dry_run ¶

Repair loop¶

repair ¶

Verdict dataclass ¶

evaluate ¶

build_repair_prompt ¶

draft_hash ¶

aggregate_usage ¶

Providers¶

providers ¶

Completion dataclass ¶

ProviderError ¶

Provider ¶

reachable ¶

complete ¶

AnthropicProvider ¶

OpenAIProvider ¶

OllamaProvider ¶

GitHubModelsProvider ¶

RecordedProvider ¶

RecordingProvider ¶

save ¶

prompt_key ¶

reachable_providers ¶

Outcome `dataclass` ¶

Provenance `dataclass` ¶

CopilotResult `dataclass` ¶

DraftAttempt `dataclass` ¶

DryRunReport `dataclass` ¶

LintDiagnostic `dataclass` ¶

LintReport `dataclass` ¶

clean `property` ¶

RepairTrace `dataclass` ¶

RetrievalChunk `dataclass` ¶

RetrievalTrace `dataclass` ¶

LintDiagnostic `dataclass` ¶

LintReport `dataclass` ¶

clean `property` ¶

DryRunReport `dataclass` ¶

RetrievalChunk `dataclass` ¶

RetrievalTrace `dataclass` ¶

DraftAttempt `dataclass` ¶

RepairTrace `dataclass` ¶

CopilotResult `dataclass` ¶

Outcome `dataclass` ¶

Provenance `dataclass` ¶

Verdict `dataclass` ¶

Completion `dataclass` ¶

dim `property` ¶

SearchHit `dataclass` ¶

CorpusChunk `dataclass` ¶

BoardRow `dataclass` ¶

overfit_gap `property` ¶

DraftScore `dataclass` ¶

static_pass `property` ¶

runnable `property` ¶

LiftReport `dataclass` ¶

base_runnable_by_tier `property` ¶

repaired_runnable_by_tier `property` ¶

lift_by_tier `property` ¶

dry_run_agreement_by_tier `property` ¶

base_runnable `property` ¶

repaired_runnable `property` ¶

lift `property` ¶

LiftRow `dataclass` ¶

RecordedDryRun `dataclass` ¶

EvalPrompt `dataclass` ¶

StructuralSpec `dataclass` ¶

EvalReport `dataclass` ¶

pass_rate_by_tier `property` ¶

PromptOutcome `dataclass` ¶

StructuralResult `dataclass` ¶

Aggregate `dataclass` ¶

CloseTheLoop `dataclass` ¶

RunMeta `dataclass` ¶

BoardRow `dataclass` ¶

overfit_gap `property` ¶