Surviving a kill: partial manifest recovery and programmatic resume¶

The durability claim is narrow but load-bearing: the JSON Lines manifest survives a mid-sweep Ctrl-C, the partial DataFrame is loadable from disk, and the CLI inspector reports the partial state. This notebook walks through that exact flow end-to-end:

Launch gmat-sweep run as a subprocess against a small mission.
Poll the manifest until a few runs have completed.
Send the subprocess SIGINT (the same signal Ctrl-C raises).
Inspect the partial manifest with gmat-sweep show.
Reload the manifest in-process and aggregate the partial DataFrame from the Parquet files on disk.
Programmatically resume the sweep — re-run only the missing run_ids — via Sweep.from_manifest(...).resume() and aggregate the now-complete DataFrame.

Prerequisites. A local GMAT install (R2026a is the primary development target). This notebook does not depend on the [examples] extra (no plots).

Platform note. The SIGINT-via-subprocess approach used below is POSIX-only. On Windows, the same recovery flow works after a real Ctrl-C in an interactive gmat-sweep run session — but reproducing the kill from a notebook cell needs different signalling primitives.

Set up the run¶

Resolve the GMAT install once and confirm the small mission script that ships next to this notebook is where we expect it. The script propagates for 60 seconds with point-mass Earth — every per-run cost is sub-second so the kill-and-inspect demo finishes inside the smoke step's wall-clock budget.

In [1]:

Copied!





import signal
import subprocess
import tempfile
import time
from pathlib import Path

from gmat_run import locate_gmat

from gmat_sweep import Manifest, lazy_multiindex

install = locate_gmat()
script_path = Path("leo_short.script").resolve()

print(f"GMAT version: {install.version}")
print(f"Script:       {script_path.name}")
print(f"Exists:       {script_path.exists()}")
import signal
import subprocess
import tempfile
import time
from pathlib import Path

from gmat_run import locate_gmat

from gmat_sweep import Manifest, lazy_multiindex

install = locate_gmat()
script_path = Path("leo_short.script").resolve()

print(f"GMAT version: {install.version}")
print(f"Script:       {script_path.name}")
print(f"Exists:       {script_path.exists()}")

GMAT version: R2026a
Script:       leo_short.script
Exists:       True

Launch the sweep as a subprocess¶

We dispatch 20 runs with --workers 1 so completions are serialised — that makes the polling loop below predictable. Each completed run appends one fsync'd line to manifest.jsonl.

In [2]:

Copied!





tmpdir = tempfile.TemporaryDirectory(prefix="killed-sweep-")
out_dir = Path(tmpdir.name)
manifest_path = out_dir / "manifest.jsonl"

proc = subprocess.Popen(
    [
        "gmat-sweep",
        "run",
        "--grid",
        "Sat.SMA=7000:8000:20",
        "--workers",
        "1",
        "--out",
        str(out_dir),
        str(script_path),
    ],
    stdout=subprocess.PIPE,
    stderr=subprocess.PIPE,
)
print(f"Spawned gmat-sweep pid={proc.pid}, output dir = {out_dir}")
tmpdir = tempfile.TemporaryDirectory(prefix="killed-sweep-")
out_dir = Path(tmpdir.name)
manifest_path = out_dir / "manifest.jsonl"

proc = subprocess.Popen(
    [
        "gmat-sweep",
        "run",
        "--grid",
        "Sat.SMA=7000:8000:20",
        "--workers",
        "1",
        "--out",
        str(out_dir),
        str(script_path),
    ],
    stdout=subprocess.PIPE,
    stderr=subprocess.PIPE,
)
print(f"Spawned gmat-sweep pid={proc.pid}, output dir = {out_dir}")

Spawned gmat-sweep pid=4159, output dir = /tmp/killed-sweep-6r0ph7wt

Wait for a few runs to finish, then SIGINT¶

Poll the manifest until at least 5 runs have completed. Each fsync'd append is durable, so a parse mid-poll always sees a consistent prefix.

Once the threshold is reached we send the child SIGINT — the same signal an interactive Ctrl-C would raise — and wait for it to exit. The orchestrator does not catch KeyboardInterrupt, so the subprocess exits with a non-zero code; that is the expected outcome.

In [3]:

Copied!





TARGET_ENTRIES = 5
DEADLINE_S = 60.0
deadline = time.monotonic() + DEADLINE_S


def _entries_on_disk(path: Path) -> int:
    if not path.exists():
        return 0
    # Header line + one per-run line. Subtract 1 for the header.
    return max(0, sum(1 for _ in path.open("r", encoding="utf-8")) - 1)


while time.monotonic() < deadline:
    if proc.poll() is not None:
        raise RuntimeError(
            f"gmat-sweep exited before SIGINT could be sent (rc={proc.returncode}); "
            f"stderr: {proc.stderr.read().decode(errors='replace') if proc.stderr else ''}"
        )
    if _entries_on_disk(manifest_path) >= TARGET_ENTRIES:
        break
    time.sleep(0.1)
else:
    proc.kill()
    raise RuntimeError(
        f"gmat-sweep did not append {TARGET_ENTRIES} manifest entries within {DEADLINE_S} s"
    )

n_before_kill = _entries_on_disk(manifest_path)
proc.send_signal(signal.SIGINT)
rc = proc.wait(timeout=30)
n_after_kill = _entries_on_disk(manifest_path)

print(f"Subprocess exited with rc={rc}")
print(f"Manifest entries at kill:  {n_before_kill}")
print(f"Manifest entries on disk:  {n_after_kill}")
TARGET_ENTRIES = 5
DEADLINE_S = 60.0
deadline = time.monotonic() + DEADLINE_S


def _entries_on_disk(path: Path) -> int:
    if not path.exists():
        return 0
    # Header line + one per-run line. Subtract 1 for the header.
    return max(0, sum(1 for _ in path.open("r", encoding="utf-8")) - 1)


while time.monotonic() < deadline:
    if proc.poll() is not None:
        raise RuntimeError(
            f"gmat-sweep exited before SIGINT could be sent (rc={proc.returncode}); "
            f"stderr: {proc.stderr.read().decode(errors='replace') if proc.stderr else ''}"
        )
    if _entries_on_disk(manifest_path) >= TARGET_ENTRIES:
        break
    time.sleep(0.1)
else:
    proc.kill()
    raise RuntimeError(
        f"gmat-sweep did not append {TARGET_ENTRIES} manifest entries within {DEADLINE_S} s"
    )

n_before_kill = _entries_on_disk(manifest_path)
proc.send_signal(signal.SIGINT)
rc = proc.wait(timeout=30)
n_after_kill = _entries_on_disk(manifest_path)

print(f"Subprocess exited with rc={rc}")
print(f"Manifest entries at kill:  {n_before_kill}")
print(f"Manifest entries on disk:  {n_after_kill}")

Subprocess exited with rc=-2
Manifest entries at kill:  5
Manifest entries on disk:  5

Inspect the partial manifest with `gmat-sweep show`¶

The CLI's show subcommand prints a one-line summary: counts per status bucket, total wall-clock duration, output directory. It loads the manifest with the same Manifest.load used in-process, so a torn final line (if the kill landed mid-write) is tolerated — at most one entry is lost.

In [4]:

Copied!





show = subprocess.run(
    ["gmat-sweep", "show", str(manifest_path)],
    capture_output=True,
    text=True,
    check=True,
)
print(show.stdout.strip())
show = subprocess.run(
    ["gmat-sweep", "show", str(manifest_path)],
    capture_output=True,
    text=True,
    check=True,
)
print(show.stdout.strip())

5 runs (5 ok) in 1.44 s — output: /tmp/killed-sweep-6r0ph7wt

Reload the manifest in-process¶

Manifest.load returns the full record — header fingerprint plus one ManifestEntry per completed run, in the order they finished.

In [5]:

Copied!





manifest = Manifest.load(manifest_path)
print(f"Header.run_count (planned):  {manifest.run_count}")
print(f"Entries on disk (completed): {len(manifest.entries)}")
print(f"GMAT install version:        {manifest.gmat_install_version}")
print(f"Script SHA-256 (canonical):  {manifest.script_sha256[:16]}...")

for entry in manifest.entries[:3]:
    print(f"  run_id={entry.run_id} status={entry.status} overrides={entry.overrides}")
manifest = Manifest.load(manifest_path)
print(f"Header.run_count (planned):  {manifest.run_count}")
print(f"Entries on disk (completed): {len(manifest.entries)}")
print(f"GMAT install version:        {manifest.gmat_install_version}")
print(f"Script SHA-256 (canonical):  {manifest.script_sha256[:16]}...")

for entry in manifest.entries[:3]:
    print(f"  run_id={entry.run_id} status={entry.status} overrides={entry.overrides}")

Header.run_count (planned):  20
Entries on disk (completed): 5
GMAT install version:        R2026a
Script SHA-256 (canonical):  79ecf7dcc0108e9a...
  run_id=0 status=ok overrides={'Sat.SMA': 7000.0}
  run_id=1 status=ok overrides={'Sat.SMA': 7052.631578947368}
  run_id=2 status=ok overrides={'Sat.SMA': 7105.263157894737}

Aggregate the partial DataFrame¶

lazy_multiindex is the same aggregator sweep() uses internally to produce its return value. Pointing it at a partial manifest yields a partial DataFrame — only the runs that completed before the kill are represented, but every one of them is intact.

In [6]:

Copied!





df = lazy_multiindex(manifest, out_dir)
print(f"Partial DataFrame shape: {df.shape}")
print(f"Distinct run_ids:        {df.index.get_level_values('run_id').nunique()}")
df.head()
df = lazy_multiindex(manifest, out_dir)
print(f"Partial DataFrame shape: {df.shape}")
print(f"Distinct run_ids:        {df.index.get_level_values('run_id').nunique()}")
df.head()

Partial DataFrame shape: (15, 5)
Distinct run_ids:        5

Out[6]:

		Sat.UTCGregorian	Sat.X	Sat.Y	Sat.Z	__status
run_id	time
0	2026-01-01 00:00:00	2026-01-01 00:00:00	6993.000000	0.000000	0.000000	ok
	2026-01-01 00:00:30	2026-01-01 00:00:30	6989.332373	199.112254	108.109133	ok
	2026-01-01 00:01:00	2026-01-01 00:01:00	6978.333351	398.015650	216.104866	ok
1	2026-01-01 00:00:00	2026-01-01 00:00:00	7045.578947	0.000000	0.000000	ok
1	2026-01-01 00:00:30	2026-01-01 00:00:30	7041.965850	198.368677	107.705404	ok

Resume the sweep¶

Sweep.from_manifest rebuilds the run iterable from the manifest's recorded parameter_spec and re-validates the on-disk script's canonical SHA-256 against the manifest header. Sweep.resume submits only the union of find_failed() and find_missing(...) — the runs that never produced an ok entry — and appends new manifest entries with the same run_ids. Manifest.load folds the duplicates last-wins, so the final aggregated DataFrame carries one row per (run_id, time) exactly like a fresh sweep would.

The pool's lifecycle is the caller's: a with LocalJoblibPool(...) block ensures workers are torn down whether resume succeeds or raises.

In [7]:

Copied!





from gmat_sweep import Sweep
from gmat_sweep.backends.joblib import LocalJoblibPool

with LocalJoblibPool(workers=1) as pool:
    resumed_df = (
        Sweep.from_manifest(
            manifest_path,
            script_path,
            backend=pool,
            progress=False,
        )
        .resume()
        .to_dataframe()
    )

resumed_manifest = Manifest.load(manifest_path)
print(f"Manifest entries after resume:    {len(resumed_manifest.entries)}")
print(f"Distinct run_ids in DataFrame:    {resumed_df.index.get_level_values('run_id').nunique()}")
print("Status counts:")
print(resumed_df["__status"].value_counts())
from gmat_sweep import Sweep
from gmat_sweep.backends.joblib import LocalJoblibPool

with LocalJoblibPool(workers=1) as pool:
    resumed_df = (
        Sweep.from_manifest(
            manifest_path,
            script_path,
            backend=pool,
            progress=False,
        )
        .resume()
        .to_dataframe()
    )

resumed_manifest = Manifest.load(manifest_path)
print(f"Manifest entries after resume:    {len(resumed_manifest.entries)}")
print(f"Distinct run_ids in DataFrame:    {resumed_df.index.get_level_values('run_id').nunique()}")
print("Status counts:")
print(resumed_df["__status"].value_counts())

Manifest entries after resume:    20
Distinct run_ids in DataFrame:    20
Status counts:
__status
ok    60
Name: count, dtype: int64

Why the resumed DataFrame is identical to a never-killed sweep¶

The manifest is append-only with fsync after every entry, so the original five ok rows are still on disk byte-for-byte and their Parquet files are reused as-is.
Sweep.from_manifest re-derives the run iterable from the manifest's parameter_spec — a grid sweep rebuilds the same cartesian product; a Monte Carlo or Latin hypercube rebuilds bit-equal draws because the expanders are deterministic in (perturb, n, seed).
The canonical script-hash check refuses to mix outputs from two different scripts; the allow_script_drift escape hatch is available when the change is known to be benign.

Where to next¶

Resume reference. Resume documents the last-wins entry semantics, the script-drift gate, and the single-machine limitation. The same flow is also available from the shell as gmat-sweep resume.
Manifest schema. Manifest schema documents every field the JSON Lines records carry, including the canonical script-hash convention the resume gate relies on.
CLI reference. The gmat-sweep run and gmat-sweep show flags.