Surviving a kill: partial manifest recovery and programmatic resume¶
The durability claim is narrow but load-bearing: the JSON Lines manifest survives a mid-sweep Ctrl-C, the partial DataFrame is loadable from disk, and the CLI inspector reports the partial state. This notebook walks through that exact flow end-to-end:
- Launch
gmat-sweep runas a subprocess against a small mission. - Poll the manifest until a few runs have completed.
- Send the subprocess
SIGINT(the same signalCtrl-Craises). - Inspect the partial manifest with
gmat-sweep show. - Reload the manifest in-process and aggregate the partial DataFrame from the Parquet files on disk.
- Programmatically resume the sweep — re-run only the missing
run_ids — viaSweep.from_manifest(...).resume()and aggregate the now-complete DataFrame.
Prerequisites. A local GMAT install (R2026a is the primary development target). This notebook does not depend on the [examples] extra (no plots).
Platform note. The SIGINT-via-subprocess approach used below is POSIX-only. On Windows, the same recovery flow works after a real Ctrl-C in an interactive gmat-sweep run session — but reproducing the kill from a notebook cell needs different signalling primitives.
Set up the run¶
Resolve the GMAT install once and confirm the small mission script that ships next to this notebook is where we expect it. The script propagates for 60 seconds with point-mass Earth — every per-run cost is sub-second so the kill-and-inspect demo finishes inside the smoke step's wall-clock budget.
import signal
import subprocess
import tempfile
import time
from pathlib import Path
from gmat_run import locate_gmat
from gmat_sweep import Manifest, lazy_multiindex
install = locate_gmat()
script_path = Path("leo_short.script").resolve()
print(f"GMAT version: {install.version}")
print(f"Script: {script_path.name}")
print(f"Exists: {script_path.exists()}")
GMAT version: R2026a Script: leo_short.script Exists: True
Launch the sweep as a subprocess¶
We dispatch 20 runs with --workers 1 so completions are serialised — that makes the polling loop below predictable. Each completed run appends one fsync'd line to manifest.jsonl.
tmpdir = tempfile.TemporaryDirectory(prefix="killed-sweep-")
out_dir = Path(tmpdir.name)
manifest_path = out_dir / "manifest.jsonl"
proc = subprocess.Popen(
[
"gmat-sweep",
"run",
"--grid",
"Sat.SMA=7000:8000:20",
"--workers",
"1",
"--out",
str(out_dir),
str(script_path),
],
stdout=subprocess.PIPE,
stderr=subprocess.PIPE,
)
print(f"Spawned gmat-sweep pid={proc.pid}, output dir = {out_dir}")
Spawned gmat-sweep pid=4159, output dir = /tmp/killed-sweep-6r0ph7wt
Wait for a few runs to finish, then SIGINT¶
Poll the manifest until at least 5 runs have completed. Each fsync'd append is durable, so a parse mid-poll always sees a consistent prefix.
Once the threshold is reached we send the child SIGINT — the same signal an interactive Ctrl-C would raise — and wait for it to exit. The orchestrator does not catch KeyboardInterrupt, so the subprocess exits with a non-zero code; that is the expected outcome.
TARGET_ENTRIES = 5
DEADLINE_S = 60.0
deadline = time.monotonic() + DEADLINE_S
def _entries_on_disk(path: Path) -> int:
if not path.exists():
return 0
# Header line + one per-run line. Subtract 1 for the header.
return max(0, sum(1 for _ in path.open("r", encoding="utf-8")) - 1)
while time.monotonic() < deadline:
if proc.poll() is not None:
raise RuntimeError(
f"gmat-sweep exited before SIGINT could be sent (rc={proc.returncode}); "
f"stderr: {proc.stderr.read().decode(errors='replace') if proc.stderr else ''}"
)
if _entries_on_disk(manifest_path) >= TARGET_ENTRIES:
break
time.sleep(0.1)
else:
proc.kill()
raise RuntimeError(
f"gmat-sweep did not append {TARGET_ENTRIES} manifest entries within {DEADLINE_S} s"
)
n_before_kill = _entries_on_disk(manifest_path)
proc.send_signal(signal.SIGINT)
rc = proc.wait(timeout=30)
n_after_kill = _entries_on_disk(manifest_path)
print(f"Subprocess exited with rc={rc}")
print(f"Manifest entries at kill: {n_before_kill}")
print(f"Manifest entries on disk: {n_after_kill}")
Subprocess exited with rc=-2 Manifest entries at kill: 5 Manifest entries on disk: 5
Inspect the partial manifest with gmat-sweep show¶
The CLI's show subcommand prints a one-line summary: counts per status bucket, total wall-clock duration, output directory. It loads the manifest with the same Manifest.load used in-process, so a torn final line (if the kill landed mid-write) is tolerated — at most one entry is lost.
show = subprocess.run(
["gmat-sweep", "show", str(manifest_path)],
capture_output=True,
text=True,
check=True,
)
print(show.stdout.strip())
5 runs (5 ok) in 1.44 s — output: /tmp/killed-sweep-6r0ph7wt
Reload the manifest in-process¶
Manifest.load returns the full record — header fingerprint plus one ManifestEntry per completed run, in the order they finished.
manifest = Manifest.load(manifest_path)
print(f"Header.run_count (planned): {manifest.run_count}")
print(f"Entries on disk (completed): {len(manifest.entries)}")
print(f"GMAT install version: {manifest.gmat_install_version}")
print(f"Script SHA-256 (canonical): {manifest.script_sha256[:16]}...")
for entry in manifest.entries[:3]:
print(f" run_id={entry.run_id} status={entry.status} overrides={entry.overrides}")
Header.run_count (planned): 20
Entries on disk (completed): 5
GMAT install version: R2026a
Script SHA-256 (canonical): 79ecf7dcc0108e9a...
run_id=0 status=ok overrides={'Sat.SMA': 7000.0}
run_id=1 status=ok overrides={'Sat.SMA': 7052.631578947368}
run_id=2 status=ok overrides={'Sat.SMA': 7105.263157894737}
Aggregate the partial DataFrame¶
lazy_multiindex is the same aggregator sweep() uses internally to produce its return value. Pointing it at a partial manifest yields a partial DataFrame — only the runs that completed before the kill are represented, but every one of them is intact.
df = lazy_multiindex(manifest, out_dir)
print(f"Partial DataFrame shape: {df.shape}")
print(f"Distinct run_ids: {df.index.get_level_values('run_id').nunique()}")
df.head()
Partial DataFrame shape: (15, 5) Distinct run_ids: 5
| Sat.UTCGregorian | Sat.X | Sat.Y | Sat.Z | __status | ||
|---|---|---|---|---|---|---|
| run_id | time | |||||
| 0 | 2026-01-01 00:00:00 | 2026-01-01 00:00:00 | 6993.000000 | 0.000000 | 0.000000 | ok |
| 2026-01-01 00:00:30 | 2026-01-01 00:00:30 | 6989.332373 | 199.112254 | 108.109133 | ok | |
| 2026-01-01 00:01:00 | 2026-01-01 00:01:00 | 6978.333351 | 398.015650 | 216.104866 | ok | |
| 1 | 2026-01-01 00:00:00 | 2026-01-01 00:00:00 | 7045.578947 | 0.000000 | 0.000000 | ok |
| 2026-01-01 00:00:30 | 2026-01-01 00:00:30 | 7041.965850 | 198.368677 | 107.705404 | ok |
Resume the sweep¶
Sweep.from_manifest rebuilds the run iterable from the manifest's recorded parameter_spec and re-validates the on-disk script's canonical SHA-256 against the manifest header. Sweep.resume submits only the union of find_failed() and find_missing(...) — the runs that never produced an ok entry — and appends new manifest entries with the same run_ids. Manifest.load folds the duplicates last-wins, so the final aggregated DataFrame carries one row per (run_id, time) exactly like a fresh sweep would.
The pool's lifecycle is the caller's: a with LocalJoblibPool(...) block ensures workers are torn down whether resume succeeds or raises.
from gmat_sweep import Sweep
from gmat_sweep.backends.joblib import LocalJoblibPool
with LocalJoblibPool(workers=1) as pool:
resumed_df = (
Sweep.from_manifest(
manifest_path,
script_path,
backend=pool,
progress=False,
)
.resume()
.to_dataframe()
)
resumed_manifest = Manifest.load(manifest_path)
print(f"Manifest entries after resume: {len(resumed_manifest.entries)}")
print(f"Distinct run_ids in DataFrame: {resumed_df.index.get_level_values('run_id').nunique()}")
print("Status counts:")
print(resumed_df["__status"].value_counts())
Manifest entries after resume: 20 Distinct run_ids in DataFrame: 20 Status counts: __status ok 60 Name: count, dtype: int64
Why the resumed DataFrame is identical to a never-killed sweep¶
- The manifest is append-only with
fsyncafter every entry, so the original fiveokrows are still on disk byte-for-byte and their Parquet files are reused as-is. Sweep.from_manifestre-derives the run iterable from the manifest'sparameter_spec— a grid sweep rebuilds the same cartesian product; a Monte Carlo or Latin hypercube rebuilds bit-equal draws because the expanders are deterministic in(perturb, n, seed).- The canonical script-hash check refuses to mix outputs from two different scripts; the
allow_script_driftescape hatch is available when the change is known to be benign.
Where to next¶
- Resume reference. Resume documents the last-wins entry semantics, the script-drift gate, and the single-machine limitation. The same flow is also available from the shell as
gmat-sweep resume. - Manifest schema. Manifest schema documents every field the JSON Lines records carry, including the canonical script-hash convention the resume gate relies on.
- CLI reference. The
gmat-sweep runandgmat-sweep showflags.