Archive bundle — pack a sweep for handoff¶
A finished sweep is the script, the JSON Lines manifest, and the per-run Parquet outputs. Sweep.archive() packs all three into one .zip — suitable for archival deposit (Zenodo, JOSS supplementary material) or internal handoff. The bundle is self-describing: paths inside the zip are rewritten to bundle-relative form, a MANIFEST.hash carries SHA-256 of every member, and a README.md documents the layout.
This notebook walks through producing a bundle, inspecting its contents, and re-aggregating the per-run DataFrame from the unzipped tree without a re-run.
Prerequisites. A local GMAT install (R2026a is the primary development target; see Supported versions). This notebook does not depend on the [examples] extra (no plots).
Set up the run¶
A small Sat.SMA grid against the leo_short.script fixture — sub-second per run, ten runs total. The bundle is small enough to inspect line-by-line below.
import tempfile
import zipfile
from pathlib import Path
from gmat_run import locate_gmat
from gmat_sweep import LocalJoblibPool, Manifest, Sweep, lazy_multiindex, sweep
install = locate_gmat()
script_path = Path("leo_short.script").resolve()
print(f"GMAT version: {install.version}")
print(f"Script: {script_path.name}")
print(f"Exists: {script_path.exists()}")
GMAT version: R2026a Script: leo_short.script Exists: True
Run the sweep¶
Pass out= so the per-run Parquet files and the JSON Lines manifest survive past the call — the bundle reads from that directory. The default out=None would tie everything to the DataFrame's lifetime, so the archive call would have nothing to read.
tmpdir = tempfile.TemporaryDirectory(prefix="archive-bundle-")
out_dir = Path(tmpdir.name)
df = sweep(
script_path,
grid={"Sat.SMA": [7000, 7100, 7200, 7300, 7400, 7500, 7600, 7700, 7800, 7900]},
out=out_dir,
progress=False,
)
df["__status"].value_counts()
__status ok 30 Name: count, dtype: int64
Pack the bundle¶
Sweep.from_manifest reconstructs a Sweep object from the manifest on disk; Sweep.archive writes a .zip next to it. include_logs=False (the default) drops the per-run worker.log files to keep the bundle small; flip to True for a forensic-grade bundle that retains every log.
with LocalJoblibPool(max_workers=1) as pool:
sweep_obj = Sweep.from_manifest(out_dir / "manifest.jsonl", script_path, backend=pool)
bundle_path = sweep_obj.archive(out=out_dir / "sweep_bundle.zip")
print(f"Bundle: {bundle_path.name}")
print(f"Bundle size: {bundle_path.stat().st_size:,} bytes")
Bundle: sweep_bundle.zip Bundle size: 20,497 bytes
Inspect the layout¶
The bundle's top-level entries are deliberately stable so a downstream consumer can find the script, the manifest, and the per-run Parquet files without negotiating a per-bundle convention. The accompanying README.md documents the layout; MANIFEST.hash carries SHA-256 of every member so a corrupted member is detectable on extract.
with zipfile.ZipFile(bundle_path) as zf:
members = sorted(zf.namelist())
print(f"Total members: {len(members)}")
for member in members[:12]:
print(f" {member}")
if len(members) > 12:
print(f" ... ({len(members) - 12} more)")
Total members: 14 MANIFEST.hash README.md manifest.jsonl runs/run-0/report__RF.parquet runs/run-1/report__RF.parquet runs/run-2/report__RF.parquet runs/run-3/report__RF.parquet runs/run-4/report__RF.parquet runs/run-5/report__RF.parquet runs/run-6/report__RF.parquet runs/run-7/report__RF.parquet runs/run-8/report__RF.parquet ... (2 more)
Re-aggregate from the unzipped bundle¶
Extract the bundle to a fresh directory, reload the bundled manifest, and pass it back through lazy_multiindex to rebuild the per-run DataFrame without re-running anything. The aggregated frame is bit-equal to df — every per-run Parquet is copied verbatim into the bundle.
extract_dir = Path(tmpdir.name) / "extracted"
extract_dir.mkdir(exist_ok=True)
with zipfile.ZipFile(bundle_path) as zf:
zf.extractall(extract_dir)
bundled_manifest = Manifest.load(extract_dir / "manifest.jsonl")
df_from_bundle = lazy_multiindex(bundled_manifest, extract_dir)
print(f"Original frame: {df.shape}")
print(f"Bundle frame: {df_from_bundle.shape}")
original_ids = sorted(df.index.unique("run_id").tolist())
bundle_ids = sorted(df_from_bundle.index.unique("run_id").tolist())
print(f"Run IDs match: {original_ids == bundle_ids}")
Original frame: (30, 5) Bundle frame: (30, 5) Run IDs match: True
Where to next¶
- Manifest schema. Manifest schema documents every field the JSON Lines records carry, including the
output_dirrewrite rule the bundler applies. - Resume from a partial sweep. Notebook 03 walks through partial-manifest recovery — the same
Sweep.from_manifestentry point used here. - Bundling with logs for forensics. Pass
include_logs=Trueto retain every per-runworker.log. The bundled manifest'slog_pathkeeps pointing at the bundled file, so a downstream investigator can correlate run outcomes with their log lines without re-running anything.