Design decisions (D1–D17)¶
The frozen decision record. It consolidates the prerequisite analysis (the dataset-redistribution, label-source, detectability-floor, Δv-inversion, irregular-sampling, foundation-model, and leaderboard-integrity studies) and the project charter into the decisions that fix the shape of the dataset, the benchmark, and the library contract. D1–D10 fix the v0.1 surface; D11–D13 the v0.2 additions — the learned baselines, the public leaderboard, and the dataset-growth pass; D14–D17 the v0.3 additions — the foundation-model baseline, the expanded dataset with the new IGSO class, the hidden-label competition design, and the score-protocol and calibration bump. Each decision states the call and its rationale; the implementable benchmark contract is on the benchmark protocol page, and the output schema and Δv inversion on the schema reference page.
The record is frozen by release — each release's surface matches it, and any change is a version bump with a documented rationale.
D1 — Package layout + stack¶
Package maneuver_detect, Python ≥ 3.10, Hatchling, py.typed:
- the public surface —
detect, thedatasetsaccessor, and the canonical maneuver type; data/— the catalogue fetchers (CelesTrak, Space-Track), elset cleaning, and series assembly;labels/— one module per label source plus the epoch-to-gap labeller;features/— mean-element feature engineering;detectors/— the classical reference detector in v0.1; learned detectors arrive in later releases;benchmark/— splits, the matching rule, the metrics, and the scorer;physics.py— the Δv inversion;cli.py— themaneuver-detectcommand line.
Stack: numpy / pandas; sgp4; astropy (internal only); and the PyTorch / Lightning modelling stack and the
Hugging Face Hub / datasets libraries, on which the learned baselines and the Hub-distributed artifacts
build. The time-series foundation-model stack is an optional [foundation] extra so the base install stays
light. Dev tooling: pytest, pytest-cov, ruff, mypy, mkdocs-material, mkdocstrings.
D2 — Dataset distribution — recipe-first hybrid¶
Publish operator labels, a pinned reconstruction recipe (the fetch code, the NORAD catalogue, the per-object date ranges and query parameters, and a per-series SHA-256 content-hash manifest), and directly-shipped data only where it is openly licensed. Raw Space-Track data — and analysis derived from it — is never redistributed: the Space-Track User Agreement reaches derived analysis, and while TLEs are public-domain U.S.-Government works, the terms of use bind redistribution. Multi-year training history therefore comes from Space-Track via the recipe, with each user reconstructing it from their own account. Reconstruction is byte-deterministic. A recipe entry's epoch window scopes both the series fetch and that object's maneuver labels, so the committed label set is a function of the whole recipe, not the full announced history.
D3 — Label sources + class scope¶
Sources: DORIS/IDS maneuver files (LEO altimetry, with Δv — the Δv-labelled core), ILRS maneuver
history (which links to the same DORIS/IDS files), and GPS NANUs of the FCSTDV type (MEO,
U.S.-Government public domain, epoch-only; SVN/PRN resolved to NORAD via the CelesTrak crosswalk). The
Shorten benchmark is a development-only cross-check (unlicensed, not redistributed). SpotGEO is rejected
(optical object detection, not maneuver labels). Class scope: LEO (primary, Δv-labelled) + MEO
(epoch-only) + GEO (best-effort, epoch-only); HEO is deferred — sparse, a later addition.
D4 — Labelling granularity + matching tolerance + detectability floor¶
A label is the inter-elset gap that brackets a maneuver epoch (per-gap, not per-second), because a maneuver is observable only as a discontinuity between consecutive elsets. The detection-matching tolerance is the labelled gap plus one adjacent gap on each side (≈ ±2 days), set by the ~1-day median TLE cadence. The detectability floor (in-track Δv) is ~cm/s in LEO and ~0.05–0.15 m/s in GEO, calibrated per object/class because it is TLE-quality-dependent; MEO is analytical and HEO not applicable. The false-positive unit is false alarms per satellite-year, with a primary operating point of 1 FA/sat-year, reported as a curve. Detection must be multi-element — mean motion alone catches only a small fraction of maneuvers — and the benchmark scores precision/recall at a fixed false-alarm rate over the above-floor population.
D5 — Δv inversion + type rule + tolerance¶
The inversion reads a Δv back out of the mean-element step across the gap: vis-viva for the in-track
component (from the semi-major-axis change), the Gauss relations for the cross-track component (from the
inclination and node change), and the residual eccentricity change for the weakly-observable radial
component. The reported magnitude is |Δv| and the type is the dominant component — in-track ↔ Δa,
cross-track ↔ Δi / ΔΩ, radial ↔ Δe. Element steps are detrended first to remove secular drift (notably
the J2 nodal regression), which would otherwise read as a large spurious cross-track Δv. The implementation
uses the full Gauss variational equations, not the linearised circular form. Δv is reported only above
the floor, to within about ±25%; radial-dominated maneuvers are reported low-confidence.
D6 — Canonical output schema + Detector interface¶
The canonical maneuver record (and the DataFrame detect returns) is epoch (UTC), confidence
(calibrated, [0, 1]), type (in-track / cross-track / radial), delta_v_estimate (m/s), plus the
provenance norad_id, elset_epoch_before, elset_epoch_after (the bounding epochs of the
inter-elset gap). This is frozen as the library contract. A Detector interface returns that schema,
detect(history, model=...) dispatches by name, and the datasets accessor exposes
tle_history(...). v0.1 ships the classical detector behind detect(); learned models arrive later behind
the same interface.
D7 — Benchmark protocol¶
Per the benchmark protocol: leak-free splits by satellite and time (no satellite and no overlapping time window shared across train/val/test), seeded and byte-stable; the matching rule (the D4 tolerance); the metric — precision/recall at a fixed false-alarm rate per class over the above-floor population, plus per-class type confusion; and a deterministic scorer. The protocol, the splits, and the schema are frozen by release.
D8 — Reproducibility / versioning¶
Seeded, byte-stable splits; a pinned, reconstructable dataset with a content-hash manifest (D2); the dataset and checkpoints versioned in lockstep, each checkpoint carrying a model card (training data, splits, metrics, intended use); and a deterministic scorer that reproduces the reported baseline numbers from committed prediction files.
D9 — Licensing¶
Code is MIT (the org convention). Authored dataset artifacts — the label mapping, the splits, the manifests, the recipe, and features derived from open data — are CC-BY-4.0; openly-licensed pass-through data keeps its upstream licence; raw Space-Track data is not redistributed. Model weights are MIT or CC-BY-4.0, with foundation-model fine-tunes inheriting their base licence. All runtime dependencies are permissive. Because the label sources are open or U.S.-Government public domain, the dataset licence is not forced restrictive; the development-only Shorten labels stay out of the distribution.
D10 — Decoupling guarantee¶
GMAT-free — no GMAT dependency at runtime, in tests, or in the build. The foundation-model stack is an
optional [foundation] extra the base install excludes. The deferred charter studies — irregular-sampling
model input, foundation-model applicability, and leaderboard integrity / compute budget — belong to later
milestones, not v0.1.
D11 — Irregular-sampling input encoding¶
The frozen encoding the learned baselines consume from the unevenly-sampled element series:
time-encoded element deltas, no interpolation. Resampling to a fixed grid is rejected — interpolating
across a gap fabricates values on the very interval a maneuver lives in and roughly halves above-floor
recall. Element deltas are fed signed (the non-linear model recovers the burn magnitude itself, and the
sign carries direction for the D5 type classification); timing is carried as a bounded Δt (time2vec)
block. Secular drift (the J2 nodal regression and apsidal precession) is removed by a two-sided local-linear
fit before the delta, angles are carried as the eccentricity vector plus an unwrapped node, and
normalisation is per-class robust (median/IQR) on train-split statistics only. Δt stays in the input
because step-rate and detrending need it; it carries only a modest, structural correlation with the label, so
the benchmark reports a timing-only baseline as the "cheating floor" a submission must beat (D7) and keeps
the headline metric as recall over the above-floor population (D4).
D12 — Leaderboard integrity and compute budget¶
The public leaderboard is a Gradio Space on the free Hugging Face CPU tier — scoring is pure
element-arithmetic (the D4 matching and per-class counts), CPU-cheap and deterministic (D8), so the board
needs no GPU; the only GPU spend in the project is offline baseline training. Submissions are a
predictions.json of canonical maneuver records (the schema), and two surfaces keep it safe:
the response is aggregate-only (per-class above-floor recall at the operating point plus the timing-only
floor, never the per-label match table) and the submission is fixed-schema (the reader rejects any
non-prediction payload, so a submission cannot carry a query). A courtesy rate limit of 5 scored
submissions per user per UTC day guards against floods. Compute budget: both baselines are small
(transformer ≈ 10⁷ parameters, BiLSTM ≈ 1–3 × 10⁶) on an O(10⁵)-window set, so a full v0.2 run — both
baselines, a small sweep, and the finals — is under ~1 GPU-day and trains in hours on a single ≤ 24 GB
GPU, no multi-GPU; a wall-clock and peak-memory acceptance gate is recorded on each checkpoint's model
card.
Amendment — v0.2 ships a reproducibility board, not a hidden-label competition. The integrity design
assumed hidden test labels, but the open dataset (D8/D9) publishes the full answer key:
dataset/v0.2/labels.json commits every label and splits.json marks the test objects, and git history
makes that irretractable. The hidden-label firewall is therefore unbuildable on the v0.2 test set and is
dropped — the public/private-subset split is removed, and the aggregate-only response and the rate limit
are kept as courtesy / abuse guards, not integrity guarantees. The board is a reproducibility /
convenience board on the public splits (see the leaderboard guide); a true hidden-label
competition would need a separate, never-committed forward holdout and is deferred. One D2 consequence: the
scorer's matching windows are real elset epochs (derived Space-Track data, not redistributed), so the board's
scoring fixture is built offline from a credentialed reconstruction and supplied as private deploy-time data,
not committed.
D13 — Expanded label sources for v0.2 dataset growth¶
The v0.2 dataset-growth pass extends the D3 source set, resolving the "no public GEO maneuver-label source" gap and the open question on non-GPS GNSS notices:
- MEO — add Galileo NAGUs (
PLN_MANV): a second, independent MEO operator beyond GPS — public, no-auth, machine-ingestible, GSAT → NORAD via the CelesTrak Galileo crosswalk, epoch-only. The terms are an attribution-required reuse grant (© EU), so the labels are redistribution-clean and shipped. GLONASS is excluded — its terms cap public reproduction more tightly than Space-Track and fail D2. - GEO — self-labelled longitude-shift. No openly-licensed GEO maneuver-label file exists to redistribute, so v0.2 labels GEO best-effort from longitude-shift inspection of the reconstructed series; GEO stays epoch-only. The operator-announced BeiDou feed remains a recipe-first option (the D2 pattern — ship the fetch recipe and parser, never redistributed labels), not shipped data.
- Scope and licence otherwise unchanged (D3 / D9): LEO primary and Δv-labelled, HEO deferred; attribution stacks per source, and no new redistribution restriction attaches.
D14 — Foundation-model baseline + the [foundation] extra¶
v0.3 adds a foundation-model baseline: a forecast-residual detector that replaces the classical detector's
hand-built quiet-dynamics prior with a pretrained time-series model. It forecasts each object's mean-element
series, standardises the forecast residual, and thresholds it per orbit class; the D4 matcher, the D5 Δv
inversion, the D6 schema, and the D7 scorer are all reused unchanged. The shipped backend is Chronos
(amazon/chronos-bolt-small), chosen over TimesFM — which was wired as a second entry but removed when its
zero-shot forecast proved unusable on noisy sub-daily LEO series. Both families publish Apache-2.0 checkpoints,
so a fine-tune is redistributable (D9) and the dependency is permissive; the licence is confirmed per
checkpoint revision at ingest. The detector lives behind the optional [foundation] extra
(= chronos-forecasting) the base install excludes (D1 / D10), imports its backend lazily, and pulls the
checkpoint from the Hub at runtime. One implementation finding fixed the recipe: the residual is standardised
by a robust per-object MAD, not the model's predictive interval, which collapses toward zero on confident gaps.
The v0.3 baseline ships zero-shot — it trains nothing, so it runs on the free CPU/GPU tiers; a light
Chronos fine-tune is the optional polish, not the default.
D15 — Expanded label sources + the IGSO class for v0.3 dataset growth¶
The v0.3 dataset-growth pass extends the D13 source set and adds a new scored orbit class, resolving the v0.2 caveats that the GEO labels were self-derived (circular) and the second MEO operator (Galileo) was thin:
- IGSO + GEO — add QZSS via the OHI files. Japan's Cabinet Office publishes a per-satellite Operational History Information file carrying each Quasi-Zenith satellite's executed orbit-maintenance Δv — the only surveyed operator feed besides DORIS that ships an executed Δv. QZS-2/4/1R are inclined and eccentric — a new IGSO class (magnitude-only Δv, since the IGSO files omit the burn-direction marker); QZS-3/6 are equatorial GEO and carry an operator Δv with a north-south / east-west type marker. Shipped under CC-BY-4.0 ("Source: Quasi-Zenith Satellite System website").
- GEO — add NOAA GOES operator epochs. The NOAA OSPO navigation summary names each GOES bird's last-maneuver day (US-Government public domain, shipped); the history is replayed from its Internet-Archive snapshots. With QZSS, this moves GEO from self-labelled to operator-announced and breaks the v0.2 self-label circularity.
- MEO — Galileo back-catalogue. The same NAGU feed (D13), crawled over the full window, thickens MEO.
- HEO — reserved, no objects in v0.3. No ingestible operator feed exists for the high-eccentricity regime even credentialed, and self-labelling the noisy deep-space TLEs is perturbation-dominated, so HEO ships as a reserved, empty class — IGSO is the v0.3 new scored class.
- Scope / licence (D3 / D9) otherwise unchanged: LEO primary and Δv-labelled; attribution stacks per source (NOAA public domain, QZSS CC-BY, Galileo © EU); no new restriction attaches (BeiDou / GLONASS / EUMETSAT excluded). A lockstep v0.3 dataset bump; the class-generic splits and per-class scorer carry the new IGSO class through reporting automatically.
D16 — Hidden-label competition track via a never-committed forward holdout¶
The design of the true hidden-label competition the D12 amendment deferred. The open dataset publishes
every committed label, so a competition needs labels that were never committed: a forward holdout of
maneuvers with epoch strictly after the public dataset's freeze, reconstructed from the same operator feeds via
the D2 recipe and never written to labels.json / splits.json. Disjointness is purely temporal (a single
epoch > freeze cut), so the same object may appear on both sides. The holdout is rolling, keyed to the
release cadence — at each release the matured window is revealed into the next public dataset and a fresh
forward window becomes the new private holdout. On a never-committed holdout the full D12.3 firewall is
restorable: hidden labels, a public/private split scored once at the reveal, the rate limit now an integrity
bound, and the aggregate-only / fixed-schema round trip. LEO and MEO carry the competition from launch; the
D15 operator feeds (QZSS, NOAA GOES) populate GEO/IGSO for real. The board itself is a follow-up build,
gated on a v0.3 release first freezing the public dataset to define epoch > freeze; this decision fixes its
shape, not its code.
D17 — v0.3 score-protocol bump: per-class operating point + baked-in calibration¶
The publish half of the uncertainty-calibration work, applied to the real v0.3 baselines (the calibration machinery — reliability diagrams, temperature scaling, split-conformal — landed earlier as model-agnostic code; see the benchmark protocol):
- Per-class operating point in the report.
ScoreReport.to_jsongains a per-classoperating_point_confidence— the confidence cut admitted within the false-alarm budget at the headline operating point. It is additive (every prior field byte-for-byte unchanged) but it changes the frozen scorer artifact, so it is a v0.3-boundary change, not a v0.2 patch. - Calibration baked into the bundle, not re-fit at load. Each published detector carries a calibrator (temperature scaling + a split-conformal predictor) fit on the val split only — never the test labels — frozen into its bundle, so a loaded model emits calibrated confidence with no calibration data at inference (the same can't-drift discipline as the model card). Old bundles without a calibrator load unchanged.
- Reliability + operating points are published, not asserted. The per-class reliability curve and the calibrated operating point are recorded into the bundle and rendered onto the model card; consistent with the no-committed-real-data convention, the committed docs carry the methodology and a bundle→diagram helper, and the real-data figures are rendered at the credentialed release-cut run.
- The leaderboard tolerates the field by ignoring it — its public response stays a strict aggregate subset (headline recall per class, the operating point, the timing floor), so the new field neither leaks into a response nor changes scoring.
Derived from the v0.1, v0.2, and v0.3 prerequisite studies and the project charter.