Kubernetes Job-per-run¶
Wire KubernetesJobPool into a
Kubernetes cluster. Every run becomes one batch/v1 Job, every Pod is
a fresh Python interpreter, and the cluster's scheduler decides where
each Job lands. There is no Dask or Ray middleware — the pool talks to
the Kubernetes API directly.
Two Kubernetes paths
The companion recipe Kubernetes pod-per-worker wires
DaskPool into a Dask-managed pool of worker Pods. Pick that when
your stack already wants a Dask client; pick this page when you
want native scheduling and one less middleware layer.
Prerequisites¶
- A reachable Kubernetes cluster — managed (GKE, EKS, AKS, …),
self-hosted, or local (kind, k3d, Docker Desktop).
kubectl get nodesshould succeed from the machine running the driver. - A container image with
gmat-sweep[k8s]plus a working GMAT install. See Image for the build pattern. - A
PersistentVolumeClaimvisible to both the driver and the workers at the same path. PureReadWriteMany(EFS, Filestore, Azure Files, NFS-CSI, GCS Fuse) is the typical answer; aReadWriteOncePodPVC works when the driver runs as a Pod on the same node as the workers. - Python with
gmat-sweep[k8s]installed in the driver env.
Image¶
KubernetesJobPool requires image= — the pool ships no default. The
canonical image at ghcr.io/astro-tools/gmat:<tag> carries GMAT and
gmatpy but not gmat-sweep itself — downstream consumers add
their own packages on top:
FROM ghcr.io/astro-tools/gmat:R2026a
RUN python3.12 -m pip install --no-cache-dir "gmat-sweep[k8s]==<your-version>"
Pin both the GMAT image tag and the gmat-sweep version. A mismatch
between driver and worker gmat-sweep versions is a serialisation hazard
the Pool ABC's bit-equal contract cannot recover from.
Push the image somewhere both the cluster and CI can reach
(ghcr.io/<your-org>/gmat-sweep, ECR, GCR, …). Pass the fully-pinned
reference to KubernetesJobPool(image=...).
PVC layout¶
Per-run spec / outcome JSON files travel through the PVC, not over the
Kubernetes API. The pool writes
<driver_mount_path>/_specs/<run_id>.json on the driver side; the Pod
reads <pvc_mount_path>/_specs/<run_id>.json. Outcome JSON travels back
the same way under _outcomes/. Per-run Parquet outputs go into the
out= directory of the sweep.
If the driver runs as a Pod, mount the PVC at the same path as the
workers (/sweep is the default). Then pvc_mount_path and
driver_mount_path are the same string and you don't pass
driver_mount_path at all.
If the driver runs out-of-cluster (laptop, CI runner), mount the same
storage on the driver side at the same path the Pods see — typically
via an NFS / CIFS mount that exports the PVC's backing storage. Setting
driver_mount_path to a different local path is supported, but
RunSpec.script_path and RunSpec.output_dir must already be Pod-side
paths in either case — the pool does not rewrite them.
A minimal PersistentVolume + PersistentVolumeClaim pair backed by
NFS:
apiVersion: v1
kind: PersistentVolume
metadata:
name: gmat-sweep-pv
spec:
capacity:
storage: 100Gi
accessModes: ["ReadWriteMany"]
nfs:
server: nfs.internal
path: /export/gmat-sweep
storageClassName: gmat-sweep
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: gmat-sweep
spec:
accessModes: ["ReadWriteMany"]
resources:
requests:
storage: 100Gi
storageClassName: gmat-sweep
volumeName: gmat-sweep-pv
Authentication¶
KubernetesJobPool autodetects between the two auth paths:
- In-cluster — the driver runs as a Pod with a ServiceAccount that
has RBAC to create/list/watch
batch/v1Jobs and read Pod logs in the target namespace. The pool callsload_incluster_config(). Triggered when/var/run/secrets/kubernetes.io/serviceaccount/tokenexists. - Out-of-cluster — the driver runs anywhere with a kubeconfig. The
pool calls
load_kube_config(config_file=...), defaulting to~/.kube/config. Triggered when the in-cluster token file is absent.
Force one path explicitly with in_cluster=True / in_cluster=False.
Pass kubeconfig= to point at a non-default kubeconfig path.
A minimal Role for the in-cluster path:
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: gmat-sweep-driver
namespace: <your-namespace>
rules:
- apiGroups: ["batch"]
resources: ["jobs"]
verbs: ["create", "get", "list", "watch", "delete"]
- apiGroups: [""]
resources: ["pods"]
verbs: ["get", "list"]
- apiGroups: [""]
resources: ["pods/log"]
verbs: ["get"]
Worked example¶
In-cluster driver¶
from gmat_sweep import sweep
from gmat_sweep.backends import KubernetesJobPool
with KubernetesJobPool(
image="ghcr.io/your-org/gmat-sweep:0.4.0-gmat-R2026a",
pvc_name="gmat-sweep",
pvc_mount_path="/sweep",
namespace="missions",
parallelism=32,
default_resources={"cpu": "1", "memory": "4Gi"},
) as pool:
df = sweep(
"/sweep/missions/mission.script",
grid={"Sat.SMA": [7000.0, 7100.0, 7200.0, 7300.0]},
backend=pool,
out="/sweep/sweeps/sma-scan",
)
print(df.head())
The driver runs as a Pod whose serviceAccountName is bound to the
gmat-sweep-driver Role above. parallelism=32 caps in-flight Jobs at
32; the cluster decides where each Pod lands.
Out-of-cluster driver¶
from gmat_sweep import sweep
from gmat_sweep.backends import KubernetesJobPool
with KubernetesJobPool(
image="ghcr.io/your-org/gmat-sweep:0.4.0-gmat-R2026a",
pvc_name="gmat-sweep",
pvc_mount_path="/sweep",
driver_mount_path="/mnt/gmat-sweep",
in_cluster=False,
parallelism=8,
) as pool:
df = sweep(
"/sweep/missions/mission.script",
grid={"Sat.SMA": [7000.0, 7100.0, 7200.0, 7300.0]},
backend=pool,
out="/sweep/sweeps/sma-scan",
)
The driver mounts the same NFS export at /mnt/gmat-sweep; the Pods
mount the PVC at /sweep. Paths inside the script (script_path,
output_dir) are Pod-side absolute paths.
Per-run resources¶
resources= accepts a mapping keyed by run_id or a callable
(RunSpec) -> dict. Both compose with default_resources=:
# Per-run map
KubernetesJobPool(
...,
default_resources={"cpu": "1", "memory": "4Gi"},
resources={
17: {"cpu": "8", "memory": "16Gi"}, # one heavy run
42: {"cpu": "4", "memory": "8Gi", "nvidia.com/gpu": "1"},
},
)
# Callable
def by_spec(spec):
if spec.overrides.get("Sat.SMA", 0) > 30000:
return {"cpu": "4", "memory": "8Gi"}
return None # fall through to default_resources
KubernetesJobPool(..., resources=by_spec, default_resources={"cpu": "1"})
Each resolved value populates V1ResourceRequirements.requests. Pass a
dict with both requests and limits keys to set limits explicitly:
resources={
0: {
"requests": {"cpu": "2", "memory": "4Gi"},
"limits": {"cpu": "4", "memory": "8Gi"},
}
}
Local development with kind¶
For iterating on the pool itself, kind is the lightest cluster:
kind create cluster --name gmat-sweep
docker build -t gmat-sweep:dev -f Dockerfile .
kind load docker-image gmat-sweep:dev --name gmat-sweep
kubectl apply -f sweep-pvc.yaml
The extraMounts field in kind-config.yaml bind-mounts a host
directory into the kind node, so a hostPath PV on the node and the
driver-side path resolve to the same files. See tests/k8s/ for the
configs used in CI.
Caveats¶
backoffLimit=0 — no silent retries¶
The pool sets backoffLimit=0 so a Pod failure maps 1:1 to a
RunOutcome.failed. Setting it higher would let Kubernetes silently
re-run a failed Pod, which breaks gmat-sweep's failure-as-row contract.
If you need retries, do them at the sweep layer with
Sweep.from_manifest(...).resume().
ttlSecondsAfterFinished=300¶
Completed Jobs auto-GC after 5 minutes. Long enough for a kubectl
inspection window, short enough to not leak Job objects across long
sweeps. Override via ttl_seconds_after_finished= if you need more.
job_deadline_seconds=3600¶
Driver-side wall-clock deadline per Job. A Job that has not reached a
terminal status (succeeded or failed) within this many seconds is
deleted (propagationPolicy=Background) and folded into a synthetic
RunOutcome.failed. The check exists to break the otherwise-silent
hang on Pods stuck in Pending / ImagePullBackOff / Unschedulable,
which never produce a status event and so never satisfy the watch
loop. Granularity is bounded by the watch reconnect cadence (~60 s);
the deadline is a hang preventer, not a tight SLA. Override via
job_deadline_seconds= for shorter (test) or longer (multi-hour
solver) runs.
close() deletes in-flight Jobs¶
Calling close() (whether from a with block exit, a Ctrl-C, or
explicitly mid-sweep) issues a background-propagation delete for
every Job still in flight at close time. Without this, an aborted
sweep would leave Pods running until the TTL kicked in after they
finished — burning your namespace quota and your bill. ApiException
on a per-Job delete is swallowed so a Job already reaped by the TTL
controller (or by a concurrent kubectl) doesn't propagate out.
Pod failure modes¶
A Pod that finished but didn't write outcome JSON (OOMKill, eviction,
image pull failure) is folded into a synthetic RunOutcome.failed
carrying the Pod's logs as stderr. Same posture every other backend
takes for transport-layer failures.
Image discipline¶
The driver and the Pods must run the same gmat-sweep version, the
same gmat-run version, and the same image — same as the dask-kubernetes
recipe documents. A pinned image tag is the simplest enforcement.
Storage shape¶
The PVC must be visible to every worker Pod and to the driver. A
ReadWriteOnce PVC with a single-node cluster works in development;
production typically wants ReadWriteMany (EFS, Filestore, Azure Files,
NFS-CSI, GCS Fuse) so worker Pods can land anywhere.
When this isn't enough¶
CRD-driven sweep management, custom CSI drivers, multi-cluster
federation, or per-Pod sidecars — those exit the recipe and become
custom Pool work against the Pool ABC.
The KubernetesJobPool source under
gmat_sweep/backends/kubernetes.py is a working template.