Skip to content

feza-ai/spark

Repository files navigation

Spark

Single-binary Go pod orchestrator for GPU hosts. Accepts Kubernetes manifests and runs workloads via Podman on a single node. Connects to NATS for messaging.

Features

  • Kubernetes manifest support -- Pod, Job, CronJob, Deployment, StatefulSet
  • NATS messaging -- apply, delete, get, list pods over request/reply subjects
  • HTTP REST API -- health checks, resource queries, and pod CRUD (v1.2.0)
  • Resource-aware scheduling -- CPU, memory, and GPU tracking with allocatable limits
  • Priority-based preemption -- lower-priority pods evicted to make room for higher-priority work
  • SQLite state persistence -- WAL-mode database for crash recovery (v1.1.0)
  • Pod re-discovery -- recovers running pods from Podman after restart (v1.1.0)
  • Retention pruning -- automatic cleanup of completed/failed pods (v1.1.0)
  • Resource reconciliation -- periodic sync of actual vs requested CPU/memory/GPU usage (v1.2.0)
  • Graceful shutdown -- configurable drain timeout with ordered teardown (v1.2.0)
  • GPU detection -- NVIDIA GPU support with unified memory fallback (GB10)
  • CronJob scheduling -- cron expression parsing with Allow/Forbid/Replace concurrency policies
  • Directory watcher -- manifest hot-loading with SHA-256 change detection
  • Heartbeat publishing -- periodic node status over NATS
  • Event and log streaming -- lifecycle events and container logs published over NATS
  • Prometheus metrics -- /metrics endpoint with node, pod, and scheduler metrics (v1.3.0)
  • HTTP authentication -- bearer token auth with configurable token file (v1.3.0)
  • Pod logs via HTTP -- tail and SSE streaming of pod logs (v1.3.0)
  • Pod events via HTTP -- lifecycle event history with time filtering (v1.3.0)
  • Structured JSON logging -- configurable log format for aggregation (v1.3.0)
  • EmptyDir volumes -- tmpfs-backed scratch volumes for pods (v1.3.0)
  • Pod exec -- execute commands inside running containers via HTTP (v1.4.0)
  • Container port mapping -- expose container ports to the host via podman --publish (v1.4.0)
  • Init containers -- sequential initialization containers before main containers (v1.4.0)
  • GPU device assignment -- per-pod GPU device isolation via NVIDIA_VISIBLE_DEVICES (v1.4.0)
  • Image management -- list and pull container images via HTTP API (v1.4.0)
  • Security context -- runAsUser, privileged, capabilities add/drop forwarded to podman (v1.5.0)
  • Manifest removal -- file deletion stops pods, releases resources, unregisters cron jobs (v1.5.0)
  • CronJob registration on all paths -- NATS, HTTP, and filesystem all register cron jobs (v1.5.0)
  • Stuck pod recovery -- Scheduled and Preempted pods recovered after timeout (v1.5.0)
  • GPU count-based scheduling -- nvidia.com/gpu: N allocates N device slots, separate from GPU memory (v1.6.0)
  • Liveness probes -- exec and HTTP probes with configurable thresholds; reconciler restarts on failure (v1.6.0)
  • CronJob HTTP management -- list, inspect, and unregister cron jobs via REST API (v1.6.0)
  • Node info endpoint -- GPU model, device count, device IDs, CPU, memory, OS via HTTP (v1.6.0)

Quick Start

go build ./cmd/spark
./spark --nats nats://localhost:4222

Spark will detect system resources (CPU, memory, GPU), connect to NATS, and begin watching /etc/spark/manifests for Kubernetes manifests.

CLI Flags

Flag Default Description
--nats nats://localhost:4222 NATS server URL
--node-id hostname Node identifier
--manifest-dir /etc/spark/manifests Directory to watch for manifests
--gpu-max 1 Max concurrent GPU pods
--heartbeat-interval 10s Heartbeat publish interval
--reconcile-interval 5s Reconciliation loop interval
--system-reserve-cpu 2000 CPU millicores reserved for system
--system-reserve-memory 4096 MB of RAM reserved for system
--state-db /var/lib/spark/state.db Path to SQLite database file
--pod-retention 168h Retention period for completed/failed pods
--http-addr :8080 HTTP listen address
--shutdown-timeout 30s Max time to drain pods on shutdown
--reconcile-resources-interval 60s Resource reconciliation interval
--log-format text Log output format (text or json)
--api-token-file (empty) Path to file containing API bearer token
--housekeeping-interval 1m Housekeeping loop interval
--completed-pod-ttl 1h TTL after which Completed pods are reaped (0 disables)
--failed-pod-ttl 24h TTL after which Failed pods are reaped (0 disables)
--orphan-reap-ttl 1h TTL after which terminal-state orphan podman pods are reaped (0 disables)
--image-prune-interval 24h Interval between podman image prune -f runs (0 disables)

Per-pod TTL override is available via the spark.feza.ai/ttl-after-finished annotation (any value parseable by time.ParseDuration; 0s disables cleanup for that pod).

HTTP API

All endpoints are served on the address specified by --http-addr (default :8080).

Health Check

curl http://localhost:8080/healthz
{"status": "ok"}

List Resources

curl http://localhost:8080/api/v1/resources
{
  "total": {"cpu_millis": 20000, "memory_mb": 131072, "gpu_memory_mb": 131072},
  "reserved": {"cpu_millis": 2000, "memory_mb": 4096, "gpu_memory_mb": 0},
  "allocated": {"cpu_millis": 4000, "memory_mb": 8192, "gpu_memory_mb": 65536},
  "available": {"cpu_millis": 14000, "memory_mb": 118784, "gpu_memory_mb": 65536}
}

List Pods

curl http://localhost:8080/api/v1/pods
[
  {"name": "myapp", "status": "Running", "created_at": "2026-03-19T10:00:00Z"}
]

Get Pod

curl http://localhost:8080/api/v1/pods/myapp

Returns the full pod record including spec, status, events, and timestamps.

Apply Pod

curl -X POST http://localhost:8080/api/v1/pods \
  -H "Content-Type: application/yaml" \
  -d @pod.yaml

Accepts a Kubernetes manifest (YAML) in the request body. Supports Pod, Job, CronJob, Deployment, and StatefulSet kinds.

Delete Pod

curl -X DELETE http://localhost:8080/api/v1/pods/myapp
{"deleted": "myapp"}

Prometheus Metrics

curl http://localhost:8080/metrics

Returns node and pod metrics in Prometheus text exposition format (v0.0.4).

Pod Logs

curl http://localhost:8080/api/v1/pods/myapp/logs?tail=50

Returns the last 50 lines of pod logs as text/plain. Use ?follow=true for SSE streaming.

Pod Events

curl http://localhost:8080/api/v1/pods/myapp/events

Returns pod lifecycle events as JSON. Use ?since=2026-03-19T00:00:00Z to filter by time.

Pod Exec

curl -X POST http://localhost:8080/api/v1/pods/myapp/exec \
  -H "Content-Type: application/json" \
  -d '{"command":["ls","-la"]}'
{"stdout":"total 0\ndrwxr-xr-x ...","stderr":"","exit_code":0}

List Images

curl http://localhost:8080/api/v1/images

Pull Image

curl -X POST http://localhost:8080/api/v1/images/pull \
  -H "Content-Type: application/json" \
  -d '{"image":"localhost:5000/mymodel:latest"}'

Node Info

curl http://localhost:8080/api/v1/node
{
  "hostname": "dgx-spark",
  "os": "linux",
  "arch": "arm64",
  "cpu_cores": 72,
  "memory_total_mb": 131072,
  "gpu_model": "NVIDIA GH200",
  "gpu_count": 1,
  "gpu_device_ids": [0],
  "gpu_memory_mb": 131072
}

List CronJobs

curl http://localhost:8080/api/v1/cronjobs
[
  {"name": "train-nightly", "schedule": "0 2 * * *", "next_run": "2026-03-21T02:00:00Z", "run_count": 14}
]

Get CronJob

curl http://localhost:8080/api/v1/cronjobs/train-nightly

Delete CronJob

curl -X DELETE http://localhost:8080/api/v1/cronjobs/train-nightly

Authentication

When --api-token-file is set, all HTTP endpoints except /healthz and /metrics require a bearer token:

curl -H "Authorization: Bearer <token>" http://localhost:8080/api/v1/pods

Without the header, requests return 401 Unauthorized. If --api-token-file is not set, authentication is disabled.

NATS Subjects

Subject Purpose
req.spark.apply Apply a pod manifest (request/reply)
req.spark.delete Delete a pod (request/reply)
req.spark.get Get pod status (request/reply)
req.spark.list List all pods (request/reply)
evt.spark.event.{pod} Pod lifecycle events
log.spark.{pod} Container log streaming
heartbeat.spark.{node} Node heartbeat with resource usage

Deployment

The deploy/ directory contains everything needed to run Spark on a DGX or similar GPU host:

File Purpose
setup-dgx.sh Full DGX setup: installs NATS, Spark, and systemd services
setup-registry.sh Sets up a local OCI registry on port 5000
spark.service Systemd unit for Spark
nats-server.service Systemd unit for NATS
registry.service Systemd unit for local OCI registry
spark.env Environment variables for the Spark service
install.sh Binary installation script
nfpm/ Deb packaging configuration

DGX Deployment

ssh ndungu@192.168.86.250
sudo bash deploy/setup-dgx.sh

This installs Spark and NATS as systemd services, configures the manifest directory, and starts both services.

Local OCI Registry

Spark uses a local OCI registry at localhost:5000 to store and serve container images, avoiding remote pulls during workload execution.

# Set up the registry
sudo bash deploy/setup-registry.sh

# Push an image
podman build -t myapp:latest .
podman tag myapp:latest localhost:5000/myapp:latest
podman push localhost:5000/myapp:latest

Reference images in pod manifests with the localhost:5000 prefix:

apiVersion: v1
kind: Pod
metadata:
  name: myapp
spec:
  containers:
    - name: myapp
      image: localhost:5000/myapp:latest

Architecture

cmd/spark/          Entry point: flags, startup, signal handling
internal/
  api/              HTTP REST API handlers (health, resources, node, pods, exec, logs, events, images, cronjobs, metrics, auth)
  bus/              NATS bus abstraction, protocol handlers, event/log publishers
  cron/             Cron expression parser and scheduled job trigger
  executor/         Podman interface: pod create, stop, exec, logs, image pull, stats, liveness probes
  gpu/              GPU detection (nvidia-smi), device enumeration, and system resource detection
  lifecycle/        Graceful shutdown coordinator with pod draining
  manifest/         K8s YAML parser (Pod, Job, CronJob, Deployment, StatefulSet) with ports, init containers, securityContext, livenessProbe
  metrics/          Prometheus metrics collector and text renderer
  reconciler/       Desired-state reconciliation loop, pod recovery, resource sync, liveness probe polling
  scheduler/        Resource-aware scheduling with priority preemption and GPU count-based device slot tracking
  state/            Pod state store (in-memory + SQLite WAL persistence) with source path tracking
  watcher/          Manifest directory poller (SHA-256 change detection)

Build / Test / Lint

# Build
go build ./cmd/spark

# Test
go test ./... -race -timeout 120s

# Lint
go vet ./...
staticcheck ./...

Constraints

  • Go standard library only, except github.com/nats-io/nats.go and modernc.org/sqlite.
  • Podman, not Docker.
  • Standard flag package for CLI flags.
  • HTTP routing via net/http.ServeMux with Go 1.22+ method-aware patterns.

Architecture Decisions

ADR Title
001 Go standard library only
002 NATS protocol design
003 Local OCI registry
004 K8s manifest compatibility
005 Priority preemption algorithm
006 Resource-aware scheduling
007 Ubuntu deb packaging
008 SQLite state persistence
009 HTTP API design
010 Prometheus metrics via stdlib
011 HTTP bearer token authentication

License

MIT License. See LICENSE for details.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors