8 stable releases

2026.1.203	Apr 16, 2026
2026.1.176	Apr 2, 2026
2026.1.130	Mar 28, 2026

#47 in Finance

Used in lattice-client

Apache-2.0

305KB
6.5K SLoC

Lattice

A distributed workload scheduler for large-scale scientific computing, AI/ML training, inference services, and regulated workloads. Lattice schedules both finite jobs (batch training, simulations) and infinite jobs (inference services, monitoring) on shared HPC infrastructure with topology-aware placement and a unified API for human users and autonomous agents.

Design Principles

Full-node scheduling — the scheduler reasons about whole nodes; the node agent handles intra-node packing via Sarus and uenv
Intent-based API with Slurm compatibility — agents declare what they need, users can use familiar sbatch-like commands
Distributed control plane — Raft quorum with persistent WAL, snapshots, and backup/restore; per-vCluster schedulers for workload-specific policies
uenv-native software delivery — SquashFS user environments as the default, OCI containers when isolation is needed
Regulated workload support — user-level node claims, dedicated nodes, encrypted storage, full audit trail with 7-year retention
Federation as opt-in — multi-site operation via Sovra trust layer, fully functional without it

Architecture

User Plane         lattice-cli + lattice-api (OIDC via hpc-auth)
Software Plane     uenv (SquashFS) + Sarus (OCI) + Registry
Scheduling Plane   Raft Quorum + vCluster Schedulers (knapsack)
Data Plane         VAST (NFS/S3) tiered storage + data mover
Network Fabric     Slingshot / Ultra Ethernet (libfabric)
Node Plane         Node Agent + mount namespaces + eBPF telemetry
Infrastructure     OpenCHAMI (Redfish BMC, boot, inventory)

The scheduler uses a multi-dimensional knapsack algorithm with a composite cost function covering priority, wait time, fair share, topology fitness, data readiness, backlog pressure, energy cost, checkpoint efficiency, and conformance fitness. Weights are tunable per vCluster and testable offline with the RM-Replay simulator.

See docs/architecture/ for detailed design documents and docs/decisions/ for ADRs.

Repository Structure

crates/
├── lattice-common/        Shared types, config, protobuf bindings
├── lattice-quorum/        Raft consensus, persistent WAL, snapshots, backup/restore
├── lattice-scheduler/     vCluster schedulers, knapsack solver, cost function
├── lattice-api/           gRPC (tonic) + REST (axum) server, OIDC, RBAC, mTLS
├── lattice-checkpoint/    Checkpoint broker, cost evaluator
├── lattice-node-agent/    Per-node daemon, GPU/memory discovery, eBPF telemetry, data staging
├── lattice-cli/           CLI binary (submit/status/cancel/session/telemetry)
├── lattice-test-harness/  Shared mocks, fixtures, builders
└── lattice-acceptance/    BDD scenarios (cucumber) + property tests (proptest)
proto/                     Protobuf definitions (API contract)
sdk/python/                Python SDK (httpx REST client)
tools/rm-replay/           Scheduler simulator (real + simple cost modes)
infra/                     Dockerfiles, docker-compose, systemd, Grafana, alerting
config/                    Configuration files (minimal + production)
examples/                  Usage examples (CLI, Python SDK, DAGs, configs, Slurm migration)
scripts/                   Release version patching
docs/
├── architecture/          30 design documents
├── decisions/             Architecture Decision Records
└── references/            External references

Building

# Build the workspace
cargo build --workspace

# Run tests (~1368 unit/integration + 274 BDD scenarios)
just test              # default: skips slow multi-node Raft tests
just test-all          # full suite including slow tests
just test-slow         # only slow tests

# Or without just
cargo nextest run --workspace --exclude lattice-acceptance  # unit/integration
cargo test -p lattice-acceptance                            # BDD acceptance tests
cargo test --workspace -- --include-ignored                 # everything including slow

# Lint and check
cargo fmt --all -- --check
cargo clippy --workspace --all-targets
cargo deny check

# Or use just
just all

Python SDK

cd sdk/python
pip install -e .
pytest

Docker

cd infra/docker
docker compose up

Technology Stack

Component	Language	Key Dependencies
Scheduler core (10 crates)	Rust	tokio, tonic, prost, openraft, axum, flate2, tar
Security	Rust	jsonwebtoken (OIDC), rcgen (mTLS), RBAC middleware
Observability	Rust	prometheus, eBPF (Linux), nvml-wrapper (NVIDIA)
User SDK	Python	httpx, pytest
Simulator	Rust	Real cost evaluator from lattice-scheduler
Protobuf	buf	Rust (tonic/prost) generation
Deployment	Docker/systemd	Multi-stage builds, Grafana dashboards

External Integrations

System	Role
OpenCHAMI	Infrastructure management (Redfish BMC)
FirecREST	Optional compatibility gateway (hybrid Slurm deployments)
uenv / Sarus	Software delivery (SquashFS / OCI)
Sovra	Federation trust (optional)
VAST	Tiered storage (NFS + S3)
Waldur	Accounting and billing (optional)

Contributing with Claude Code

This project includes structured Claude Code profiles for different development phases: analyst, architect, adversary, implementer, integrator and auditor. Each profile constrains Claude to a specific role in the workflow defined in '.claude/CLAUDE.md', this is loaded automatically.

License

Apache-2.0

Citation

If you use this in research, please cite:

@software{lattice,
  title={Lattice: A scheduler for HPC and AI workloads},
  author={Pim Witlox},
  year={2026},
  url={https://github.com/witlox/lattice}
}

`lib.rs`:

lattice-common

Shared types, configuration, and error handling for the Lattice scheduler.

This crate defines the core domain model used across all Lattice components:

Allocation: The universal work unit (replaces Slurm job + K8s pod)
Node: Physical compute node with capabilities and ownership
Tenant: Organizational boundary with quotas
VCluster: Logical cluster with its own scheduling policy
TopologyModel: Slingshot/UE dragonfly group structure

Dependencies

~31–49MB
~803K SLoC