Related layers: Context • Containers • Code
Goals: zero single‑point‑of‑failure, fast worker discovery, and resilient routing under churn.
- Gateway → Orchestrator: RPC calls via @hyperswarm/rpc for scheduling.
- Orchestrator → Worker: RPC calls via @hyperswarm/rpc to
infer. - Quotas/Reservations: Persisted in PostgreSQL as part of job state when applicable.
- Affinity score per request:
(model match) + (region proximity) + (gpuClass match) + (queueDepth inverse) + (warmCache bonus). - Backpressure: token‑bucket at Gateway (per tenant) + queue depth thresholds at workers → if exceeded, shed or degrade (return lower‑latency model/params when policy allows).
- Retries: exponential backoff with retryToken (idempotency) and circuit‑breaker trips per model/region.
- Node keypairs; presence and model announcements are signed and nonce‑protected to mitigate replay.
- Transport uses HTTPS/TLS where configured; optional end‑to‑end payload encryption is planned as a tenant policy.
- Content‑addressed (digest) with signatures; N≥3 replicas.
- Workers verify signatures before load; support warm pools & on‑the‑fly adapter merges (e.g., LoRA) with cache eviction by LRU + size budget.
- Trace every request (correlationId, traceId) via Gateway → Orchestrator → Worker.
- SLIs: P50/P95 latency by model, success rate, throttle rate, stale‑cache rate.
- Error budgets gate rollouts (progressive canary on workers via control topic).
@startuml Component-Orchestrator
!include https://raw.githubusercontent.com/plantuml-stdlib/C4-PlantUML/master/C4_Component.puml
LAYOUT_WITH_LEGEND()
SHOW_LEGEND(true)
Container_Boundary(orch, "Orchestrator/Scheduler") {
Component(discSub, "DiscoverySub", "Node.js/TypeScript", "Hyperswarm DHT peer discovery and health checks")
Component(scheduler, "Scheduler", "Node.js/TypeScript", "Model capability matching and load balancing")
Component(policy, "Policy Engine", "Node.js/TypeScript", "Quota, rate limit, model routing policies")
Component(cb, "Circuit Breaker", "Node.js/TypeScript", "Failure detection using opossum library")
Component(retry, "Retry Manager", "Node.js/TypeScript", "Exponential backoff with p-retry")
Component(resv, "Reservation Manager", "Node.js/TypeScript", "Reserves capacity; writes job state to Metadata Store")
Component(otel, "TelemetrySDK", "Node.js/TypeScript", "OTel traces, metrics, logs")
}
Container_Ext(gateway, "RPC Gateway", "Edge Adapter")
Container_Ext(worker, "Inference Worker", "ModelRuntime")
ContainerDb_Ext(meta, "Metadata Store", "Postgres/SQLite-cluster")
ContainerQueue_Ext(dlq, "Error Queue / DLQ", "Queue")
Container_Ext(observ, "OTel Collector", "Telemetry Aggregation")
Rel(gateway, scheduler, "Dispatch request", "RPC (@hyperswarm/rpc)")
Rel(discSub, scheduler, "Fleet snapshot (in‑memory cache)")
Rel(scheduler, policy, "Evaluate policy/quotas")
Rel(scheduler, cb, "Check breaker state")
Rel(scheduler, retry, "Plan retries/backoff")
Rel(scheduler, resv, "Reserve capacity; persist job", "SQL")
Rel(resv, meta, "Write job + reservation", "SQL")
Rel(scheduler, worker, "Assign job / process request", "RPC (@hyperswarm/rpc)")
Rel(scheduler, dlq, "Poison/failed jobs → enqueue")
Rel(otel, observ, "Export telemetry", "OTel")
SHOW_LEGEND()
@endumlNotes
DiscoverySubsubscribes tohaif/presence/<region>andhaif/models/<model>/<version>/<region>.Schedulerranks candidates using: region proximity, gpuClass, warmCache, queueDepth, breaker state, tenant policy.Reservation Managerensures idempotency via(correlationId, retryToken)and persists state before dispatch.
@startuml Component-Worker
!include https://raw.githubusercontent.com/plantuml-stdlib/C4-PlantUML/master/C4_Component.puml
LAYOUT_WITH_LEGEND()
SHOW_LEGEND(true)
Container_Boundary(worker, "Inference Worker") {
Component(base, "BaseWorker", "Node.js/TypeScript", "Hyperswarm RPC endpoint handler")
Component(loader, "ModelLoader", "Python", "Dynamic model loading with transformers/torch")
Component(exec, "ExecutionEngine", "Python", "GPU/CPU inference with vLLM or PyTorch")
Component(health, "HealthReporter", "Python", "GPU memory and CPU usage tracking")
Component(quota, "QuotaGuard", "Node.js/TypeScript", "Per‑tenant limits; local backpressure")
Component(otelw, "TelemetrySDK", "Node.js/Python", "OTel traces, metrics, logs")
}
Container_Ext(orch, "Orchestrator/Scheduler")
Container_Ext(artifact, "Model Artifact Store", "S3/Hypercore")
Rel(orch, base, "infer() RPC", "@hyperswarm/rpc")
Rel(loader, artifact, "Fetch by digest; verify sig", "HTTPS/P2P")
Rel(health, orch, "Announce on presence + model topics")
Rel(exec, base, "Stream chunks → Gateway via Orchestrator")
Rel(quota, base, "Check local tokens / shed if needed")
Rel(otelw, orch, "Export telemetry via Collector", "OTel")
SHOW_LEGEND()
@endumlNotes
HealthReporterpublishes every 5s; worker expires after 15s silence.ModelLoaderpins hot models; eviction policy LRU + VRAM budget; adapters can be merged on load.ExecutionEnginesupports graceful stop on shutdown; in‑flight jobs finalize or reroute.
- DiscoverySub: Uses
hyperswarmDHT for peer discovery, maintains worker registry in memory with periodic health checks - Scheduler: Implements weighted round-robin and capability-based routing using custom scoring algorithms
- Policy Engine: Rule-based system with JSON configuration, supports rate limiting via
bottlenecklibrary - Circuit Breaker: Implements the circuit breaker pattern using
opossumwith configurable thresholds - Retry Manager: Exponential backoff retry logic using
p-retrywith jitter and maximum attempts - Reservation Manager: PostgreSQL integration via
pgfor job state persistence and capacity tracking
- BaseWorker: HTTP/RPC server using
hyperswarmfor P2P communication, handles authentication and routing - QuotaGuard: Token bucket rate limiting per tenant, integrates with Redis for distributed quotas
- ModelLoader: Dynamic model loading using
transformers,torch, andacceleratelibraries - ExecutionEngine: Inference execution with
vLLMfor LLMs or direct PyTorch for other models - HealthReporter: System monitoring using
psutilandpynvmlfor GPU metrics
- TelemetrySDK (OTel): Unified tracing/metrics/logs across Gateway, Orchestrator and Workers
- Exporters: OTLP exporters to OTel Collector; metrics to Prometheus, logs to Loki
- Dashboards/Alerts: Grafana for visualization; Alertmanager for alerting on SLO violations
Service Discovery & Communication
- DiscoverySub ingests signed heartbeats and capability announcements from Workers over Hyperswarm topics. Scheduler uses an affinity score to route requests; transport is encrypted and per‑request RPC topics stream results.
Data Storage & Replication
- Reservation Manager persists idempotent job state in PostgreSQL with regional primaries and replicas. Model artifacts are content‑addressed and signed, stored with ≥3 replicas; Workers verify and cache locally.
Scalability & Robustness
- Circuit Breaker and Retry Manager implement failure isolation and jittered backoff. Sharding by model/tenant/region and warm caches reduce tail latency. DLQ stores poison messages for post‑mortem.
Local AI Execution
- ExecutionEngine runs models locally (GPU/CPU) and streams results; ModelLoader verifies signatures, loads weights/adapters, and maintains warm pools.