A solution inspired by the kube-prometheus-stack with the intention of providing an 'all-in-one' ai platform. At the moment, it's geared towards my homelab, but I will ensure it's agnostic to any LLMOps/MLOps kubernetes environment. Feedback is welcome!
- Multi-Model Support: Deploy and manage multiple LLM models simultaneously on limited hardware
- GPU Agnostic: Supports gpu operators through resource requests/limists (examples with AMD)
- Auto-scaling: Intelligent scale-to-zero capabilities via KubeElasti integration
- AI Gateway: LiteLLM proxy for consistent model access and routing
- Monitoring: Built-in Prometheus metrics and healthcheck monitoring
- Persistent Storage: Model caching with configurable PVC storage
- Model customization: Jinja2 prompt templates configmap and bring-your-own-container approach to support diverse backends and hardware
- HuggingFace Integration: Seamless model loading from HuggingFace Hub
- MLOps: MLflow subchart for traditional ML experimentation and model registry
- LLMOps: Arise Phoenix subchart for GenAI tracing and evaluation
graph TB
subgraph "kube-ai-stack Architecture"
subgraph "LLM Model Deployments"
LLM1["LLM Model 1<br/>Deployment<br/>(GPU-enabled)"]
LLM2["LLM Model 2<br/>Deployment<br/>(GPU-enabled)"]
LLMN["LLM Model N<br/>Deployment<br/>(GPU-enabled)"]
end
subgraph "Model Services"
SVC1["Service<br/>(Model API)"]
SVC2["Service<br/>(Model API)"]
SVCN["Service<br/>(Model API)"]
end
subgraph "Storage Layer"
PVC1["Persistent Volume Claim<br/>(Model Cache)"]
PVC2["Persistent Volume Claim<br/>(Model Cache)"]
PVCN["Persistent Volume Claim<br/>(Model Cache)"]
end
LITELLM["LiteLLM Proxy/Gateway<br/>(Unified API Endpoint)"]
subgraph "Monitoring & Auto-scaling Layer"
SERVICEMONITOR["ServiceMonitor<br/>(Prometheus)"]
ELASTISERVICE["ElastiService<br/>(Auto-scaling)"]
end
subgraph "Observability & Experiment Layer"
MLFLOW["MLflow<br/>(Experiment Tracking)"]
PHOENIX["Phoenix<br/>(ML Observability)"]
end
%% Connections
LLM1 --> SVC1
LLM2 --> SVC2
LLMN --> SVCN
SVC1 --> PVC1
SVC2 --> PVC2
SVCN --> PVCN
SVC1 --> ELASTISERVICE
SVC2 --> ELASTISERVICE
SVCN --> ELASTISERVICE
SERVICEMONITOR --> LITELLM
ELASTISERVICE --> SERVICEMONITOR
LITELLM --> MLFLOW
LITELLM --> PHOENIX
end
- This is what it was originally created for
- Enables testing the latest OSS models and quickly have them available in LiteLLM gateway
- Optimize the runtime based on your hardware, but standardize the serving later
- Scale-to-zero to enable multiple models on limited hardware
- Clusters in lower environments used by ML/AI Engineers
- Ability to quickly test new OSS models
- Scale-to-zero enabled for cost effectiveness in off-hours
- At the moment, I wouldn't necessarily recommend this
- However, in theory, with kube elasti scaling disabled, this is feasible. Or if you can handle cold start latency, this could work
- Kubernetes cluster (v1.19+)
- GPU operators installed
- Helm 3.x
- kubectl configured for your cluster
# Clone the repository
git clone <repository-url>
cd kube-ai-stack
# Install the Helm chart
helm install my-ai-stack charts/kube-ai-stack
# Or install with custom configuration
helm install my-ai-stack charts/kube-ai-stack -f my-values.yaml# Check deployment status
kubectl get deployments -n models
# Check services
kubectl get services -n models
# Check pod status
kubectl get pods -n modelsThe stack is highly configurable through Helm values. Key configuration areas include:
- Global Settings: Namespace, image repository, resource defaults, subcharts
- Model Definitions: Individual model configurations and parameters
- LiteLLM Settings: Gateway configuration and routing behavior
- Auto-scaling: Scaling policies and trigger conditions
- Resource Management: CPU, memory, and GPU allocation
- Storage: PVC configuration and storage class selection
- Subcharts: Customize subcharts like MLflow and Arise Phoenix
For detailed configuration options, see the chart README.
Contributions are welcome! Please feel free to submit issues, feature requests, or pull requests.
- In the short run, integrate the remaining subcharts leverage in my homelab cluster: qdrant, litellm, openwebui, searxng
- In the med run, automated CI/CD, linting, pre-commit for testing and publishing of images
- In the long run, integration testing and getting started docs in some platform