kube-ai-stack

A solution inspired by the kube-prometheus-stack with the intention of providing an 'all-in-one' ai platform. At the moment, it's geared towards my homelab, but I will ensure it's agnostic to any LLMOps/MLOps kubernetes environment. Feedback is welcome!

Key Features

Multi-Model Support: Deploy and manage multiple LLM models simultaneously on limited hardware
GPU Agnostic: Supports gpu operators through resource requests/limists (examples with AMD)
Auto-scaling: Intelligent scale-to-zero capabilities via KubeElasti integration
AI Gateway: LiteLLM proxy for consistent model access and routing
Monitoring: Built-in Prometheus metrics and healthcheck monitoring
Persistent Storage: Model caching with configurable PVC storage
Model customization: Jinja2 prompt templates configmap and bring-your-own-container approach to support diverse backends and hardware
HuggingFace Integration: Seamless model loading from HuggingFace Hub
MLOps: MLflow subchart for traditional ML experimentation and model registry
LLMOps: Arise Phoenix subchart for GenAI tracing and evaluation

Architecture

graph TB
    subgraph "kube-ai-stack Architecture"
        subgraph "LLM Model Deployments"
            LLM1["LLM Model 1<br/>Deployment<br/>(GPU-enabled)"]
            LLM2["LLM Model 2<br/>Deployment<br/>(GPU-enabled)"]
            LLMN["LLM Model N<br/>Deployment<br/>(GPU-enabled)"]
        end

        subgraph "Model Services"
            SVC1["Service<br/>(Model API)"]
            SVC2["Service<br/>(Model API)"]
            SVCN["Service<br/>(Model API)"]
        end

        subgraph "Storage Layer"
            PVC1["Persistent Volume Claim<br/>(Model Cache)"]
            PVC2["Persistent Volume Claim<br/>(Model Cache)"]
            PVCN["Persistent Volume Claim<br/>(Model Cache)"]
        end

        LITELLM["LiteLLM Proxy/Gateway<br/>(Unified API Endpoint)"]

        subgraph "Monitoring & Auto-scaling Layer"
            SERVICEMONITOR["ServiceMonitor<br/>(Prometheus)"]
            ELASTISERVICE["ElastiService<br/>(Auto-scaling)"]
        end

        subgraph "Observability & Experiment Layer"
            MLFLOW["MLflow<br/>(Experiment Tracking)"]
            PHOENIX["Phoenix<br/>(ML Observability)"]
        end

        %% Connections
        LLM1 --> SVC1
        LLM2 --> SVC2
        LLMN --> SVCN

        SVC1 --> PVC1
        SVC2 --> PVC2
        SVCN --> PVCN

        SVC1 --> ELASTISERVICE
        SVC2 --> ELASTISERVICE
        SVCN --> ELASTISERVICE

        SERVICEMONITOR --> LITELLM
        ELASTISERVICE --> SERVICEMONITOR

        LITELLM --> MLFLOW
        LITELLM --> PHOENIX
    end

Use Cases

Homelab

This is what it was originally created for
Enables testing the latest OSS models and quickly have them available in LiteLLM gateway
Optimize the runtime based on your hardware, but standardize the serving later
Scale-to-zero to enable multiple models on limited hardware

Dev Clusters and Experimentation

Clusters in lower environments used by ML/AI Engineers
Ability to quickly test new OSS models
Scale-to-zero enabled for cost effectiveness in off-hours

Production

At the moment, I wouldn't necessarily recommend this
However, in theory, with kube elasti scaling disabled, this is feasible. Or if you can handle cold start latency, this could work

Quick Start

Prerequisites

Kubernetes cluster (v1.19+)
GPU operators installed
Helm 3.x
kubectl configured for your cluster

Installation

# Clone the repository
git clone <repository-url>
cd kube-ai-stack

# Install the Helm chart
helm install my-ai-stack charts/kube-ai-stack

# Or install with custom configuration
helm install my-ai-stack charts/kube-ai-stack -f my-values.yaml

Verify Installation

# Check deployment status
kubectl get deployments -n models

# Check services
kubectl get services -n models

# Check pod status
kubectl get pods -n models

Configuration

The stack is highly configurable through Helm values. Key configuration areas include:

Global Settings: Namespace, image repository, resource defaults, subcharts
Model Definitions: Individual model configurations and parameters
LiteLLM Settings: Gateway configuration and routing behavior
Auto-scaling: Scaling policies and trigger conditions
Resource Management: CPU, memory, and GPU allocation
Storage: PVC configuration and storage class selection
Subcharts: Customize subcharts like MLflow and Arise Phoenix

For detailed configuration options, see the chart README.

Contributing

Contributions are welcome! Please feel free to submit issues, feature requests, or pull requests.

Roadmap

In the short run, integrate the remaining subcharts leverage in my homelab cluster: qdrant, litellm, openwebui, searxng
In the med run, automated CI/CD, linting, pre-commit for testing and publishing of images
In the long run, integration testing and getting started docs in some platform

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
.github/workflows		.github/workflows
assets		assets
charts/kube-ai-stack		charts/kube-ai-stack
.gitignore		.gitignore
AGENTS.md		AGENTS.md
LICENSE		LICENSE
README.md		README.md
renovate.json		renovate.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

kube-ai-stack

Key Features

Architecture

Use Cases

Homelab

Dev Clusters and Experimentation

Production

Quick Start

Prerequisites

Installation

Verify Installation

Configuration

Contributing

Roadmap

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

kube-ai-stack

Key Features

Architecture

Use Cases

Homelab

Dev Clusters and Experimentation

Production

Quick Start

Prerequisites

Installation

Verify Installation

Configuration

Contributing

Roadmap

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages