This comprehensive guide walks you through deploying Large Language Models (LLMs) using Ollama, on Azure Kubernetes Service (AKS). The setup includes both the Ollama server (a REST API server for running LLM models) and Open-WebUI client for easy interaction with your LLM.
- Azure CLI installed and configured
- kubectl installed
- SSH key pair generated
- zsh shell (bash commands may vary slightly)
KubernetesLLM/
├── bicep/ # Bicep Infrastructure as Code files
│ ├── main.bicep # Main Bicep template for AKS deployment
│ ├── kubernetes-resources.bicep # Bicep module for Kubernetes resources
│ └── main.parameters.json # Parameters for Bicep deployment
├── images/ # Documentation images
├── namespace.yaml # Kubernetes namespace manifest
├── ollama-service.yaml # Ollama service manifest
├── ollama-statefulset.yaml # Ollama StatefulSet manifest
├── webui-deployment.yaml # Open-WebUI deployment manifest
├── webui-ingress.yaml # Open-WebUI ingress manifest
├── webui-pvc.yaml # Open-WebUI persistent volume claim
├── webui-service.yaml # Open-WebUI service manifest
├── deploy.sh # Deployment script for Bicep
├── my_ssh_key.pub # SSH public key for AKS nodes
└── README.md # This documentation
You can deploy this solution using either Azure CLI commands directly or using Azure Bicep for Infrastructure as Code.
Set the required environment variables:
export AKS_RG="llama3-aks-rg"
export AKS_NAME="llm-aks-cluster"az group create -n $AKS_RG -l eastus2Note: Using Standard_B2s VM size (8GB RAM) for small LLM testing
az aks create -n $AKS_NAME -g $AKS_RG \
--network-plugin azure \
--network-plugin-mode overlay \
-k 1.30.3 \
--node-count 1 \
--node-vm-size Standard_B2s \
--ssh-key-value ./my_ssh_key.pubaz aks get-credentials -n $AKS_NAME -g $AKS_RG --overwrite-existingkubectl get nodeskubectl apply -f .Check all resources in the ollama namespace:
kubectl get all,pv,pvc -n ollamaList running Ollama processes:
kubectl exec ollama-0 -n ollama -it -- ollama psInstall and run an LLM model (example using llama3.2:3b):
kubectl exec ollama-0 -n ollama -it -- ollama run llama3.2:3bGet the public IP for the Open-WebUI service:
kubectl get svc -n ollamaNow you can navigate to the public IP of the client service to chat with the model.
Here are some example models that can be used in ollama available here:
| Model | Parameters | Size | Download |
|---|---|---|---|
| Llama 3.1 | 8B | 4.7GB | ollama run llama3.1 |
| Llama 3.1 | 70B | 40GB | ollama run llama3.1:70b |
| Llama 3.1 | 405B | 231GB | ollama run llama3.1:405b |
| Phi 3 Mini | 3.8B | 2.3GB | ollama run phi3 |
| Phi 3 Medium | 14B | 7.9GB | ollama run phi3:medium |
| Gemma 2 | 2B | 1.6GB | ollama run gemma2:2b |
| Gemma 2 | 9B | 5.5GB | ollama run gemma2 |
| Gemma 2 | 27B | 16GB | ollama run gemma2:27b |
| Mistral | 7B | 4.1GB | ollama run mistral |
| Moondream 2 | 1.4B | 829MB | ollama run moondream |
| Neural Chat | 7B | 4.1GB | ollama run neural-chat |
| Starling | 7B | 4.1GB | ollama run starling-lm |
| Code Llama | 7B | 3.8GB | ollama run codellama |
| Llama 2 Uncensored | 7B | 3.8GB | ollama run llama2-uncensored |
| LLaVA | 7B | 4.5GB | ollama run llava |
| Solar | 10.7B | 6.1GB | ollama run solar |
- The
ollamaserver is running only on CPU. However, it can also run on GPU or also NPU. - As LLM models size are large, it is recommended to use a VM with large disk space.
- During the inference, the model will consume a lot of memory and CPU. It is recommended to use a VM with a large memory and CPU
- The deployment uses Azure CNI networking with overlay mode
- Minimum recommended VM size is Standard_B2s (8GB RAM) for testing small LLMs
- Adjust resources according to your LLM size and performance requirements
This project includes Azure Bicep templates in the bicep/ directory for deploying the entire infrastructure in a repeatable, version-controlled way.
-
Make sure you have the latest Azure CLI installed with Bicep support:
az bicep install az bicep upgrade
-
Review and customize the parameters in
bicep/main.parameters.jsonif needed. -
Run the deployment script:
./deploy.sh
The script will:
- Create a resource group if it doesn't exist
- Read your SSH public key from the
my_ssh_key.pubfile - Deploy the AKS cluster and Kubernetes resources using Bicep
- Configure kubectl to connect to your new cluster
The Bicep deployment consists of the following files in the bicep/ directory:
-
main.bicep - The main template that deploys the AKS cluster
- Defines the AKS cluster with the specified VM size, node count, and Kubernetes version
- Configures networking with Azure CNI in overlay mode
- Sets up SSH access using the provided public key
-
kubernetes-resources.bicep - A module that deploys the Kubernetes resources
- Creates the ollama namespace
- Deploys the Ollama StatefulSet and service
- Sets up the Open-WebUI deployment, service, and persistent volume claim
- Configures the necessary connections between components
-
main.parameters.json - Parameters for the deployment
- Defines default values for the AKS cluster name, location, VM size, etc.
- Can be customized to match your requirements
-
deploy.sh - A script in the root directory to simplify the deployment process
- Creates the Azure resource group
- Reads your SSH public key
- Creates a temporary parameters file with your SSH key
- Deploys the Bicep templates
- Configures kubectl to connect to your cluster
You can customize the deployment by:
- Modifying the parameters in
bicep/main.parameters.json - Editing the
deploy.shscript to change deployment variables - Directly modifying the Bicep templates for more advanced customizations
For example, to deploy a larger VM size for running bigger LLM models, you can change the nodeVmSize parameter in the parameters file or deployment script.