Skip to content

Ashref-dev/azure-k8s-llm-deploy

Repository files navigation

Kubernetes LLM Deployment Guide

This comprehensive guide walks you through deploying Large Language Models (LLMs) using Ollama, on Azure Kubernetes Service (AKS). The setup includes both the Ollama server (a REST API server for running LLM models) and Open-WebUI client for easy interaction with your LLM.

Open-WebUI Interface

Prerequisites

  • Azure CLI installed and configured
  • kubectl installed
  • SSH key pair generated
  • zsh shell (bash commands may vary slightly)

Project Structure

KubernetesLLM/
├── bicep/                      # Bicep Infrastructure as Code files
│   ├── main.bicep              # Main Bicep template for AKS deployment
│   ├── kubernetes-resources.bicep  # Bicep module for Kubernetes resources
│   └── main.parameters.json    # Parameters for Bicep deployment
├── images/                     # Documentation images
├── namespace.yaml              # Kubernetes namespace manifest
├── ollama-service.yaml         # Ollama service manifest
├── ollama-statefulset.yaml     # Ollama StatefulSet manifest
├── webui-deployment.yaml       # Open-WebUI deployment manifest
├── webui-ingress.yaml          # Open-WebUI ingress manifest
├── webui-pvc.yaml              # Open-WebUI persistent volume claim
├── webui-service.yaml          # Open-WebUI service manifest
├── deploy.sh                   # Deployment script for Bicep
├── my_ssh_key.pub              # SSH public key for AKS nodes
└── README.md                   # This documentation

Environment Setup

You can deploy this solution using either Azure CLI commands directly or using Azure Bicep for Infrastructure as Code.

Option 1: Using Azure CLI

Set the required environment variables:

export AKS_RG="llama3-aks-rg"
export AKS_NAME="llm-aks-cluster"

Deployment Steps

1. Create Azure Resource Group

az group create -n $AKS_RG -l eastus2

2. Create AKS Cluster

Note: Using Standard_B2s VM size (8GB RAM) for small LLM testing

az aks create -n $AKS_NAME -g $AKS_RG \
    --network-plugin azure \
    --network-plugin-mode overlay \
    -k 1.30.3 \
    --node-count 1 \
    --node-vm-size Standard_B2s \
    --ssh-key-value ./my_ssh_key.pub

3. Configure kubectl

az aks get-credentials -n $AKS_NAME -g $AKS_RG --overwrite-existing

4. Verify Cluster Connection

kubectl get nodes

5. Deploy Ollama and Open-WebUI

kubectl apply -f .

6. Monitor Deployment

Check all resources in the ollama namespace:

kubectl get all,pv,pvc -n ollama

7. Managing LLM Models

List running Ollama processes:

kubectl exec ollama-0 -n ollama -it -- ollama ps

Install and run an LLM model (example using llama3.2:3b):

kubectl exec ollama-0 -n ollama -it -- ollama run llama3.2:3b

8. Access the Web Interface

Get the public IP for the Open-WebUI service:

kubectl get svc -n ollama

Now you can navigate to the public IP of the client service to chat with the model.

Here are some example models that can be used in ollama available here:

Model Parameters Size Download
Llama 3.1 8B 4.7GB ollama run llama3.1
Llama 3.1 70B 40GB ollama run llama3.1:70b
Llama 3.1 405B 231GB ollama run llama3.1:405b
Phi 3 Mini 3.8B 2.3GB ollama run phi3
Phi 3 Medium 14B 7.9GB ollama run phi3:medium
Gemma 2 2B 1.6GB ollama run gemma2:2b
Gemma 2 9B 5.5GB ollama run gemma2
Gemma 2 27B 16GB ollama run gemma2:27b
Mistral 7B 4.1GB ollama run mistral
Moondream 2 1.4B 829MB ollama run moondream
Neural Chat 7B 4.1GB ollama run neural-chat
Starling 7B 4.1GB ollama run starling-lm
Code Llama 7B 3.8GB ollama run codellama
Llama 2 Uncensored 7B 3.8GB ollama run llama2-uncensored
LLaVA 7B 4.5GB ollama run llava
Solar 10.7B 6.1GB ollama run solar

Important notes

  • The ollama server is running only on CPU. However, it can also run on GPU or also NPU.
  • As LLM models size are large, it is recommended to use a VM with large disk space.
  • During the inference, the model will consume a lot of memory and CPU. It is recommended to use a VM with a large memory and CPU
  • The deployment uses Azure CNI networking with overlay mode
  • Minimum recommended VM size is Standard_B2s (8GB RAM) for testing small LLMs
  • Adjust resources according to your LLM size and performance requirements

Option 2: Using Azure Bicep

This project includes Azure Bicep templates in the bicep/ directory for deploying the entire infrastructure in a repeatable, version-controlled way.

  1. Make sure you have the latest Azure CLI installed with Bicep support:

    az bicep install
    az bicep upgrade
  2. Review and customize the parameters in bicep/main.parameters.json if needed.

  3. Run the deployment script:

    ./deploy.sh

The script will:

  • Create a resource group if it doesn't exist
  • Read your SSH public key from the my_ssh_key.pub file
  • Deploy the AKS cluster and Kubernetes resources using Bicep
  • Configure kubectl to connect to your new cluster

Bicep Deployment Details

The Bicep deployment consists of the following files in the bicep/ directory:

  1. main.bicep - The main template that deploys the AKS cluster

    • Defines the AKS cluster with the specified VM size, node count, and Kubernetes version
    • Configures networking with Azure CNI in overlay mode
    • Sets up SSH access using the provided public key
  2. kubernetes-resources.bicep - A module that deploys the Kubernetes resources

    • Creates the ollama namespace
    • Deploys the Ollama StatefulSet and service
    • Sets up the Open-WebUI deployment, service, and persistent volume claim
    • Configures the necessary connections between components
  3. main.parameters.json - Parameters for the deployment

    • Defines default values for the AKS cluster name, location, VM size, etc.
    • Can be customized to match your requirements
  4. deploy.sh - A script in the root directory to simplify the deployment process

    • Creates the Azure resource group
    • Reads your SSH public key
    • Creates a temporary parameters file with your SSH key
    • Deploys the Bicep templates
    • Configures kubectl to connect to your cluster

Customizing the Bicep Deployment

You can customize the deployment by:

  1. Modifying the parameters in bicep/main.parameters.json
  2. Editing the deploy.sh script to change deployment variables
  3. Directly modifying the Bicep templates for more advanced customizations

For example, to deploy a larger VM size for running bigger LLM models, you can change the nodeVmSize parameter in the parameters file or deployment script.

References

About

deploying Large Language Models (LLMs) using Ollama, on Azure Kubernetes Service (AKS).

Topics

Resources

Stars

Watchers

Forks