Terraform/Ansible code to build the ARC LLM (and image generation) hosting
This is a slightly involved process.
You will need to get a tenancy with GPUs attached to it. Once you have done that, there is terraform code in inference-host which will build your VMs. Having created the VMs, you need to use attach_gpu.py to attach a GPU to each VM, RESTART NOT REBOOT the VMs so that they migrate to the right machine and then use ansible to install the base environment. Finally, you need to use the appropriate roles to install the right containers and set them to run as daemons.
Some things to note:
- It is assumed that the image is in your tenancy. This is because when I started there were no images in the general store on
sl-g01. usernameis just a tagkeynamesets which SSH key is injected into your VMs.
You may also want to modify the tags that configure ingress because by default the subdomains are inf01, inf02 etc.
cd inference-host
terraform init
terraform apply
We don't have a standardised nice way of doing this so I have written a fairly shonky Python script which does it.
python3 attach_gpu.py <VM id> <GPU id>
e.g.
python3 attach_gpu.py uccaoke-inference-01-e3190ff38b sl-g01-10-000031000
This script will ask Condenser what type that GPU is and add it to the kubevirt config for the VM.
You then need to RESTART NOT REBOOT the VM, either through the Rancher UI or through kubectl virt/virtctl
As with other things I have built:
ansible-playbook -i generate_inventory.py full.yaml
This will:
- Install Docker
- Install Python3.12 (not actually necessary now development has been done)
- Install the Nvdia drivers and useful tools.
- Install and configure Ngnix as a transparent proxy
Step 4. is important for a number of reasons:
- VLLM only supports one API key for authentication. This is bad.
- Worse, for "reasons", the API key only protects
/v1/API calls, and not any of the other API calls, like the ones that let you shut down the server.
So what this does is configure Nginx in the following way:
- Use TLS so that traffic to ingress is encrypted.
- Block all requests except those that include a bearer token from
/etc/nginx/nginx.authwhich is auto-generated by ansible. By default it creates 3 API keys and puts copies in your ansible build machine in.apikey1,.apikey2and.apikey3. - Proxy requests to
/v1to port 5500 on localhost. This is where we will put our vllm/image server docker containers - Configure logging so that
/var/log/nginx/llm.loghas a log of which key was used for each request.
It is worth changing the contents of /etc/nginx/nginx.auth so that keys are named so that they appear in the log. You can also add/delete new keys.
For example:
map $http_authorization $api_user {
default "__unauthenticated__";
"Bearer HMSWarlockIsBest" "keith.drummond";
"Bearer ILoveHMSSaracen" "richard.chesnaye";
}
This would result in requests using the bearer token "ILovedHMSSaracen" being able to access /v1/ on the Docker container and being logged by nginx as from richard.chesnaye and so on. The default item is the "fail" condition, passing "__unauthenticated__" as the api_user which means it has NO access to anything. DO NOT REMOVE THIS ENTRY
This is obviously not a "production ready" scalable setup as that would require some sort of service to manage and issue tokens.
If you look at the role roles/vllm-qwen-coder you should see a role that can deploy the vllm docker container serving an OpenAPI compatible endpoint.
NOTE: this is super not production ready! I wrote this with some help from Claude!
Image serving is done by building a docker container that uses FastAPI and the Huggingface diffusers library to serve a model - by default Stable Diffusion XL Turbo which is defined for fast inferences.
You can configure the model by either editing roles/diffusers-sdxl/files/imagesrv.ini before building the container, or by editing it on the VM and restarting the container as it is bind-mounted into the container. For models like Flux pay attention to the setting of variant - this should be empty.
By default, the containers are run as daemons and set to restart unless explicitly stopped. The container is called inference so you can manage it knowing this fact.
For example, to watch vllm's logs on a vllm server:
docker logs inference
There is a very simple chat client in examples/chat
Modify llm.ini appropriately, e.g. Keith Drummond in the authorisation example might make it look like:
[OPENAI]
endpoint = https://inf01.arc-llm.condenser.arc.ucl.ac.uk/v1
model = Qwen-Coder
api_key = HMSWarlockIsBest
Create a virtualenv, install requirements and run the client:
python3.12 -m venv runtime
source runtime/bin/activate
pip install -r requirements.txt
<...>
python3 chat.py
Starting up - LLM endpoint = https://inf01.arc-llm.condenser.arc.ucl.ac.uk/v1/Qwen-Coder
? hello
---
🤖 : Hello! How can I help you today?
--- [2.8368186950683594 seconds]
If he wanted to use the LLM endpoint with Claude code he should set:
export ANTHROPIC_BASE_URL=https://inf01.arc-llm.condenser.arc.ucl.ac.uk/
export ANTHROPIC_MODEL=Qwen-Coder
export ANTHROPIC_AUTH_TOKEN=HMSWarlockIsBest
export ANTHROPIC_API_KEY=dummykey
The last variable (ANTHROPIC_API_KEY) is important the first time he runs claude because it bypasses the part of the claude setup that requires you to log in. He should unset it thereafter otherwise it will conflict with ANTHROPIC_AUTH_TOKEN.
To use these LLMs with OpenCode, update your opencode.json configuration file on these lines:
{
"$schema": "https://opencode.ai/config.json",
"provider": {
"ucl-arc-qwen": {
"npm": "@ai-sdk/openai-compatible",
"name": "UCL ARC Qwen",
"options": {
"baseURL": "https://inf01.arc-llm.condenser.arc.ucl.ac.uk/v1",
"apiKkey": "{env:UCL_ARC_API_KEY}"
},
"models": {
"Qwen": {
"name": "UCL/Qwen/Qwen3.6-35B-A3B-FP8"
}
}
},
"ucl-arc-qwen-coder": {
"npm": "@ai-sdk/openai-compatible",
"name": "UCL ARC Qwen-Coder",
"options": {
"baseURL": "https://inf03.arc-llm.condenser.arc.ucl.ac.uk/v1",
"apiKkey": "{env:UCL_ARC_API_KEY}"
},
"models": {
"Qwen-Coder": {
"name": "UCL/Qwen3-Coder-30B-A3B-Instruct-FP8"
}
}
}
}
}
It will read your API key from the UCL_ARC_API_KEY environment variable.
You can also mention your key here, or you can provide it to the /connect command, in which case it will store it in ~/.local/share/opencode/auth.json (on Linux).
Since these methods store your key as plain-text, they are insecure and not recommended.
Note that the ID you set when you run the /connect command should match the ID you set in the JSON configuration file (ucl-arc-qwen, for example).
When you run /models in OpenCode, you should see the models listed.
You can also call the image service using the openai python library.
token = "HMSWarlockIsBest"
from openai import OpenAI
import base64
import io
from PIL import Image
endpoint = "https://inf01.arc-llm.condenser.arc.ucl.ac.uk/"
client = OpenAI(base_url=f"{endpoint}/v1", api_key=token)
response = client.images.generate(prompt="a white cup of pink tea", n=6, size="512x512")
images = []
for a in response.data:
images.append(Image.open(io.BytesIO(base64.b64decode(a.b64_json))))
images will then be a list of PIL/Pillow images you can do whatever you want with.