arc-llms

Terraform/Ansible code to build the ARC LLM (and image generation) hosting

This is a slightly involved process.

You will need to get a tenancy with GPUs attached to it. Once you have done that, there is terraform code in inference-host which will build your VMs. Having created the VMs, you need to use attach_gpu.py to attach a GPU to each VM, RESTART NOT REBOOT the VMs so that they migrate to the right machine and then use ansible to install the base environment. Finally, you need to use the appropriate roles to install the right containers and set them to run as daemons.

Walkthrough

1. Obtain tenancy and GPU IDs.

2. Modify `inference-host/variables.tf` to suit your environment.

Some things to note:

It is assumed that the image is in your tenancy. This is because when I started there were no images in the general store on sl-g01.
username is just a tag
keyname sets which SSH key is injected into your VMs.

You may also want to modify the tags that configure ingress because by default the subdomains are inf01, inf02 etc.

3. Use Terraform to build your VMs.

cd inference-host 
terraform init
terraform apply

4. Attach GPUs to VMs

We don't have a standardised nice way of doing this so I have written a fairly shonky Python script which does it.

python3 attach_gpu.py <VM id> <GPU id>

e.g.

python3 attach_gpu.py uccaoke-inference-01-e3190ff38b sl-g01-10-000031000

This script will ask Condenser what type that GPU is and add it to the kubevirt config for the VM.

You then need to RESTART NOT REBOOT the VM, either through the Rancher UI or through kubectl virt/virtctl

5. Install the base software

As with other things I have built:

ansible-playbook -i generate_inventory.py full.yaml

This will:

Install Docker
Install Python3.12 (not actually necessary now development has been done)
Install the Nvdia drivers and useful tools.
Install and configure Ngnix as a transparent proxy

Step 4. is important for a number of reasons:

VLLM only supports one API key for authentication. This is bad.
Worse, for "reasons", the API key only protects /v1/ API calls, and not any of the other API calls, like the ones that let you shut down the server.

So what this does is configure Nginx in the following way:

Use TLS so that traffic to ingress is encrypted.
Block all requests except those that include a bearer token from /etc/nginx/nginx.auth which is auto-generated by ansible. By default it creates 3 API keys and puts copies in your ansible build machine in .apikey1, .apikey2 and .apikey3.
Proxy requests to /v1 to port 5500 on localhost. This is where we will put our vllm/image server docker containers
Configure logging so that /var/log/nginx/llm.log has a log of which key was used for each request.

It is worth changing the contents of /etc/nginx/nginx.auth so that keys are named so that they appear in the log. You can also add/delete new keys.

For example:

    map $http_authorization $api_user {
         default                     "__unauthenticated__";
         "Bearer HMSWarlockIsBest"   "keith.drummond";
         "Bearer ILoveHMSSaracen"    "richard.chesnaye";
    }

This would result in requests using the bearer token "ILovedHMSSaracen" being able to access /v1/ on the Docker container and being logged by nginx as from richard.chesnaye and so on. The default item is the "fail" condition, passing "__unauthenticated__" as the api_user which means it has NO access to anything. DO NOT REMOVE THIS ENTRY

This is obviously not a "production ready" scalable setup as that would require some sort of service to manage and issue tokens.

6. Apply roles for different inference servers

LLMs

If you look at the role roles/vllm-qwen-coder you should see a role that can deploy the vllm docker container serving an OpenAPI compatible endpoint.

Image serving

NOTE: this is super not production ready! I wrote this with some help from Claude!

Image serving is done by building a docker container that uses FastAPI and the Huggingface diffusers library to serve a model - by default Stable Diffusion XL Turbo which is defined for fast inferences.

You can configure the model by either editing roles/diffusers-sdxl/files/imagesrv.ini before building the container, or by editing it on the VM and restarting the container as it is bind-mounted into the container. For models like Flux pay attention to the setting of variant - this should be empty.

Managing the containers

By default, the containers are run as daemons and set to restart unless explicitly stopped. The container is called inference so you can manage it knowing this fact.

For example, to watch vllm's logs on a vllm server:

docker logs inference

7. Connecting to your endpoints

Using the LLM:

There is a very simple chat client in examples/chat

Modify llm.ini appropriately, e.g. Keith Drummond in the authorisation example might make it look like:

[OPENAI]
endpoint = https://inf01.arc-llm.condenser.arc.ucl.ac.uk/v1
model = Qwen-Coder
api_key = HMSWarlockIsBest

Create a virtualenv, install requirements and run the client:

python3.12 -m venv runtime
source runtime/bin/activate
pip install -r requirements.txt
<...>
python3 chat.py
Starting up - LLM endpoint = https://inf01.arc-llm.condenser.arc.ucl.ac.uk/v1/Qwen-Coder
? hello
---
🤖 : Hello! How can I help you today?
--- [2.8368186950683594 seconds]

If he wanted to use the LLM endpoint with Claude code he should set:

export ANTHROPIC_BASE_URL=https://inf01.arc-llm.condenser.arc.ucl.ac.uk/
export ANTHROPIC_MODEL=Qwen-Coder
export ANTHROPIC_AUTH_TOKEN=HMSWarlockIsBest
export ANTHROPIC_API_KEY=dummykey

The last variable (ANTHROPIC_API_KEY) is important the first time he runs claude because it bypasses the part of the claude setup that requires you to log in. He should unset it thereafter otherwise it will conflict with ANTHROPIC_AUTH_TOKEN.

OpenCode

To use these LLMs with OpenCode, update your opencode.json configuration file on these lines:

{
  "$schema": "https://opencode.ai/config.json",
  "provider": {
    "ucl-arc-qwen": {
      "npm": "@ai-sdk/openai-compatible",
      "name": "UCL ARC Qwen",
      "options": {
        "baseURL": "https://inf01.arc-llm.condenser.arc.ucl.ac.uk/v1",
        "apiKkey": "{env:UCL_ARC_API_KEY}"
      },
      "models": {
        "Qwen": {
          "name": "UCL/Qwen/Qwen3.6-35B-A3B-FP8"
        }
      }
    },
    "ucl-arc-qwen-coder": {
      "npm": "@ai-sdk/openai-compatible",
      "name": "UCL ARC Qwen-Coder",
      "options": {
        "baseURL": "https://inf03.arc-llm.condenser.arc.ucl.ac.uk/v1",
        "apiKkey": "{env:UCL_ARC_API_KEY}"
      },
      "models": {
        "Qwen-Coder": {
          "name": "UCL/Qwen3-Coder-30B-A3B-Instruct-FP8"
        }
      }
    }
  }
}

It will read your API key from the UCL_ARC_API_KEY environment variable.

You can also mention your key here, or you can provide it to the /connect command, in which case it will store it in ~/.local/share/opencode/auth.json (on Linux). Since these methods store your key as plain-text, they are insecure and not recommended.

Note that the ID you set when you run the /connect command should match the ID you set in the JSON configuration file (ucl-arc-qwen, for example). When you run /models in OpenCode, you should see the models listed.

Using the image service

You can also call the image service using the openai python library.

token = "HMSWarlockIsBest"
from openai import OpenAI
import base64
import io
from PIL import Image
endpoint = "https://inf01.arc-llm.condenser.arc.ucl.ac.uk/"
client = OpenAI(base_url=f"{endpoint}/v1", api_key=token)
response = client.images.generate(prompt="a white cup of pink tea", n=6, size="512x512")
images = []
for a in response.data:
    images.append(Image.open(io.BytesIO(base64.b64decode(a.b64_json))))

images will then be a list of PIL/Pillow images you can do whatever you want with.

Name		Name	Last commit message	Last commit date
Latest commit History 62 Commits
examples/chat		examples/chat
inference-host		inference-host
litellm		litellm
openwebui		openwebui
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

arc-llms

Walkthrough

1. Obtain tenancy and GPU IDs.

2. Modify `inference-host/variables.tf` to suit your environment.

3. Use Terraform to build your VMs.

4. Attach GPUs to VMs

5. Install the base software

6. Apply roles for different inference servers

LLMs

Image serving

Managing the containers

7. Connecting to your endpoints

Using the LLM:

OpenCode

Using the image service

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

arc-llms

Walkthrough

1. Obtain tenancy and GPU IDs.

2. Modify inference-host/variables.tf to suit your environment.

3. Use Terraform to build your VMs.

4. Attach GPUs to VMs

5. Install the base software

6. Apply roles for different inference servers

LLMs

Image serving

Managing the containers

7. Connecting to your endpoints

Using the LLM:

OpenCode

Using the image service

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

2. Modify `inference-host/variables.tf` to suit your environment.

Packages