Skip to content

worthmining/olla

Β 
Β 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

542 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Olla - Smart LLM Load Balancer & Proxy

License CI Go Report Card Latest Release
Ollama: Native Support LM Studio: Native Support vLLM: Native Support SGLang: Native Support LiteLLM: Native Support Lemonade SDK: Native Support
LM Deploy: OpenAI Compatible


Recorded with VHS - see demo tape

Documentation Β  Issues Β  Releases

Important

Olla is currently in active-development. While it is usable, we are still finalising some features and optimisations. Your feedback is invaluable! Open an issue and let us know features you'd like to see in the future.

Olla is a high-performance, low-overhead, low-latency proxy and load balancer for managing LLM infrastructure. It intelligently routes LLM requests across local and remote inference nodes - including Ollama, LM Studio and OpenAI-compatible endpoints like vLLM. Olla provides model discovery and unified model catalogues within each provider, enabling seamless routing to available models on compatible endpoints.

Olla works alongside API gateways like LiteLLM or orchestration platforms like GPUStack, focusing on making your existing LLM infrastructure reliable through intelligent routing and failover. You can choose between two proxy engines: Sherpa for simplicity and maintainability or Olla for maximum performance with advanced features like circuit breakers and connection pooling.

Single CLI application and config file is all you need to go Olla!

Key Features

Platform Support

Olla runs on multiple platforms and architectures:

Platform AMD64 ARM64 Notes
Linux βœ… βœ… Full support including Raspberry Pi 4+
macOS βœ… βœ… Intel and Apple Silicon (M1/M2/M3/M4)
Windows βœ… βœ… Windows 10/11 and Windows on ARM
Docker βœ… βœ… Multi-architecture images (amd64/arm64)

Quick Start

Installation

# Download latest release (auto-detects your platform)
bash <(curl -s https://raw.githubusercontent.com/thushan/olla/main/install.sh)
# Docker (automatically pulls correct architecture)
docker run -t \
  --name olla \
  -p 40114:40114 \
  ghcr.io/thushan/olla:latest

# Or explicitly specify platform (e.g., for ARM64)
docker run --platform linux/arm64 -t \
  --name olla \
  -p 40114:40114 \
  ghcr.io/thushan/olla:latest
# Install via Go
go install github.com/thushan/olla@latest
# Build from source
git clone https://github.com/thushan/olla.git && cd olla && make build-release
# Run Olla
./bin/olla

When you have everything running, you can check it's all working with:

# Check health of Olla
curl http://localhost:40114/internal/health

# Check endpoints
curl http://localhost:40114/internal/status/endpoints

# Check models available
curl http://localhost:40114/internal/status/models

For detailed installation and deployment options, see Getting Started Guide.

Examples

We've also got ready-to-use Docker Compose setups for common scenarios:

Common Architectures

  • Home Lab: Olla β†’ Multiple Ollama (or OpenAI Compatible - eg. vLLM) instances across your machines
  • Hybrid Cloud: Olla β†’ Local endpoints + LiteLLM β†’ Cloud APIs (OpenAI, Anthropic, Bedrock, etc.)
  • Enterprise: Olla β†’ GPUStack cluster + vLLM servers + LiteLLM (cloud overflow)
  • Development: Olla β†’ Local + Shared team endpoints + LiteLLM (API access)

See integration patterns for detailed architectures.

🌐 OpenWebUI Integration

Complete setup with OpenWebUI + Olla load balancing multiple Ollama instances or unify all OpenAI compatible models.

  • See: examples/ollama-openwebui/
  • Services: OpenWebUI (web UI) + Olla (proxy/load balancer)
  • Use Case: Web interface with intelligent load balancing across multiple Ollama servers with Olla
  • Quick Start:
    cd examples/ollama-openwebui
    # Edit olla.yaml to configure your Ollama endpoints
    docker compose up -d
    # Access OpenWebUI at http://localhost:3000

You can learn more about OpenWebUI Ollama with Olla or see OpenWebUI OpenAI with Olla.

Documentation

Full documentation is available at https://thushan.github.io/olla/

🀝 Contributing

We welcome contributions! Please open an issue first to discuss major changes.

πŸ€– AI Disclosure

This project has been built with the assistance of AI tools for documentation, test refinement, and code reviews.

We've utilised GitHub Copilot, Anthropic Claude, Jetbrains Junie and OpenAI ChatGPT for documentation, code reviews, test refinement and troubleshooting.

πŸ™ Acknowledgements

πŸ“„ License

Licensed under the Apache License 2.0. See LICENSE for details.

🎯 Roadmap

  • Circuit breakers: Advanced fault tolerance (Olla engine)
  • Connection pooling: Per-endpoint connection management (Olla engine)
  • Object pooling: Reduced GC pressure for high throughput (Olla engine)
  • Model routing: Route based on model requested
  • Authenticated Endpoints: Support calling authenticated endpoints (bearer) like OpenAI/Groq/OpenRouter as endpoints
  • Auto endpoint discovery: Add endpoints, let Olla determine the type
  • Model benchmarking: Benchmark models across multiple endpoints easily
  • Metrics export: Prometheus/OpenTelemetry integration
  • Dynamic configuration: API-driven endpoint management
  • TLS termination: Built-in SSL support
  • Olla Admin Panel: View Olla metrics easily within the browser
  • Model caching: Intelligent model preloading
  • Advanced Connection Management: Authenticated endpoints (via SSH tunnels, OAuth, Tokens)
  • OpenRouter Support: Support OpenRouter calls within Olla (divert to free models on OpenRouter etc)

Let us know what you want to see!


Made with ❀️ for the LLM community

🏠 Homepage β€’ πŸ“– Documentation β€’ πŸ› Issues β€’ πŸš€ Releases

About

Lightweight & fast AI inference proxy for self-hosted LLMs backends like Ollama, LM Studio and others. Designed for speed, simplicity and local-first deployments.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Go 87.5%
  • Shell 7.4%
  • Python 4.6%
  • Other 0.5%