#
moe
Here are 8 public repositories matching this topic...
Inferflow is an efficient and highly configurable inference engine for large language models (LLMs).
bloom falcon moe gemma mistral mixture-of-experts model-quantization multi-gpu-inference m2m100 llamacpp llm-inference internlm llama2 qwen baichuan2 mixtral phi-2 deepseek minicpm
-
Updated
Mar 15, 2024 - C++
⚡ Native MLX Swift LLM inference server for Apple Silicon. OpenAI-compatible API, SSD streaming for 100B+ MoE models, TurboQuant KV cache compression, + iOS iPhone app.
-
Updated
Apr 3, 2026 - C++
🚀 Run any LLM on any hardware. 130% faster MoE inference with ExpertFlow + TurboQuant KV compression. Ollama-compatible API. Built on llama.cpp.
machine-learning ai cpp gpu optimization cuda inference moe quantization rocm nvidia-gpu amd-gpu mixture-of-experts openai-api llm llama-cpp local-llm llm-inference ollama
-
Updated
Apr 1, 2026 - C++
Improve this page
Add a description, image, and links to the moe topic page so that developers can more easily learn about it.
Add this topic to your repo
To associate your repository with the moe topic, visit your repo's landing page and select "manage topics."