94 models. 30 providers. One command to find what runs on your hardware.
A terminal tool that right-sizes LLM models to your system's RAM, CPU, and GPU. Detects your hardware, scores each model across quality, speed, fit, and context dimensions, and tells you which ones will actually run well on your machine.
Ships with an interactive TUI (default) and a classic CLI mode. Supports multi-GPU setups, MoE architectures, dynamic quantization selection, and speed estimation.
curl -fsSL https://llmfit.axjns.dev/install.sh | shDownloads the latest release binary from GitHub and installs it to /usr/local/bin (or ~/.local/bin)
Or
brew tap AlexsJones/llmfit
brew install llmfitExample of a medium performance home laptop
Example of models with Mixture-of-Experts architectures
cargo install llmfitbrew tap AlexsJones/llmfit
brew install llmfitcurl -fsSL https://llmfit.axjns.dev/install.sh | shDownloads the latest release binary from GitHub and installs it to /usr/local/bin (or ~/.local/bin).
git clone https://github.com/AlexsJones/llmfit.git
cd llmfit
cargo build --release
# binary is at target/release/llmfitllmfitLaunches the interactive terminal UI. Your system specs (CPU, RAM, GPU name, VRAM, backend) are shown at the top. Models are listed in a scrollable table sorted by composite score. Each row shows the model's score, estimated tok/s, best quantization for your hardware, run mode, memory usage, and use-case category.
| Key | Action |
|---|---|
Up / Down or j / k |
Navigate models |
/ |
Enter search mode (partial match on name, provider, params, use case) |
Esc or Enter |
Exit search mode |
Ctrl-U |
Clear search |
f |
Cycle fit filter: All, Runnable, Perfect, Good, Marginal |
1-9 |
Toggle provider visibility |
Enter |
Toggle detail view for selected model |
PgUp / PgDn |
Scroll by 10 |
g / G |
Jump to top / bottom |
q |
Quit |
Use --cli or any subcommand to get classic table output:
# Table of all models ranked by fit
llmfit --cli
# Only perfectly fitting models, top 5
llmfit fit --perfect -n 5
# Show detected system specs
llmfit system
# List all models in the database
llmfit list
# Search by name, provider, or size
llmfit search "llama 8b"
# Detailed view of a single model
llmfit info "Mistral-7B"
# Top 5 recommendations (JSON, for agent/script consumption)
llmfit recommend --json --limit 5
# Recommendations filtered by use case
llmfit recommend --json --use-case coding --limit 3Add --json to any subcommand for machine-readable output:
llmfit --json system # Hardware specs as JSON
llmfit --json fit -n 10 # Top 10 fits as JSON
llmfit recommend --json # Top 5 recommendations (JSON is default for recommend)-
Hardware detection -- Reads total/available RAM via
sysinfo, counts CPU cores, and probes for GPUs:- NVIDIA -- Multi-GPU support via
nvidia-smi. Aggregates VRAM across all detected GPUs. Falls back to VRAM estimation from GPU model name if reporting fails. - AMD -- Detected via
rocm-smi. - Intel Arc -- Discrete VRAM via sysfs, integrated via
lspci. - Apple Silicon -- Unified memory via
system_profiler. VRAM = system RAM. - Backend detection -- Automatically identifies the acceleration backend (CUDA, Metal, ROCm, SYCL, CPU ARM, CPU x86) for speed estimation.
- NVIDIA -- Multi-GPU support via
-
Model database -- 94 models sourced from the HuggingFace API, stored in
data/hf_models.jsonand embedded at compile time. Memory requirements are computed from parameter counts across a quantization hierarchy (Q8_0 through Q2_K). VRAM is the primary constraint for GPU inference; system RAM is the fallback for CPU-only execution.MoE support -- Models with Mixture-of-Experts architectures (Mixtral, DeepSeek-V2/V3) are detected automatically. Only a subset of experts is active per token, so the effective VRAM requirement is much lower than total parameter count suggests. For example, Mixtral 8x7B has 46.7B total parameters but only activates ~12.9B per token, reducing VRAM from 23.9 GB to ~6.6 GB with expert offloading.
-
Dynamic quantization -- Instead of assuming a fixed quantization, llmfit tries the best quality quantization that fits your hardware. It walks a hierarchy from Q8_0 (best quality) down to Q2_K (most compressed), picking the highest quality that fits in available memory. If nothing fits at full context, it tries again at half context.
-
Multi-dimensional scoring -- Each model is scored across four dimensions (0–100 each):
Dimension What it measures Quality Parameter count, model family reputation, quantization penalty, task alignment Speed Estimated tokens/sec based on backend, params, and quantization Fit Memory utilization efficiency (sweet spot: 50–80% of available memory) Context Context window capability vs target for the use case Dimensions are combined into a weighted composite score. Weights vary by use-case category (General, Coding, Reasoning, Chat, Multimodal, Embedding). For example, Chat weights Speed higher (0.35) while Reasoning weights Quality higher (0.55). Models are ranked by composite score, with unrunnable models (Too Tight) always at the bottom.
-
Speed estimation -- Estimated tokens per second using backend-specific constants:
Backend Speed constant CUDA 220 Metal 160 ROCm 180 SYCL 100 CPU (ARM) 90 CPU (x86) 70 Formula:
K / params_b × quant_speed_multiplier, with penalties for CPU offload (0.5×), CPU-only (0.3×), and MoE expert switching (0.8×). -
Fit analysis -- Each model is evaluated for memory compatibility:
Run modes:
- GPU -- Model fits in VRAM. Fast inference.
- MoE -- Mixture-of-Experts with expert offloading. Active experts in VRAM, inactive in RAM.
- CPU+GPU -- VRAM insufficient, spills to system RAM with partial GPU offload.
- CPU -- No GPU. Model loaded entirely into system RAM.
Fit levels:
- Perfect -- Recommended memory met on GPU. Requires GPU acceleration.
- Good -- Fits with headroom. Best achievable for MoE offload or CPU+GPU.
- Marginal -- Tight fit, or CPU-only (CPU-only always caps here).
- Too Tight -- Not enough VRAM or system RAM anywhere.
The model list is generated by scripts/scrape_hf_models.py, a standalone Python script (stdlib only, no pip dependencies) that queries the HuggingFace REST API. 94 models across 30 providers including Meta Llama, Mistral, Qwen, Google Gemma, Microsoft Phi, DeepSeek, IBM Granite, Allen Institute OLMo, xAI Grok, Cohere, BigCode, 01.ai, Upstage, TII Falcon, HuggingFace, Zhipu GLM, Moonshot Kimi, Baidu ERNIE, and more. The scraper automatically detects MoE architectures via model config (num_local_experts, num_experts_per_tok) and known architecture mappings.
Model categories span general purpose, coding (CodeLlama, StarCoder2, WizardCoder, Qwen2.5-Coder, Qwen3-Coder), reasoning (DeepSeek-R1, Orca-2), multimodal/vision (Llama 3.2 Vision, Llama 4 Scout/Maverick, Qwen2.5-VL), chat, enterprise (IBM Granite), and embedding (nomic-embed, bge).
See MODELS.md for the full list.
To refresh the model database:
# Automated update (recommended)
make update-models
# Or run the script directly
./scripts/update_models.sh
# Or manually
python3 scripts/scrape_hf_models.py
cargo build --releaseThe scraper writes data/hf_models.json, which is baked into the binary via include_str!. The automated update script backs up existing data, validates JSON output, and rebuilds the binary.
src/
main.rs -- CLI argument parsing, entrypoint, TUI launch
hardware.rs -- System RAM/CPU/GPU detection (multi-GPU, backend identification)
models.rs -- Model database, quantization hierarchy, dynamic quant selection
fit.rs -- Multi-dimensional scoring (Q/S/F/C), speed estimation, MoE offloading
display.rs -- Classic CLI table rendering + JSON output
tui_app.rs -- TUI application state, filters, navigation
tui_ui.rs -- TUI rendering (ratatui)
tui_events.rs -- TUI keyboard event handling (crossterm)
data/
hf_models.json -- Model database (94 models)
skills/
llmfit-advisor/ -- OpenClaw skill for hardware-aware model recommendations
scripts/
scrape_hf_models.py -- HuggingFace API scraper
update_models.sh -- Automated database update script
install-openclaw-skill.sh -- Install the OpenClaw skill
Makefile -- Build and maintenance commands
The Cargo.toml already includes the required metadata (description, license, repository). To publish:
# Dry run first to catch issues
cargo publish --dry-run
# Publish for real (requires a crates.io API token)
cargo login
cargo publishBefore publishing, make sure:
- The version in
Cargo.tomlis correct (bump with each release). - A
LICENSEfile exists in the repo root. Create one if missing:
# For MIT license:
curl -sL https://opensource.org/license/MIT -o LICENSE
# Or write your own. The Cargo.toml declares license = "MIT".data/hf_models.jsonis committed. It is embedded at compile time and must be present in the published crate.- The
excludelist inCargo.tomlkeepstarget/,scripts/, anddemo.gifout of the published crate to keep the download small.
To publish updates:
# Bump version
# Edit Cargo.toml: version = "0.2.0"
cargo publish| Crate | Purpose |
|---|---|
clap |
CLI argument parsing with derive macros |
sysinfo |
Cross-platform RAM and CPU detection |
serde / serde_json |
JSON deserialization for model database |
tabled |
CLI table formatting |
colored |
CLI colored output |
ratatui |
Terminal UI framework |
crossterm |
Terminal input/output backend for ratatui |
- Linux -- Full support. GPU detection via
nvidia-smi(NVIDIA),rocm-smi(AMD), and sysfs/lspci(Intel Arc). - macOS (Apple Silicon) -- Full support. Detects unified memory via
system_profiler. VRAM = system RAM (shared pool). Models run via Metal GPU acceleration. - macOS (Intel) -- RAM and CPU detection works. Discrete GPU detection if
nvidia-smiavailable. - Windows -- RAM and CPU detection works. NVIDIA GPU detection via
nvidia-smiif installed.
| Vendor | Detection method | VRAM reporting |
|---|---|---|
| NVIDIA | nvidia-smi |
Exact dedicated VRAM |
| AMD | rocm-smi |
Detected (VRAM may be unknown) |
| Intel Arc (discrete) | sysfs (mem_info_vram_total) |
Exact dedicated VRAM |
| Intel Arc (integrated) | lspci |
Shared system memory |
| Apple Silicon | system_profiler |
Unified memory (= system RAM) |
Contributions are welcome, especially new models.
- Add the model's HuggingFace repo ID (e.g.,
meta-llama/Llama-3.1-8B) to theTARGET_MODELSlist inscripts/scrape_hf_models.py. - If the model is gated (requires HuggingFace authentication to access metadata), add a fallback entry to the
FALLBACKSlist in the same script with the parameter count and context length. - Run the automated update script:
make update-models # or: ./scripts/update_models.sh - Verify the updated model list:
./target/release/llmfit list - Update MODELS.md by running:
python3 << 'EOF' < scripts/...(see commit history for the generator script) - Open a pull request.
See MODELS.md for the current list and AGENTS.md for architecture details.
llmfit ships as an OpenClaw skill that lets the agent recommend hardware-appropriate local models and auto-configure Ollama/vLLM/LM Studio providers.
# From the llmfit repo
./scripts/install-openclaw-skill.sh
# Or manually
cp -r skills/llmfit-advisor ~/.openclaw/skills/Once installed, ask your OpenClaw agent things like:
- "What local models can I run?"
- "Recommend a coding model for my hardware"
- "Set up Ollama with the best models for my GPU"
The agent will call llmfit recommend --json under the hood, interpret the results, and offer to configure your openclaw.json with optimal model choices.
The skill teaches the OpenClaw agent to:
- Detect your hardware via
llmfit --json system - Get ranked recommendations via
llmfit recommend --json - Map HuggingFace model names to Ollama/vLLM/LM Studio tags
- Configure
models.providers.ollama.modelsinopenclaw.json
See skills/llmfit-advisor/SKILL.md for the full skill definition.
If you're looking for a different approach, check out llm-checker -- a Node.js CLI tool with Ollama integration that can pull and benchmark models directly. It takes a more hands-on approach by actually running models on your hardware via Ollama, rather than estimating from specs. Good if you already have Ollama installed and want to test real-world performance. Note that it doesn't support MoE (Mixture-of-Experts) architectures -- all models are treated as dense, so memory estimates for models like Mixtral or DeepSeek-V3 will reflect total parameter count rather than the smaller active subset.
MIT