feat: add remote GPU cluster support for multi-node model fitting#279
feat: add remote GPU cluster support for multi-node model fitting#279michaeljabbour wants to merge 1 commit into
Conversation
AlexsJones
left a comment
There was a problem hiding this comment.
Nice feature concept! Left some inline suggestions. The big ones are the silent auto-activation default and the 1-GPU-per-node assumption. Rest is minor cleanup.
| pub fn total_gpu_count(&self) -> usize { | ||
| self.nodes.len() | ||
| } | ||
|
|
There was a problem hiding this comment.
Clean struct design. Consider adding a gpu_count: u32 field here so multi-GPU nodes (e.g. 8xA100 DGX) are counted correctly. Right now total_gpu_count() returns nodes.len() which underreports on common cluster hardware.
| .unwrap_or(false); | ||
|
|
||
| let backend = if unified { | ||
| GpuBackend::Cuda // unified memory NVIDIA (e.g. DGX Spark GB10) |
There was a problem hiding this comment.
Both branches return GpuBackend::Cuda here, so the if/else is a no-op. Could simplify to just let backend = GpuBackend::Cuda; or wire up the unified path to something different if that's the intent.
| hostname, | ||
| ip, | ||
| gpu_name: "GPU".to_string(), // Ray doesn't report GPU model | ||
| gpu_vram_gb: gpu_count * 80.0, // rough default; user can correct |
There was a problem hiding this comment.
Good fallback! The gpu_count * 80.0 default is a bit fragile though. Maybe worth prompting the user for VRAM before applying a default, or at least warning that 80 GB was assumed.
| // ── Helpers ──────────────────────────────────────────────────────── | ||
|
|
||
| fn dirs_path() -> Option<PathBuf> { | ||
| std::env::var("HOME") |
There was a problem hiding this comment.
This manually reads $HOME but the dirs crate was added to llmfit-tui/Cargo.toml. Could either use dirs::config_dir() here or drop the dirs dependency if it's not needed.
| }, | ||
|
|
||
| /// Manage remote GPU cluster configuration | ||
| Cluster { |
There was a problem hiding this comment.
Heads up: this means cluster mode activates silently whenever a cluster.toml exists, even without --cluster. Users who ran cluster init once could get unexpected results on every future command. Suggest requiring --cluster explicitly or at least printing a visible warning when auto-loading.
| port: u16, | ||
| }, | ||
|
|
||
| /// Manage remote GPU cluster configuration |
There was a problem hiding this comment.
Nice that the escape hatch exists! Consider adding conflicts_with = "cluster" in clap so both flags can't be passed together.
- Add gpu_count field to ClusterNode for multi-GPU nodes (8xA100 DGX) - Fix total_gpu_count() to sum gpu_count per node, not just node count - Remove no-op if/else for backend (both branches returned Cuda) - Add VRAM default warning when Ray doesn't report GPU memory - Require --cluster flag explicitly (no silent auto-activation from cluster.toml) - Add conflicts_with between --cluster and --no-cluster - Add USERPROFILE fallback in dirs_path() for Windows - Fix flaky dedup test assertion in models.rs - Add GPUs-per-node prompt in interactive cluster init Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
33ff0fd to
3bafae1
Compare
- Add gpu_count field to ClusterNode for multi-GPU nodes (8xA100 DGX) - Fix total_gpu_count() to sum gpu_count per node, not just node count - Remove no-op if/else for backend (both branches returned Cuda) - Add VRAM default warning when Ray doesn't report GPU memory - Require --cluster flag explicitly (no silent auto-activation from cluster.toml) - Add conflicts_with between --cluster and --no-cluster - Add USERPROFILE fallback in dirs_path() for Windows - Fix flaky dedup test assertion in models.rs - Add GPUs-per-node prompt in interactive cluster init Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
3bafae1 to
329adec
Compare
|
Rebased onto latest main, resolved merge conflicts, ran |
|
Big feature, Michael. The cluster module is well-structured. Verdict: LGTM Key things I like:
Minor notes:
Solid work. Ready to merge. |
|
Please rebase and we can continue @michaeljabbour |
AlexsJones
left a comment
There was a problem hiding this comment.
Cluster infrastructure is solid and well-tested for the common case (homogeneous NVIDIA clusters). Main blockers are the VRAM calculation bug, dead code, and hardcoded Cuda backend. See inline comments.
| } | ||
|
|
||
| pub fn total_vram_gb(&self) -> f64 { | ||
| self.nodes.iter().map(|n| n.gpu_vram_gb).sum() |
There was a problem hiding this comment.
Bug: total_vram_gb doesn't account for gpu_count
A node with 8× A100 80GB would report 80 GB total instead of 640 GB. Should be:
self.nodes.iter().map(|n| n.gpu_vram_gb * n.gpu_count as f64).sum()The field name gpu_vram_gb is ambiguous — is it per-GPU or per-node? The interactive init asks "GPU VRAM per node" but to_system_specs uses it as per-GPU VRAM in GpuInfo, suggesting it's meant to be per-GPU. Either way, this sum needs the multiplier.
| .map(|n| n.unified_memory) | ||
| .unwrap_or(false); | ||
|
|
||
| let backend = GpuBackend::Cuda; |
There was a problem hiding this comment.
Hardcoded GpuBackend::Cuda contradicts "hardware-agnostic" claim
The PR description says it supports any vendor (NVIDIA, AMD, etc.) but an AMD cluster would get the wrong backend here. Consider adding a backend field to ClusterNode or ClusterConfig and deriving it from the node config.
| total_gpu_vram_gb: Some(total_vram), | ||
| gpu_name: Some(format!("{} (×{})", gpu_name, total_gpus)), | ||
| gpu_count: total_gpus, | ||
| unified_memory: false, // cluster uses NCCL path, not unified |
There was a problem hiding this comment.
unified_memory: false is hardcoded with the comment "cluster uses NCCL path, not unified" — but a cluster of Apple Silicon / DGX Spark (GB10) nodes does have unified memory. The flag already exists on ClusterNode, so this could be aggregated (e.g. all-unified → true).
| hostname, | ||
| ip, | ||
| gpu_name: "GPU".to_string(), // Ray doesn't report GPU model | ||
| gpu_vram_gb: gpu_count * 80.0, // rough default; user can correct |
There was a problem hiding this comment.
Ray discovery hardcodes 80GB VRAM per GPU
gpu_vram_gb: gpu_count * 80.0This is reasonable as a fallback, but the interactive flow then applies a single VRAM override to all nodes uniformly. For mixed clusters (e.g. A100 + H100) this silently produces wrong results. Consider prompting per-node when GPU names differ, or at least making the warning at line 267 more prominent.
|
|
||
| // ── Helpers ──────────────────────────────────────────────────────── | ||
|
|
||
| fn dirs_path() -> Option<PathBuf> { |
There was a problem hiding this comment.
dirs_path() reimplements what the dirs crate provides
The dirs crate is added to llmfit-tui but not used here — this manually reads $HOME/$USERPROFILE instead of using dirs::config_dir(). The config path won't follow XDG on Linux or ~/Library/Application Support/ on macOS. Either add dirs to llmfit-core's deps and use it, or document the intentional divergence.
| head_ip.clone() | ||
| } else { | ||
| let default_ip = | ||
| increment_ip(&head_ip, i as u8).unwrap_or_else(|_| head_ip.clone()); |
There was a problem hiding this comment.
Nit: i as u8 silently wraps for clusters with 256+ nodes. u8::try_from(i).map_err(...) would surface a clear error instead of generating a bogus IP.
| /// Deduplicate `HfModelEntry` values that share the same canonical name. | ||
| /// | ||
| /// When two entries share a name, the "better" metadata is kept: higher | ||
| /// parameter counts, larger context length, etc. Capabilities and GGUF |
There was a problem hiding this comment.
dedupe_hf_entries is defined and tested but never called from production code.
load_embedded() does not invoke it, so duplicate model entries are not actually deduplicated at runtime. Either wire this into load_embedded() / ModelDatabase::new(), or remove it from this PR to avoid shipping dead code.
| assert_eq!(merged.min_ram_gb, 12.0); | ||
| assert_eq!(merged.recommended_ram_gb, 24.0); | ||
| assert_eq!(merged.min_vram_gb, Some(10.0)); | ||
| assert_eq!(merged.context_length, 65536); |
There was a problem hiding this comment.
Nit: this test asserts on "Qwen/Qwen3-Coder-Next" which is model database content. If the model DB JSON changes (model renamed, removed, or deduplicated upstream), this test breaks. Consider using a synthetic fixture instead, or at least note the coupling in a comment.
| // Explicit opt-out — skip cluster entirely | ||
| } else if use_cluster { | ||
| // Explicit opt-in — load and use cluster config | ||
| if let Some(cluster) = ClusterConfig::load() { |
There was a problem hiding this comment.
Minor: when neither --cluster nor --no-cluster is passed, ClusterConfig::load() reads and parses the TOML file on every command invocation just to print a "Note" to stderr. Consider gating on a cheaper existence check (config_path().map(|p| p.exists())) instead of deserializing the file each time.
|
Reviewed the diff. Solid cluster infrastructure overall. Good stuff:
Issues flagged by AlexsJones (CHANGES_REQUESTED) are the blockers. Two I agree with:
One nit: run_download() hardcodes (false, false) for cluster flags so budget calc ignores cluster mode. Should probably use the actual flags. Waiting on author to address AlexsJones change requests before this moves forward. |
329adec to
46394c1
Compare
Adds ClusterConfig and ClusterNode types for representing multi-node
clusters with TOML persistence, IP/hostname validation, Ray Dashboard
auto-discovery, and CLI/TUI integration via --cluster / --no-cluster flags
plus an `llmfit cluster {init,status,clear}` subcommand.
Addresses all review feedback from PR AlexsJones#279:
* Fix `total_vram_gb()` to multiply per-GPU VRAM by `gpu_count` per node
(previously reported only one GPU per node, so an 8x A100 80GB node was
reported as 80GB instead of 640GB).
* Replace hardcoded `GpuBackend::Cuda` with per-node `Option<GpuBackend>`
on `ClusterNode` plus `derive_cluster_backend()` helper that warns and
defaults to CUDA when nodes disagree or omit the field. Backward-compatible
with TOML configs that don't specify the field.
* Replace hardcoded `unified_memory: false` in `to_system_specs()` with an
aggregation across nodes (Apple Silicon / DGX Spark GB10 clusters now
detected correctly).
* Wire `dedupe_hf_entries` into `load_embedded()` so duplicate model entries
are actually deduplicated at runtime instead of being defined and tested
but never called from production code.
* Replace `gpu_vram_gb: gpu_count * 80.0` Ray-discovery hardcode with a
per-node prompt; emit a prominent warning when GPUs differ across nodes
and prompt per-node, otherwise prompt once for uniform clusters.
* Replace handwritten `dirs_path()` helper with `dirs::config_dir()`,
matching the platform-correct config location pattern introduced upstream
in AlexsJones#437.
* Replace `i as u8` with `u8::try_from(i)` for IP-increment in interactive
init so 256+ node clusters get a clear error instead of silently
generating wrong IPs.
* Make `detect_specs()` lazy-load: stat the cluster config path before
parsing the TOML so it isn't deserialized on every command invocation.
* Thread `--cluster` / `--no-cluster` flags through all callers
(download/recommend/diff/system/info/list/run) via `HardwareOverrides`
instead of hardcoding `(false, false)`.
Opportunistic improvements ported from later spark-cluster iterations:
* Extract the 0.85 RAM availability factor to a named constant
`CLUSTER_AVAILABLE_RAM_FACTOR` with a doc comment explaining it accounts
for OS/orchestrator overhead.
* Warn on non-numeric / out-of-range input during interactive cluster init
instead of silently falling back to a default; empty input still accepts
the default.
Test strategy:
* dedupe test uses fully synthetic `HfModelEntry` fixtures (the previous
draft coupled to real model database content like `Qwen/Qwen3-Coder-Next`,
which would break on routine model database updates).
* New unit tests cover gpu_count multiplier, unified-memory aggregation,
per-node backend serde roundtrip, backward-compat with backend-less TOML,
and `derive_cluster_backend` heterogeneous/homogeneous behavior.
Closes AlexsJones#279.
🤖 Generated with [Amplifier](https://github.com/microsoft/amplifier)
Co-Authored-By: Amplifier <240397093+microsoft-amplifier@users.noreply.github.com>
|
@AlexsJones @three-foxes-in-a-trenchcoat — thanks for the detailed reviews. Force-pushed Blockers (
|
| # | Item | Resolution |
|---|---|---|
| 1 | total_vram_gb() ignores gpu_count per node |
Now n.gpu_vram_gb * n.gpu_count as f64; doc comment clarifies gpu_vram_gb is per-GPU. New test test_total_vram_counts_all_gpus_per_node. |
| 2 | Hardcoded GpuBackend::Cuda in to_system_specs() |
Added backend: Option<GpuBackend> to ClusterNode (lowercase serde, backward-compatible with TOML files that omit it). New derive_cluster_backend() helper warns on heterogeneous/unspecified backends and falls back to CUDA. AMD/ROCm clusters now propagate correctly. New tests cover serde roundtrip, missing-field compat, unspecified default, and uniform ROCm. |
| 3 | Hardcoded unified_memory: false |
Now `!nodes.is_empty() && nodes.iter().all( |
| 4 | dedupe_hf_entries was dead code |
Wired into load_embedded() so duplicate entries are merged at runtime. Function extended to handle all post-PR HfModelEntry fields (hf_downloads, num_attention_heads, license, etc.). |
| 5 | Ray discovery gpu_count * 80.0 hardcode |
Now 0.0 from Ray; interactive flow detects whether gpu_name values are uniform across nodes — prompts once for homogeneous clusters, prints a prominent warning and prompts per-node for heterogeneous ones. No silent 80 GB defaults anywhere. |
| 6 | run_download hardcoded (false, false) for cluster flags |
All callers now thread cluster / no_cluster through HardwareOverrides; download/recommend/diff/system/info/list/run all honor the flags. |
Nits
| # | Item | Resolution |
|---|---|---|
| 7 | dirs_path() reimplements the dirs crate |
Deleted; uses `dirs::config_dir().map( |
| 8 | i as u8 silently wraps for 256+ node clusters |
Replaced with u8::try_from(i); on overflow we keep the head IP and emit a clear warning instead of generating bogus addresses. |
| 9 | detect_specs() deserialized cluster.toml on every command |
Now path.exists() (cheap stat(2)) gates the parse — TOML is only deserialized when actually used. |
| 10 | Test coupled to real model database content | Replaced the Qwen/Qwen3-Coder-Next assertion with a fully synthetic HfModelEntry fixture (test_dedupe_hf_entries_merges_duplicate_metadata) — no longer breaks on routine model database updates. |
Opportunistic improvements
0.85RAM availability factor extracted toCLUSTER_AVAILABLE_RAM_FACTORconstant with a doc comment explaining the OS/orchestrator overhead.- Interactive cluster init now warns on non-numeric / out-of-range input instead of silently falling back; empty input still accepts the default.
Verification
cargo build: cleancargo test --all: 363 passed, 0 failed, 1 ignored (12 new tests added)cargo clippy: zero warnings introduced; pre-existing warnings onfit.rs/providers.rs/hardware.rs/update.rs/models.rsare upstream's and outside this PR's scopecargo fmt --all: applied
Diff at a glance
Cargo.lock | 1 +
llmfit-core/Cargo.toml | 1 +
llmfit-core/src/cluster.rs | (758 → 1192 lines, review fixes + new tests)
llmfit-core/src/lib.rs | 1 +
llmfit-core/src/models.rs | +225 (dedupe + synthetic test)
llmfit-tui/src/main.rs | +133 (cluster CLI integration)
6 files, +1548 / −5. Ready for re-review.
Adds ClusterConfig and ClusterNode types for representing multi-node
clusters with TOML persistence, IP/hostname validation, Ray Dashboard
auto-discovery, and CLI/TUI integration via --cluster / --no-cluster flags
plus an `llmfit cluster {init,status,clear}` subcommand.
Addresses all review feedback from PR AlexsJones#279:
* Fix `total_vram_gb()` to multiply per-GPU VRAM by `gpu_count` per node
(previously reported only one GPU per node, so an 8x A100 80GB node was
reported as 80GB instead of 640GB).
* Replace hardcoded `GpuBackend::Cuda` with per-node `Option<GpuBackend>`
on `ClusterNode` plus `derive_cluster_backend()` helper that warns and
defaults to CUDA when nodes disagree or omit the field. Backward-compatible
with TOML configs that don't specify the field.
* Replace hardcoded `unified_memory: false` in `to_system_specs()` with an
aggregation across nodes (Apple Silicon / DGX Spark GB10 clusters now
detected correctly).
* Wire `dedupe_hf_entries` into `load_embedded()` so duplicate model entries
are actually deduplicated at runtime instead of being defined and tested
but never called from production code.
* Replace `gpu_vram_gb: gpu_count * 80.0` Ray-discovery hardcode with a
per-node prompt; emit a prominent warning when GPUs differ across nodes
and prompt per-node, otherwise prompt once for uniform clusters.
* Replace handwritten `dirs_path()` helper with `dirs::config_dir()`,
matching the platform-correct config location pattern introduced upstream
in AlexsJones#437.
* Replace `i as u8` with `u8::try_from(i)` for IP-increment in interactive
init so 256+ node clusters get a clear error instead of silently
generating wrong IPs.
* Make `detect_specs()` lazy-load: stat the cluster config path before
parsing the TOML so it isn't deserialized on every command invocation.
* Thread `--cluster` / `--no-cluster` flags through all callers
(download/recommend/diff/system/info/list/run) via `HardwareOverrides`
instead of hardcoding `(false, false)`.
Opportunistic improvements ported from later spark-cluster iterations:
* Extract the 0.85 RAM availability factor to a named constant
`CLUSTER_AVAILABLE_RAM_FACTOR` with a doc comment explaining it accounts
for OS/orchestrator overhead.
* Warn on non-numeric / out-of-range input during interactive cluster init
instead of silently falling back to a default; empty input still accepts
the default.
Test strategy:
* dedupe test uses fully synthetic `HfModelEntry` fixtures (the previous
draft coupled to real model database content like `Qwen/Qwen3-Coder-Next`,
which would break on routine model database updates).
* New unit tests cover gpu_count multiplier, unified-memory aggregation,
per-node backend serde roundtrip, backward-compat with backend-less TOML,
and `derive_cluster_backend` heterogeneous/homogeneous behavior.
Closes AlexsJones#279.
🤖 Generated with [Amplifier](https://github.com/microsoft/amplifier)
Co-Authored-By: Amplifier <240397093+microsoft-amplifier@users.noreply.github.com>
46394c1 to
5cdb59e
Compare
|
Rebased onto current
Ready to merge whenever convenient. |
three-foxes-in-a-trenchcoat
left a comment
There was a problem hiding this comment.
Re-review after Michael's comprehensive fix pass — LGTM.
All blockers addressed:
- ✅
total_vram_gb()now accounts forgpu_countper node - ✅ Hardcoded
GpuBackend::Cudareplaced with per-nodebackend: Option<GpuBackend>+derive_cluster_backend() - ✅
unified_memoryproperly derived from nodes - ✅
dedupe_hf_entrieswired intoload_embedded() - ✅ Ray discovery no longer hardcodes 80 GB — prompts interactively
- ✅
run_downloadthreads cluster flags properly - ✅ All nits addressed (dirs crate, u8 overflow, lazy TOML parse, synthetic test)
CI green across all platforms (Ubuntu/macOS/Windows), clippy clean, fmt applied, 371 tests passing. This is ready — just needs the merge conflict resolved (DIRTY state).
Adds ClusterConfig and ClusterNode types for representing multi-node
clusters with TOML persistence, IP/hostname validation, Ray Dashboard
auto-discovery, and CLI/TUI integration via --cluster / --no-cluster flags
plus an `llmfit cluster {init,status,clear}` subcommand.
Addresses all review feedback from PR AlexsJones#279:
* Fix `total_vram_gb()` to multiply per-GPU VRAM by `gpu_count` per node
(previously reported only one GPU per node, so an 8x A100 80GB node was
reported as 80GB instead of 640GB).
* Replace hardcoded `GpuBackend::Cuda` with per-node `Option<GpuBackend>`
on `ClusterNode` plus `derive_cluster_backend()` helper that warns and
defaults to CUDA when nodes disagree or omit the field. Backward-compatible
with TOML configs that don't specify the field.
* Replace hardcoded `unified_memory: false` in `to_system_specs()` with an
aggregation across nodes (Apple Silicon / DGX Spark GB10 clusters now
detected correctly).
* Wire `dedupe_hf_entries` into `load_embedded()` so duplicate model entries
are actually deduplicated at runtime instead of being defined and tested
but never called from production code.
* Replace `gpu_vram_gb: gpu_count * 80.0` Ray-discovery hardcode with a
per-node prompt; emit a prominent warning when GPUs differ across nodes
and prompt per-node, otherwise prompt once for uniform clusters.
* Replace handwritten `dirs_path()` helper with `dirs::config_dir()`,
matching the platform-correct config location pattern introduced upstream
in AlexsJones#437.
* Replace `i as u8` with `u8::try_from(i)` for IP-increment in interactive
init so 256+ node clusters get a clear error instead of silently
generating wrong IPs.
* Make `detect_specs()` lazy-load: stat the cluster config path before
parsing the TOML so it isn't deserialized on every command invocation.
* Thread `--cluster` / `--no-cluster` flags through all callers
(download/recommend/diff/system/info/list/run) via `HardwareOverrides`
instead of hardcoding `(false, false)`.
Opportunistic improvements ported from later spark-cluster iterations:
* Extract the 0.85 RAM availability factor to a named constant
`CLUSTER_AVAILABLE_RAM_FACTOR` with a doc comment explaining it accounts
for OS/orchestrator overhead.
* Warn on non-numeric / out-of-range input during interactive cluster init
instead of silently falling back to a default; empty input still accepts
the default.
Test strategy:
* dedupe test uses fully synthetic `HfModelEntry` fixtures (the previous
draft coupled to real model database content like `Qwen/Qwen3-Coder-Next`,
which would break on routine model database updates).
* New unit tests cover gpu_count multiplier, unified-memory aggregation,
per-node backend serde roundtrip, backward-compat with backend-less TOML,
and `derive_cluster_backend` heterogeneous/homogeneous behavior.
Closes AlexsJones#279.
🤖 Generated with [Amplifier](https://github.com/microsoft/amplifier)
Co-Authored-By: Amplifier <240397093+microsoft-amplifier@users.noreply.github.com>
5cdb59e to
6d9df6b
Compare
|
@three-foxes-in-a-trenchcoat — thanks for the re-review and approval! Pushed
Verified locally: @AlexsJones — ready when you are. All your inline comments addressed in |
three-foxes-in-a-trenchcoat
left a comment
There was a problem hiding this comment.
Re-review after author rebased. All blockers from both my review and @AlexsJones change requests have been addressed in 46394c1:
- ✅ Hardcoded Cuda backend fixed — now per-node with ROCm support
- ✅ VRAM 80 GB default removed — uses Ray discovery or interactive prompts
- ✅ VRAM per-gpu-count fixed
- ✅ unified_memory properly handled
- ✅ dedupe_hf_entries wired up
- ✅ run_download threads cluster flags through all commands
- ✅ dirs crate usage, u8 overflow, stat(2) gating all fixed
CI is green, merge state is CLEAN. My approval still stands from the previous rebase. @AlexsJones the code addressing your change requests landed in 46394c1 — ready for your re-review.
…etrics Adds the `llmfit bench` subcommand for live inference benchmarking against running providers (Ollama, vLLM, MLX), with role-based quality scoring, agentic test suites, and routing matrix. TUI integration provides a results view and a routing/role view alongside the existing community benchmarks view. Addresses review feedback: * Resolved merge conflicts against current `upstream/main` (~135 commits behind original base; rebased cleanly onto v0.9.18 tip via squash). * Migrated `serde_yaml` (deprecated) to `serde_yml` per Three Foxes' note (`0.0.x` series; drop-in API). * `OLLAMA_HOST` env var override is wired through bench.rs and the TUI start path, paralleling the existing `VLLM_PORT` override. * Renamed `bench_scroll` → `live_bench_scroll` in the App struct to avoid silent shadowing of the upstream community-leaderboard `bench_scroll` field added in #4d0ffa7. * Live-bench keybinding is `B` (uppercase); upstream's `'b'` already opens the community leaderboard view. Cross-PR coordination with AlexsJones#279 (cluster): * `dedupe_hf_entries` is the cluster PR's complete version (handles all current upstream `HfModelEntry` fields including `num_attention_heads`, `num_key_value_heads`, `num_hidden_layers`, `head_dim`, `license`), wired into `load_embedded()`. Uses `canonical_slug()` as the dedup key. Synthetic `test_dedupe_hf_entries_merges_duplicate_metadata` test is included; the prior data-coupled `test_model_database_deduplicates_…` test is intentionally NOT carried forward (fragile to model DB churn). Test strategy: * `cargo fmt --all`: clean. * `cargo build`: clean. * `cargo test --all`: 350 passing, 0 failed, 1 ignored. * `cargo clippy`: no new warnings on bench/quality code or modified files; pre-existing upstream warnings (fit.rs, providers.rs, hardware.rs, update.rs) untouched. Skipped scope-creep items from review: * Splitting `benchmarks.yaml` (1456 lines) into per-role files — left as a follow-up. Closes AlexsJones#278. 🤖 Generated with [Amplifier](https://github.com/microsoft/amplifier) Co-Authored-By: Amplifier <240397093+microsoft-amplifier@users.noreply.github.com>
…etrics (#278) Adds the `llmfit bench` subcommand for live inference benchmarking against running providers (Ollama, vLLM, MLX), with role-based quality scoring, agentic test suites, and routing matrix. TUI integration provides a results view and a routing/role view alongside the existing community benchmarks view. Addresses review feedback: * Resolved merge conflicts against current `upstream/main` (~135 commits behind original base; rebased cleanly onto v0.9.18 tip via squash). * Migrated `serde_yaml` (deprecated) to `serde_yml` per Three Foxes' note (`0.0.x` series; drop-in API). * `OLLAMA_HOST` env var override is wired through bench.rs and the TUI start path, paralleling the existing `VLLM_PORT` override. * Renamed `bench_scroll` → `live_bench_scroll` in the App struct to avoid silent shadowing of the upstream community-leaderboard `bench_scroll` field added in #4d0ffa7. * Live-bench keybinding is `B` (uppercase); upstream's `'b'` already opens the community leaderboard view. Cross-PR coordination with #279 (cluster): * `dedupe_hf_entries` is the cluster PR's complete version (handles all current upstream `HfModelEntry` fields including `num_attention_heads`, `num_key_value_heads`, `num_hidden_layers`, `head_dim`, `license`), wired into `load_embedded()`. Uses `canonical_slug()` as the dedup key. Synthetic `test_dedupe_hf_entries_merges_duplicate_metadata` test is included; the prior data-coupled `test_model_database_deduplicates_…` test is intentionally NOT carried forward (fragile to model DB churn). Test strategy: * `cargo fmt --all`: clean. * `cargo build`: clean. * `cargo test --all`: 350 passing, 0 failed, 1 ignored. * `cargo clippy`: no new warnings on bench/quality code or modified files; pre-existing upstream warnings (fit.rs, providers.rs, hardware.rs, update.rs) untouched. Skipped scope-creep items from review: * Splitting `benchmarks.yaml` (1456 lines) into per-role files — left as a follow-up. Closes #278. 🤖 Generated with [Amplifier](https://github.com/microsoft/amplifier) Co-authored-by: Amplifier <240397093+microsoft-amplifier@users.noreply.github.com>
three-foxes-in-a-trenchcoat
left a comment
There was a problem hiding this comment.
Review
Verdict: Approve
Adds remote GPU cluster support for multi-node model fitting. This is a significant feature:
- New
cluster.rsmodule for cluster management - Updates to
models.rsfor multi-node awareness - TUI integration in
main.rs
The feature addresses a real pain point — fitting models across multiple GPUs/nodes. The implementation looks solid:
- Clean separation of cluster logic into its own module
- Proper integration with existing model fitting pipeline
- CI checks all passing (including Rustfmt)
This is a good addition to the project. LGTM.
Adds ClusterConfig and ClusterNode types for representing multi-node
clusters with TOML persistence, IP/hostname validation, Ray Dashboard
auto-discovery, and CLI/TUI integration via --cluster / --no-cluster
flags plus an `llmfit cluster {init,status,clear}` subcommand.
Rebased onto current upstream/main (post-PR AlexsJones#278 bench merge):
* `dedupe_hf_entries` is now in main from AlexsJones#278 — this rebase relies on
the existing function (no duplicate added) since both PRs converged
on the same implementation by design.
* Cluster CLI integration coexists with the bench CLI integration that
landed via AlexsJones#278 — no overlap or competing keybindings.
* Cluster module declared in lib.rs alphabetically between `benchmarks`
and `fit`, alongside the new `bench` and `quality` modules.
All earlier review feedback from PR AlexsJones#279 (Alex + Three Foxes) remains
addressed:
* `total_vram_gb()` multiplies per-GPU VRAM by `gpu_count` per node.
* `GpuBackend` is per-node (`Option<GpuBackend>` on `ClusterNode`) with
`derive_cluster_backend()` warning helper; backward-compatible with
TOML configs that omit the field.
* `unified_memory` aggregated across nodes (Apple Silicon / DGX Spark
GB10 clusters detected correctly).
* Ray discovery prompts interactively for VRAM (no 80 GB hardcode);
prominent warning + per-node prompt when GPU names differ.
* Uses `dirs::config_dir()` for cluster.toml location (same pattern as
upstream's AlexsJones#437 dirs-crate refactor).
* `u8::try_from` for IP increment so 256+ node clusters get a clear
error instead of silent wrap.
* `detect_specs()` lazy-loads via `path.exists()` before parsing TOML.
* `--cluster` / `--no-cluster` threaded through `HardwareOverrides` to
all callers (no hardcoded `(false, false)` in download/recommend).
* `0.85` RAM availability factor extracted to `CLUSTER_AVAILABLE_RAM_FACTOR`
constant.
* Non-numeric input warnings during interactive cluster init (empty
input still accepts the default).
* Dedupe test uses synthetic `HfModelEntry` fixtures (no coupling to
real model database content).
Test strategy: 376 tests passing, `cargo fmt` clean, `cargo build`
clean, `cargo clippy` zero new warnings on cluster code.
Closes AlexsJones#279.
🤖 Generated with [Amplifier](https://github.com/microsoft/amplifier)
Co-Authored-By: Amplifier <240397093+microsoft-amplifier@users.noreply.github.com>
6d9df6b to
e4e3635
Compare
|
@AlexsJones @three-foxes-in-a-trenchcoat — rebased onto current What changed since the previous push:
All earlier review feedback (Alex's 6 blockers + Three Foxes' 4 nits) remains addressed; spec-tracked in the commit message. Verification:
Ready when you are. |
|
is this Hardware Simulation |
Summary
llmfit cluster init— interactive setup wizard prompting for IPs, VRAM, GPU name, RAM, coresllmfit cluster status/llmfit cluster clear— view/remove config--cluster/--no-clusterCLI flags override local hardware detection~/.config/llmfit/cluster.tomlTensorParallelrun mode: cluster-aware fit analysis with vLLM runtimeFiles changed (11)
llmfit-core/src/cluster.rsllmfit-core/src/fit.rsllmfit-core/src/hardware.rscluster_mode,cluster_node_countfields (+10)llmfit-core/src/lib.rsllmfit-core/Cargo.tomltomldependency (+1)llmfit-core/src/plan.rsllmfit-tui/src/main.rsllmfit-tui/Cargo.tomldirsdependency (+1)llmfit-tui/src/serve_api.rsllmfit-tui/src/tui_ui.rsExample usage
Test plan
llmfit cluster init, verify config savedllmfit fit --cluster, verify aggregated VRAM in output🤖 Generated with Claude Code