Tags: unslothai/unsloth
Tags
Fix macOS Apple Silicon installs resolving torch against x86_64 (#5976) * Fix macOS Apple Silicon installs that resolve torch against x86_64 On Apple Silicon, `uv venv --python 3.13` can reuse a cached x86_64 (Rosetta) CPython, often because uv itself is an x86_64 build. The resulting venv reports macosx_*_x86_64 to the wheel resolver, but PyTorch has shipped no macOS x86_64 wheels since 2.2.2, so the torch install fails with "no wheels with a matching platform tag (macosx_..._x86_64)". Two changes, both scoped to macOS arm64 and additive (no other install path is affected): - Create the venv with an arch-explicit `cpython-X.Y-macos-aarch64-none` request on Apple Silicon (no --python override), so uv cannot fall back to a cached x86_64 interpreter. - Harden the existing x86_64 venv guard: when the venv python cannot be executed (x86_64 binary on a Mac without Rosetta), the platform.machine() probe returns empty and the recreate was silently skipped. Fall back to reading the binary's Mach-O arch via lipo/file so migrated or pre-existing x86_64 venvs are still recreated as arm64. * Harden arm64 static-arch fallback: file -L and set -e safety Address review feedback on the lipo/file fallback: - uv symlinks the venv's bin/python to the base interpreter; plain `file` reports the symlink ("symbolic link to ...") and the arch substring never matches. Use `file -L` to dereference (lipo already follows the link). - Append `|| true` so the command substitution cannot abort the installer under set -e on a Mac that has neither lipo nor file. --------- Co-authored-by: danielhanchen <michaelhan2050@gmail.com>
Studio: clearer MCP server validation when stdio is disabled (#5928) Improve the rejection message when an MCP server address is not an http(s) URL. It now points to the expected http(s):// form with an example, and only mentions that local commands are disabled when the value contains whitespace (a reliable command signal), since a lone token may just be a scheme-less URL. Wording is host-scoped rather than desktop-only because self-hosted hosts can opt in via an env var. Backend only, with tests; accepted input is unchanged.
test(install): cover the Apple Silicon venv arch rebuild guard Extracts the real guard block from install.sh and asserts: clean arm64 venv untouched, x86_64 venv rebuilt as arm64, the x86_64-then-3.13.8 corner case, arm64 3.13.8 downgrade preserved, --python skip, and Intel/Rosetta no-op. Co-authored-by: Ramakrishna Bachu <ramankrishna10@gmail.com>
studio: engage draft-mtp on vision MTP GGUFs (drop incorrect vision g… …ate) (#5560) * studio: engage draft-mtp on vision MTP GGUFs The draft-mtp auto-promotion in LlamaCppBackend.load_model was gated on not effective_is_vision, and the spec-emit branch repeated the same guard. Every Unsloth -MTP GGUF repo ships an mmproj projector, so effective_is_vision was always True for those repos and the MTP speedup silently never engaged out of the box. llama.cpp #22673 explicitly states MTP is compatible with vision input. The bundled b9204 server happily loads both: a manual run with --mmproj ... --spec-type draft-mtp --spec-draft-n-max 6 logs "loaded multimodal model" followed by "adding speculative implementation 'draft-mtp'". Drop the vision gate from both sites and rewrite the matching short circuit in _already_in_target_state so reload checks reach the auto promotion path on vision MTP loads. Add three regression tests covering vision MTP match (auto and default), and non MTP vision repo unaffected. Verified on a B200 with unsloth/Qwen3.6-35B-A3B-MTP-GGUF:UD-Q4_K_XL: base decode 179.7 t/s vs MTP decode 253.8 t/s, draft acceptance 0.57, 1.41x speedup on a 255 token completion. mmproj still loads and image input remains available. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * studio: prefer Qwen3.5 -MTP GGUF variants in default model lists With the vision gate dropped in the previous commit, draft-mtp now auto-engages on -MTP GGUF repos out of the box. Swap the four Qwen3.5 recommended entries in DEFAULT_MODELS_GGUF and DEFAULT_MODELS_STANDARD to their -MTP-GGUF counterparts so new users get the speedup by default: unsloth/Qwen3.5-4B-GGUF -> unsloth/Qwen3.5-4B-MTP-GGUF unsloth/Qwen3.5-9B-GGUF -> unsloth/Qwen3.5-9B-MTP-GGUF unsloth/Qwen3.5-35B-A3B-GGUF -> unsloth/Qwen3.5-35B-A3B-MTP-GGUF unsloth/Qwen3.5-0.8B-GGUF -> unsloth/Qwen3.5-0.8B-MTP-GGUF All four HF repos exist (HEAD 200) and ship the same UD-Q4_K_XL quant layout as the non-MTP variants. Non-Qwen3.5 entries are untouched. * bump version to 2026.5.4 Picks up the studio MTP vision-gate fix and the Qwen3.5 -MTP default swap in this PR. * studio: prefer Qwen3.6-35B-A3B-MTP-GGUF in default model lists Same rationale as the previous Qwen3.5 swap. The Qwen3.6 MTP variant exists at unsloth/Qwen3.6-35B-A3B-MTP-GGUF (HF HEAD 200) and now auto-engages draft-mtp out of the box with the gate fix. * studio: drop --spec-draft-n-max from 6 to 3 for draft-mtp n=6 is too greedy: on Qwen3.6 the draft has to guess 6 tokens ahead and acceptance crashes to ~0.45, leaving only ~14% throughput gain. PR ggml-org/llama.cpp#22673's author benched n=3 at ~0.72 acceptance and 2 to 3x speedup on the same Qwen3.6 family, and the README sample command uses n=2 or n=3. Match that. CPU/Mac branch already uses n=3, so this aligns both paths. * studio: set --spec-draft-n-max back to 6 for draft-mtp on GPU Reverts the n=3 tuning. n=6 is the original default; user-side comparisons hold the larger draft window steady so the toggle (next commit) is the primary on/off lever. * studio: add Speculative Decoding toggle under Max Tokens Adds a top-level kill switch (panel-switch under Max Tokens, mirroring Auto-Healing Tool Calls) that forces the /load request's speculative_type to "off" when disabled. The backend "off" branch in LlamaCppBackend.load_model skips both the draft-mtp auto-promotion and the spec-emit branch, so neither --spec-type draft-mtp nor --spec-default reaches llama-server. Wiring: - chat-runtime-store: new speculativeDecodingEnabled bool, default true, persisted to localStorage under unsloth_speculative_decoding, plus a setSpeculativeDecodingEnabled setter. - chat-settings-sheet: SpeculativeDecodingToggle rendered immediately beneath the Max Tokens slider for non-external models. - use-chat-model-runtime: when speculativeDecodingEnabled is false, override speculative_type to "off" in the loadModel call so the switch wins over any pre-existing speculativeType state (including the existing per-model toggle in Model Settings). Verified end to end on unsloth/Qwen3.6-35B-A3B-MTP-GGUF:UD-Q4_K_XL: toggle ON emits --spec-type draft-mtp --spec-draft-n-max 6; toggle OFF emits zero --spec-* flags on the same MTP GGUF. * studio: relocate Speculative Decoding toggle into Model Settings Move the toggle out from under Max Tokens and back into the Model Settings section, directly beneath KV Cache Dtype, where the existing Apply/Reset workflow already drives a reload on dirty. This way flipping the switch in the UI actually picks up: the section becomes dirty, Apply re-runs /load with the new speculative_type. Drop the !currentModelIsMultimodal gate so vision MTP GGUFs can also disable speculative decoding from the UI. Switch the toggle's off-value from null to "off" so the backend's "off" short-circuit fires for MTP models too (null normalises to None which re-triggers the draft-mtp auto-promotion). Tooltip now reads "Faster generation with 0% accuracy hit". Remove the now-redundant speculativeDecodingEnabled bool + setter from the runtime store and the load-time override in use-chat-model-runtime; the toggle binds directly to speculativeType. * studio: restore OOM/TIGHT badge on recommended GGUF rows The recommended-list row passed vramStatus=null for any GGUF repo because the existing useRecommendedModelVram hook reads safetensors totals from HF model info, which GGUF-only repos do not expose. As a result, an OOM Q-quant repo would render with only a "GGUF" badge and no visual signal that nothing in it fits. Add useGgufRecommendedFit: per repo, fetch the variant list via the existing /api/models/gguf-variants endpoint, take the smallest variant's size_bytes, and classify with the same 0.7*GPU + 0.7*RAM thresholds as GgufVariantExpander. Session-scoped cache + in-flight dedup so a repo is requested at most once. Wire the result into the three GGUF row sites in pickers.tsx so OOM and TIGHT badges show on the collapsed cards. * Revert "studio: restore OOM/TIGHT badge on recommended GGUF rows" This reverts commit 07793b1240df72b13e51d6dc15f63c4ee8c6cba9. The new useGgufRecommendedFit hook was treating the symptom. PR #5561 identified the real root cause: useGpuInfo was calling /api/system with plain fetch instead of authFetch, so the session-auth check failed silently and gpu.available stayed false everywhere. With no GPU info, every fit check (variant expander, recommended carousel) fell back to "no signal" and dropped the OOM/TIGHT badges. Reverting the over-engineered hook and applying the authFetch fix in the next commit, which restores the existing badges with one line. * chore: replace qwen suggested with MTP variant * fix: restore GPU info auth for GGUF fit badges --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: imagineer99 <samleejackson0@gmail.com>
studio/chat: release stuck IME flag when compositionend never fires (#… …5551) * studio/chat: release stuck IME flag when compositionend never fires Chrome on Windows talking to a WSL-hosted Studio (issue #5546) fires compositionstart + compositionupdate but no compositionend after the IME commits. The earlier hardening in #5327 cleared the stale flag on the next non-composing input event, which never arrives in this sequence, so composingRef stays true forever and the Send button stays disabled even though the committed CJK text is already in the textarea. Add a watchdog in both useImeComposerInputHandlers (main + edit composer) and SharedComposer (compare mode) that runs the same reset the missing compositionend would have done. The timer is rearmed on every compositionupdate and on every non-composing input so it only fires when the IME pipeline has actually gone quiet — normal candidate selection keeps it alive, the WSL stuck case lets it expire. Extends the existing IME Playwright smoke with a stuck-compositionend repro and adds a static guard so the watchdog can't be removed without the regression tests catching it. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * studio/chat: re-pin composing flag on IME keydown to close #5546 watchdog gap The stuck-compositionend watchdog (PR #5551) releases composingRef after 2500 ms of IME silence so Send unwedges in the WSL+Chrome case. The same release also fires during a long candidate-window pause in healthy IMEs, which lets a subsequent IME-confirm Enter slip preedit text through handleSubmit (main composer) or click-Send through send() (compare composer). Add a keydown gate to both composers: when the browser still reports nativeEvent.isComposing or keyCode 229, re-pin composingRef and cancel any pending watchdog so the next form-submit / send() guard refuses. The Send button stays visually enabled (avoids re-introducing the stuck-UI bug) but the submit path is blocked until a real compositionend or non-composing input arrives. Mirrors the existing isComposing guard shape in shared-composer.onKeyDown. Tests: - tests/studio/test_composer_rtl_bidi_attribute.py: two new static guards asserting the keydown gate wiring in both composer files. - tests/studio/playwright_chat_ime_i18n.py: new section 6c repro that fires the IME-confirm keydown after the watchdog has cleared, then triggers form.requestSubmit() and asserts the preedit text is not cleared (would indicate a leaked submit). Verified across Chromium / Firefox / WebKit via a side-by-side pre-PR vs post-PR simulation (54 scenarios, zero pageerror or console.error). The #5546 stuck-end repro still passes (Send re-enables 2.5-3 s after the silent commit) and the new keydown-repin probe confirms the submit gate refuses on all three engines. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * studio/chat: re-arm IME watchdog after keydown re-pin (Codex P1) The keydown re-pin added in 2c3c979 closed the watchdog-race for healthy IMEs, but on the same WSL+Chrome no-compositionend path this PR targets it would re-lock Send permanently: setting composingRef=true and only *clearing* the watchdog leaves the flag pinned forever if no follow-up compositionend or non-composing input ever arrives. Swap clearStuckTimer/clearStuckImeTimer for refreshStuckTimer/ refreshStuckImeTimer in both composer keydown gates so the watchdog fires once more after every IME keypress. Same visual contract — Send stays enabled — the submit gate just keeps a 2.5s window before re-releasing instead of staying locked. Extends the playwright IME smoke with section 6d: clears composing via the watchdog, fires an IME keydown, then waits past the re-armed watchdog window and asserts the form submit actually flushes the textarea. Two new static guards in test_composer_rtl_bidi_attribute lock the refresh call into both keydown handlers. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Daniel Han <danielhanchen@gmail.com>
PreviousNext