Skip to content

Tags: Blaizzy/mlx-vlm

Tags

v0.5.0

Toggle v0.5.0's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
Speculative decoding fixes: auto-detect drafter kind and preserve mul…

…timodal prefill (#1125)

* Auto-detect drafter kind from HF model_type, improve error message

When --draft-model points at a Gemma 4 assistant checkpoint but the user
forgets --draft-kind mtp, --draft-kind silently defaults to "dflash" and
the DFlash round-loop crashes deep inside draft_block with an opaque
"set_shared_kv() must be called" RuntimeError (issue #1122).

- load_drafter() now peeks at the drafter's HF config.json model_type and
  overrides --draft-kind when the value implies a specific round-loop
  ("gemma4_assistant" → "mtp"). Returns (model, resolved_kind) so callers
  use the right round-loop downstream.
- CLI (generate.py) and server pick up the resolved kind and print a
  clear note when an override happens.
- The fallback RuntimeError in gemma4_assistant.draft_block now points at
  the actual fix ("pass --draft-kind mtp on the CLI / set
  MLX_VLM_DRAFT_KIND=mtp on the server").
- 7 unit tests for the resolver covering the override, no-op, missing
  config, and malformed config cases.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Fix MTP repetition loop on long generations (#1122)

rollback_speculative_cache skipped trim when is_trimmable() returned
False. For sliding-attention layers (RotatingKVCache /
BatchRotatingKVCache), is_trimmable() returns False once
offset >= max_size (= sliding_window). After ~1024 tokens, sliding
caches stopped getting rolled back while full-attention caches kept
trimming — rejected drafts' K/V leaked in as fake history, corrupting
attention and pushing the model into a repetition loop.

Drop the gate; trim() is just a counter decrement and the next verify
forward's _update_concat correctly drops the stale slots.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Make --draft-kind default to auto-detect

When --draft-kind is omitted, infer the round-loop from the drafter's
HF model_type (gemma4_assistant → mtp; everything else → dflash).
Explicit --draft-kind still wins, with the existing override+warning if
it disagrees with the drafter's model_type.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* format

* Fix MTP+image: thread inputs_embeds through speculative prefill

The speculative prefill path called the language model with raw input_ids,
discarding the merged inputs_embeds (and per_layer_inputs) returned by
_gpu_embed. Image/audio features were therefore lost and the model
attended to vision-token placeholders as if they were normal text — the
output described the image as a "block of repetitive Unicode characters".

Pad each request's pre-computed inputs_embeds and per_layer_inputs to
max_len, batch them, and pass through to lm() instead of letting it
re-embed input_ids.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Fix speculative multimodal prefill batching

* Fix Gemma 4 MTP drafter masks under left-padded batches

The drafter masks treated 'valid keys' as the FIRST kv_valid_len buffer
positions (k_idx < kv_valid_len). The target's batched prefill *left-pads*
short rows, so the real keys actually live at the END of the buffer
(positions [kv_len - kv_valid_len, kv_len)). The drafter therefore
attended to padded zero-K/V slots and ignored the real prefix —
producing degenerate per-row output ('*   The user. is. asking.     *
for. a.') in mixed-length batches.

Flip the validity check to 'k_idx >= kv_len - kv_valid_len' for the full
mask, and shift the SWA mask's k coords by left_padding so the q-k
distance is computed in RoPE space (where real keys are at positions
0..T_real-1, padded keys at <0).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Add left-padding support for batched shared K/V normalization

Introduce a new function to handle left-padding in the normalization of batched shared K/V states. This ensures that the drafter's prefix-valid layout correctly accounts for left-padded rows, allowing for accurate handling of mixed-length batches. Update the relevant methods in the Gemma4AssistantDraftModel and associated masks to utilize this new functionality. Add tests to verify the correct behavior of the normalization process with left-padding.

Co-Authored-By: Claude Opus 4.7 (1M context)

* Fix batched Qwen DFlash rollback

* Add eos_token_id support to TextConfig and ModelConfig for qwen_3_5

* Add resolve_qwen_eos_token_id function to streamline EOS token handling in ModelConfig

This update introduces the resolve_qwen_eos_token_id function to improve the management of the eos_token_id in ModelConfig for both qwen3_5 and qwen3_5_moe. The function ensures that the QWEN_CHAT_EOS_TOKEN_ID is included in the eos_token_id list if not already present. Additionally, tests have been updated to reflect these changes, ensuring compatibility with the new logic.

* format

* refactor speculative tests

* Refactor test_utils.py by removing the get_class_predicate test and enhancing the MockProcessor class with a more detailed DummyTokenizer implementation. Update load_image tests to improve error handling and mock response behavior.

* Fix speculative server test helper

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

v0.4.4

Toggle v0.4.4's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
Optimize TurboQuant Metal kernels: 0.85-1.90x baseline with 89% KV sa…

…vings (#909)

* Enhance TurboQuant performance with new fused decode kernel and optimizations (0.59x vs baseline)

- Introduced `_fully_fused_decode_kernel` for scoring, online softmax, value accumulation, and normalization in a single Metal dispatch, reducing dispatch count from 7 to 1.
- Optimized `_prod_score_repeat_kernel` by grouping q_rot by codebook index, minimizing inner loop computations.
- Added precomputed masks and entry counts to streamline processing and improve efficiency.
- Updated comments for clarity on new implementations and their benefits.

These changes significantly enhance the TurboQuant decoding process, improving performance and reducing memory overhead.

* Implement tiled and fused MSE score kernels for TurboQuant (10% faster)
- Introduced `_mse_score_tiled_kernel` to optimize MSE score computation by preloading query values into registers and processing tokens in tiles, significantly reducing dispatch overhead.
- Added `_mse_dequant_rotated_kernel` for efficient fused dequantization in rotated space, combining unpacking, codebook lookup, and norm scaling in a single Metal dispatch.
- Enhanced `_fast_mse_dequant_rotated` to utilize the new kernel for improved performance in MSE state dequantization.
- Updated the `_metal_mse_score` function to leverage the tiled kernel for better efficiency in scoring operations.

These changes enhance the performance and efficiency of TurboQuant operations, particularly in handling MSE computations.

* Enhance TurboQuant fused MSE decode kernel for improved performance

- Updated `_fused_mse_decode_kernel` to align with MLX SDPA architecture, optimizing threadgroup and SIMD configurations for better parallelism.
- Refactored source code to streamline processing of key and value tokens, incorporating shared memory for cross-simdgroup reductions.
- Improved online softmax and value accumulation logic, enhancing overall efficiency in the decoding process.

These changes significantly boost the performance of TurboQuant operations, particularly in handling MSE decoding tasks.

* Implement two-pass fused MSE decode kernels for TurboQuant

- Added `_fused_mse_decode_2pass_1_kernel` for block-parallel quantized attention, optimizing key and value processing with pre-rotated queries and shared memory.
- Introduced `_fused_mse_decode_2pass_2_kernel` to reduce partial block results via cross-block online softmax, enhancing efficiency in the decoding process.
- Updated comments for clarity on the new implementations and their alignment with MLX SDPA architecture.

These changes significantly improve the performance and scalability of TurboQuant operations, particularly in handling MSE decoding tasks.

* Enhance quantization process in TurboQuantMSECodec

- Updated the `_quantize_unit` method to improve efficiency by eliminating unnecessary D×D rotations when the estimate is not computed.
- Introduced handling for cases where `self.bits` is zero, returning an appropriately shaped zero array.
- Added detailed docstring to clarify the purpose and functionality of the updated method.

These changes optimize the quantization process, contributing to better performance in TurboQuant operations.

* Add unrolled extraction and scoring functions for TurboQuant

- Introduced `_gen_unrolled_extract`, `_gen_unrolled_score`, and `_gen_unrolled_value` functions to facilitate MLX-style unrolled byte extraction and accumulation for key and value processing.
- Updated the `_fused_mse_decode_kernel` and `_fused_mse_decode_2pass_1_kernel` to utilize the new unrolled functions, enhancing performance and efficiency in the decoding process.
- Improved comments for clarity on the new implementations and their alignment with TurboQuant architecture.

These changes significantly optimize the TurboQuant decoding process, particularly in handling key and value extraction and scoring operations.

* Refactor fused MSE quantization kernel for improved packing and clarity

- Enhanced the `_fused_mse_quantize_kernel` by implementing a more efficient packing mechanism for indices, allowing for per-dimension handling and reducing complexity.
- Updated comments to clarify the new packing logic and its alignment with TurboQuant architecture.
- Adjusted the `quantize` method in `_TurboQuantMSECodec` to reflect support for all bit widths, improving flexibility in quantization operations.

These changes optimize the quantization process, contributing to better performance and clarity in TurboQuant operations.

* Implement fused key-value quantization kernel for TurboQuant

- Introduced `_fused_kv_quantize_kernel` to optimize key and value quantization in a single dispatch, reducing the number of required calls and improving performance.
- Enhanced the `_fused_mse_quantize_kernel` with updated comments for clarity on the new packing logic and its efficiency.
- Adjusted the Metal source code for both kernels to streamline processing and improve parallelism.

These changes significantly enhance the quantization process in TurboQuant, contributing to better performance and reduced dispatch overhead.

* Refactor RHT forward and inverse functions in TurboQuant for improved performance

- Updated the `_rht_forward` and `_rht_inverse` functions to utilize `mx.hadamard_transform`, replacing the previous `_fast_walsh_hadamard` calls for better efficiency.
- Enhanced docstrings to clarify the use of `mx.hadamard_transform` and its impact on performance, particularly for dimensions that are powers of 2.
- Adjusted the handling of padding and scaling to streamline the operations and improve clarity.

These changes optimize the RHT operations in TurboQuant, contributing to enhanced performance and reduced computational overhead.

* Add fused non-rotated quantization kernel for TurboQuant

- Introduced `_fused_norot_quantize_kernel` to optimize quantization of pre-rotated vectors without internal rotation, enhancing performance when used with `mx.hadamard_transform`.
- Updated the `quantize` method in `_TurboQuantMSECodec` to utilize the new kernel for efficient single-token decoding, improving handling of packed data.
- Enhanced comments for clarity on the new kernel's functionality and its integration within the TurboQuant architecture.

These changes significantly improve the quantization process, contributing to better performance in TurboQuant operations.

* format

* remove dead code

* format

* Remove unused kernel functions from TurboQuant

- Deleted the `_prod_score_multi_kernel`, `_mse_weighted_rot_multi_kernel`, and `_mse_dequant_rotated_kernel` functions, which were not utilized in the current implementation.
- Cleaned up the codebase by eliminating dead code, enhancing maintainability and readability.

These changes streamline the TurboQuant module by removing unnecessary complexity.

* Fix decoding divergence

- Refactored `_gen_unrolled_extract`, `_gen_unrolled_score`, and `_gen_unrolled_value` to support runtime bit offsets, improving flexibility in byte extraction.
- Updated `_fused_mse_decode_kernel` to include a dimension parameter, ensuring compatibility with varying input sizes and enhancing performance.
- Adjusted test cases to reflect changes in codec usage, switching to the new MSE-only codec for improved speed and quality.

These modifications streamline the TurboQuant module, enhancing its efficiency and adaptability for different scenarios.

* format

* Refactor kernel documentation in TurboQuant

- Updated docstrings for several fused decode and quantization kernels to enhance clarity and conciseness.
- Removed redundant details while retaining essential information about kernel functionality and architecture.
- Improved overall readability of the codebase, making it easier for future developers to understand the purpose and operation of each kernel.

These changes streamline the documentation within the TurboQuant module, facilitating better comprehension and maintenance.

* Enhance README with TurboQuant KV cache quantization details

- Added new command-line options `--kv-bits` and `--kv-quant-scheme` for KV cache quantization configuration.
- Included example usage for running the server with TurboQuant settings.
- Updated documentation on the quantization process for keys and values to improve clarity.

These changes provide users with better guidance on utilizing TurboQuant features effectively.

* Refactor quantization logic in KV cache handling

- Updated the `quantize_entry` function to skip `RotatingKVCache` entries, optimizing the quantization process.
- Changed the handling of lists in `quantize_entry` to modify entries in place for improved performance.
- Added logic to skip quantization for the last layer in `prompt_cache`, addressing sensitivity issues in deep models.

These changes enhance the efficiency and effectiveness of the quantization process within the TurboQuant module.

v0.4.3

Toggle v0.4.3's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
Remove TurboQuant benchmark artifacts and add README docs (#894)

* Remove benchmark test files and add TurboQuant README section

- Remove NIAH test data, runner scripts, PPL evaluation, and plot tools
  (these were development/benchmarking artifacts, not production tests)
- Keep test_turboquant.py (unit tests for the codec)
- Add TurboQuant KV Cache section to README with:
  - Quick start CLI and Python examples
  - How it works (rotation + codebook quantization)
  - Performance tables (Qwen3.5-4B, gemma-4-31b)
  - Supported bit widths guide
  - Compatibility notes for different cache types

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Add all kv_bits options (2, 3, 3.5, 4) to TurboQuant README section

Show CLI examples for each bit width and expand the supported bit widths
table with key/value bit breakdown.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Revert "Add all kv_bits options (2, 3, 3.5, 4) to TurboQuant README section"

This reverts commit 52569e9.

* Update README.md to reflect performance metrics for TurboQuant 3.5-bit, replacing outdated decode and prefill rates with peak memory usage statistics. This change highlights a significant reduction in KV memory usage and provides clearer insights into TurboQuant's efficiency.

* Update README.md to correct the performance metrics section for TurboQuant 3.5-bit, changing "Active Memory" to "Peak Memory" for clarity and accuracy in reporting memory usage statistics.

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

v0.4.2

Toggle v0.4.2's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
Fix PaliGemma processor kwarg routing (#877)

* Fix PaliGemma processor kwarg routing (#870)

* Update .gitignore

v0.4.1

Toggle v0.4.1's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
Add molmo point (#844)

v0.4.0

Toggle v0.4.0's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
Update dependencies and version number (#809)

- Bump mlx-lm version to 0.31.0 in requirements.txt.
- Update package version to 0.4.0 in version.py.

v0.3.12

Toggle v0.3.12's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
Add dots-ocr (#749)

* add dots-ocr

* add docs

v0.3.11

Toggle v0.3.11's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
Refactor Attention class to calculate n_kv_heads dynamically based on…

… head_dim, improving clarity and maintainability of the code. (#713)

v0.3.10

Toggle v0.3.10's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
TFMS v5 RC3 + Fix processor registry (#693)

* Refactor DeepSeek OCR models to standardize processor naming and enhance functionality

- Renamed `DeepseekVLV2Processor` to `DeepseekOCRProcessor` in the original DeepSeek OCR model for consistency.
- Introduced a new `DeepseekOCR2Processor` in the `deepseekocr_2` module, implementing a patch for `AutoProcessor` to correctly handle model loading.
- Updated the `get_input_embeddings` method to improve input handling and ensure compatibility with new processor classes.
- Enhanced model initialization and query handling to support dynamic resolution and improved feature extraction.

* Enhance test_smoke.py for improved memory management and cleanup

- Added garbage collection and synchronization steps to ensure proper memory handling during tests.
- Updated cleanup process to include the configuration object, enhancing resource management after test execution.

* Update dependencies and configuration files

- Removed `mlx-audio` from optional dependencies in `pyproject.toml`.
- Updated `transformers` version from `5.0.0rc1` to `5.0.0rc3` and `mlx-lm` version from `0.30.2` to `0.30.5` in `requirements.txt`.
- Enhanced `uv.lock` with additional resolution markers for Python version compatibility and updated package distribution details.

* Fix device info retrieval in wired_limit function

- Updated the method of accessing the maximum recommended working set size from the MX library, changing from `mx.metal.device_info()` to `mx.device_info()`. This change ensures compatibility with the current MX library structure and improves the accuracy of memory management in the model.

* Refactor version retrieval in test_smoke.py

- Updated the method of retrieving package versions by replacing direct imports with `importlib.metadata.version`. This change enhances compatibility and standardizes version retrieval for MLX, MLX-VLM, and Transformers packages in the test suite.

v0.3.9

Toggle v0.3.9's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
Add ministral3 (#611)

* Add to_dict method to BaseModelConfig for improved serialization

* Filter out consolidated model weights when loading .safetensors files in load_model function

* add ministral 3

* Add tests for Mistral3 and Ministral3 models in test_models.py

* bump version

* remove unused