Tags · Blaizzy/mlx-vlm

v0.4.4

Optimize TurboQuant Metal kernels: 0.85-1.90x baseline with 89% KV sa…

…vings (#909)

* Enhance TurboQuant performance with new fused decode kernel and optimizations (0.59x vs baseline)

- Introduced `_fully_fused_decode_kernel` for scoring, online softmax, value accumulation, and normalization in a single Metal dispatch, reducing dispatch count from 7 to 1.
- Optimized `_prod_score_repeat_kernel` by grouping q_rot by codebook index, minimizing inner loop computations.
- Added precomputed masks and entry counts to streamline processing and improve efficiency.
- Updated comments for clarity on new implementations and their benefits.

These changes significantly enhance the TurboQuant decoding process, improving performance and reducing memory overhead.

* Implement tiled and fused MSE score kernels for TurboQuant (10% faster)
- Introduced `_mse_score_tiled_kernel` to optimize MSE score computation by preloading query values into registers and processing tokens in tiles, significantly reducing dispatch overhead.
- Added `_mse_dequant_rotated_kernel` for efficient fused dequantization in rotated space, combining unpacking, codebook lookup, and norm scaling in a single Metal dispatch.
- Enhanced `_fast_mse_dequant_rotated` to utilize the new kernel for improved performance in MSE state dequantization.
- Updated the `_metal_mse_score` function to leverage the tiled kernel for better efficiency in scoring operations.

These changes enhance the performance and efficiency of TurboQuant operations, particularly in handling MSE computations.

* Enhance TurboQuant fused MSE decode kernel for improved performance

- Updated `_fused_mse_decode_kernel` to align with MLX SDPA architecture, optimizing threadgroup and SIMD configurations for better parallelism.
- Refactored source code to streamline processing of key and value tokens, incorporating shared memory for cross-simdgroup reductions.
- Improved online softmax and value accumulation logic, enhancing overall efficiency in the decoding process.

These changes significantly boost the performance of TurboQuant operations, particularly in handling MSE decoding tasks.

* Implement two-pass fused MSE decode kernels for TurboQuant

- Added `_fused_mse_decode_2pass_1_kernel` for block-parallel quantized attention, optimizing key and value processing with pre-rotated queries and shared memory.
- Introduced `_fused_mse_decode_2pass_2_kernel` to reduce partial block results via cross-block online softmax, enhancing efficiency in the decoding process.
- Updated comments for clarity on the new implementations and their alignment with MLX SDPA architecture.

These changes significantly improve the performance and scalability of TurboQuant operations, particularly in handling MSE decoding tasks.

* Enhance quantization process in TurboQuantMSECodec

- Updated the `_quantize_unit` method to improve efficiency by eliminating unnecessary D×D rotations when the estimate is not computed.
- Introduced handling for cases where `self.bits` is zero, returning an appropriately shaped zero array.
- Added detailed docstring to clarify the purpose and functionality of the updated method.

These changes optimize the quantization process, contributing to better performance in TurboQuant operations.

* Add unrolled extraction and scoring functions for TurboQuant

- Introduced `_gen_unrolled_extract`, `_gen_unrolled_score`, and `_gen_unrolled_value` functions to facilitate MLX-style unrolled byte extraction and accumulation for key and value processing.
- Updated the `_fused_mse_decode_kernel` and `_fused_mse_decode_2pass_1_kernel` to utilize the new unrolled functions, enhancing performance and efficiency in the decoding process.
- Improved comments for clarity on the new implementations and their alignment with TurboQuant architecture.

These changes significantly optimize the TurboQuant decoding process, particularly in handling key and value extraction and scoring operations.

* Refactor fused MSE quantization kernel for improved packing and clarity

- Enhanced the `_fused_mse_quantize_kernel` by implementing a more efficient packing mechanism for indices, allowing for per-dimension handling and reducing complexity.
- Updated comments to clarify the new packing logic and its alignment with TurboQuant architecture.
- Adjusted the `quantize` method in `_TurboQuantMSECodec` to reflect support for all bit widths, improving flexibility in quantization operations.

These changes optimize the quantization process, contributing to better performance and clarity in TurboQuant operations.

* Implement fused key-value quantization kernel for TurboQuant

- Introduced `_fused_kv_quantize_kernel` to optimize key and value quantization in a single dispatch, reducing the number of required calls and improving performance.
- Enhanced the `_fused_mse_quantize_kernel` with updated comments for clarity on the new packing logic and its efficiency.
- Adjusted the Metal source code for both kernels to streamline processing and improve parallelism.

These changes significantly enhance the quantization process in TurboQuant, contributing to better performance and reduced dispatch overhead.

* Refactor RHT forward and inverse functions in TurboQuant for improved performance

- Updated the `_rht_forward` and `_rht_inverse` functions to utilize `mx.hadamard_transform`, replacing the previous `_fast_walsh_hadamard` calls for better efficiency.
- Enhanced docstrings to clarify the use of `mx.hadamard_transform` and its impact on performance, particularly for dimensions that are powers of 2.
- Adjusted the handling of padding and scaling to streamline the operations and improve clarity.

These changes optimize the RHT operations in TurboQuant, contributing to enhanced performance and reduced computational overhead.

* Add fused non-rotated quantization kernel for TurboQuant

- Introduced `_fused_norot_quantize_kernel` to optimize quantization of pre-rotated vectors without internal rotation, enhancing performance when used with `mx.hadamard_transform`.
- Updated the `quantize` method in `_TurboQuantMSECodec` to utilize the new kernel for efficient single-token decoding, improving handling of packed data.
- Enhanced comments for clarity on the new kernel's functionality and its integration within the TurboQuant architecture.

These changes significantly improve the quantization process, contributing to better performance in TurboQuant operations.

* format

* remove dead code

* format

* Remove unused kernel functions from TurboQuant

- Deleted the `_prod_score_multi_kernel`, `_mse_weighted_rot_multi_kernel`, and `_mse_dequant_rotated_kernel` functions, which were not utilized in the current implementation.
- Cleaned up the codebase by eliminating dead code, enhancing maintainability and readability.

These changes streamline the TurboQuant module by removing unnecessary complexity.

* Fix decoding divergence

- Refactored `_gen_unrolled_extract`, `_gen_unrolled_score`, and `_gen_unrolled_value` to support runtime bit offsets, improving flexibility in byte extraction.
- Updated `_fused_mse_decode_kernel` to include a dimension parameter, ensuring compatibility with varying input sizes and enhancing performance.
- Adjusted test cases to reflect changes in codec usage, switching to the new MSE-only codec for improved speed and quality.

These modifications streamline the TurboQuant module, enhancing its efficiency and adaptability for different scenarios.

* format

* Refactor kernel documentation in TurboQuant

- Updated docstrings for several fused decode and quantization kernels to enhance clarity and conciseness.
- Removed redundant details while retaining essential information about kernel functionality and architecture.
- Improved overall readability of the codebase, making it easier for future developers to understand the purpose and operation of each kernel.

These changes streamline the documentation within the TurboQuant module, facilitating better comprehension and maintenance.

* Enhance README with TurboQuant KV cache quantization details

- Added new command-line options `--kv-bits` and `--kv-quant-scheme` for KV cache quantization configuration.
- Included example usage for running the server with TurboQuant settings.
- Updated documentation on the quantization process for keys and values to improve clarity.

These changes provide users with better guidance on utilizing TurboQuant features effectively.

* Refactor quantization logic in KV cache handling

- Updated the `quantize_entry` function to skip `RotatingKVCache` entries, optimizing the quantization process.
- Changed the handling of lists in `quantize_entry` to modify entries in place for improved performance.
- Added logic to skip quantization for the last layer in `prompt_cache`, addressing sensitivity issues in deep models.

These changes enhance the efficiency and effectiveness of the quantization process within the TurboQuant module.

Apr 4, 2026
90732bd
zip
tar.gz
Notes

v0.4.3

Remove TurboQuant benchmark artifacts and add README docs (#894)

* Remove benchmark test files and add TurboQuant README section

- Remove NIAH test data, runner scripts, PPL evaluation, and plot tools
  (these were development/benchmarking artifacts, not production tests)
- Keep test_turboquant.py (unit tests for the codec)
- Add TurboQuant KV Cache section to README with:
  - Quick start CLI and Python examples
  - How it works (rotation + codebook quantization)
  - Performance tables (Qwen3.5-4B, gemma-4-31b)
  - Supported bit widths guide
  - Compatibility notes for different cache types

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Add all kv_bits options (2, 3, 3.5, 4) to TurboQuant README section

Show CLI examples for each bit width and expand the supported bit widths
table with key/value bit breakdown.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Revert "Add all kv_bits options (2, 3, 3.5, 4) to TurboQuant README section"

This reverts commit 52569e9.

* Update README.md to reflect performance metrics for TurboQuant 3.5-bit, replacing outdated decode and prefill rates with peak memory usage statistics. This change highlights a significant reduction in KV memory usage and provides clearer insights into TurboQuant's efficiency.

* Update README.md to correct the performance metrics section for TurboQuant 3.5-bit, changing "Active Memory" to "Peak Memory" for clarity and accuracy in reporting memory usage statistics.

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Apr 2, 2026
391179c
zip
tar.gz
Notes

v0.4.2

Fix PaliGemma processor kwarg routing (#877)

* Fix PaliGemma processor kwarg routing (#870)

* Update .gitignore

Mar 28, 2026
7f7a04c
zip
tar.gz
Notes

v0.4.1

Add molmo point (#844)

Mar 21, 2026
f613272
zip
tar.gz
Notes

v0.4.0

Update dependencies and version number (#809)

- Bump mlx-lm version to 0.31.0 in requirements.txt.
- Update package version to 0.4.0 in version.py.

Mar 7, 2026
718f69e
zip
tar.gz
Notes

v0.3.12

Add dots-ocr (#749)

* add dots-ocr

* add docs

Feb 16, 2026
739f4ab
zip
tar.gz
Notes

v0.3.11

Refactor Attention class to calculate n_kv_heads dynamically based on…

… head_dim, improving clarity and maintainability of the code. (#713)

Feb 4, 2026
21f9b45
zip
tar.gz
Notes

v0.3.10

TFMS v5 RC3 + Fix processor registry (#693)

* Refactor DeepSeek OCR models to standardize processor naming and enhance functionality

- Renamed `DeepseekVLV2Processor` to `DeepseekOCRProcessor` in the original DeepSeek OCR model for consistency.
- Introduced a new `DeepseekOCR2Processor` in the `deepseekocr_2` module, implementing a patch for `AutoProcessor` to correctly handle model loading.
- Updated the `get_input_embeddings` method to improve input handling and ensure compatibility with new processor classes.
- Enhanced model initialization and query handling to support dynamic resolution and improved feature extraction.

* Enhance test_smoke.py for improved memory management and cleanup

- Added garbage collection and synchronization steps to ensure proper memory handling during tests.
- Updated cleanup process to include the configuration object, enhancing resource management after test execution.

* Update dependencies and configuration files

- Removed `mlx-audio` from optional dependencies in `pyproject.toml`.
- Updated `transformers` version from `5.0.0rc1` to `5.0.0rc3` and `mlx-lm` version from `0.30.2` to `0.30.5` in `requirements.txt`.
- Enhanced `uv.lock` with additional resolution markers for Python version compatibility and updated package distribution details.

* Fix device info retrieval in wired_limit function

- Updated the method of accessing the maximum recommended working set size from the MX library, changing from `mx.metal.device_info()` to `mx.device_info()`. This change ensures compatibility with the current MX library structure and improves the accuracy of memory management in the model.

* Refactor version retrieval in test_smoke.py

- Updated the method of retrieving package versions by replacing direct imports with `importlib.metadata.version`. This change enhances compatibility and standardizes version retrieval for MLX, MLX-VLM, and Transformers packages in the test suite.

Jan 28, 2026
48fc189
zip
tar.gz
Notes

v0.3.9

Add ministral3 (#611)

* Add to_dict method to BaseModelConfig for improved serialization

* Filter out consolidated model weights when loading .safetensors files in load_model function

* add ministral 3

* Add tests for Mistral3 and Ministral3 models in test_models.py

* bump version

* remove unused

Dec 3, 2025
a010fd3
zip
tar.gz
Notes

PreviousNext

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

v0.5.0

v0.4.4

v0.4.3

v0.4.2

v0.4.1

v0.4.0

v0.3.12

v0.3.11

v0.3.10

v0.3.9

Uh oh!

Tags: Blaizzy/mlx-vlm