-
Notifications
You must be signed in to change notification settings - Fork 22.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: Integration of KleidiAI 4-Bit MatMul Kernels into PyTorch #137830
Comments
@ezyang could you let us know why the RFC has been moved to torchao? |
@jerryzh168 @vkuzo who should answer on this |
can you give more context here? personally I feel a separate op might make more sense since the packing etc. are pretty different from existing op |
The existing 1 . |
can you just add low bit related kernels to torchao? I think even for fp32/fp16/bf16 kernels that are not related to low bits, does these require the same kind of packing? if so, maybe it's better to just add to torchao as well |
Thanks for your inputs. We had this discussion with @digantdesai and @malfet and they asked us to pursue this integration and raise a RFC. |
Thanks for the RFC. Here's my high-level thinking, some of which we already discussed asynchronously, so it shouldn't be a surprise. I'm listing some nuances/feedback for the proposed approach in this RFC, as well as for an alternative approach I suggested, which involves housing custom ops in TorchAO. Regarding the proposed approach in this RFC:
Regarding the TorchAO custom ops approach:
I also discussed with @malfet and he has some different opinions, I tried to capture some of them here. And given he has better visibility in to the native op developments, I would rely on his judgement on what is the best path forward here. |
Unfortunately even the existing ops convert/mm APIs are not really consist. Each backend outputs tensor in different form for prepack op. @malfet this is the thing that you highlighted first and it has stuck to me every since. We shouldnt really have backend specific prepacking hidding behind the same aten op, since each is doing a different thing. |
Agree with @jerryzh168 here. Also, I think backend specific custom op, in this case arm specific custom op, makes more sense. This makes the API clear that this ops either output packed weights (prepack op), or accept packed weights (mm op) that is only interpretable by specific implementation, e.g. ARM's implementation. Thus CUDA or x86 impl of mm op cannot accept weights packed for ARM's implementation. And the best way to make this clear is to have separate custom ops. This also helps to clarify the API that activations are quantized dynamically. |
RFC: Integration of KleidiAI 4-Bit MatMul Kernels into PyTorch
Hardware Platform: aarch64
Motivation
Pytorch Integration PR: #134124
torch/ao Quantization Change: ng-05/ao@cbcf915
Overview of 4-Bit MatMul Operations
The 4-bit matmul process consists of two main operations:
packed_weight
buffer.packed_weight
buffer.Current Target Operations
We aim to enhance two existing PyTorch operations:
_convert_weight_to_int4pack_cpu
: This operation is responsible for converting weights to a packed format._weight_int4pack_mm_cpu
: This operation performs the matrix multiplication using the packed weights.Issues Identified
Modification of Operation Signatures:
_convert_weight_to_int4pack_cpu
:_weight_int4pack_mm_cpu
:packed_weight
shape is 1 dimensional and accounts for scales, bias , weights. We need the row count (N) to perform matmul correctly.Support for Channelwise and Groupwise Quantization:
Data Type Handling in Operations:
_weight_int4pack_mm_cpu
operator performs multiplication and accumulation in FP32/BF16. In contrast, our kernels dynamically quanitzes fp32 input to INT8, use INT8 for multiplication and accumulate results in FP32. This might introduce noticeable (but within acceptable error range [mean error : 0.0064]) accuracy changes for the same operation across platforms.Initial Approach:
Why Kleidi in pytorch?
Proposed Path Forward
To resolve the above issues and successfully integrate our kernels into
_convert_weight_to_int4pack_cpu
and_weight_int4pack_mm_cpu
, we seek suggestions and collaboration on the following:E2E Flow
Our torch ao quantizer implementation can directly replace GPTFast block in below diagram
Existing 4 bit matmul kernel fucntionality in
_weight_int4pack_mm_cpu
operartion :KleidiAI 4 bit matmul kernel fucntionality in
_weight_int4pack_mm_cpu
operartion :Your feedback and suggestions are highly welcome!
cc: @malfet @digantdesai @jgong5 @sanchitintel @cfRod @milpuz01
cc @jerryzh168 @jianyuh @raghuramank100 @jamesr66a @vkuzo @jgong5 @Xia-Weiwen @leslie-fang-intel @msaroufim
The text was updated successfully, but these errors were encountered: