-
Notifications
You must be signed in to change notification settings - Fork 359
Add hqq support for Int4TilePackedTo4dTensor #2912
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/2912
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 12aeb58 with merge base 568c193 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
cf6526f to
4705e33
Compare
torchao/quantization/quantize_/workflows/int4/int4_tile_packed_to_4d_tensor.py
Outdated
Show resolved
Hide resolved
9c7f7e1 to
953f1f9
Compare
| class Int4ChooseQParamsAlgorithm(str, Enum): | ||
| """Variant of quantization algorithm to calculate scale and zero_point | ||
| * tinygemm: the choose qparams algorithm native for tinygemm kernel |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we describe what this actually does
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, updated with some pseudo code describing the core logic
ad5e93c to
b8db855
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See inline comment
| `packing_format`: the packing format for int4 tensor, available from version 2 and above | ||
| `version`: version of the config to use, only subset of above args are valid for version 1, and subset of above args are valid for version 2, default is 1, see note for more details | ||
| Note: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @mergennachin I added some more docs here, please let me know if this helps, only a subset of args will be used for each of the version right now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Documentation is good, but why not also do assertion?
If ignored field are present, don't you wanna throw an exception to the developer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, talked offline with @jerryzh168
The current approach seems fine. The increased complexity seems not worth it since we have to change the type to Optional.
Summary: * Added Int4ChooseQparamsAlgorithm enum that has TINYGEMM and HQQ options, by default tensors will be using TINYGEMM option * Enabled `Int4ChooseQparamsAlgorithm.HQQ` option for Int4TilePackedTo4dTensor, instead of calling quant primitive ops for tinygemm to quantize the high precision tensor, the `use_hqq=True` path will quantize with `_choose_qparams_and_quantize_affine_hqq` that help improve accuracy for int4 weight only quantization, but still reuse the tinygemm kernel for speedup Test Plan: python test/quantization/quantize_/workflows/int4/test_int4_tile_packed_to_4d_tensor.py Accuracy test (sanity check) to make sure hqq improves accuracy: ``` sh release.sh --model_id Qwen/Qwen3-8B --quants INT4 --push_to_hub no hqq checkpoint: https://huggingface.co/jerryzh168/Qwen3-8B-INT4-non-hqq hqq checkpoint: https://huggingface.co/jerryzh168/Qwen3-8B-INT4 export MODEL=jerryzh168/Qwen3-8B-INT4-non-hqq export TASK=mmlu lm_eval --model hf --model_args pretrained=$MODEL --tasks $TASK --device cuda:0 --batch_size auto | Groups |Version|Filter|n-shot|Metric| |Value | |Stderr| |------------------|------:|------|------|------|---|-----:|---|-----:| |mmlu | 2|none | |acc |↑ |0.7019|± |0.0036| | - humanities | 2|none | |acc |↑ |0.6036|± |0.0066| | - other | 2|none | |acc |↑ |0.7403|± |0.0076| | - social sciences| 2|none | |acc |↑ |0.8083|± |0.0070| | - stem | 2|none | |acc |↑ |0.7069|± |0.0078| export MODEL=jerryzh168/Qwen3-8B-INT4 lm_eval --model hf --model_args pretrained=$MODEL --tasks $TASK --device cuda:0 --batch_size auto | Groups |Version|Filter|n-shot|Metric| |Value | |Stderr| |------------------|------:|------|------|------|---|-----:|---|-----:| |mmlu | 2|none | |acc |↑ |0.7040|± |0.0036| | - humanities | 2|none | |acc |↑ |0.5962|± |0.0065| | - other | 2|none | |acc |↑ |0.7470|± |0.0075| | - social sciences| 2|none | |acc |↑ |0.8177|± |0.0069| | - stem | 2|none | |acc |↑ |0.7114|± |0.0078| hqq improves the accuracy for mmlu slightly. ``` Reviewers: Subscribers: Tasks: Tags:
|
I'm merging this now to unblock future PRs, please feel free to add more comment if there is anything else that should be fixed |
Summary:
Int4ChooseQparamsAlgorithm.HQQoption for Int4TilePackedTo4dTensor, instead of calling quant primitive ops for tinygemm to quantize the high precision tensor, theHQQpath will quantize with_choose_qparams_and_quantize_affine_hqqthat help improve accuracy for int4 weight only quantization, but still reuse the tinygemm kernel for speedupTest Plan:
python test/quantization/quantize_/workflows/int4/test_int4_tile_packed_to_4d_tensor.py
Accuracy test (sanity check) to make sure hqq improves accuracy:
Reviewers:
Subscribers:
Tasks:
Tags: