Skip to content

Conversation

@jerryzh168
Copy link
Contributor

@jerryzh168 jerryzh168 commented Aug 29, 2025

Summary:

  • Added Int4ChooseQparamsAlgorithm enum that has TINYGEMM and HQQ options, by default tensors will be using TINYGEMM option
  • Enabled Int4ChooseQparamsAlgorithm.HQQ option for Int4TilePackedTo4dTensor, instead of calling quant primitive ops for tinygemm to quantize the high precision tensor, the HQQ path will quantize with _choose_qparams_and_quantize_affine_hqq that help improve accuracy for int4 weight only quantization, but still reuse the tinygemm kernel for speedup

Test Plan:
python test/quantization/quantize_/workflows/int4/test_int4_tile_packed_to_4d_tensor.py

Accuracy test (sanity check) to make sure hqq improves accuracy:

sh release.sh --model_id Qwen/Qwen3-8B --quants INT4 --push_to_hub

no hqq checkpoint: https://huggingface.co/jerryzh168/Qwen3-8B-INT4-non-hqq
hqq checkpoint: https://huggingface.co/jerryzh168/Qwen3-8B-INT4

export MODEL=jerryzh168/Qwen3-8B-INT4-non-hqq
export TASK=mmlu
lm_eval --model hf --model_args pretrained=$MODEL --tasks $TASK --device cuda:0 --batch_size auto

|      Groups      |Version|Filter|n-shot|Metric|   |Value |   |Stderr|
|------------------|------:|------|------|------|---|-----:|---|-----:|
|mmlu              |      2|none  |      |acc   |↑  |0.7019|±  |0.0036|
| - humanities     |      2|none  |      |acc   |↑  |0.6036|±  |0.0066|
| - other          |      2|none  |      |acc   |↑  |0.7403|±  |0.0076|
| - social sciences|      2|none  |      |acc   |↑  |0.8083|±  |0.0070|
| - stem           |      2|none  |      |acc   |↑  |0.7069|±  |0.0078|

export MODEL=jerryzh168/Qwen3-8B-INT4
lm_eval --model hf --model_args pretrained=$MODEL --tasks $TASK --device cuda:0 --batch_size auto

|      Groups      |Version|Filter|n-shot|Metric|   |Value |   |Stderr|
|------------------|------:|------|------|------|---|-----:|---|-----:|
|mmlu              |      2|none  |      |acc   |↑  |0.7040|±  |0.0036|
| - humanities     |      2|none  |      |acc   |↑  |0.5962|±  |0.0065|
| - other          |      2|none  |      |acc   |↑  |0.7470|±  |0.0075|
| - social sciences|      2|none  |      |acc   |↑  |0.8177|±  |0.0069|
| - stem           |      2|none  |      |acc   |↑  |0.7114|±  |0.0078|

hqq improves the accuracy for mmlu slightly.

Reviewers:

Subscribers:

Tasks:

Tags:

@pytorch-bot
Copy link

pytorch-bot bot commented Aug 29, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/2912

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 12aeb58 with merge base 568c193 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Aug 29, 2025
@jerryzh168 jerryzh168 added the topic: new feature Use this tag if this PR adds a new feature label Aug 29, 2025
@jerryzh168 jerryzh168 force-pushed the add-hqq branch 2 times, most recently from cf6526f to 4705e33 Compare September 2, 2025 20:04
@jerryzh168 jerryzh168 force-pushed the add-hqq branch 2 times, most recently from 9c7f7e1 to 953f1f9 Compare September 4, 2025 00:57
class Int4ChooseQParamsAlgorithm(str, Enum):
"""Variant of quantization algorithm to calculate scale and zero_point
* tinygemm: the choose qparams algorithm native for tinygemm kernel
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we describe what this actually does

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, updated with some pseudo code describing the core logic

@jerryzh168 jerryzh168 force-pushed the add-hqq branch 2 times, most recently from ad5e93c to b8db855 Compare September 4, 2025 17:45
Copy link

@mergennachin mergennachin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See inline comment

`packing_format`: the packing format for int4 tensor, available from version 2 and above
`version`: version of the config to use, only subset of above args are valid for version 1, and subset of above args are valid for version 2, default is 1, see note for more details
Note:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @mergennachin I added some more docs here, please let me know if this helps, only a subset of args will be used for each of the version right now

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Documentation is good, but why not also do assertion?

If ignored field are present, don't you wanna throw an exception to the developer?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, talked offline with @jerryzh168

The current approach seems fine. The increased complexity seems not worth it since we have to change the type to Optional.

Summary:
* Added Int4ChooseQparamsAlgorithm enum that has TINYGEMM and HQQ options, by default tensors will be using TINYGEMM option
* Enabled `Int4ChooseQparamsAlgorithm.HQQ` option for Int4TilePackedTo4dTensor, instead of calling quant primitive ops for tinygemm to quantize the high
precision tensor, the `use_hqq=True` path will quantize with `_choose_qparams_and_quantize_affine_hqq` that help improve
accuracy for int4 weight only quantization, but still reuse the tinygemm kernel for speedup

Test Plan:
python test/quantization/quantize_/workflows/int4/test_int4_tile_packed_to_4d_tensor.py

Accuracy test (sanity check) to make sure hqq improves accuracy:
```
sh release.sh --model_id Qwen/Qwen3-8B --quants INT4 --push_to_hub

no hqq checkpoint: https://huggingface.co/jerryzh168/Qwen3-8B-INT4-non-hqq
hqq checkpoint: https://huggingface.co/jerryzh168/Qwen3-8B-INT4

export MODEL=jerryzh168/Qwen3-8B-INT4-non-hqq
export TASK=mmlu
lm_eval --model hf --model_args pretrained=$MODEL --tasks $TASK --device cuda:0 --batch_size auto

|      Groups      |Version|Filter|n-shot|Metric|   |Value |   |Stderr|
|------------------|------:|------|------|------|---|-----:|---|-----:|
|mmlu              |      2|none  |      |acc   |↑  |0.7019|±  |0.0036|
| - humanities     |      2|none  |      |acc   |↑  |0.6036|±  |0.0066|
| - other          |      2|none  |      |acc   |↑  |0.7403|±  |0.0076|
| - social sciences|      2|none  |      |acc   |↑  |0.8083|±  |0.0070|
| - stem           |      2|none  |      |acc   |↑  |0.7069|±  |0.0078|

export MODEL=jerryzh168/Qwen3-8B-INT4
lm_eval --model hf --model_args pretrained=$MODEL --tasks $TASK --device cuda:0 --batch_size auto

|      Groups      |Version|Filter|n-shot|Metric|   |Value |   |Stderr|
|------------------|------:|------|------|------|---|-----:|---|-----:|
|mmlu              |      2|none  |      |acc   |↑  |0.7040|±  |0.0036|
| - humanities     |      2|none  |      |acc   |↑  |0.5962|±  |0.0065|
| - other          |      2|none  |      |acc   |↑  |0.7470|±  |0.0075|
| - social sciences|      2|none  |      |acc   |↑  |0.8177|±  |0.0069|
| - stem           |      2|none  |      |acc   |↑  |0.7114|±  |0.0078|

hqq improves the accuracy for mmlu slightly.
```

Reviewers:

Subscribers:

Tasks:

Tags:
@jerryzh168
Copy link
Contributor Author

I'm merging this now to unblock future PRs, please feel free to add more comment if there is anything else that should be fixed

@jerryzh168 jerryzh168 merged commit 2dacd7f into pytorch:main Sep 5, 2025
18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. topic: new feature Use this tag if this PR adds a new feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants