Add support for fbgemm fp8 kernels by jerryzh168 · Pull Request #2276 · pytorch/ao

jerryzh168 · 2025-05-30T03:26:52Z

Summary:
fp8 per row quantized weight with fp8 dynamic per row quantization only for now

	overall tokens/sec	TTFT	Peak Memory	Model Size
baseline - 1	131.65	0.0220	16.24 GB	15.01 GB
baseline - 128	76.38	0.0544	26.92 GB	15.01 GB
float8dq-row - 1 (no compile)	95.95	0.0525	9.01 GB	7.51 GB
float8dq-row - 128 (no compile)	94.29	0.0655	19.90 GB	7.51 GB
float8dq-row - 1 (compile)	138.60	0.0599	9.01 GB	7.51 GB
float8dq-row - 128 (compile)	77.25	0.0658	19.90 GB	7.51 GB
fbgemm-fp8 - 1 (no compile)	37.02	0.0486	16.76 GB	7.51 GB
fbgemm-fp8 - 128 (no compile)	11.7	0.0768	21.92 GB	7.51 GB
fbgemm-fp8 - 1 (compile)	176.34	0.0251	9.38 GB	7.51 GB
fbgemm-fp8 - 128 (compile)	99.12	0.0516	19.97 GB	7.51 GB

Test Plan:
python test/dtypes/test_fbgemm_fp8.py

in torchao/_models/llama folder:

export CHECKPOINT_PATH=../../../checkpoints # path to checkpoints folder 
export MODEL_REPO=meta-llama/Meta-Llama-3.1-8B-Instruct

python generate.py --checkpoint_path $CHECKPOINT_PATH/$MODEL_REPO/model.pth --quantization fbgemm-fp8 --batch_size 1

compile doesn't work yet

batch size 1

Average overall tokens/sec: 37.02
Average decode tokens/sec: 37.4586 s
Average TTFT: 0.0486 s
Average tokens/sec: 37.02
Average Bandwidth: 278.05 GB/s
Peak Memory Usage: 16.76 GB
Model Size: 7.51 GB

batch size 128 Average overall tokens/sec: 11.70
Average decode tokens/sec: 11.7677 s
Average TTFT: 0.0768 s
Average tokens/sec: 11.70
Average tokens/sec including batches 1498.09
Average Bandwidth: 87.91 GB/s
Peak Memory Usage: 21.92 GB
Model Size: 7.51 GB

float8dq-row batch size 1, w/ compile
Average overall tokens/sec: 95.95
Average decode tokens/sec: 99.0108 s
Average TTFT: 0.0525 s
Average tokens/sec: 95.95
Average Bandwidth: 720.68 GB/s
Peak Memory Usage: 9.01 GB
Model Size: 7.51 GB  float8dq-row batch size 128, w/ compile
Average overall tokens/sec: 94.29
Average decode tokens/sec: 97.3500 s
Average TTFT: 0.0655 s
Average tokens/sec: 94.29
Average tokens/sec including batches 12069.68
Average Bandwidth: 708.26 GB/s
Peak Memory Usage: 19.90 GB
Model Size: 7.51 GB

Reviewers:

Subscribers:

Tasks:

Tags:

pytorch-bot · 2025-05-30T03:26:57Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/2276

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit ff66c46 with merge base d963a88 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Summary: fp8 per row quantized weight with fp8 dynamic per row quantization only for now Test Plan: python test/dtypes/test_fbgemm_fp8.py in torchao/_models/llama folder: export CHECKPOINT_PATH=../../../checkpoints # path to checkpoints folder export MODEL_REPO=meta-llama/Meta-Llama-3.1-8B-Instruct python generate.py --checkpoint_path $CHECKPOINT_PATH/$MODEL_REPO/model.pth --quantization fbgemm-fp8 --batch_size 1 Reviewers: Subscribers: Tasks: Tags:

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label May 30, 2025

jerryzh168 requested review from HDCharles and drisspg May 30, 2025 03:27

jerryzh168 force-pushed the fbgemm-fp8 branch 2 times, most recently from f197ab8 to 678e836 Compare May 30, 2025 03:29

jerryzh168 requested review from jiawenliu64 and jwfromm May 30, 2025 03:29

jerryzh168 added the topic: new feature Use this tag if this PR adds a new feature label May 30, 2025

jerryzh168 force-pushed the fbgemm-fp8 branch from 678e836 to 33c4b27 Compare May 31, 2025 00:56

jerryzh168 changed the title ~~Add support for fbgemm fp8 kernels~~ [WIP] Add support for fbgemm fp8 kernels May 31, 2025

jerryzh168 force-pushed the fbgemm-fp8 branch from 33c4b27 to ff66c46 Compare June 3, 2025 05:21

jerryzh168 changed the title ~~[WIP] Add support for fbgemm fp8 kernels~~ Add support for fbgemm fp8 kernels Jun 4, 2025

drisspg approved these changes Jun 4, 2025

View reviewed changes

jerryzh168 merged commit 35ffb26 into pytorch:main Jun 5, 2025
19 checks passed

jerryzh168 mentioned this pull request Jul 8, 2025

Tutorial for benchmarking #2499

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for fbgemm fp8 kernels#2276

Add support for fbgemm fp8 kernels#2276
jerryzh168 merged 1 commit into
pytorch:mainfrom
jerryzh168:fbgemm-fp8

jerryzh168 commented May 30, 2025 •

edited

Loading

Uh oh!

pytorch-bot Bot commented May 30, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

jerryzh168 commented May 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot Bot commented May 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/2276

✅ No Failures

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jerryzh168 commented May 30, 2025 •

edited

Loading

pytorch-bot Bot commented May 30, 2025 •

edited

Loading