Skip to content

[BUG] Tensor Parallel async_chunk=4 mismatch async_chunk=1 result when sequence length longer than 16K #174

@Achazwl

Description

@Achazwl

Is there an existing issue for this?

  • I have searched the existing issues

Description of the Bug

TP linear async_chunk=4 mismatch async_chunk=1 result when sequence length longer than 16K, but match when <= 8K.

Environment Information

- GCC version: 7.5.0
- Torch version: 1.13.1
- Linux system version: Ubuntu 18.04.6 LTS
- CUDA version: 11.6
- Torch's CUDA version (as per `torch.cuda.version()`): 11.6

To Reproduce

CUDA_LAUNCH_BLOCKING can fix this

Expected Behavior

match

Screenshots

No response

Additional Information

No response

Confirmation

  • I have reviewed and verified all the information provided in this report.

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions