Skip to content

Conversation

kwen2501
Copy link
Contributor

@kwen2501 kwen2501 commented Jan 30, 2024

Copy link

pytorch-bot bot commented Jan 30, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/118674

Note: Links to docs will display an error until the docs builds have been completed.

❌ 2 New Failures, 1 Unrelated Failure

As of commit db92422 with merge base 9afd539 (image):

NEW FAILURES - The following jobs have failed:

FLAKY - The following job failed but was likely due to flakiness present on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added the release notes: distributed (c10d) release notes category label Jan 30, 2024
@github-actions github-actions bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Jan 30, 2024
@kwen2501 kwen2501 marked this pull request as ready for review January 30, 2024 20:56
@wconstab
Copy link
Contributor

this SGTM! a couple thoughts on landing. Should we first land a change that just starts asserting the condition we want to achieve (assert that all those loops/vectors are size 1) and ensure no breakages with minimal code change effort? Then the remaining PR would literally be dead code removal. (and could even be landed in a few smaller chunks if its easier to review)

@kwen2501
Copy link
Contributor Author

Thanks, I think that's a good idea. But I also just realized that a lot of our tests were written in pg.all_reduce([tensors]) style, instead of dist.all_reduce(tensor) (the documented way). So, even for assert == 1 to become green, there needs quite some efforts to refactor our tests first.

@wconstab
Copy link
Contributor

is this actually a deprecation and a bc cycle kind of thing? (and are you thinking its not worth doing the normal BC process bc we don't think any users are actually using this approach?)

@kwen2501
Copy link
Contributor Author

We actually followed the deprecation process (2 release cycles at least), and recently removed <coll>_multigpu APIs from the distributed package. So if the distributed package is used per documentation, the assert == 1 would be true.

The problem is in some of our tests, which used unsupported/documented style.

@wconstab
Copy link
Contributor

Ok. So more work...

@wanchaol
Copy link
Collaborator

This is awesome! I think we should probably try to import this and see if internal tests are happy given this is a large refactor?

@kwen2501 kwen2501 added the module: bc-breaking Related to a BC-breaking change label Feb 2, 2024
@kwen2501 kwen2501 changed the title [WIP][c10d] Simplify ProcessGroupNCCL into single-device style [c10d] Simplify ProcessGroupNCCL into single-device style Feb 2, 2024
@kwen2501
Copy link
Contributor Author

kwen2501 commented Feb 3, 2024

@wconstab @wanchaol breaking this into smaller PRs as you suggested:
First one: #119099

smalltalkman pushed a commit to smalltalkman/pytorch that referenced this pull request Feb 6, 2024
Breaking pytorch#118674 into multiple smaller PRs.
This is the first one.
It adds `assert size==1` to PGNCCL, and refactors some old tests written in multi-device style (which would otherwise fail at the assert).

Pull Request resolved: pytorch#119099
Approved by: https://github.com/wconstab
pytorchmergebot pushed a commit that referenced this pull request Feb 7, 2024
Breaking #118674 into multiple smaller PRs.
This is the first one.
It adds `assert size==1` to PGNCCL, and refactors some old tests written in multi-device style (which would otherwise fail at the assert).

Pull Request resolved: #119099
Approved by: https://github.com/wconstab, https://github.com/XilunWu
pytorch-bot bot pushed a commit that referenced this pull request Feb 8, 2024
Breaking #118674 into multiple smaller PRs.
This is the first one.
It adds `assert size==1` to PGNCCL, and refactors some old tests written in multi-device style (which would otherwise fail at the assert).

Pull Request resolved: #119099
Approved by: https://github.com/wconstab
pytorch-bot bot pushed a commit that referenced this pull request Feb 8, 2024
Breaking #118674 into multiple smaller PRs.
This is the first one.
It adds `assert size==1` to PGNCCL, and refactors some old tests written in multi-device style (which would otherwise fail at the assert).

Pull Request resolved: #119099
Approved by: https://github.com/wconstab, https://github.com/XilunWu
pytorchmergebot pushed a commit that referenced this pull request Feb 9, 2024
…device style (#119421)

Part 2 and last part of #118674:
Introduce actual "single-device" code change to ProcessGroupNCCL.

assert size == 1 and test refactor have been done in #119099.

Pull Request resolved: #119421
Approved by: https://github.com/shuqiangzhang
@kwen2501
Copy link
Contributor Author

kwen2501 commented Feb 9, 2024

Closing via #119099 and #119421

@kwen2501 kwen2501 closed this Feb 9, 2024
pytorchmergebot pushed a commit that referenced this pull request Feb 12, 2024
…device style (#119421)

Part 2 and last part of #118674:
Introduce actual "single-device" code change to ProcessGroupNCCL.

assert size == 1 and test refactor have been done in #119099.

Pull Request resolved: #119421
Approved by: https://github.com/shuqiangzhang
clee2000 pushed a commit that referenced this pull request Feb 14, 2024
Breaking #118674 into multiple smaller PRs.
This is the first one.
It adds `assert size==1` to PGNCCL, and refactors some old tests written in multi-device style (which would otherwise fail at the assert).

Pull Request resolved: #119099
Approved by: https://github.com/wconstab
clee2000 pushed a commit that referenced this pull request Feb 14, 2024
Breaking #118674 into multiple smaller PRs.
This is the first one.
It adds `assert size==1` to PGNCCL, and refactors some old tests written in multi-device style (which would otherwise fail at the assert).

Pull Request resolved: #119099
Approved by: https://github.com/wconstab, https://github.com/XilunWu
clee2000 pushed a commit that referenced this pull request Feb 14, 2024
…device style (#119421)

Part 2 and last part of #118674:
Introduce actual "single-device" code change to ProcessGroupNCCL.

assert size == 1 and test refactor have been done in #119099.

Pull Request resolved: #119421
Approved by: https://github.com/shuqiangzhang
clee2000 pushed a commit that referenced this pull request Feb 14, 2024
…device style (#119421)

Part 2 and last part of #118674:
Introduce actual "single-device" code change to ProcessGroupNCCL.

assert size == 1 and test refactor have been done in #119099.

Pull Request resolved: #119421
Approved by: https://github.com/shuqiangzhang
@github-actions github-actions bot deleted the single_dev_pg branch March 11, 2024 01:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: bc-breaking Related to a BC-breaking change oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[c10d] Refactor ProcessGroupNCCL to use single device
3 participants