update _unsafe_set_version_counter to accept lists of tensors #137921

bdhirsh · 2024-10-14T17:57:14Z

See the comment here (cc @XilunWu @H-Huang @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang @rec) - this PR updates _unsafe_set_version_counter to accept a list of tensors, for overhead-sensitive users (e.g. distributed) who need to hide VC bumps from autograd on a large list of tensors without wanting to suffer the overhead of going from python->C++ separately for every tensor in the list.

I left the binding in pybind, and used a std::vector. if we really need to optimize overhead even further, we could write a manual cpython binding.

I use this updated API in the next PR to fix FSDP2, so that it properly hides the VC of all all_gather_buffer tensors in its call to split_with_sizes_copy.out(all_gather_buffers).

Stack from ghstack (oldest at bottom):

cc @XilunWu @H-Huang @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @chenyang78 @kadeng @chauhang @amjames @rec

[ghstack-poisoned]

pytorch-bot · 2024-10-14T17:57:18Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/137921

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 8 New Failures, 1 Cancelled Job, 10 Unrelated Failures

As of commit ed893f9 with merge base 932ae13 ():

NEW FAILURES - The following jobs have failed:

pull / linux-focal-cuda12.1-py3.10-gcc9 / test (default, 1, 5, lf.linux.4xlarge.nvidia.gpu) (gh)
inductor/test_distributed_patterns.py::DistributedPatternTests::test_unsafe_set_version_counter2
pull / linux-focal-cuda12.1-py3.10-gcc9-sm86 / test (default, 1, 5, lf.linux.g5.4xlarge.nvidia.gpu) (gh)
inductor/test_distributed_patterns.py::DistributedPatternTests::test_unsafe_set_version_counter2
pull / linux-focal-py3.11-clang10 / test (default, 1, 4, lf.linux.4xlarge) (gh)
inductor/test_distributed_patterns.py::DistributedPatternTests::test_unsafe_set_version_counter2
pull / linux-focal-py3.12-clang10 / test (default, 1, 4, lf.linux.4xlarge) (gh)
inductor/test_distributed_patterns.py::DistributedPatternTests::test_unsafe_set_version_counter2
pull / linux-focal-py3.12-clang10-experimental-split-build / test (default, 1, 3, linux.4xlarge) (gh)
inductor/test_distributed_patterns.py::DistributedPatternTests::test_unsafe_set_version_counter2
pull / linux-focal-py3.9-clang10 / test (default, 1, 4, lf.linux.4xlarge) (gh)
inductor/test_distributed_patterns.py::DistributedPatternTests::test_unsafe_set_version_counter2
pull / linux-jammy-py3.10-clang15-asan / test (default, 1, 6, lf.linux.4xlarge) (gh)
inductor/test_distributed_patterns.py::DistributedPatternTests::test_unsafe_set_version_counter2
pull / linux-jammy-py3.9-gcc11 / test (default, 1, 4, lf.linux.2xlarge) (gh)
inductor/test_distributed_patterns.py::DistributedPatternTests::test_unsafe_set_version_counter2

CANCELLED JOB - The following job was cancelled. Please retry:

inductor-periodic / cuda12.1-py3.10-gcc9-sm80 / test (inductor_torchbench_smoketest_perf, 1, 1, linux.gcp.a100) (gh)

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

pull / linux-focal-cuda12.1-py3.10-gcc9 / test (default, 3, 5, lf.linux.4xlarge.nvidia.gpu) (gh) (trunk failure)
test_public_bindings.py::TestPublicBindings::test_correct_module_names
pull / linux-focal-cuda12.1-py3.10-gcc9-sm86 / test (default, 3, 5, lf.linux.g5.4xlarge.nvidia.gpu) (gh) (trunk failure)
test_public_bindings.py::TestPublicBindings::test_correct_module_names
pull / linux-focal-py3.11-clang10 / test (crossref, 2, 2, lf.linux.2xlarge) (gh) (trunk failure)
test_public_bindings.py::TestPublicBindings::test_correct_module_names
pull / linux-focal-py3.11-clang10 / test (default, 3, 4, lf.linux.4xlarge) (gh) (trunk failure)
test_public_bindings.py::TestPublicBindings::test_correct_module_names
pull / linux-focal-py3.12-clang10 / test (default, 3, 4, lf.linux.4xlarge) (gh) (trunk failure)
test_public_bindings.py::TestPublicBindings::test_correct_module_names
pull / linux-focal-py3.12-clang10-experimental-split-build / test (default, 3, 3, linux.4xlarge) (gh) (trunk failure)
test_public_bindings.py::TestPublicBindings::test_correct_module_names
pull / linux-focal-py3.9-clang10 / test (crossref, 2, 2, lf.linux.2xlarge) (gh) (trunk failure)
test_public_bindings.py::TestPublicBindings::test_correct_module_names
pull / linux-focal-py3.9-clang10 / test (default, 3, 4, lf.linux.4xlarge) (gh) (trunk failure)
test_public_bindings.py::TestPublicBindings::test_correct_module_names
pull / linux-jammy-py3.10-clang15-asan / test (default, 3, 6, lf.linux.4xlarge) (gh) (trunk failure)
test_public_bindings.py::TestPublicBindings::test_correct_module_names
pull / linux-jammy-py3.9-gcc11 / test (default, 3, 4, lf.linux.2xlarge) (gh) (trunk failure)
test_public_bindings.py::TestPublicBindings::test_correct_module_names

This comment was automatically generated by Dr. CI and updates every 15 minutes.

…ors" See the comment [here](#132014 (comment)) (cc awgu) - this PR updates `_unsafe_set_version_counter` to accept a list of tensors, for overhead-sensitive users (e.g. distributed) who need to hide VC bumps from autograd on a large list of tensors without wanting to suffer the overhead of going from python->C++ separately for every tensor in the list. I left the binding in pybind, and used a `std::vector`. if we **really** need to optimize overhead even further, we could write a manual cpython binding. I use this updated API in the next PR to fix FSDP2, so that it properly hides the VC of all `all_gather_buffer` tensors in its call to `split_with_sizes_copy.out(all_gather_buffers)`. cc XilunWu H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 kadeng chauhang amjames rec [ghstack-poisoned]

albanD

SGTM, only nits

albanD · 2024-10-14T18:51:28Z

torch/autograd/grad_mode.py

-    def __init__(self, tensor: torch.Tensor) -> None:
-        self.tensor = tensor
-        self.prev_version = tensor._version
+    def __init__(self, tensors: Union[torch.Tensor, List[torch.Tensor]]) -> None:


Suggested change

def __init__(self, tensors: Union[torch.Tensor, List[torch.Tensor]]) -> None:

def __init__(self, tensors: Union[torch.Tensor, Tuple[torch.Tensor, ...]]) -> None:

You use both in your tests :p I would settle on tuple unless we often need the flexibility of a list.

fair, will do

albanD · 2024-10-14T18:53:06Z

torch/distributed/_composable/fsdp/_fsdp_collectives.py

-            )
+        # decrement version counters of all outputs at once
+        old_versions = [x._version - 1 for x in fsdp_param.all_gather_outputs]
+        torch._C._autograd._unsafe_set_version_counter(


nit: Use the preserve API here to get proper error handling?

yeah agreed, will change (I remember @awgu saying he was worried about the overhead of the context manager, but we will now be doing a single call through the context manager, vs before we were doing N individual calls to pybind, so this should net out to being faster either way?)

I feel like I saw another PR from someone else doing that change. @awgu am I dreaming here? :p

I used it in https://github.com/pytorch/pytorch/pull/137496/files#diff-1db25273fb600ec451c57e5a26c98898152448157d593b7f5bc7a35b2bf1b22eR314-R316

…ors" See the comment [here](#132014 (comment)) (cc awgu) - this PR updates `_unsafe_set_version_counter` to accept a list of tensors, for overhead-sensitive users (e.g. distributed) who need to hide VC bumps from autograd on a large list of tensors without wanting to suffer the overhead of going from python->C++ separately for every tensor in the list. I left the binding in pybind, and used a `std::vector`. if we **really** need to optimize overhead even further, we could write a manual cpython binding. I use this updated API in the next PR to fix FSDP2, so that it properly hides the VC of all `all_gather_buffer` tensors in its call to `split_with_sizes_copy.out(all_gather_buffers)`. cc XilunWu H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 kadeng chauhang amjames rec [ghstack-poisoned]

…ors" See the comment [here](#132014 (comment)) (cc XilunWu H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang rec) - this PR updates `_unsafe_set_version_counter` to accept a list of tensors, for overhead-sensitive users (e.g. distributed) who need to hide VC bumps from autograd on a large list of tensors without wanting to suffer the overhead of going from python->C++ separately for every tensor in the list. I left the binding in pybind, and used a `std::vector`. if we **really** need to optimize overhead even further, we could write a manual cpython binding. I use this updated API in the next PR to fix FSDP2, so that it properly hides the VC of all `all_gather_buffer` tensors in its call to `split_with_sizes_copy.out(all_gather_buffers)`. cc XilunWu H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 kadeng chauhang amjames rec [ghstack-poisoned]

bdhirsh · 2024-10-16T20:29:33Z

@pytorchbot merge

pytorchmergebot · 2024-10-16T20:31:19Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2024-10-16T20:56:54Z

Merge failed

Reason: 1 mandatory check(s) failed. The first few are:

pull / linux-focal-py3.12-clang10 / test (default, 3, 4, linux.4xlarge)

Dig deeper by viewing the failures on hud

Details for Dev Infra team

Raised by workflow job

Failing merge rule: Core Maintainers

update _unsafe_set_version_counter to accept lists of tensors

4a7e0c7

[ghstack-poisoned]

bdhirsh requested review from albanD and soulitzer as code owners October 14, 2024 17:57

bdhirsh mentioned this pull request Oct 14, 2024

log ViewAndMutationMeta to trace_structured #133784

Closed

pytorch-bot bot added ciflow/inductor module: dynamo oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (fsdp) release notes category labels Oct 14, 2024

This was referenced Sep 18, 2024

autograd codegen: bump VC properly for mutable ops with no returns #133044

Open

track number of cpp->python exceptions thrown in torch.compile benchmark suite #131481

Open

github-actions bot requested review from antoniojkim, ezyang, miladm and SherlockNoMad October 14, 2024 17:57

awgu approved these changes Oct 14, 2024

View reviewed changes

awgu mentioned this pull request Oct 14, 2024

[FSDP2 Related]torch.split_with_sizes_copy of the GPU does not update the version counter of out correctly. #132014

Open

albanD approved these changes Oct 14, 2024

View reviewed changes

bdhirsh added 3 commits October 14, 2024 12:35

pytorch-bot bot added the module: inductor label Oct 16, 2024

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Oct 16, 2024

pytorchmergebot added the merging label Oct 16, 2024

pytorchmergebot removed the merging label Oct 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

update _unsafe_set_version_counter to accept lists of tensors #137921

update _unsafe_set_version_counter to accept lists of tensors #137921

bdhirsh commented Oct 14, 2024 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Oct 14, 2024 •

edited

Loading

albanD left a comment

albanD Oct 14, 2024

albanD Oct 14, 2024

bdhirsh Oct 14, 2024

albanD Oct 14, 2024

bdhirsh Oct 14, 2024

albanD Oct 16, 2024

awgu Oct 16, 2024

bdhirsh commented Oct 16, 2024

pytorchmergebot commented Oct 16, 2024

pytorchmergebot commented Oct 16, 2024

	def __init__(self, tensors: Union[torch.Tensor, List[torch.Tensor]]) -> None:
	def __init__(self, tensors: Union[torch.Tensor, Tuple[torch.Tensor, ...]]) -> None:

update _unsafe_set_version_counter to accept lists of tensors #137921

Are you sure you want to change the base?

update _unsafe_set_version_counter to accept lists of tensors #137921

Conversation

bdhirsh commented Oct 14, 2024 • edited by pytorch-bot bot Loading

pytorch-bot bot commented Oct 14, 2024 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/137921

❌ 8 New Failures, 1 Cancelled Job, 10 Unrelated Failures

albanD left a comment

Choose a reason for hiding this comment

albanD Oct 14, 2024

Choose a reason for hiding this comment

albanD Oct 14, 2024

Choose a reason for hiding this comment

bdhirsh Oct 14, 2024

Choose a reason for hiding this comment

albanD Oct 14, 2024

Choose a reason for hiding this comment

bdhirsh Oct 14, 2024

Choose a reason for hiding this comment

albanD Oct 16, 2024

Choose a reason for hiding this comment

awgu Oct 16, 2024

Choose a reason for hiding this comment

bdhirsh commented Oct 16, 2024

pytorchmergebot commented Oct 16, 2024

Merge started

pytorchmergebot commented Oct 16, 2024

Merge failed

bdhirsh commented Oct 14, 2024 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Oct 14, 2024 •

edited

Loading