Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ModuleTracker: Add explicit garbage collection #139214

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

pragupta
Copy link
Contributor

@pragupta pragupta commented Oct 29, 2024

When running an FSDP model with FlopCounterMode, we are experiencing a memory leak. It is coming from ModuleTracker class. Even though ModuleTracker class is keeping weakrefrences of the operators, the tensors/operators are not being freed after the backward pass. To force free these tensors/operators after forward pass, I explicitly added garbage collection in the post forward hook.

Fixes #ISSUE_NUMBER

Copy link

pytorch-bot bot commented Oct 29, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/139214

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit 3cca8cb with merge base f14f245 (image):

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@jithunnair-amd jithunnair-amd added rocm This tag is for PRs from ROCm team ciflow/rocm labels Oct 29, 2024
@pruthvistony pruthvistony added the topic: not user facing topic category label Oct 30, 2024
@pruthvistony
Copy link
Collaborator

@pytorchbot rebase

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Successfully rebased pg-flop-counter onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout pg-flop-counter && git pull --rebase)

Copy link
Collaborator

@pruthvistony pruthvistony left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Please check the lint error.

When running an FSDP model with FlopCounterMode, we are experiencing a memory
leak. It is coming from ModuleTracker class. Even though
ModuleTracker class is keeping weakrefrences of the operators, the
tensors/operators are not being freed after the backward pass. To force
free these tensors/operators after backwardpass, I explicitly added
garbage collection in the post forward hook.
@pragupta pragupta marked this pull request as ready for review October 30, 2024 16:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/rocm open source rocm This tag is for PRs from ROCm team topic: not user facing topic category
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants