Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

init kineto after torch module initialized #131448

Open
wants to merge 11 commits into
base: main
Choose a base branch
from
Open

Conversation

staugust
Copy link

@staugust staugust commented Jul 23, 2024

Fixes #131020

As discussed in the issue thread, we can use KINETO_DAEMON_INIT_DELAY_S to delay the initialization of kineto in case kineto is initialized before libtorch_cuda.so.

It's not clear to set a proper value of environmental variable KINETO_DAEMON_INIT_DELAY_S, here's a trick to make the initialization of kineto after the initialization of module torch. I'm not sure whether this is an acceptable trick, please take a look at this pr, thanks.

Copy link

pytorch-bot bot commented Jul 23, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/131448

Note: Links to docs will display an error until the docs builds have been completed.

❌ 3 New Failures, 1 Pending, 12 Unrelated Failures

As of commit 1f56a28 with merge base 1797a20 (image):

NEW FAILURES - The following jobs have failed:

FLAKY - The following job failed but was likely due to flakiness present on trunk:

BROKEN TRUNK - The following jobs failed but was present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Copy link

linux-foundation-easycla bot commented Jul 23, 2024

CLA Signed

The committers listed above are authorized under a signed CLA.

@albanD albanD added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Jul 23, 2024
@sraikund16
Copy link
Contributor

This seems reasonable to me. @briancoutinho do you see any issue with initializing the client later in the stub?

Also @staugust do you mind adding a test to make sure KINETO_DAEMON_INIT_DELAY_S is no longer necessary?

Copy link
Contributor

@sraikund16 sraikund16 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check comments above

@staugust
Copy link
Author

This seems reasonable to me. @briancoutinho do you see any issue with initializing the client later in the stub?

Also @staugust do you mind adding a test to make sure KINETO_DAEMON_INIT_DELAY_S is no longer necessary?

OK,I'd like to add a testcase to check pytorch on-demand profiling to ensure it.

@staugust
Copy link
Author

/easycla

@staugust
Copy link
Author

@aaronenyeshi @briancoutinho @sraikund16 would you like to spend a few minutes to review this pr? Thanks.

@staugust
Copy link
Author

This seems reasonable to me. @briancoutinho do you see any issue with initializing the client later in the stub?
Also @staugust do you mind adding a test to make sure KINETO_DAEMON_INIT_DELAY_S is no longer necessary?

OK,I'd like to add a testcase to check pytorch on-demand profiling to ensure it.

Testcase is added #132375

@staugust
Copy link
Author

staugust commented Aug 29, 2024

@pytorchbot label "topic: not user facing"

Copy link

pytorch-bot bot commented Aug 29, 2024

❌ 🤖 pytorchbot command failed:

Got EOF while in a quoted string```
Try `@pytorchbot --help` for more info.

@staugust
Copy link
Author

@pytorchbot label "topic: not user facing"

@pytorch-bot pytorch-bot bot added the topic: not user facing topic category label Aug 29, 2024
@staugust
Copy link
Author

@aaronenyeshi @briancoutinho @sraikund16 , hello, it's been a while since this PR was last updated. Could you please find some time to review this PR when you're available? Thank you in advance.

@briancoutinho
Copy link
Contributor

@staugust changes look good to me :)

@briancoutinho
Copy link
Contributor

@pytorchmergebot merge

Copy link

pytorch-bot bot commented Oct 17, 2024

Pull workflow has not been scheduled for the PR yet. It could be because author doesn't have permissions to run those or skip-checks keywords were added to PR/commits, aborting merge. Please get/give approval for the workflows and/or remove skip ci decorators before next merge attempt. If you think this is a mistake, please contact PyTorch Dev Infra.

@briancoutinho
Copy link
Contributor

@pytorchmergebot merge

Copy link

pytorch-bot bot commented Oct 17, 2024

Pull workflow has not been scheduled for the PR yet. It could be because author doesn't have permissions to run those or skip-checks keywords were added to PR/commits, aborting merge. Please get/give approval for the workflows and/or remove skip ci decorators before next merge attempt. If you think this is a mistake, please contact PyTorch Dev Infra.

@sraikund16
Copy link
Contributor

@pytorchbot merge

Copy link

pytorch-bot bot commented Oct 17, 2024

Pull workflow has not been scheduled for the PR yet. It could be because author doesn't have permissions to run those or skip-checks keywords were added to PR/commits, aborting merge. Please get/give approval for the workflows and/or remove skip ci decorators before next merge attempt. If you think this is a mistake, please contact PyTorch Dev Infra.

@sraikund16
Copy link
Contributor

@pytorchbot rebase

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Successfully rebased main onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout main && git pull --rebase)

torch/csrc/stub.c Outdated Show resolved Hide resolved
Copy link
Contributor

@sraikund16 sraikund16 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we are using extern "C" wrong here. Please use the suggestion in this post: https://stackoverflow.com/a/2403522

…ondemand tracing is not supported on Apple or edge platform
@@ -9,7 +9,11 @@ extern "C"
__attribute__((visibility("default"))) PyObject* PyInit__C(void);
#endif

extern void global_kineto_init();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we might need to move this prototype to a header for when stub.c is not compiled with kineto_client_interface

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this prototype can be moved to a header file. By moving this prototype to a header, do you mean some macros should be added to determine whether stub.c is compiled with kineto_client_interface? Here's a simple example:

#ifndef TORCH_KINETO_H
#define TORCH_KINETO_H

#ifndef USE_KINETO 
// stub.c is compiled with `kineto_client_interface`, `torch._C` can find symbol `global_kineto_init`. 
void global_kineto_init();
#else
// stub.c is not compiled with `kineto_client_interface`,  `PyMODINIT_FUNC PyInit__C(void)` calls a empty function. 
inline void global_kineto_init(){}
#endif
#endif // TORCH_KINETO_H

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might need to look at the compile paths but I just see this PR failing a lot of builds if you look below

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the failures it is failing because of missing prototype in kineto_client_interface so we need to have a header that this file and stub.c includes to make the builds compile

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I was out of office last week. I'd check those builds and try to fix it today.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I created a new header file kineto_client_interface.h, and declare global_kineto_init inside this header file. Then, I moved the call to global_kineto_init into torch/csrc/Module.cpp:initModule instead of stub.c as torch/csrc/Module.cpp has already included header files under torch/csrc/profiler, that is #include <torch/csrc/profiler/combined_traceback.h>.

@staugust
Copy link
Author

I've synced my forked branch with the latest main branch, the two failed linux-focal-cuda11.8-py3.10-gcc9/test distributed cases should be solved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
open source topic: not user facing topic category triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
7 participants