-
Notifications
You must be signed in to change notification settings - Fork 22.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
init kineto after torch module initialized #131448
base: main
Are you sure you want to change the base?
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/131448
Note: Links to docs will display an error until the docs builds have been completed. ❌ 3 New Failures, 1 Pending, 12 Unrelated FailuresAs of commit 1f56a28 with merge base 1797a20 (): NEW FAILURES - The following jobs have failed:
FLAKY - The following job failed but was likely due to flakiness present on trunk:
BROKEN TRUNK - The following jobs failed but was present on the merge base:👉 Rebase onto the `viable/strict` branch to avoid these failures
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
This seems reasonable to me. @briancoutinho do you see any issue with initializing the client later in the stub? Also @staugust do you mind adding a test to make sure KINETO_DAEMON_INIT_DELAY_S is no longer necessary? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please check comments above
OK,I'd like to add a testcase to check pytorch on-demand profiling to ensure it. |
824b2af
to
d45c81e
Compare
/easycla |
@aaronenyeshi @briancoutinho @sraikund16 would you like to spend a few minutes to review this pr? Thanks. |
Testcase is added #132375 |
@pytorchbot label "topic: not user facing" |
❌ 🤖 pytorchbot command failed:
|
@pytorchbot label "topic: not user facing" |
@aaronenyeshi @briancoutinho @sraikund16 , hello, it's been a while since this PR was last updated. Could you please find some time to review this PR when you're available? Thank you in advance. |
@staugust changes look good to me :) |
@pytorchmergebot merge |
Pull workflow has not been scheduled for the PR yet. It could be because author doesn't have permissions to run those or skip-checks keywords were added to PR/commits, aborting merge. Please get/give approval for the workflows and/or remove skip ci decorators before next merge attempt. If you think this is a mistake, please contact PyTorch Dev Infra. |
@pytorchmergebot merge |
Pull workflow has not been scheduled for the PR yet. It could be because author doesn't have permissions to run those or skip-checks keywords were added to PR/commits, aborting merge. Please get/give approval for the workflows and/or remove skip ci decorators before next merge attempt. If you think this is a mistake, please contact PyTorch Dev Infra. |
@pytorchbot merge |
Pull workflow has not been scheduled for the PR yet. It could be because author doesn't have permissions to run those or skip-checks keywords were added to PR/commits, aborting merge. Please get/give approval for the workflows and/or remove skip ci decorators before next merge attempt. If you think this is a mistake, please contact PyTorch Dev Infra. |
@pytorchbot rebase |
@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here |
Successfully rebased |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we are using extern "C" wrong here. Please use the suggestion in this post: https://stackoverflow.com/a/2403522
…ondemand tracing is not supported on Apple or edge platform
torch/csrc/stub.c
Outdated
@@ -9,7 +9,11 @@ extern "C" | |||
__attribute__((visibility("default"))) PyObject* PyInit__C(void); | |||
#endif | |||
|
|||
extern void global_kineto_init(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we might need to move this prototype to a header for when stub.c is not compiled with kineto_client_interface
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this prototype can be moved to a header file. By moving this prototype to a header, do you mean some macros should be added to determine whether stub.c
is compiled with kineto_client_interface
? Here's a simple example:
#ifndef TORCH_KINETO_H
#define TORCH_KINETO_H
#ifndef USE_KINETO
// stub.c is compiled with `kineto_client_interface`, `torch._C` can find symbol `global_kineto_init`.
void global_kineto_init();
#else
// stub.c is not compiled with `kineto_client_interface`, `PyMODINIT_FUNC PyInit__C(void)` calls a empty function.
inline void global_kineto_init(){}
#endif
#endif // TORCH_KINETO_H
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might need to look at the compile paths but I just see this PR failing a lot of builds if you look below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on the failures it is failing because of missing prototype in kineto_client_interface so we need to have a header that this file and stub.c includes to make the builds compile
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I was out of office last week. I'd check those builds and try to fix it today.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I created a new header file kineto_client_interface.h
, and declare global_kineto_init
inside this header file. Then, I moved the call to global_kineto_init
into torch/csrc/Module.cpp:initModule
instead of stub.c
as torch/csrc/Module.cpp
has already included header files under torch/csrc/profiler
, that is #include <torch/csrc/profiler/combined_traceback.h>
.
From the definition of `TORCH_API` in `c10/macros/Export.h`, we can see that `TORCH_API` is designed to decorate functions in `libtorch.so`.
I've synced my forked branch with the latest main branch, the two failed |
Fixes #131020
As discussed in the issue thread, we can use
KINETO_DAEMON_INIT_DELAY_S
to delay the initialization ofkineto
in casekineto
is initialized beforelibtorch_cuda.so
.It's not clear to set a proper value of environmental variable
KINETO_DAEMON_INIT_DELAY_S
, here's a trick to make the initialization ofkineto
after the initialization of moduletorch
. I'm not sure whether this is an acceptable trick, please take a look at this pr, thanks.