Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MINOR Add a thread dump on build timeout #17181

Merged
merged 18 commits into from
Sep 13, 2024

Conversation

mumrah
Copy link
Contributor

@mumrah mumrah commented Sep 12, 2024

To help with debugging build timeouts, this patch adds a thread dump script to run in parallel with the JUnit tests. 5 minutes prior to the build timeout of 3 hours, this script will iteratively run jstack to the Gradle worker processes in order to obtain thread dumps.

@mumrah mumrah added the build Gradle build or GitHub Actions label Sep 12, 2024
@mumrah
Copy link
Contributor Author

mumrah commented Sep 12, 2024

@mjsax I think this can help us debug StreamThreadTest

@mumrah
Copy link
Contributor Author

mumrah commented Sep 12, 2024

From the raw logs of https://github.com/apache/kafka/actions/runs/10835121701?pr=17181

We see the thread dump script start and Gradle start:

2024-09-12T16:46:09.6343038Z Dumping threads in 5 minutes
2024-09-12T16:46:10.6064558Z Starting a Gradle Daemon (subsequent builds will be faster)
2024-09-12T16:46:22.9075240Z 
2024-09-12T16:46:22.9076166Z > Configure project :
2024-09-12T16:46:22.9078028Z Starting build with version 4.0.0-SNAPSHOT (commit id xxxxxxxx) using Gradle 8.10, Java 17 and Scala 2.13.14
2024-09-12T16:46:22.9079901Z Build properties: ignoreFailures=false, maxParallelForks=2, maxScalacThreads=4, maxTestRetries=1
2024-09-12T16:46:28.9054435Z 

5 minutes later, we see the threads dumped

2024-09-12T16:51:14.5056001Z Gradle Test Run :core:test > Gradle Test Executor 9 > AuthorizerIntegrationTest > testOffsetFetchAllTopicPartitionsAuthorization(String) > "testOffsetFetchAllTopicPartitionsAuthorization(String).quorum=zk" PASSED
2024-09-12T16:51:14.5057941Z 
2024-09-12T16:51:14.5059650Z Gradle Test Run :core:test > Gradle Test Executor 9 > AuthorizerIntegrationTest > testOffsetFetchAllTopicPartitionsAuthorization(String) > "testOffsetFetchAllTopicPartitionsAuthorization(String).quorum=kraft" STARTED
2024-09-12T16:51:15.0199756Z Thread Dump for GradleWorkerMain pid 2224
2024-09-12T16:51:15.1075100Z 
2024-09-12T16:51:15.1076319Z 2024-09-12 16:51:15
2024-09-12T16:51:15.1077590Z Full thread dump OpenJDK 64-Bit Server VM (17.0.12+7 mixed mode, sharing):
2024-09-12T16:51:15.1078565Z 
2024-09-12T16:51:15.1080327Z Threads class SMR info:
2024-09-12T16:51:15.1081076Z _java_thread_list=0x00007f35b41089c0, length=16, elements={
2024-09-12T16:51:15.1082118Z 0x00007f365c02c160, 0x00007f365c140d40, 0x00007f365c142130, 0x00007f365c14a340,
2024-09-12T16:51:15.1083426Z 0x00007f365c14b700, 0x00007f365c14cb20, 0x00007f365c14e4e0, 0x00007f365c14fa20,
2024-09-12T16:51:15.1084603Z 0x00007f365c150ea0, 0x00007f365c185930, 0x00007f365c188020, 0x00007f365c8fa100,
2024-09-12T16:51:15.1085898Z 0x00007f365c8fd030, 0x00007f365ca3f8b0, 0x00007f365ca46220, 0x00007f35994fae10
2024-09-12T16:51:15.1087095Z }

and 5 more minutes later, we see the job killed by the timeout utility.

Copy link
Contributor

@chia7712 chia7712 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mumrah thanks for this patch!

for GRADLE_WORKER_PID in `jps | grep GradleWorkerMain | awk -F" " '{print $1}'`;
do
echo "Thread Dump for GradleWorkerMain pid $GRADLE_WORKER_PID";
kill -3 $GRADLE_WORKER_PID;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we dump the processes only if they have keyword "kafka"? That can avoid unnecessary dump I'd say

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To my knowledge, the JUnit tests are all run on the GradleWorkerMain java processes. Here's a jps during a local test run

66818 GradleWorkerMain
66884 GradleWorkerMain
18725 GradleDaemon
66631 GradleWorkerMain
66886 FindBugs2
66889 FindBugs2
65804 GradleWrapperMain
66924 Jps
66862 GradleWorkerMain
66798 GradleWorkerMain
66864 GradleWorkerMain
66899 Main
66841 GradleWorkerMain
13976 Main
66811 GradleWorkerMain
65754 GradleDaemon
65917 GradleWorkerMain

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for unclear comment :(

My point was we should print the thread dump only if they have kafka-related stack. The thread dump of this PR have two unrelated processes (GradleWorkerMain).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How can we do that? grep for "kafka" in the thread dump after the fact? I'm not sure we could figure this out prior to generating the dump

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How can we do that? grep for "kafka" in the thread dump after the fact? I'm not sure we could figure this out prior to generating the dump

"grep" seems to be a acceptable way. we could call jstack to check it before archiving it.

.github/workflows/build.yml Show resolved Hide resolved
Copy link
Contributor

@chia7712 chia7712 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mumrah thanks for this patch

BTW, could you please add compression-level: 9 to "Archive JUnit reports" task? the archive are plaintext so we should compress them to save cost.

@@ -141,3 +144,11 @@ jobs:
with:
name: build-scan-test-${{ matrix.java }}
path: ~/.gradle/build-scan-data
- name: Archive Thread Dumps
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe the artifact url generated by this task should be passed to "Parse JUnit tests". Than can enable "Parse JUnit tests" to add url to summary (like junit reports)

@mumrah
Copy link
Contributor Author

mumrah commented Sep 13, 2024

@chia7712 here's a run with the latest changes https://github.com/apache/kafka/actions/runs/10850348751?pr=17181

The "Parse JUnit Tests" step produces the following stdout:

Gradle command timed out. These are partial results!
1026 tests cases run in 41s. 1026 PASSED ✅, 0 FAILED ❌, 0 FLAKY ⚠️ , 0 SKIPPED 🙈, and 0 errors.
Failing this step because the tests timed out. Thread dumps were taken and archived here: https://github.com/apache/kafka/actions/runs/10850348751/artifacts/1930070177
Error: Process completed with exit code 1.

The thread dump archive is also in the job summary https://github.com/apache/kafka/actions/runs/10850348751?pr=17181#summary-30111636589

Copy link
Contributor

@chia7712 chia7712 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mumrah thanks for updated patch. overall LGTM

.github/scripts/junit.py Outdated Show resolved Hide resolved
@mumrah mumrah merged commit 420f69a into apache:trunk Sep 13, 2024
6 of 7 checks passed
@mumrah mumrah deleted the thread-dump-on-timeout branch September 13, 2024 15:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
build Gradle build or GitHub Actions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants