-
-
Notifications
You must be signed in to change notification settings - Fork 7k
Fix: Decrease pressure via agent disposal / management #6380
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR addresses HTTP agent resource leaks that cause timeouts after prolonged operation by implementing agent reuse and proper cleanup. Instead of creating new HTTP/HTTPS agents for every request, the code now creates reusable agents per monitor instance with keepAlive disabled to prevent connection pooling issues.
Key Changes:
- Added reusable HTTP/HTTPS agent instance variables to the Monitor class
- Modified HTTP/HTTPS/Steam/Docker monitor types to reuse agents instead of creating new ones per request
- Implemented agent cleanup in the stop() method to prevent resource leaks
|
I am in the process of looking to test this locally -- I am brand new to the project so it may take a minute. Feedback is welcome! |
|
So what was the result of the testing? |
|
@CommanderStorm I'm doing this as a side-side-quest for my day job and have never touched this repo before so it's going to take a little to build, push, and use the docker container in our existing setup amongst my other priorities. Hopefully a maintainer or others can weigh in on this PR in the interim. |
|
Just read #275 again. I have the same concern as @steveiliop56 said. #275 (comment) I have also checked the Node.js Doc. https://nodejs.org/api/http.html#new-agentoptions
And actually I am not so faimilar with |
|
And since you mentioned Claude AI, Did the idea of reusing agent and If yes, I think we better stop here. |
Great catch. I was taking a belt & suspenders approach. That one's on me, I hadn't read the docs thoroughly enough. I was coming with a .NET mindset of disposing any But from those same docs, right after you quoted:
So I'm thinking there could be something where enough sockets are kept open longer than they should be, causing an issue. This is a strange situation in that I'm attempting to resolve a performance-related symptom where it's difficult to diagnose the problem, so I'm looking for a non-harmful practice that could also possibly fix things, and this stuck out. To be clear, I'm also shifting my thinking to agree that this is less about memory (and perhaps less about sockets) and more about pressure from object accumulation over a long time. I'm not seeing memory pressure on our container at all. My new theory is that enough of these are being created that we see socket exhaustion faster than they're cleaned up, or that GC pauses due to object accumulation get longer. And that since it's efficient and fine to re-use them, nothing seems to be hurt by this change. I did work with a colleague to create a small test harness (which didn't make sense to include in the larger codebase I don't think, though let me know if you want it). It just shows that we definitely don't accumulate things anymore with this PR. But I think that much is obvious from the code change so it probably doesn't make sense to include. If you'd like, I can work to deploy this PR into our production environment and let it run for some time. That might be the ultimate proof -- putting it in a production scenario and seeing if the symptom is resolved. |
|
I've deployed this to an experimental container under my dockerhub and we are now using it in our pre-prod and prod environments. I'll report back on findings. This can stay open in the interim. Based on past experience it takes about 18 hours for the symptom to show up. I'll let this run for a few days and if symptoms don't show up I think we'll have a strong signal on this |
|
Observations from production so far:
|
|
Additional observations from production:
So I do think we have a strong signal that this is, at a minimum, helping with the symptom. Going to keep monitoring / reporting for a few more days. |
|
I've still not run into any more 48000ms timeout errors. I'm thinking this PR might be a fix for the majority of the symptoms. The lack of rolling false positives has been delightful. I just pushed a change to destroy the HTTP agent whenever the monitor fails. This should alleviate the case in which a "stuck" agent causes an issue (such as my long-running agent causing an issue with the one web site that's behind an Azure app Gateway and "went bad" after a few days.) I'll let that change marinate for few more days in our production environment before asking for another review here. |
|
Could you run another new Uptime Kuma instance that without this patch for side-by-side comparison? Since you mentioned socket exhaustion, did you see this by using command like I will try to invite other people to test it. If it is eventually true, I think we should transfer the issue to Node.js' repo too, as their doc has never mentioned this behavior. |
You mean so that we can see one instance start failing while the other doesn't? I could look into that but there's a cost associated in our case. Not a bad idea though 👍 Coming up on a holiday in the US but I will look into that right after. Previously very few days we would begin seeing 48000ms timeouts causing monitors to trip, and it would continue to increase in frequency until I restarted the container in Azure. Haven't seen that happening with this deployment.
This was a best guess around x number of HTTP agents being created. My thinking was that socket contention could be causing the issue, but I have no proof of that. All I have is a fix for the symptoms at this point, not certainty about the cause, which is unfortunate.
Not a bad idea, though I'm well out of my depth at that point. |
|
My fix to destroy agents on failure didn't resolve the issue this PR introduced/uncovered around monitors failing and staying failed in certain situations (specifically ones that check a site that's behind an Azure app gateway). Still seeing that in production. I'll have to look into that a bit more, will likely revert that commit and try again. Still resolves the 48000ms timeout errors so far though! Haven't seen them reappear. |
ℹ️ NOTE: This fix was generated in concert with Claude AI, but with me in the driver's seat, starting from my own hunch, working through analysis, and then looking at the possible fixes. I do not have deep background in node/TS, but I have 20 years of experience and I also hate AI slop. 😄 Blunt feedback on the PR is welcome.
❗ Important Announcements
Click here for more details:
🚫 Please Avoid Unnecessary Pinging of Maintainers
We kindly ask you to refrain from pinging maintainers unless absolutely necessary. Pings are for critical/urgent pull requests that require immediate attention.
📋 Overview
🛠️ Type of change
📄 Checklist
📷 Screenshots or Visual Changes
N/A