Skip to content

Conversation

@SeanKilleen
Copy link

@SeanKilleen SeanKilleen commented Nov 19, 2025

ℹ️ NOTE: This fix was generated in concert with Claude AI, but with me in the driver's seat, starting from my own hunch, working through analysis, and then looking at the possible fixes. I do not have deep background in node/TS, but I have 20 years of experience and I also hate AI slop. 😄 Blunt feedback on the PR is welcome.

❗ Important Announcements

Click here for more details:

⚠️ Please Note: We do not accept all types of pull requests, and we want to ensure we don’t waste your time. Before submitting, make sure you have read our pull request guidelines: Pull Request Rules

🚫 Please Avoid Unnecessary Pinging of Maintainers

We kindly ask you to refrain from pinging maintainers unless absolutely necessary. Pings are for critical/urgent pull requests that require immediate attention.

📋 Overview

🛠️ Type of change

  • 🐛 Bugfix (a non-breaking change that resolves an issue)
  • ✨ New feature (a non-breaking change that adds new functionality)
  • ⚠️ Breaking change (a fix or feature that alters existing functionality in a way that could cause issues)
  • 🎨 User Interface (UI) updates
  • 📄 New Documentation (addition of new documentation)
  • 📄 Documentation Update (modification of existing documentation)
  • 📄 Documentation Update Required (the change requires updates to related documentation)
  • 🔧 Other (please specify):
    • Provide additional details here.

📄 Checklist

  • 🔍 My code adheres to the style guidelines of this project.
  • 🦿 I have indicated where (if any) I used an LLM for the contributions
  • ✅ I ran ESLint and other code linters for modified files.
  • 🛠️ I have reviewed and tested my code.
  • 📝 I have commented my code, especially in hard-to-understand areas (e.g., using JSDoc for methods).
  • ⚠️ My changes generate no new warnings.
  • 🤖 My code needed automated testing. I have added them (this is an optional task).
  • 📄 Documentation updates are included (if applicable).
  • 🔒 I have considered potential security impacts and mitigated risks.
  • 🧰 Dependency updates are listed and explained.
  • 📚 I have read and understood the Pull Request guidelines.

📷 Screenshots or Visual Changes

N/A

Copilot AI review requested due to automatic review settings November 19, 2025 20:35
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR addresses HTTP agent resource leaks that cause timeouts after prolonged operation by implementing agent reuse and proper cleanup. Instead of creating new HTTP/HTTPS agents for every request, the code now creates reusable agents per monitor instance with keepAlive disabled to prevent connection pooling issues.

Key Changes:

  • Added reusable HTTP/HTTPS agent instance variables to the Monitor class
  • Modified HTTP/HTTPS/Steam/Docker monitor types to reuse agents instead of creating new ones per request
  • Implemented agent cleanup in the stop() method to prevent resource leaks

@SeanKilleen
Copy link
Author

I am in the process of looking to test this locally -- I am brand new to the project so it may take a minute. Feedback is welcome!

@SeanKilleen SeanKilleen changed the title Resolve HTTP agent leak Fix: Resolve HTTP agent leak via agent disposal / management Nov 19, 2025
@CommanderStorm
Copy link
Collaborator

So what was the result of the testing?

@SeanKilleen
Copy link
Author

SeanKilleen commented Nov 20, 2025

@CommanderStorm I'm doing this as a side-side-quest for my day job and have never touched this repo before so it's going to take a little to build, push, and use the docker container in our existing setup amongst my other priorities.

Hopefully a maintainer or others can weigh in on this PR in the interim.

@louislam
Copy link
Owner

louislam commented Nov 20, 2025

Just read #275 again. I have the same concern as @steveiliop56 said. #275 (comment)

I have also checked the Node.js Doc. https://nodejs.org/api/http.html#new-agentoptions

agent.destroy()
It is usually not necessary to do this. However, if using an agent with keepAlive enabled, then it is best to explicitly shut down the agent when it is no longer needed.

keepAlive Default: false.

  1. So we have never enabled keepAlive, why is destroy() still needed?
  2. I think you have to prove that maybe using a minimal javascript script to compare both with/without reusing the http agent object, not (I think!).

And actually I am not so faimilar with http.Agent too. Recreating http.Agent looks more stateless to me I think. While reusing the same http.Agent, I have no idea if it would have any state changes inside the agent, and affect upcoming requests.

@louislam
Copy link
Owner

louislam commented Nov 20, 2025

And since you mentioned Claude AI, Did the idea of reusing agent and agent.destroy() purely come up with Caude, not by your test?

If yes, I think we better stop here.

@louislam louislam added the pr:please address review comments this PR needs a bit more work to be mergable label Nov 20, 2025
@SeanKilleen SeanKilleen changed the title Fix: Resolve HTTP agent leak via agent disposal / management Fix: Decrease pressure via agent disposal / management Nov 20, 2025
@SeanKilleen
Copy link
Author

SeanKilleen commented Nov 20, 2025

@louislam

So we have never enabled keepAlive, why is destroy() still needed?

Great catch. I was taking a belt & suspenders approach. That one's on me, I hadn't read the docs thoroughly enough. I was coming with a .NET mindset of disposing any IDisposable.

But from those same docs, right after you quoted:

Otherwise, sockets might stay open for quite a long time before the server terminates them.

So I'm thinking there could be something where enough sockets are kept open longer than they should be, causing an issue. This is a strange situation in that I'm attempting to resolve a performance-related symptom where it's difficult to diagnose the problem, so I'm looking for a non-harmful practice that could also possibly fix things, and this stuck out.

To be clear, I'm also shifting my thinking to agree that this is less about memory (and perhaps less about sockets) and more about pressure from object accumulation over a long time. I'm not seeing memory pressure on our container at all.

My new theory is that enough of these are being created that we see socket exhaustion faster than they're cleaned up, or that GC pauses due to object accumulation get longer. And that since it's efficient and fine to re-use them, nothing seems to be hurt by this change.

I did work with a colleague to create a small test harness (which didn't make sense to include in the larger codebase I don't think, though let me know if you want it). It just shows that we definitely don't accumulate things anymore with this PR. But I think that much is obvious from the code change so it probably doesn't make sense to include.


If you'd like, I can work to deploy this PR into our production environment and let it run for some time. That might be the ultimate proof -- putting it in a production scenario and seeing if the symptom is resolved.

@SeanKilleen
Copy link
Author

SeanKilleen commented Nov 20, 2025

I've deployed this to an experimental container under my dockerhub and we are now using it in our pre-prod and prod environments. I'll report back on findings. This can stay open in the interim.

Based on past experience it takes about 18 hours for the symptom to show up. I'll let this run for a few days and if symptoms don't show up I think we'll have a strong signal on this

@SeanKilleen
Copy link
Author

Observations from production so far:

  • We've received far less of the 48000ms timeout errors, but the ones we did receive took an interesting format. Rather than a continuous rolling issue as we previously saw, a one-time issue happened at 12:41am where 12 monitors had this issue at once, and then recovered 1 minute later. I didn't restart anything yet; looking to see if it happens again or if maybe that was a one-off fluke.
    • The end-user experience is definitely preferable to the current continual rolling false-positives.
  • I've noticed a side effect of my change where in one instance, an agent that was being re-used "went bad" (looking to further diagnose). This agent was monitoring a web site in front of Azure app gateway. It received a 403 and was marked as down. The 403 is surprising which makes me think something about how it initially established the http connection with the Azure app Gateway went stale (possibly the keepAlive = false). When I paused the monitor and then resumed it, the problem cleared (presumably because the http agent was recreated). So I'll want to look into that as part of this PR, even if we proceed.

@SeanKilleen
Copy link
Author

Additional observations from production:

  • No other 48000ms timeout errors except for that one small batch that resolved itself yesterday
  • No other issues with monitors "going back" or being unable to recover

So I do think we have a strong signal that this is, at a minimum, helping with the symptom. Going to keep monitoring / reporting for a few more days.

@SeanKilleen
Copy link
Author

I've still not run into any more 48000ms timeout errors. I'm thinking this PR might be a fix for the majority of the symptoms. The lack of rolling false positives has been delightful.

I just pushed a change to destroy the HTTP agent whenever the monitor fails. This should alleviate the case in which a "stuck" agent causes an issue (such as my long-running agent causing an issue with the one web site that's behind an Azure app Gateway and "went bad" after a few days.)

I'll let that change marinate for few more days in our production environment before asking for another review here.

@louislam
Copy link
Owner

Could you run another new Uptime Kuma instance that without this patch for side-by-side comparison?

Since you mentioned socket exhaustion, did you see this by using command like netstat?

I will try to invite other people to test it.

If it is eventually true, I think we should transfer the issue to Node.js' repo too, as their doc has never mentioned this behavior.

@SeanKilleen
Copy link
Author

Could you run another new Uptime Kuma instance that without this patch for side-by-side comparison?

You mean so that we can see one instance start failing while the other doesn't? I could look into that but there's a cost associated in our case. Not a bad idea though 👍 Coming up on a holiday in the US but I will look into that right after.

Previously very few days we would begin seeing 48000ms timeouts causing monitors to trip, and it would continue to increase in frequency until I restarted the container in Azure. Haven't seen that happening with this deployment.

Since you mentioned socket exhaustion, did you see this by using command like netstat?

This was a best guess around x number of HTTP agents being created. My thinking was that socket contention could be causing the issue, but I have no proof of that. All I have is a fix for the symptoms at this point, not certainty about the cause, which is unfortunate.

If it is eventually true, I think we should transfer the issue to Node.js' repo too, as their doc has never mentioned this behavior.

Not a bad idea, though I'm well out of my depth at that point.

@SeanKilleen
Copy link
Author

SeanKilleen commented Nov 29, 2025

My fix to destroy agents on failure didn't resolve the issue this PR introduced/uncovered around monitors failing and staying failed in certain situations (specifically ones that check a site that's behind an Azure app gateway). Still seeing that in production. I'll have to look into that a bit more, will likely revert that commit and try again.

Still resolves the 48000ms timeout errors so far though! Haven't seen them reappear.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr:please address review comments this PR needs a bit more work to be mergable

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Lots of 'timeout of 48000ms exceeded' during the day, but the site is still up?

3 participants