Fix: Decrease pressure via agent disposal / management #6380

SeanKilleen · 2025-11-19T20:35:40Z

ℹ️ NOTE: This fix was generated in concert with Claude AI, but with me in the driver's seat, starting from my own hunch, working through analysis, and then looking at the possible fixes. I do not have deep background in node/TS, but I have 20 years of experience and I also hate AI slop. 😄 Blunt feedback on the PR is welcome.

❗ Important Announcements

Click here for more details:

⚠️ Please Note: We do not accept all types of pull requests, and we want to ensure we don’t waste your time. Before submitting, make sure you have read our pull request guidelines: Pull Request Rules

🚫 Please Avoid Unnecessary Pinging of Maintainers

We kindly ask you to refrain from pinging maintainers unless absolutely necessary. Pings are for critical/urgent pull requests that require immediate attention.

📋 Overview

What problem does this pull request address?
- aims to resolve issues surfaced in Lots of 'timeout of 48000ms exceeded' during the day, but the site is still up? #275 regarding 48000ms timeouts over time
- re-uses http agents where possible
- disables keepAlive to disable connection pooling on the individual agents
- destroys the http agents for a monitor when the monitor is stopped.

Resolves Lots of 'timeout of 48000ms exceeded' during the day, but the site is still up? #275 (I think!)

🛠️ Type of change

🐛 Bugfix (a non-breaking change that resolves an issue)
✨ New feature (a non-breaking change that adds new functionality)
⚠️ Breaking change (a fix or feature that alters existing functionality in a way that could cause issues)
🎨 User Interface (UI) updates
📄 New Documentation (addition of new documentation)
📄 Documentation Update (modification of existing documentation)
📄 Documentation Update Required (the change requires updates to related documentation)
🔧 Other (please specify):
- Provide additional details here.

📄 Checklist

📷 Screenshots or Visual Changes

N/A

Copilot

Pull Request Overview

This PR addresses HTTP agent resource leaks that cause timeouts after prolonged operation by implementing agent reuse and proper cleanup. Instead of creating new HTTP/HTTPS agents for every request, the code now creates reusable agents per monitor instance with keepAlive disabled to prevent connection pooling issues.

Key Changes:

Added reusable HTTP/HTTPS agent instance variables to the Monitor class
Modified HTTP/HTTPS/Steam/Docker monitor types to reuse agents instead of creating new ones per request
Implemented agent cleanup in the stop() method to prevent resource leaks

server/model/monitor.js

SeanKilleen · 2025-11-19T20:36:15Z

I am in the process of looking to test this locally -- I am brand new to the project so it may take a minute. Feedback is welcome!

CommanderStorm · 2025-11-20T01:52:19Z

So what was the result of the testing?

SeanKilleen · 2025-11-20T02:23:51Z

@CommanderStorm I'm doing this as a side-side-quest for my day job and have never touched this repo before so it's going to take a little to build, push, and use the docker container in our existing setup amongst my other priorities.

Hopefully a maintainer or others can weigh in on this PR in the interim.

louislam · 2025-11-20T07:55:12Z

Just read #275 again. I have the same concern as @steveiliop56 said. #275 (comment)

I have also checked the Node.js Doc. https://nodejs.org/api/http.html#new-agentoptions

agent.destroy()
It is usually not necessary to do this. However, if using an agent with keepAlive enabled, then it is best to explicitly shut down the agent when it is no longer needed.

keepAlive Default: false.

So we have never enabled keepAlive, why is destroy() still needed?
I think you have to prove that maybe using a minimal javascript script to compare both with/without reusing the http agent object, not (I think!).

And actually I am not so faimilar with http.Agent too. Recreating http.Agent looks more stateless to me I think. While reusing the same http.Agent, I have no idea if it would have any state changes inside the agent, and affect upcoming requests.

louislam · 2025-11-20T08:01:36Z

And since you mentioned Claude AI, Did the idea of reusing agent and agent.destroy() purely come up with Caude, not by your test?

If yes, I think we better stop here.

SeanKilleen · 2025-11-20T13:05:28Z

@louislam

So we have never enabled keepAlive, why is destroy() still needed?

Great catch. I was taking a belt & suspenders approach. That one's on me, I hadn't read the docs thoroughly enough. I was coming with a .NET mindset of disposing any IDisposable.

But from those same docs, right after you quoted:

Otherwise, sockets might stay open for quite a long time before the server terminates them.

So I'm thinking there could be something where enough sockets are kept open longer than they should be, causing an issue. This is a strange situation in that I'm attempting to resolve a performance-related symptom where it's difficult to diagnose the problem, so I'm looking for a non-harmful practice that could also possibly fix things, and this stuck out.

To be clear, I'm also shifting my thinking to agree that this is less about memory (and perhaps less about sockets) and more about pressure from object accumulation over a long time. I'm not seeing memory pressure on our container at all.

My new theory is that enough of these are being created that we see socket exhaustion faster than they're cleaned up, or that GC pauses due to object accumulation get longer. And that since it's efficient and fine to re-use them, nothing seems to be hurt by this change.

I did work with a colleague to create a small test harness (which didn't make sense to include in the larger codebase I don't think, though let me know if you want it). It just shows that we definitely don't accumulate things anymore with this PR. But I think that much is obvious from the code change so it probably doesn't make sense to include.

If you'd like, I can work to deploy this PR into our production environment and let it run for some time. That might be the ultimate proof -- putting it in a production scenario and seeing if the symptom is resolved.

SeanKilleen · 2025-11-20T13:26:40Z

I've deployed this to an experimental container under my dockerhub and we are now using it in our pre-prod and prod environments. I'll report back on findings. This can stay open in the interim.

Based on past experience it takes about 18 hours for the symptom to show up. I'll let this run for a few days and if symptoms don't show up I think we'll have a strong signal on this

SeanKilleen · 2025-11-21T15:52:48Z

Observations from production so far:

We've received far less of the 48000ms timeout errors, but the ones we did receive took an interesting format. Rather than a continuous rolling issue as we previously saw, a one-time issue happened at 12:41am where 12 monitors had this issue at once, and then recovered 1 minute later. I didn't restart anything yet; looking to see if it happens again or if maybe that was a one-off fluke.
- The end-user experience is definitely preferable to the current continual rolling false-positives.
I've noticed a side effect of my change where in one instance, an agent that was being re-used "went bad" (looking to further diagnose). This agent was monitoring a web site in front of Azure app gateway. It received a 403 and was marked as down. The 403 is surprising which makes me think something about how it initially established the http connection with the Azure app Gateway went stale (possibly the keepAlive = false). When I paused the monitor and then resumed it, the problem cleared (presumably because the http agent was recreated). So I'll want to look into that as part of this PR, even if we proceed.

SeanKilleen · 2025-11-22T15:39:38Z

Additional observations from production:

No other 48000ms timeout errors except for that one small batch that resolved itself yesterday
No other issues with monitors "going back" or being unable to recover

So I do think we have a strong signal that this is, at a minimum, helping with the symptom. Going to keep monitoring / reporting for a few more days.

SeanKilleen · 2025-11-25T02:31:16Z

I've still not run into any more 48000ms timeout errors. I'm thinking this PR might be a fix for the majority of the symptoms. The lack of rolling false positives has been delightful.

I just pushed a change to destroy the HTTP agent whenever the monitor fails. This should alleviate the case in which a "stuck" agent causes an issue (such as my long-running agent causing an issue with the one web site that's behind an Azure app Gateway and "went bad" after a few days.)

I'll let that change marinate for few more days in our production environment before asking for another review here.

louislam · 2025-11-25T04:07:49Z

Could you run another new Uptime Kuma instance that without this patch for side-by-side comparison?

Since you mentioned socket exhaustion, did you see this by using command like netstat?

I will try to invite other people to test it.

If it is eventually true, I think we should transfer the issue to Node.js' repo too, as their doc has never mentioned this behavior.

SeanKilleen · 2025-11-25T04:16:57Z

Could you run another new Uptime Kuma instance that without this patch for side-by-side comparison?

You mean so that we can see one instance start failing while the other doesn't? I could look into that but there's a cost associated in our case. Not a bad idea though 👍 Coming up on a holiday in the US but I will look into that right after.

Previously very few days we would begin seeing 48000ms timeouts causing monitors to trip, and it would continue to increase in frequency until I restarted the container in Azure. Haven't seen that happening with this deployment.

Since you mentioned socket exhaustion, did you see this by using command like netstat?

This was a best guess around x number of HTTP agents being created. My thinking was that socket contention could be causing the issue, but I have no proof of that. All I have is a fix for the symptoms at this point, not certainty about the cause, which is unfortunate.

If it is eventually true, I think we should transfer the issue to Node.js' repo too, as their doc has never mentioned this behavior.

Not a bad idea, though I'm well out of my depth at that point.

SeanKilleen · 2025-11-29T12:49:21Z

My fix to destroy agents on failure didn't resolve the issue this PR introduced/uncovered around monitors failing and staying failed in certain situations (specifically ones that check a site that's behind an Azure app gateway). Still seeing that in production. I'll have to look into that a bit more, will likely revert that commit and try again.

Still resolves the 48000ms timeout errors so far though! Haven't seen them reappear.

Update the monitor

c621339

Copilot AI review requested due to automatic review settings November 19, 2025 20:35

Copilot AI reviewed Nov 19, 2025

View reviewed changes

server/model/monitor.js Outdated Show resolved Hide resolved

SeanKilleen added 3 commits November 19, 2025 16:00

Move away from class variables to avoid conflict with ReadBean

ceb4b8b

linting: use double quotes

74e0f47

Fix other linting

33460cd

SeanKilleen changed the title ~~Resolve HTTP agent leak~~ Fix: Resolve HTTP agent leak via agent disposal / management Nov 19, 2025

louislam added the pr:please address review comments this PR needs a bit more work to be mergable label Nov 20, 2025

SeanKilleen changed the title ~~Fix: Resolve HTTP agent leak via agent disposal / management~~ Fix: Decrease pressure via agent disposal / management Nov 20, 2025

Merge branch 'master' into fix-db-issue

4ef587c

Destroy agents on failure so they can be recreated

48a4237

Merge branch 'master' into fix-db-issue

0940468

louislam mentioned this pull request Nov 25, 2025

Lots of 'timeout of 48000ms exceeded' during the day, but the site is still up? #275

Open

Merge branch 'master' into fix-db-issue

8068fb4

Uh oh!

Fix: Decrease pressure via agent disposal / management #6380

Are you sure you want to change the base?

Fix: Decrease pressure via agent disposal / management #6380

Conversation

SeanKilleen commented Nov 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

❗ Important Announcements

🚫 Please Avoid Unnecessary Pinging of Maintainers

📋 Overview

🛠️ Type of change

📄 Checklist

📷 Screenshots or Visual Changes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Uh oh!

SeanKilleen commented Nov 19, 2025

Uh oh!

CommanderStorm commented Nov 20, 2025

Uh oh!

SeanKilleen commented Nov 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

louislam commented Nov 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

louislam commented Nov 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SeanKilleen commented Nov 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SeanKilleen commented Nov 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SeanKilleen commented Nov 21, 2025

Uh oh!

SeanKilleen commented Nov 22, 2025

Uh oh!

SeanKilleen commented Nov 25, 2025

Uh oh!

louislam commented Nov 25, 2025

Uh oh!

SeanKilleen commented Nov 25, 2025

Uh oh!

SeanKilleen commented Nov 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

SeanKilleen commented Nov 19, 2025 •

edited

Loading

SeanKilleen commented Nov 20, 2025 •

edited

Loading

louislam commented Nov 20, 2025 •

edited

Loading

louislam commented Nov 20, 2025 •

edited

Loading

SeanKilleen commented Nov 20, 2025 •

edited

Loading

SeanKilleen commented Nov 20, 2025 •

edited

Loading

SeanKilleen commented Nov 29, 2025 •

edited

Loading