Expose healthcheck server (#643) #644

mwasilew2 · 2024-12-02T16:32:00Z

This exposes a basic healtcheck server. Internally, the healtcheck handler doesn't do anything complex. The idea is to just check if the process is responding.

Closes #643

This exposes a basic healtcheck server. Internally, the healtcheck handler doesn't do anything complex. The idea is to just check if the process is responding. Signed-off-by: Michal Wasilewski <michal@mwasilewski.net>

mwasilew2 · 2024-12-02T16:34:37Z

Just trying to help here. Let me know if this needs tweaking.

msakrejda · 2024-12-02T18:22:15Z

Thanks for the contribution, but I don't think this health check would have caught the issue fixed by #637: I believe the health check would have kept responding "okay", but the collector still would have been stuck. Do you see it differently?

mwasilew2 · 2024-12-03T14:03:50Z

@msakrejda

I don't think this health check would have caught the issue fixed by #637: I believe the health check would have kept responding "okay", but the collector still would have been stuck. Do you see it differently?

oh, didn't see #637

No, I agree, this healthcheck would most likely respond with "200 OK" even when hitting the bug fixed in #637.

On the other hand, I don't have the diagnostics necessary to know if our collector was unresponsive due to #637 , so it's hard to tell if a simple healthcheck wouldn't help. My intent was not to come up with a check that can cover all potential failure scenarios. What I was trying to achieve with this PR was to have the most simple check that would test if the process is able to respond, that's it. If it's not able to respond even to a healthcheck, automation (cloud provider) can try restarting it. If it is able to respond, the operator (me or my colleagues) can jump in and try to capture some diagnostics.

I'm also fine with closing this PR if you feel it doesn't bring any value :)

lfittl · 2024-12-03T18:58:10Z

util/healthcheck.go

+			Addr: address,
+		}
+		http.HandleFunc("/health", func(w http.ResponseWriter, r *http.Request) {
+			w.WriteHeader(http.StatusOK)


I wonder how hard it'd be to determine whether all configured servers have successful recent full snapshots - still not sure if that'd catch #637, but it would catch e.g. password errors or other problems with collecting the data and (some) errors when sending it.

We could potentially use server.PrevState.LastStatementStatsAt for this purpose? (https://github.com/pganalyze/collector/blob/main/state/state.go#L44)

If we do this, we should also make sure to not report failure at startup (where it can take up to 10 minutes to have the first full snapshot that sets server.PrevState)

My intention was to configure this healthcheck in the ECS task definition or as a kubernetes liveness probe. The idea was that if the process is unresponsive, ECS or k8s would restart the process.

In that context, I'm not sure if an issue with one of the monitored DBs would be a good enough reason to kick the entire collector. Maybe the statuses of monitored DBs could be reported as a metric on the collector? As well as statuses of uploads? So in other words, collector could expose metrics for status of targets and status of uploads. If my understanding is correct, the later would cover #637 .

Yeah, perhaps I dismissed this idea prematurely. I think what Lukas proposed could work: we'd just need to make sure the health check has access to the server list. We could potentially expose configuration around that to have some flexibility over what constitutes a health check failure.

@mwasilew2 for the collector metric approach, how were you thinking of reporting these metrics? That seems different from a health check endpoint, right?

One additional thought on how we could do this: If the point is to detect unresponsive collectors, we could flag when the full snapshot runner was entered last (for all servers), and if that time is too old, cause the healthcheck to fail.

That would detect e.g. an infinite block on a Go routine inside the full snapshot (and presumably cause a restart of the container in a container-based environment), but would not fail when e.g. a server has the wrong password configured/etc.

lfittl · 2025-03-19T00:03:50Z

We just reviewed this on 3/18/25 in an internal review - we still think this would be a good idea to add, if we can have it check some internal health state (i.e whether stats were recently collected/submitted).

Expose healthcheck server (pganalyze#643)

346c6df

This exposes a basic healtcheck server. Internally, the healtcheck handler doesn't do anything complex. The idea is to just check if the process is responding. Signed-off-by: Michal Wasilewski <michal@mwasilewski.net>

msakrejda mentioned this pull request Dec 2, 2024

Expose a health check endpoint #643

Open

lfittl reviewed Dec 3, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Expose healthcheck server (#643) #644

Expose healthcheck server (#643) #644

Uh oh!

mwasilew2 commented Dec 2, 2024

Uh oh!

mwasilew2 commented Dec 2, 2024

Uh oh!

msakrejda commented Dec 2, 2024

Uh oh!

mwasilew2 commented Dec 3, 2024

Uh oh!

lfittl Dec 3, 2024

Uh oh!

mwasilew2 Dec 5, 2024

Uh oh!

msakrejda Dec 9, 2024

Uh oh!

lfittl Oct 28, 2025

Uh oh!

lfittl commented Mar 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Expose healthcheck server (#643) #644

Are you sure you want to change the base?

Expose healthcheck server (#643) #644

Uh oh!

Conversation

mwasilew2 commented Dec 2, 2024

Uh oh!

mwasilew2 commented Dec 2, 2024

Uh oh!

msakrejda commented Dec 2, 2024

Uh oh!

mwasilew2 commented Dec 3, 2024

Uh oh!

lfittl Dec 3, 2024

Choose a reason for hiding this comment

Uh oh!

mwasilew2 Dec 5, 2024

Choose a reason for hiding this comment

Uh oh!

msakrejda Dec 9, 2024

Choose a reason for hiding this comment

Uh oh!

lfittl Oct 28, 2025

Choose a reason for hiding this comment

Uh oh!

lfittl commented Mar 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants