Strange container handling/cleanup/mgmt behavior on worker nodes after upgrade #8783

danmerker · 2023-06-29T22:01:00Z

danmerker
Jun 29, 2023

Hi.

We have many Concourse instances, and I recently upgraded them from v7.6.0 to v7.9.1 (thanks for the --disable-srv-lookup fix!). We use bosh-deployed vms, and we upgraded from Postgres 43 to 44 in the process, as well as changing stemcells from Bionic 1.54 to Jammy 1.108.

After upgrading, our worker nodes have been acting flaky. We've gotten a lot of errors of these varieties, on many of our instances:

"max containers reached" (though we were nowhere close to having the limit of 250, per fly workers)
"failed to create volume" -- this was an ephemeral disk (of 200 GB/worker) full; we didn't have this issue before -- when this happens, I just have to bosh delete these vms and let the director recreate them.

To my eye, this seems like something to do with garbage collection, but despite furious googling & trial & error, I can't find an obvious culprit. The database usage is also much larger than I'd expect on several of our deployments, which appear to be the more problematic ones. I've scaled up to add workers on several of these deployments, and the issues persist.

We have some failed jobs that would leave containers around, but it doesn't appear to be enough to move the needle so much as to lead to these failures.

Anyone got any advice? Or can I provide further details or logs to help provide additional context?

Thanks.

klas-s · 2023-06-30T09:01:52Z

klas-s
Jun 30, 2023

Different deployment than yours, but similar symptoms.

#8780

0 replies

danmerker · 2023-08-14T20:54:27Z

danmerker
Aug 14, 2023
Author

OK -- an update. This appears to be largely due to artifacts of many old pipeline runs (hundreds of thousands), causing the database to have massive amounts of rows, and leading to slower operations. Why did this happen after upgrade, and not before? Maybe the DB upgrade process exacerbated this behavior; hard to tell. Regardless, deleting the offending pipelines and re-flying them to get rid of this residue seems to make everything run snappier, and thus garbage collection seems more effective. Diagnosis & remediation steps here were helpful: https://github.com/concourse/concourse/wiki/Schema-Inspection-Queries#show-the-size-and-number-of-hits-for-each-index

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Strange container handling/cleanup/mgmt behavior on worker nodes after upgrade #8783

{{title}}

Replies: 2 comments

{{title}}

{{title}}

Select a reply

Strange container handling/cleanup/mgmt behavior on worker nodes after upgrade #8783

danmerker Jun 29, 2023

Replies: 2 comments

klas-s Jun 30, 2023

danmerker Aug 14, 2023 Author

danmerker
Jun 29, 2023

klas-s
Jun 30, 2023

danmerker
Aug 14, 2023
Author