-
Notifications
You must be signed in to change notification settings - Fork 5.1k
-
Notifications
You must be signed in to change notification settings - Fork 5.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Metabase becomes inoperable, leaves too many hanging DB connections #8679
Comments
It would be good to test this out on the most recent version of Metabase If you are still seeing this issue after upgrading, what would be most useful is a thread dump of the Metabase process when you observe this connection issue. A thread dump will capture what each thread in the JVM is doing at that exact moment. I'm not sure how you have deployment Metabase, but the easiest way to get a thread dump is to find out the PID of your Metabase process and run |
@senior I'm seeing this issue happening even in the 31.0 RC. Isn't it supposed to incorporate all the mentioned fix from the 30.4? |
@joaoferrao |
@senior that's correct. |
@joaoferrao You should have the fixes then. What would be helpful would be figuring out what those connections are doing via the thread dump that I described above. |
@senior I'm trying to follow your instructions but the alpine dist where metabase is included doesn't include jstack apparently and I'm trying to find a way to install it without resorting to restart the production instance. suggestions? |
We're using release v0.32.9 and it also shows this problem; no active users of Metabase yet it has 195 open connections to my DB server (postgresql 11). I have metabase configured to view 350 databases; pg connection max is 200; it would be nice if unused connections could be closed. |
We experience the same issue with a lower amount of databases (around 30). We are thinking of putting PgBouncer in between to handle the connections. |
PgBouncer did not work for us. We are now testing with the latest RC (0.33.0-preview2). The same thing happens. It opens a lot of connections (97 of the allowed 100 on that postgres) and keeps them open for ever. This means other processes can no longer connect to that database. For example the one that inserts the data in the first place. That proces will run into this: See also the open connections from before (v 0.31.2) and then the spike once we upgraded to 0.33preview2. This also happened with 0.32. We go from around 20 to 40 connections open to "max connections allowed" open. So I guess something changed. after 0.31.2 in terms of the handling of connections to the db. What I noticed is that each connected database seems to max out at around 15 connections per database (or user, not sure what determines the limit). That would mean that you need at least I now upgraded to 200 allowed connections and while using Metabase those also slowly get filled up. In the case of @djbusby that would mean a whopping 15 * 350 in the worst case. Hopefully somebody can come up with a good solution: something like "close connections after x amount of time not active". |
@emetselaar
You might want to look at the c3p0 section about Managing Pool Size and Connection Age, since you might be able to close idle connections. |
Thanks @flamber - I will try it out tomorrow. Now that I know where to look I see that the |
Update: I could not get it to work so I updated the number of allowed open connections to the database. However, I think that the inconsistency between from /metabase/src/metabase/driver/sql_jdbc/connection.clj (def ^:private data-warehouse-connection-pool-properties
"c3p0 connection pool properties for connected data warehouse DBs. See
https://www.mchange.com/projects/c3p0/#configuration_properties for descriptions of properties."
{"maxIdleTime" (* 3 60 60)
"minPoolSize" 1
"initialPoolSize" 1
"maxPoolSize" 15
;; prevent broken connections closed by dbs by testing them every 3 mins
"idleConnectionTestPeriod" (* 3 60)
;; prevent overly large pools by condensing them when connections are idle for 15m+
"maxIdleTimeExcessConnections" (* 15 60)}) |
Facing the same issue with AWS RDS MySQL. |
I worked around this by increasing the amount of allowed connections on the database itself. Not ideal, but for my purposes it works. |
I am also encountering long running queries/connections, thought I would post them in case it's helpful:
|
Fixed by #11832 |
Hi all. I'm still experiencing the above problem. Metabase keeps taking up more and more connection slots until nothing is able to connect. Is this regression known? |
@Mapiarz |
Here's diagonostic info:
Here are the final moments of my psql server before becoming full unresponsive:
The metabase logs seem to only be available for the last hour or so, so I will have to wait until metabase eats up all my connections again. Is there another way @flamber ? |
@Mapiarz You should use the support email, when using the Enterprise Edition. Also, the latest release is 1.36.7 |
@flamber We are not self-hosting Metabase, we are using the Metabase Cloud Starter plan. As for other questions:
|
@Mapiarz Okay, very helpful information. Are you using SSH tunnel? Still the log will show details about threads, connections, queues etc like: It looks like it's running out of connections hourly, which is the default sync-process, and that makes me think that the previous sync somehow doesn't complete. But it should only consume 15 connections max - unless you have configured multiple database in Admin > Databases, since each configured here can handle up to 15 connections simultaneously. |
@flamber Some answers:
I've just opened up metabase and opened one of our dashboards with a bunch of custom questions and the number of idle connections to Postgres grew to 33. The more questions/dashboards I open, the more connections open up. It seems that whenever I open something new, the number of connections grows, but when I navigate away (or even close the Metabase tab) the connections are not closed. Surely that's not the intended behaviour? The good news is that I now know how to reproduce the behaviour so I'm sure we'll get to the bottom of this. Here's a log I just grabbed from the Metabase admin panel: https://dpaste.org/Dc4X/slim (it's long so I've pasted it elsewhere). |
IMPORTANT: if anyone hits this issue and is running on RDS AND has performance insights enabled, please provide us with all the info that performance insights provides so we can identify the queries that are being run when connections get out of control. Thanks! |
So I think there might be two different issues in this thread. To try to reintroduce some clarity, it would be helpful if people distinguished when talking about metabase's own app db and other databases used as sources for metabase. The two issues I think I'm seeing:
I think I think I have solved problem number 2. Number 1 remains under active investigation. Here's the issue with number 2: Simplified version where the connection pool is 2 to the source db:
|
Addresses part of #8679 Pivot tables can have subqueries that run to create tallies. We do not hold the entirety of resultsets in memory so we have a bit of an inversion of control flow: connections are opened, queries run, and result sets are transduced and then the connection closed. The error here was that subsequent queries for the pivot were run while the first connection is still held open. But the connection is no longer needed. But enough pivots running at the same time in a dashboard can create a deadlock where the subqueries need a new connection, but the main queries cannot be released until the subqueries have completed. Also, rf management is critical. It's completion arity must be called once and only once. We also have middleware that need to be composed (format, etc) and others that can only be composed once (limit). We have to save the original reducing function before composition (this is the one that can write to the download writer, etc) but compose it each time we use it with `(rff metadata)` so we have the format and other middleware. Keeping this distinction in mind will save you lots of time. (The limit query will ignore all subsequent rows if you just grab the output of `(rff metadata)` and not the rf returned from the `:rff` key on the context. But this takes the following connection management: ``` tap> "OPENING CONNECTION 0" tap> "already open: " tap> "OPENING CONNECTION 1" tap> "already open: 0" tap> "CLOSING CONNECTION 1" tap> "OPENING CONNECTION 2" tap> "already open: 0" tap> "CLOSING CONNECTION 2" tap> "OPENING CONNECTION 3" tap> "already open: 0" tap> "CLOSING CONNECTION 3" tap> "CLOSING CONNECTION 0" ``` and properly sequences it so that connection 0 is closed before opening connection 1. It hijacks the executef to just pass that function into the reducef part so we can reduce multiple times and therefore control the connections. Otherwise the reducef happens "inside" of the executef at which point the connection is closed. Care is taken to ensure that: - the init is only called once (subsequent queries have the init of the rf overridden to just return `init` (the acc passed in) rather than `(rf)` - the completion arity is only called once (use of `(completing rf)` and the reducing function in the subsequent queries is just `([acc] acc)` and does not call `(rf acc)`. Remember this is just on the lower reducing function and all of the takes, formats, etc _above_ it will have the completion arity called because we are using transduce. The completion arity is what takes the volatile rows and row counts and actually nests them in the `{:data {:rows []}` structure. Without calling that once (and ONLY once) you end up with no actual results. they are just in memory.
Addresses part of #8679 Pivot tables can have subqueries that run to create tallies. We do not hold the entirety of resultsets in memory so we have a bit of an inversion of control flow: connections are opened, queries run, and result sets are transduced and then the connection closed. The error here was that subsequent queries for the pivot were run while the first connection is still held open. But the connection is no longer needed. But enough pivots running at the same time in a dashboard can create a deadlock where the subqueries need a new connection, but the main queries cannot be released until the subqueries have completed. Also, rf management is critical. It's completion arity must be called once and only once. We also have middleware that need to be composed (format, etc) and others that can only be composed once (limit). We have to save the original reducing function before composition (this is the one that can write to the download writer, etc) but compose it each time we use it with `(rff metadata)` so we have the format and other middleware. Keeping this distinction in mind will save you lots of time. (The limit query will ignore all subsequent rows if you just grab the output of `(rff metadata)` and not the rf returned from the `:rff` key on the context. But this takes the following connection management: ``` tap> "OPENING CONNECTION 0" tap> "already open: " tap> "OPENING CONNECTION 1" tap> "already open: 0" tap> "CLOSING CONNECTION 1" tap> "OPENING CONNECTION 2" tap> "already open: 0" tap> "CLOSING CONNECTION 2" tap> "OPENING CONNECTION 3" tap> "already open: 0" tap> "CLOSING CONNECTION 3" tap> "CLOSING CONNECTION 0" ``` and properly sequences it so that connection 0 is closed before opening connection 1. It hijacks the executef to just pass that function into the reducef part so we can reduce multiple times and therefore control the connections. Otherwise the reducef happens "inside" of the executef at which point the connection is closed. Care is taken to ensure that: - the init is only called once (subsequent queries have the init of the rf overridden to just return `init` (the acc passed in) rather than `(rf)` - the completion arity is only called once (use of `(completing rf)` and the reducing function in the subsequent queries is just `([acc] acc)` and does not call `(rf acc)`. Remember this is just on the lower reducing function and all of the takes, formats, etc _above_ it will have the completion arity called because we are using transduce. The completion arity is what takes the volatile rows and row counts and actually nests them in the `{:data {:rows []}` structure. Without calling that once (and ONLY once) you end up with no actual results. they are just in memory.
Addresses part of #8679 Pivot tables can have subqueries that run to create tallies. We do not hold the entirety of resultsets in memory so we have a bit of an inversion of control flow: connections are opened, queries run, and result sets are transduced and then the connection closed. The error here was that subsequent queries for the pivot were run while the first connection is still held open. But the connection is no longer needed. But enough pivots running at the same time in a dashboard can create a deadlock where the subqueries need a new connection, but the main queries cannot be released until the subqueries have completed. Also, rf management is critical. It's completion arity must be called once and only once. We also have middleware that need to be composed (format, etc) and others that can only be composed once (limit). We have to save the original reducing function before composition (this is the one that can write to the download writer, etc) but compose it each time we use it with `(rff metadata)` so we have the format and other middleware. Keeping this distinction in mind will save you lots of time. (The limit query will ignore all subsequent rows if you just grab the output of `(rff metadata)` and not the rf returned from the `:rff` key on the context. But this takes the following connection management: ``` tap> "OPENING CONNECTION 0" tap> "already open: " tap> "OPENING CONNECTION 1" tap> "already open: 0" tap> "CLOSING CONNECTION 1" tap> "OPENING CONNECTION 2" tap> "already open: 0" tap> "CLOSING CONNECTION 2" tap> "OPENING CONNECTION 3" tap> "already open: 0" tap> "CLOSING CONNECTION 3" tap> "CLOSING CONNECTION 0" ``` and properly sequences it so that connection 0 is closed before opening connection 1. It hijacks the executef to just pass that function into the reducef part so we can reduce multiple times and therefore control the connections. Otherwise the reducef happens "inside" of the executef at which point the connection is closed. Care is taken to ensure that: - the init is only called once (subsequent queries have the init of the rf overridden to just return `init` (the acc passed in) rather than `(rf)` - the completion arity is only called once (use of `(completing rf)` and the reducing function in the subsequent queries is just `([acc] acc)` and does not call `(rf acc)`. Remember this is just on the lower reducing function and all of the takes, formats, etc _above_ it will have the completion arity called because we are using transduce. The completion arity is what takes the volatile rows and row counts and actually nests them in the `{:data {:rows []}` structure. Without calling that once (and ONLY once) you end up with no actual results. they are just in memory.
Addresses part of #8679 Pivot tables can have subqueries that run to create tallies. We do not hold the entirety of resultsets in memory so we have a bit of an inversion of control flow: connections are opened, queries run, and result sets are transduced and then the connection closed. The error here was that subsequent queries for the pivot were run while the first connection is still held open. But the connection is no longer needed. But enough pivots running at the same time in a dashboard can create a deadlock where the subqueries need a new connection, but the main queries cannot be released until the subqueries have completed. Also, rf management is critical. It's completion arity must be called once and only once. We also have middleware that need to be composed (format, etc) and others that can only be composed once (limit). We have to save the original reducing function before composition (this is the one that can write to the download writer, etc) but compose it each time we use it with `(rff metadata)` so we have the format and other middleware. Keeping this distinction in mind will save you lots of time. (The limit query will ignore all subsequent rows if you just grab the output of `(rff metadata)` and not the rf returned from the `:rff` key on the context. But this takes the following connection management: ``` tap> "OPENING CONNECTION 0" tap> "already open: " tap> "OPENING CONNECTION 1" tap> "already open: 0" tap> "CLOSING CONNECTION 1" tap> "OPENING CONNECTION 2" tap> "already open: 0" tap> "CLOSING CONNECTION 2" tap> "OPENING CONNECTION 3" tap> "already open: 0" tap> "CLOSING CONNECTION 3" tap> "CLOSING CONNECTION 0" ``` and properly sequences it so that connection 0 is closed before opening connection 1. It hijacks the executef to just pass that function into the reducef part so we can reduce multiple times and therefore control the connections. Otherwise the reducef happens "inside" of the executef at which point the connection is closed. Care is taken to ensure that: - the init is only called once (subsequent queries have the init of the rf overridden to just return `init` (the acc passed in) rather than `(rf)` - the completion arity is only called once (use of `(completing rf)` and the reducing function in the subsequent queries is just `([acc] acc)` and does not call `(rf acc)`. Remember this is just on the lower reducing function and all of the takes, formats, etc _above_ it will have the completion arity called because we are using transduce. The completion arity is what takes the volatile rows and row counts and actually nests them in the `{:data {:rows []}` structure. Without calling that once (and ONLY once) you end up with no actual results. they are just in memory.
Also experiencing this issue. Just to add few things that I've noticed in our environment:
Enviroment details:
Connection pool details after restart, no wait for users:
Connection pool details after a few minutes and users have to wait 60 seconds:
Diagnostics info:
|
@flamber @paoliniluis Is there any update on this issue? 🤔 |
@monoprosito what are you seeing? which version are you running? |
I have exactly this problem and bug rises consistently. I guess that transaction is open for statements that return to much data. My guess is that when metabase is not able to download entire query result it leaves transaction open up to the moment when user goes through entire pagination. I don't know exact amout in bytes or rows that causes it. I've added a limit on query that left transaction in idle and the bug went away. It does not replicate anymore. |
Still an issue for me {
"browser-info": {
"language": "fr-FR",
"platform": "Linux x86_64",
"userAgent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36",
"vendor": "Google Inc."
},
"system-info": {
"file.encoding": "UTF-8",
"java.runtime.name": "OpenJDK Runtime Environment",
"java.runtime.version": "11.0.20.1+1",
"java.vendor": "Eclipse Adoptium",
"java.vendor.url": "https://adoptium.net/",
"java.version": "11.0.20.1",
"java.vm.name": "OpenJDK 64-Bit Server VM",
"java.vm.version": "11.0.20.1+1",
"os.name": "Linux",
"os.version": "5.4.0-148-generic",
"user.language": "en",
"user.timezone": "UTC"
},
"metabase-info": {
"databases": [
"postgres"
],
"hosting-env": "unknown",
"application-database": "postgres",
"application-database-details": {
"database": {
"name": "PostgreSQL",
"version": "14.8 (Debian 14.8-1.pgdg110+1)"
},
"jdbc-driver": {
"name": "PostgreSQL JDBC Driver",
"version": "42.5.4"
}
},
"run-mode": "prod",
"version": {
"date": "2023-10-03",
"tag": "v0.47.3",
"branch": "?",
"hash": "4202328"
},
"settings": {
"report-timezone": "Europe/Paris"
}
}
} |
We’re launching a feature that will allow to kill dangling connections in 47.4 |
NEW: you can now define MB_JDBC_DATA_WAREHOUSE_UNRETURNED_CONNECTION_TIMEOUT_SECONDS as an environment variable. We suggest you put this env var to 20 minutes (1200 seconds) |
Default value is already 20 minutes |
Recategorizing this as P2 since 47.4+ has |
There's a debugUnreturnedConnectionStackTraces thing in the c3p0 library that might help us out here: https://www.mchange.com/projects/c3p0/#unreturnedConnectionTimeout |
Bugs
Basically at some point, everyday, the queries don't return anything due to the amount of hanging connections. Status goes from "doing science" to "took too much time".
Would really appreciate some pointers or a way to tell met abase to kill DB Connections. It becomes inoperable every day - I suspect due to the amount of hanging/idle connections to the DB. As soon as I restart the container the connections to the RDS drop to 0 and the queries work again immediately.
Below what happened, before 9h30 is when I restarted the container.
The text was updated successfully, but these errors were encountered: