Metabase becomes inoperable, leaves too many hanging DB connections #8679

joaoferrao · 2018-10-12T07:37:19Z

Bugs

Your browser and the version: tested in chrome, firefox,
Your operating system: (e.x. OS X 10, Windows XP, etc)
Your databases: AWS RDS PostgreSQL
Metabase version: 0.28.0, 0.29, 0.30.0, 0.31.0RC
Metabase hosting environment: AWS ECS
Metabase internal database: AWS RDS PostgreSQL

Basically at some point, everyday, the queries don't return anything due to the amount of hanging connections. Status goes from "doing science" to "took too much time".

Would really appreciate some pointers or a way to tell met abase to kill DB Connections. It becomes inoperable every day - I suspect due to the amount of hanging/idle connections to the DB. As soon as I restart the container the connections to the RDS drop to 0 and the queries work again immediately.

Below what happened, before 9h30 is when I restarted the container.

senior · 2018-10-25T19:14:13Z

It would be good to test this out on the most recent version of Metabase 0.30.4 as there have been several fixes recently that I think would help prevent this issue. Although #8312 doesn't specifically say it was leaving database connections open, I think it's pretty likely that it would.

If you are still seeing this issue after upgrading, what would be most useful is a thread dump of the Metabase process when you observe this connection issue. A thread dump will capture what each thread in the JVM is doing at that exact moment. I'm not sure how you have deployment Metabase, but the easiest way to get a thread dump is to find out the PID of your Metabase process and run jstack on that PID. You can then redirect the output of that command to a file, i.e. jstack 123 >> thread-dump.txt. If you would like you can email it to me directly, my address is my first name at metabase.com.

joaoferrao · 2018-10-29T20:47:25Z

@senior I'm seeing this issue happening even in the 31.0 RC. Isn't it supposed to incorporate all the mentioned fix from the 30.4?

senior · 2018-10-29T20:58:06Z

@joaoferrao 0.31.0.RC1 hasn't been released yet. Are you building your own off of the release branch?

joaoferrao · 2018-10-29T21:27:23Z

@senior that's correct.

senior · 2018-10-29T21:29:12Z

@joaoferrao You should have the fixes then. What would be helpful would be figuring out what those connections are doing via the thread dump that I described above.

joaoferrao · 2018-10-31T14:18:15Z

@senior I'm trying to follow your instructions but the alpine dist where metabase is included doesn't include jstack apparently and I'm trying to find a way to install it without resorting to restart the production instance. suggestions?

djbusby · 2019-07-01T15:42:15Z

We're using release v0.32.9 and it also shows this problem; no active users of Metabase yet it has 195 open connections to my DB server (postgresql 11). I have metabase configured to view 350 databases; pg connection max is 200; it would be nice if unused connections could be closed.

emetselaar · 2019-07-02T08:15:21Z

We experience the same issue with a lower amount of databases (around 30). We are thinking of putting PgBouncer in between to handle the connections.

emetselaar · 2019-07-04T14:33:35Z

PgBouncer did not work for us.

We are now testing with the latest RC (0.33.0-preview2). The same thing happens. It opens a lot of connections (97 of the allowed 100 on that postgres) and keeps them open for ever. This means other processes can no longer connect to that database. For example the one that inserts the data in the first place.

That proces will run into this:
psycopg2.OperationalError: FATAL: remaining connection slots are reserved for non-replication superuser connections

See also the open connections from before (v 0.31.2) and then the spike once we upgraded to 0.33preview2. This also happened with 0.32. We go from around 20 to 40 connections open to "max connections allowed" open. So I guess something changed. after 0.31.2 in terms of the handling of connections to the db.

What I noticed is that each connected database seems to max out at around 15 connections per database (or user, not sure what determines the limit). That would mean that you need at least number of databases x 15 connections + some extra to keep the database accessible for other processes.

I now upgraded to 200 allowed connections and while using Metabase those also slowly get filled up.

In the case of @djbusby that would mean a whopping 15 * 350 in the worst case. Hopefully somebody can come up with a good solution: something like "close connections after x amount of time not active".

flamber · 2019-07-04T19:17:18Z

@emetselaar
I don't know if it will help or cause more problems, so please use with caution.
The 15 connections per database is a current default, but with PR #9788, it looks like you can send parameter that changes the maximum pool size of c3p0 - example 10 connections:

java -Dc3p0.maxPoolSize=10 -jar metabase.jar

You might want to look at the c3p0 section about Managing Pool Size and Connection Age, since you might be able to close idle connections.

emetselaar · 2019-07-04T20:32:17Z

Thanks @flamber - I will try it out tomorrow. Now that I know where to look I see that the maxIdleTime is set at 3 hours, but the idleConnectionTestPeriod at 3 minutes is probably preventing the maxIdleTime to be ever hit. So the connections are never discarded?

emetselaar · 2019-07-09T09:10:24Z

Update: I could not get it to work so I updated the number of allowed open connections to the database.

However, I think that the inconsistency between maxIdleTime and idleConnectionTestPeriod looks like a bug. I do not understand the intention so it could be by design, but if that is the case then maxIdleTime does not have to be there at all right?

from /metabase/src/metabase/driver/sql_jdbc/connection.clj

(def ^:private data-warehouse-connection-pool-properties
  "c3p0 connection pool properties for connected data warehouse DBs. See
  https://www.mchange.com/projects/c3p0/#configuration_properties for descriptions of properties."
  {"maxIdleTime"                  (* 3 60 60)
   "minPoolSize"                  1
   "initialPoolSize"              1
   "maxPoolSize"                  15
   ;; prevent broken connections closed by dbs by testing them every 3 mins
   "idleConnectionTestPeriod"     (* 3 60)
   ;; prevent overly large pools by condensing them when connections are idle for 15m+
   "maxIdleTimeExcessConnections" (* 15 60)})

rajatgl17 · 2019-09-01T05:59:32Z

Facing the same issue with AWS RDS MySQL.
@emetselaar Were you able to resolve it anyhow?

emetselaar · 2019-09-02T06:24:29Z

Facing the same issue with AWS RDS MySQL.
@emetselaar Were you able to resolve it anyhow?

I worked around this by increasing the amount of allowed connections on the database itself. Not ideal, but for my purposes it works.

keithbrink · 2020-02-08T11:38:26Z

I am also encountering long running queries/connections, thought I would post them in case it's helpful:

 Just now
	analytics
SELECT "core_session"."created_at", "core_session"."user_id", "core_user"."is_superuser" FROM "core_session" LEFT JOIN "core_user" ON "core_session"."user_id" = "core_user"."id" WHERE ("core_user"."is_active" = TRUE AND "core_session"."id" = $1) LIMIT 1
	344737.12ms
Just now
	analytics
SHOW TRANSACTION ISOLATION LEVEL
	344736.97ms
Just now
	analytics
SELECT "core_session"."created_at", "core_session"."user_id", "core_user"."is_superuser" FROM "core_session" LEFT JOIN "core_user" ON "core_session"."user_id" = "core_user"."id" WHERE ("core_user"."is_active" = TRUE AND "core_session"."id" = $1) LIMIT 1
	344081.78ms
Just now
	analytics 
SELECT "pulse_channel"."id", "pulse_channel"."pulse_id", "pulse_channel"."schedule_type", "pulse_channel"."channel_type" FROM "pulse_channel" WHERE ("enabled" = TRUE AND ("schedule_type" = $1 OR ("schedule_type" = $2 AND "schedule_hour" = 8) OR ("schedule_type" = $3 AND "schedule_hour" = 8 AND "schedule_day" = $4) OR ("schedule_type" = $5 AND "schedule_hour" = 8 AND "schedule_frame" = $6 AND ("schedule_day" = $7 OR "schedule_day" = $8))))
	12703.728ms

camsaul · 2020-03-04T22:56:27Z

Fixed by #11832

Mapiarz · 2020-10-14T14:40:49Z

Hi all. I'm still experiencing the above problem. Metabase keeps taking up more and more connection slots until nothing is able to connect. Is this regression known?

flamber · 2020-10-14T15:03:01Z

@Mapiarz
Please post "Diagnostic Info" from Admin > Troubleshooting, and which database you're seeing this with.
And logs from Admin > Troubleshooting > Logs.
And any process/connection stats from your database, which could be helpful.

Mapiarz · 2020-10-15T13:21:29Z

Here's diagonostic info:

{
  "browser-info": {
    "language": "en-US",
    "platform": "Linux x86_64",
    "userAgent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:81.0) Gecko/20100101 Firefox/81.0",
    "vendor": ""
  },
  "system-info": {
    "file.encoding": "UTF-8",
    "java.runtime.name": "OpenJDK Runtime Environment",
    "java.runtime.version": "11.0.7+10",
    "java.vendor": "AdoptOpenJDK",
    "java.vendor.url": "https://adoptopenjdk.net/",
    "java.version": "11.0.7",
    "java.vm.name": "OpenJDK 64-Bit Server VM",
    "java.vm.version": "11.0.7+10",
    "os.name": "Linux",
    "os.version": "4.14.186-146.268.amzn2.x86_64",
    "user.language": "en",
    "user.timezone": "GMT"
  },
  "metabase-info": {
    "databases": [
      "mongo",
      "googleanalytics",
      "postgres"
    ],
    "hosting-env": "unknown",
    "application-database": "postgres",
    "application-database-details": {
      "database": {
        "name": "PostgreSQL",
        "version": "11.7"
      },
      "jdbc-driver": {
        "name": "PostgreSQL JDBC Driver",
        "version": "42.2.8"
      }
    },
    "run-mode": "prod",
    "version": {
      "date": "2020-09-21",
      "tag": "v1.36.6.1",
      "branch": "enterprise-release-1.36.x",
      "hash": "5695e2b"
    },
    "settings": {
      "report-timezone": "US/Eastern"
    }
  }
}

Here are the final moments of my psql server before becoming full unresponsive:

2020-10-14 02:00:14.508 UTC [31786] analysis@merged-watchdog STATEMENT:  -- Metabase
        SELECT "public"."rest_framework_api_key_apikey"."hashed_key" AS "hashed_key", "public"."rest_framework_api_key_apikey"."name" AS "name", "public"."rest_framework_api_key_apikey"."created" AS "created", "pu
blic"."rest_framework_api_key_apikey"."expiry_date" AS "expiry_date", "public"."rest_framework_api_key_apikey"."revoked" AS "revoked", "public"."rest_framework_api_key_apikey"."prefix" AS "prefix" FROM "public"."r
est_framework_api_key_apikey" LIMIT 10000
2020-10-14 02:00:14.601 UTC [31786] analysis@merged-watchdog ERROR:  permission denied for relation rest_hooks_hook
2020-10-14 02:00:14.601 UTC [31786] analysis@merged-watchdog STATEMENT:  -- Metabase
        SELECT "public"."rest_hooks_hook"."target" AS "target", "public"."rest_hooks_hook"."event" AS "event", "public"."rest_hooks_hook"."created" AS "created", "public"."rest_hooks_hook"."updated" AS "updated", 
"public"."rest_hooks_hook"."user_id" AS "user_id" FROM "public"."rest_hooks_hook" LIMIT 10000
2020-10-14 02:05:03.440 UTC [31785] analysis@merged-watchdog LOG:  could not receive data from client: Connection reset by peer
2020-10-14 02:05:03.440 UTC [31786] analysis@merged-watchdog LOG:  could not receive data from client: Connection reset by peer
2020-10-14 03:00:15.030 UTC [16489] analysis@merged-watchdog ERROR:  permission denied for relation rest_framework_api_key_apikey
2020-10-14 03:00:15.030 UTC [16489] analysis@merged-watchdog STATEMENT:  -- Metabase
        SELECT "public"."rest_framework_api_key_apikey"."hashed_key" AS "hashed_key", "public"."rest_framework_api_key_apikey"."name" AS "name", "public"."rest_framework_api_key_apikey"."created" AS "created", "pu
blic"."rest_framework_api_key_apikey"."expiry_date" AS "expiry_date", "public"."rest_framework_api_key_apikey"."revoked" AS "revoked", "public"."rest_framework_api_key_apikey"."prefix" AS "prefix" FROM "public"."r
est_framework_api_key_apikey" LIMIT 10000
2020-10-14 03:00:15.131 UTC [16489] analysis@merged-watchdog ERROR:  permission denied for relation rest_hooks_hook
2020-10-14 03:00:15.131 UTC [16489] analysis@merged-watchdog STATEMENT:  -- Metabase
        SELECT "public"."rest_hooks_hook"."target" AS "target", "public"."rest_hooks_hook"."event" AS "event", "public"."rest_hooks_hook"."created" AS "created", "public"."rest_hooks_hook"."updated" AS "updated", 
"public"."rest_hooks_hook"."user_id" AS "user_id" FROM "public"."rest_hooks_hook" LIMIT 10000
2020-10-14 03:39:50.685 UTC [18135] analysis@merged-watchdog LOG:  could not receive data from client: Connection reset by peer
2020-10-14 04:00:12.359 UTC [6674] analysis@merged-watchdog ERROR:  permission denied for relation rest_framework_api_key_apikey
2020-10-14 04:00:12.359 UTC [6674] analysis@merged-watchdog STATEMENT:  -- Metabase
        SELECT "public"."rest_framework_api_key_apikey"."hashed_key" AS "hashed_key", "public"."rest_framework_api_key_apikey"."name" AS "name", "public"."rest_framework_api_key_apikey"."created" AS "created", "pu
blic"."rest_framework_api_key_apikey"."expiry_date" AS "expiry_date", "public"."rest_framework_api_key_apikey"."revoked" AS "revoked", "public"."rest_framework_api_key_apikey"."prefix" AS "prefix" FROM "public"."r
est_framework_api_key_apikey" LIMIT 10000
2020-10-14 04:00:12.437 UTC [6674] analysis@merged-watchdog ERROR:  permission denied for relation rest_hooks_hook
2020-10-14 04:00:12.437 UTC [6674] analysis@merged-watchdog STATEMENT:  -- Metabase
        SELECT "public"."rest_hooks_hook"."target" AS "target", "public"."rest_hooks_hook"."event" AS "event", "public"."rest_hooks_hook"."created" AS "created", "public"."rest_hooks_hook"."updated" AS "updated", 
"public"."rest_hooks_hook"."user_id" AS "user_id" FROM "public"."rest_hooks_hook" LIMIT 10000
2020-10-14 04:06:28.072 UTC [6674] analysis@merged-watchdog LOG:  could not receive data from client: Connection reset by peer
2020-10-14 04:24:09.483 UTC [9994] analysis@merged-watchdog FATAL:  remaining connection slots are reserved for non-replication superuser connections
2020-10-14 04:24:09.501 UTC [9993] analysis@merged-watchdog FATAL:  remaining connection slots are reserved for non-replication superuser connections
2020-10-14 04:24:09.501 UTC [9992] analysis@merged-watchdog FATAL:  remaining connection slots are reserved for non-replication superuser connections
2020-10-14 04:24:09.577 UTC [9997] analysis@merged-watchdog FATAL:  sorry, too many clients already
2020-10-14 04:24:09.581 UTC [9996] analysis@merged-watchdog FATAL:  remaining connection slots are reserved for non-replication superuser connections
2020-10-14 04:24:09.583 UTC [9995] analysis@merged-watchdog FATAL:  remaining connection slots are reserved for non-replication superuser connections
2020-10-14 04:24:09.596 UTC [9998] analysis@merged-watchdog FATAL:  remaining connection slots are reserved for non-replication superuser connections
2020-10-14 04:24:09.608 UTC [9999] analysis@merged-watchdog FATAL:  remaining connection slots are reserved for non-replication superuser connections
2020-10-14 04:24:10.641 UTC [10007] analysis@merged-watchdog FATAL:  remaining connection slots are reserved for non-replication superuser connections
2020-10-14 04:24:10.650 UTC [10005] analysis@merged-watchdog FATAL:  remaining connection slots are reserved for non-replication superuser connections
2020-10-14 04:24:10.656 UTC [10006] analysis@merged-watchdog FATAL:  remaining connection slots are reserved for non-replication superuser connections
2020-10-14 04:24:10.675 UTC [10004] analysis@merged-watchdog FATAL:  remaining connection slots are reserved for non-replication superuser connections
2020-10-14 04:24:10.689 UTC [10003] analysis@merged-watchdog FATAL:  remaining connection slots are reserved for non-replication superuser connections
2020-10-14 04:24:10.691 UTC [10002] analysis@merged-watchdog FATAL:  remaining connection slots are reserved for non-replication superuser connections
2020-10-14 04:24:10.799 UTC [10009] analysis@merged-watchdog FATAL:  remaining connection slots are reserved for non-replication superuser connections
2020-10-14 04:24:10.801 UTC [10008] analysis@merged-watchdog FATAL:  remaining connection slots are reserved for non-replication superuser connections
2020-10-14 04:24:11.707 UTC [10012] analysis@merged-watchdog FATAL:  remaining connection slots are reserved for non-replication superuser connections
2020-10-14 04:24:11.711 UTC [10011] analysis@merged-watchdog FATAL:  remaining connection slots are reserved for non-replication superuser connections
2020-10-14 04:24:11.718 UTC [10010] analysis@merged-watchdog FATAL:  remaining connection slots are reserved for non-replication superuser connections
2020-10-14 04:24:11.750 UTC [10013] analysis@merged-watchdog FATAL:  remaining connection slots are reserved for non-replication superuser connections
2020-10-14 04:24:11.761 UTC [10015] analysis@merged-watchdog FATAL:  remaining connection slots are reserved for non-replication superuser connections
2020-10-14 04:24:11.767 UTC [10014] analysis@merged-watchdog FATAL:  remaining connection slots are reserved for non-replication superuser connections
2020-10-14 04:24:11.855 UTC [10017] analysis@merged-watchdog FATAL:  remaining connection slots are reserved for non-replication superuser connections
2020-10-14 04:24:11.858 UTC [10016] analysis@merged-watchdog FATAL:  remaining connection slots are reserved for non-replication superuser connections
2020-10-14 04:24:12.764 UTC [10018] analysis@merged-watchdog FATAL:  remaining connection slots are reserved for non-replication superuser connections
2020-10-14 04:24:12.772 UTC [10020] analysis@merged-watchdog FATAL:  remaining connection slots are reserved for non-replication superuser connections
2020-10-14 04:24:12.775 UTC [10019] analysis@merged-watchdog FATAL:  remaining connection slots are reserved for non-replication superuser connections
2020-10-14 04:24:12.801 UTC [10022] analysis@merged-watchdog FATAL:  remaining connection slots are reserved for non-replication superuser connections
2020-10-14 04:24:12.811 UTC [10023] analysis@merged-watchdog FATAL:  remaining connection slots are reserved for non-replication superuser connections
2020-10-14 04:24:12.816 UTC [10024] analysis@merged-watchdog FATAL:  remaining connection slots are reserved for non-replication superuser connections
2020-10-14 04:24:12.917 UTC [10025] analysis@merged-watchdog FATAL:  remaining connection slots are reserved for non-replication superuser connections
2020-10-14 04:24:12.919 UTC [10026] analysis@merged-watchdog FATAL:  remaining connection slots are reserved for non-replication superuser connections
2020-10-14 04:24:13.816 UTC [10027] analysis@merged-watchdog FATAL:  remaining connection slots are reserved for non-replication superuser connections
2020-10-14 04:24:13.825 UTC [10029] analysis@merged-watchdog FATAL:  remaining connection slots are reserved for non-replication superuser connections
2020-10-14 04:24:13.826 UTC [10028] analysis@merged-watchdog FATAL:  remaining connection slots are reserved for non-replication superuser connections
2020-10-14 04:24:13.869 UTC [10031] analysis@merged-watchdog FATAL:  remaining connection slots are reserved for non-replication superuser connections
2020-10-14 04:24:13.872 UTC [10032] analysis@merged-watchdog FATAL:  remaining connection slots are reserved for non-replication superuser connections
2020-10-14 04:24:13.878 UTC [10030] analysis@merged-watchdog FATAL:  remaining connection slots are reserved for non-replication superuser connections
2020-10-14 04:24:13.979 UTC [10033] analysis@merged-watchdog FATAL:  remaining connection slots are reserved for non-replication superuser connections
2020-10-14 04:24:13.986 UTC [10034] analysis@merged-watchdog FATAL:  remaining connection slots are reserved for non-replication superuser connections
2020-10-14 04:24:14.889 UTC [10035] analysis@merged-watchdog FATAL:  remaining connection slots are reserved for non-replication superuser connections
2020-10-14 04:24:14.893 UTC [10036] analysis@merged-watchdog FATAL:  remaining connection slots are reserved for non-replication superuser connections
2020-10-14 04:24:14.894 UTC [10037] analysis@merged-watchdog FATAL:  remaining connection slots are reserved for non-replication superuser connections
2020-10-14 04:24:14.927 UTC [10040] analysis@merged-watchdog FATAL:  remaining connection slots are reserved for non-replication superuser connections
2020-10-14 04:24:14.927 UTC [10038] analysis@merged-watchdog FATAL:  remaining connection slots are reserved for non-replication superuser connections
2020-10-14 04:24:14.928 UTC [10039] analysis@merged-watchdog FATAL:  remaining connection slots are reserved for non-replication superuser connections
2020-10-14 04:24:15.040 UTC [10043] analysis@merged-watchdog FATAL:  remaining connection slots are reserved for non-replication superuser connections
2020-10-14 04:24:15.042 UTC [10041] analysis@merged-watchdog FATAL:  remaining connection slots are reserved for non-replication superuser connections
2020-10-14 04:24:15.946 UTC [10047] analysis@merged-watchdog FATAL:  remaining connection slots are reserved for non-replication superuser connections

The metabase logs seem to only be available for the last hour or so, so I will have to wait until metabase eats up all my connections again. Is there another way @flamber ?

flamber · 2020-10-15T13:32:16Z

@Mapiarz You should use the support email, when using the Enterprise Edition. Also, the latest release is 1.36.7
I don't know which database you're showing logs from, but I'm guessing it's Postgres. Are you using pgbouncer or similar?
How many connections does the database allow? How many connections is Metabase configured to use?
I don't know how you're hosting Metabase, but it looks like EC2, but I don't know if you're using JAR or Docker.
The Metabase logs, should indicate if connections are open, and if they are queuing.

Mapiarz · 2020-10-15T13:56:20Z

@flamber We are not self-hosting Metabase, we are using the Metabase Cloud Starter plan. As for other questions:

Yes, sorry, I'm talking about Postgres
No pgbouncer or equivalent
Postgres configured for 100 connections
No idea how the Cloud hosted Metabase is configured

flamber · 2020-10-15T14:07:50Z

@Mapiarz Okay, very helpful information. Are you using SSH tunnel?

Still the log will show details about threads, connections, queues etc like:
2020-10-15T15:58:19+02:00 DEBUG metabase.middleware.log POST /api/card/1471/query 202 [ASYNC: completed] 537.3 ms (26 DB calls) App DB connections: 1/13 Jetty threads: 2/50 (8 idle, 0 queued) (151 total active threads) Queries in flight: 1 (0 queued)

It looks like it's running out of connections hourly, which is the default sync-process, and that makes me think that the previous sync somehow doesn't complete.

But it should only consume 15 connections max - unless you have configured multiple database in Admin > Databases, since each configured here can handle up to 15 connections simultaneously.

Mapiarz · 2020-10-15T16:22:36Z

@flamber Some answers:

Not using SSH tunnel
Just 1 postgres DB configured
The default sync process does not seem to make Postgres run out of available connection slots if we are not actively using metabase, the number of idle connections is not growing
We've ran out of available connection slots after my co-worker does some analysis in Metabase.

I've just opened up metabase and opened one of our dashboards with a bunch of custom questions and the number of idle connections to Postgres grew to 33. The more questions/dashboards I open, the more connections open up. It seems that whenever I open something new, the number of connections grows, but when I navigate away (or even close the Metabase tab) the connections are not closed. Surely that's not the intended behaviour? The good news is that I now know how to reproduce the behaviour so I'm sure we'll get to the bottom of this.

Here's a log I just grabbed from the Metabase admin panel: https://dpaste.org/Dc4X/slim (it's long so I've pasted it elsewhere).

paoliniluis · 2022-05-24T14:40:52Z

IMPORTANT: if anyone hits this issue and is running on RDS AND has performance insights enabled, please provide us with all the info that performance insights provides so we can identify the queries that are being run when connections get out of control. Thanks!

dpsutton · 2022-05-26T16:50:17Z

So I think there might be two different issues in this thread. To try to reintroduce some clarity, it would be helpful if people distinguished when talking about metabase's own app db and other databases used as sources for metabase. The two issues I think I'm seeing:

metabase creates too many connections to its own app db
metabase locks up when it has lots of idle open connections to a source db (also, pivot tables might play a role in this)

I think I think I have solved problem number 2. Number 1 remains under active investigation. Here's the issue with number 2:
Metabase performs multiple queries to enable pivot tables. It first runs the "main" query, and then runs subsequent queries to get the pivot information. The subsequent queries are run while the first connection is still open but idle. It is closed after all subqueries have run. This can create a deadlock scenario.

Simplified version where the connection pool is 2 to the source db:

open up a dashboard with two pivot tables on it
each question runs its original queries, saturating the connection pool
each question attempts to acquire another connection to the source db to run the sub queries
these connection attempts are queued waiting on two open connections in the pool but the two checked out connections are not closed until the entire pivot results are returned.
any subsequent questions against the source db are queued after these so no more queries can go to the source db
sometimes the app becomes unresponsive. (I'm not sure of the exact mechanism of this though and hope it no longer occurs when the deadlock on the source db connections is resolved.)

Addresses part of #8679 Pivot tables can have subqueries that run to create tallies. We do not hold the entirety of resultsets in memory so we have a bit of an inversion of control flow: connections are opened, queries run, and result sets are transduced and then the connection closed. The error here was that subsequent queries for the pivot were run while the first connection is still held open. But the connection is no longer needed. But enough pivots running at the same time in a dashboard can create a deadlock where the subqueries need a new connection, but the main queries cannot be released until the subqueries have completed. Also, rf management is critical. It's completion arity must be called once and only once. We also have middleware that need to be composed (format, etc) and others that can only be composed once (limit). We have to save the original reducing function before composition (this is the one that can write to the download writer, etc) but compose it each time we use it with `(rff metadata)` so we have the format and other middleware. Keeping this distinction in mind will save you lots of time. (The limit query will ignore all subsequent rows if you just grab the output of `(rff metadata)` and not the rf returned from the `:rff` key on the context. But this takes the following connection management: ``` tap> "OPENING CONNECTION 0" tap> "already open: " tap> "OPENING CONNECTION 1" tap> "already open: 0" tap> "CLOSING CONNECTION 1" tap> "OPENING CONNECTION 2" tap> "already open: 0" tap> "CLOSING CONNECTION 2" tap> "OPENING CONNECTION 3" tap> "already open: 0" tap> "CLOSING CONNECTION 3" tap> "CLOSING CONNECTION 0" ``` and properly sequences it so that connection 0 is closed before opening connection 1. It hijacks the executef to just pass that function into the reducef part so we can reduce multiple times and therefore control the connections. Otherwise the reducef happens "inside" of the executef at which point the connection is closed. Care is taken to ensure that: - the init is only called once (subsequent queries have the init of the rf overridden to just return `init` (the acc passed in) rather than `(rf)` - the completion arity is only called once (use of `(completing rf)` and the reducing function in the subsequent queries is just `([acc] acc)` and does not call `(rf acc)`. Remember this is just on the lower reducing function and all of the takes, formats, etc _above_ it will have the completion arity called because we are using transduce. The completion arity is what takes the volatile rows and row counts and actually nests them in the `{:data {:rows []}` structure. Without calling that once (and ONLY once) you end up with no actual results. they are just in memory.

cavaestoca · 2022-10-20T15:10:29Z

Also experiencing this issue. Just to add few things that I've noticed in our environment:

The problem itself (apparently) it's not related to query times, but accessing the application (maybe related to connections to Metabase DB)
After restarting the application, access to the app is really fast, no bottlenecks 👍
Running show full processlist; on Metabase DB after restart shows just a few "sleep" metabase processes

Soon after, the number of "sleep" metabase connections starts to grow until it reaches the limit set on MB_APPLICATION_DB_MAX_CONNECTION_POOL_SIZE

Once the limit is reached, every user that was not using the app after a while (not sure if its a session timeout) has to wait around 60 seconds so the application can load and continue to use (test bellow in a incognito window)

This wait time is the biggest issue we have ⚠️

After reaching the limit, the connections never drop, MySql just keeps constantly with max connections on sleep (left running for couple of weeks and connections never dropped, the example below I was using 50 connections limit. Since then we've changed to 100 to see if there was any change. No improvement happened)

Other interesting thing is that every "sleep" metabase connection restarts exactly every 60 seconds as well
Not sure as well if NGINX is having any issue related

Enviroment details:

Metabase version 0.44.4
Using Docker on AWS EBS
Around 10-15 users using concurrently every working hour
Metabase DB: MySql 8
Connected DB: Redshift, Snowflake, Mysql (metabase), Postgres
env variables that we've changed

MAX_SESSION_AGE → 7200 
MB_APPLICATION_DB_MAX_CONNECTION_POOL_SIZE → 100
MB_JDBC_DATA_WAREHOUSE_MAX_CONNECTION_POOL_SIZE → 75
MB_SESSION_COOKIES → true
MB_SESSION_TIMEOUT → 8 hours

Connection pool details after restart, no wait for users:

{"connection-pools":{"metabase-mysql-app-db":{"numConnections":7,"numIdleConnections":7,"numBusyConnections":0,"minPoolSize":1,"maxPoolSize":100}}}

Connection pool details after a few minutes and users have to wait 60 seconds:

{"connection-pools":{"metabase-mysql-app-db":{"numConnections":100,"numIdleConnections":100,"numBusyConnections":0,"minPoolSize":1,"maxPoolSize":100},"db-9-snowflake-ACCURATE":{"numConnections":1,"numIdleConnections":1,"numBusyConnections":0,"minPoolSize":1,"maxPoolSize":75},"db-2-redshift-dev":{"numConnections":50,"numIdleConnections":50,"numBusyConnections":0,"minPoolSize":1,"maxPoolSize":75},"db-8-snowflake-ANALYTICS":{"numConnections":1,"numIdleConnections":1,"numBusyConnections":0,"minPoolSize":1,"maxPoolSize":75}}}

Diagnostics info:

{
  "browser-info": {
    "language": "pt",
    "platform": "MacIntel",
    "userAgent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36",
    "vendor": "Google Inc."
  },
  "system-info": {
    "file.encoding": "UTF-8",
    "java.runtime.name": "OpenJDK Runtime Environment",
    "java.runtime.version": "11.0.16.1+1",
    "java.vendor": "Eclipse Adoptium",
    "java.vendor.url": "https://adoptium.net/",
    "java.version": "11.0.16.1",
    "java.vm.name": "OpenJDK 64-Bit Server VM",
    "java.vm.version": "11.0.16.1+1",
    "os.name": "Linux",
    "os.version": "4.14.232-176.381.amzn2.x86_64",
    "user.language": "en",
    "user.timezone": "GMT"
  },
  "metabase-info": {
    "databases": [
      "redshift",
      "postgres",
      "snowflake",
      "mysql"
    ],
    "hosting-env": "unknown",
    "application-database": "mysql",
    "application-database-details": {
      "database": {
        "name": "MySQL",
        "version": "8.0.28"
      },
      "jdbc-driver": {
        "name": "MariaDB Connector/J",
        "version": "2.7.6"
      }
    },
    "run-mode": "prod",
    "version": {
      "date": "2022-09-29",
      "tag": "v0.44.4",
      "branch": "release-x.44.x",
      "hash": "382d728"
    },
    "settings": {
      "report-timezone": null
    }
  }
}

monoprosito · 2023-05-31T13:45:47Z

@flamber @paoliniluis Is there any update on this issue? 🤔

paoliniluis · 2023-05-31T14:06:44Z

@monoprosito what are you seeing? which version are you running?

lexrost · 2023-08-16T11:36:55Z

I have exactly this problem and bug rises consistently.
also i have a probable solution.

I guess that transaction is open for statements that return to much data. My guess is that when metabase is not able to download entire query result it leaves transaction open up to the moment when user goes through entire pagination. I don't know exact amout in bytes or rows that causes it.

I've added a limit on query that left transaction in idle and the bug went away. It does not replicate anymore.
it might be somehow connected to metabase limit magic number

ziouf · 2023-10-11T16:08:39Z

Still an issue for me

{
  "browser-info": {
    "language": "fr-FR",
    "platform": "Linux x86_64",
    "userAgent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36",
    "vendor": "Google Inc."
  },
  "system-info": {
    "file.encoding": "UTF-8",
    "java.runtime.name": "OpenJDK Runtime Environment",
    "java.runtime.version": "11.0.20.1+1",
    "java.vendor": "Eclipse Adoptium",
    "java.vendor.url": "https://adoptium.net/",
    "java.version": "11.0.20.1",
    "java.vm.name": "OpenJDK 64-Bit Server VM",
    "java.vm.version": "11.0.20.1+1",
    "os.name": "Linux",
    "os.version": "5.4.0-148-generic",
    "user.language": "en",
    "user.timezone": "UTC"
  },
  "metabase-info": {
    "databases": [
      "postgres"
    ],
    "hosting-env": "unknown",
    "application-database": "postgres",
    "application-database-details": {
      "database": {
        "name": "PostgreSQL",
        "version": "14.8 (Debian 14.8-1.pgdg110+1)"
      },
      "jdbc-driver": {
        "name": "PostgreSQL JDBC Driver",
        "version": "42.5.4"
      }
    },
    "run-mode": "prod",
    "version": {
      "date": "2023-10-03",
      "tag": "v0.47.3",
      "branch": "?",
      "hash": "4202328"
    },
    "settings": {
      "report-timezone": "Europe/Paris"
    }
  }
}

paoliniluis · 2023-10-11T18:01:11Z

We’re launching a feature that will allow to kill dangling connections in 47.4

paoliniluis · 2023-11-02T21:20:23Z

NEW: you can now define MB_JDBC_DATA_WAREHOUSE_UNRETURNED_CONNECTION_TIMEOUT_SECONDS as an environment variable. We suggest you put this env var to 20 minutes (1200 seconds)

camsaul · 2024-06-24T17:32:59Z

Default value is already 20 minutes

camsaul · 2024-06-24T17:33:43Z

Recategorizing this as P2 since 47.4+ has MB_JDBC_DATA_WAREHOUSE_UNRETURNED_CONNECTION_TIMEOUT_SECONDS and we haven't heard more about people running into this problem since that change happened

paoliniluis · 2024-09-12T16:16:15Z

There's a debugUnreturnedConnectionStackTraces thing in the c3p0 library that might help us out here: https://www.mchange.com/projects/c3p0/#unreturnedConnectionTimeout

salsakran added .Performance Operation/ labels Dec 28, 2018

paulrosenzweig mentioned this issue Sep 24, 2019

Dashboard, cannot add filters and save failed #10796

Closed

flamber mentioned this issue Dec 4, 2019

Metabase is leaving open transactions with locks on our Redshift database #11441

Closed

flamber mentioned this issue Dec 19, 2019

Metabase 0.32.5 - unexpected end of stream, read 0 bytes from 7 #9885

Closed

camsaul closed this as completed Mar 4, 2020

flamber added this to the 0.35.0 milestone Mar 4, 2020

dpsutton mentioned this issue May 26, 2022

Fix deadlock in pivot table connection management #22981

Merged

rlotun modified the milestone: 0.43.2 May 27, 2022

This was referenced May 31, 2022

Manual backport pivot connections deadlock #23058

Closed

Fix deadlock in pivot table connection management (#22981) #23060

Merged

ranquild added Querying/Processor and removed Querying/ labels Jan 15, 2024

dpsutton removed their assignment Jun 3, 2024

ranquild added the .Team/QueryProcessor(deprecated) Use .Team/Querying instead label Jun 20, 2024

camsaul added Priority:P2 Average run of the mill bug and removed Priority:P1 Security holes w/o exploit, crashing, setup/upgrade, login, broken common features, correctness labels Jun 24, 2024

NevRA added .Team/Querying and removed .Team/QueryProcessor(deprecated) Use .Team/Querying instead labels Aug 16, 2024

paoliniluis mentioned this issue Sep 4, 2024

Jetty threads starvation #42139

Open

Metabase becomes inoperable, leaves too many hanging DB connections #8679

Metabase becomes inoperable, leaves too many hanging DB connections #8679

Comments

joaoferrao commented Oct 12, 2018 • edited Loading

Bugs

senior commented Oct 25, 2018

joaoferrao commented Oct 29, 2018

senior commented Oct 29, 2018

joaoferrao commented Oct 29, 2018

senior commented Oct 29, 2018

joaoferrao commented Oct 31, 2018 • edited Loading

djbusby commented Jul 1, 2019

emetselaar commented Jul 2, 2019

emetselaar commented Jul 4, 2019 • edited Loading

flamber commented Jul 4, 2019

emetselaar commented Jul 4, 2019

emetselaar commented Jul 9, 2019 • edited Loading

rajatgl17 commented Sep 1, 2019

emetselaar commented Sep 2, 2019

keithbrink commented Feb 8, 2020

camsaul commented Mar 4, 2020

Mapiarz commented Oct 14, 2020

flamber commented Oct 14, 2020

Mapiarz commented Oct 15, 2020

flamber commented Oct 15, 2020

Mapiarz commented Oct 15, 2020

flamber commented Oct 15, 2020

Mapiarz commented Oct 15, 2020 • edited Loading

paoliniluis commented May 24, 2022

dpsutton commented May 26, 2022

cavaestoca commented Oct 20, 2022 • edited Loading

Enviroment details:

Connection pool details after restart, no wait for users:

Connection pool details after a few minutes and users have to wait 60 seconds:

Diagnostics info:

monoprosito commented May 31, 2023

paoliniluis commented May 31, 2023

lexrost commented Aug 16, 2023

ziouf commented Oct 11, 2023 • edited Loading

paoliniluis commented Oct 11, 2023

paoliniluis commented Nov 2, 2023

camsaul commented Jun 24, 2024

camsaul commented Jun 24, 2024 • edited Loading

paoliniluis commented Sep 12, 2024

joaoferrao commented Oct 12, 2018 •

edited

Loading

joaoferrao commented Oct 31, 2018 •

edited

Loading

emetselaar commented Jul 4, 2019 •

edited

Loading

emetselaar commented Jul 9, 2019 •

edited

Loading

Mapiarz commented Oct 15, 2020 •

edited

Loading

cavaestoca commented Oct 20, 2022 •

edited

Loading

ziouf commented Oct 11, 2023 •

edited

Loading

camsaul commented Jun 24, 2024 •

edited

Loading