Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metabase becomes inoperable, leaves too many hanging DB connections #8679

Open
joaoferrao opened this issue Oct 12, 2018 · 83 comments
Open

Metabase becomes inoperable, leaves too many hanging DB connections #8679

joaoferrao opened this issue Oct 12, 2018 · 83 comments

Comments

@joaoferrao
Copy link

joaoferrao commented Oct 12, 2018

Bugs

  • Your browser and the version: tested in chrome, firefox,
  • Your operating system: (e.x. OS X 10, Windows XP, etc)
  • Your databases: AWS RDS PostgreSQL
  • Metabase version: 0.28.0, 0.29, 0.30.0, 0.31.0RC
  • Metabase hosting environment: AWS ECS
  • Metabase internal database: AWS RDS PostgreSQL

Basically at some point, everyday, the queries don't return anything due to the amount of hanging connections. Status goes from "doing science" to "took too much time".

Would really appreciate some pointers or a way to tell met abase to kill DB Connections. It becomes inoperable every day - I suspect due to the amount of hanging/idle connections to the DB. As soon as I restart the container the connections to the RDS drop to 0 and the queries work again immediately.

Below what happened, before 9h30 is when I restarted the container.

image

@senior
Copy link
Contributor

senior commented Oct 25, 2018

It would be good to test this out on the most recent version of Metabase 0.30.4 as there have been several fixes recently that I think would help prevent this issue. Although #8312 doesn't specifically say it was leaving database connections open, I think it's pretty likely that it would.

If you are still seeing this issue after upgrading, what would be most useful is a thread dump of the Metabase process when you observe this connection issue. A thread dump will capture what each thread in the JVM is doing at that exact moment. I'm not sure how you have deployment Metabase, but the easiest way to get a thread dump is to find out the PID of your Metabase process and run jstack on that PID. You can then redirect the output of that command to a file, i.e. jstack 123 >> thread-dump.txt. If you would like you can email it to me directly, my address is my first name at metabase.com.

@joaoferrao
Copy link
Author

@senior I'm seeing this issue happening even in the 31.0 RC. Isn't it supposed to incorporate all the mentioned fix from the 30.4?

@senior
Copy link
Contributor

senior commented Oct 29, 2018

@joaoferrao 0.31.0.RC1 hasn't been released yet. Are you building your own off of the release branch?

@joaoferrao
Copy link
Author

@senior that's correct.

@senior
Copy link
Contributor

senior commented Oct 29, 2018

@joaoferrao You should have the fixes then. What would be helpful would be figuring out what those connections are doing via the thread dump that I described above.

@joaoferrao
Copy link
Author

joaoferrao commented Oct 31, 2018

@senior I'm trying to follow your instructions but the alpine dist where metabase is included doesn't include jstack apparently and I'm trying to find a way to install it without resorting to restart the production instance. suggestions?

@djbusby
Copy link

djbusby commented Jul 1, 2019

We're using release v0.32.9 and it also shows this problem; no active users of Metabase yet it has 195 open connections to my DB server (postgresql 11). I have metabase configured to view 350 databases; pg connection max is 200; it would be nice if unused connections could be closed.

@emetselaar
Copy link

We experience the same issue with a lower amount of databases (around 30). We are thinking of putting PgBouncer in between to handle the connections.

@emetselaar
Copy link

emetselaar commented Jul 4, 2019

PgBouncer did not work for us.

We are now testing with the latest RC (0.33.0-preview2). The same thing happens. It opens a lot of connections (97 of the allowed 100 on that postgres) and keeps them open for ever. This means other processes can no longer connect to that database. For example the one that inserts the data in the first place.

That proces will run into this:
psycopg2.OperationalError: FATAL: remaining connection slots are reserved for non-replication superuser connections

See also the open connections from before (v 0.31.2) and then the spike once we upgraded to 0.33preview2. This also happened with 0.32. We go from around 20 to 40 connections open to "max connections allowed" open. So I guess something changed. after 0.31.2 in terms of the handling of connections to the db.

Screenshot 2019-07-04 at 20 36 44

What I noticed is that each connected database seems to max out at around 15 connections per database (or user, not sure what determines the limit). That would mean that you need at least number of databases x 15 connections + some extra to keep the database accessible for other processes.

Screenshot 2019-07-04 at 20 48 53

I now upgraded to 200 allowed connections and while using Metabase those also slowly get filled up.

In the case of @djbusby that would mean a whopping 15 * 350 in the worst case. Hopefully somebody can come up with a good solution: something like "close connections after x amount of time not active".

@flamber
Copy link
Contributor

flamber commented Jul 4, 2019

@emetselaar
I don't know if it will help or cause more problems, so please use with caution.
The 15 connections per database is a current default, but with PR #9788, it looks like you can send parameter that changes the maximum pool size of c3p0 - example 10 connections:

java -Dc3p0.maxPoolSize=10 -jar metabase.jar

You might want to look at the c3p0 section about Managing Pool Size and Connection Age, since you might be able to close idle connections.

@emetselaar
Copy link

Thanks @flamber - I will try it out tomorrow. Now that I know where to look I see that the maxIdleTime is set at 3 hours, but the idleConnectionTestPeriod at 3 minutes is probably preventing the maxIdleTime to be ever hit. So the connections are never discarded?

@emetselaar
Copy link

emetselaar commented Jul 9, 2019

Update: I could not get it to work so I updated the number of allowed open connections to the database.

However, I think that the inconsistency between maxIdleTime and idleConnectionTestPeriod looks like a bug. I do not understand the intention so it could be by design, but if that is the case then maxIdleTime does not have to be there at all right?

from /metabase/src/metabase/driver/sql_jdbc/connection.clj

(def ^:private data-warehouse-connection-pool-properties
  "c3p0 connection pool properties for connected data warehouse DBs. See
  https://www.mchange.com/projects/c3p0/#configuration_properties for descriptions of properties."
  {"maxIdleTime"                  (* 3 60 60)
   "minPoolSize"                  1
   "initialPoolSize"              1
   "maxPoolSize"                  15
   ;; prevent broken connections closed by dbs by testing them every 3 mins
   "idleConnectionTestPeriod"     (* 3 60)
   ;; prevent overly large pools by condensing them when connections are idle for 15m+
   "maxIdleTimeExcessConnections" (* 15 60)})

@rajatgl17
Copy link

Facing the same issue with AWS RDS MySQL.
@emetselaar Were you able to resolve it anyhow?

@emetselaar
Copy link

Facing the same issue with AWS RDS MySQL.
@emetselaar Were you able to resolve it anyhow?

I worked around this by increasing the amount of allowed connections on the database itself. Not ideal, but for my purposes it works.

@keithbrink
Copy link

I am also encountering long running queries/connections, thought I would post them in case it's helpful:

 Just now
	analytics
SELECT "core_session"."created_at", "core_session"."user_id", "core_user"."is_superuser" FROM "core_session" LEFT JOIN "core_user" ON "core_session"."user_id" = "core_user"."id" WHERE ("core_user"."is_active" = TRUE AND "core_session"."id" = $1) LIMIT 1
	344737.12ms
Just now
	analytics
SHOW TRANSACTION ISOLATION LEVEL
	344736.97ms
Just now
	analytics
SELECT "core_session"."created_at", "core_session"."user_id", "core_user"."is_superuser" FROM "core_session" LEFT JOIN "core_user" ON "core_session"."user_id" = "core_user"."id" WHERE ("core_user"."is_active" = TRUE AND "core_session"."id" = $1) LIMIT 1
	344081.78ms
Just now
	analytics 
SELECT "pulse_channel"."id", "pulse_channel"."pulse_id", "pulse_channel"."schedule_type", "pulse_channel"."channel_type" FROM "pulse_channel" WHERE ("enabled" = TRUE AND ("schedule_type" = $1 OR ("schedule_type" = $2 AND "schedule_hour" = 8) OR ("schedule_type" = $3 AND "schedule_hour" = 8 AND "schedule_day" = $4) OR ("schedule_type" = $5 AND "schedule_hour" = 8 AND "schedule_frame" = $6 AND ("schedule_day" = $7 OR "schedule_day" = $8))))
	12703.728ms

@camsaul
Copy link
Member

camsaul commented Mar 4, 2020

Fixed by #11832

@camsaul camsaul closed this as completed Mar 4, 2020
@flamber flamber added this to the 0.35.0 milestone Mar 4, 2020
@Mapiarz
Copy link

Mapiarz commented Oct 14, 2020

Hi all. I'm still experiencing the above problem. Metabase keeps taking up more and more connection slots until nothing is able to connect. Is this regression known?

@flamber
Copy link
Contributor

flamber commented Oct 14, 2020

@Mapiarz
Please post "Diagnostic Info" from Admin > Troubleshooting, and which database you're seeing this with.
And logs from Admin > Troubleshooting > Logs.
And any process/connection stats from your database, which could be helpful.

@Mapiarz
Copy link

Mapiarz commented Oct 15, 2020

Here's diagonostic info:

{
  "browser-info": {
    "language": "en-US",
    "platform": "Linux x86_64",
    "userAgent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:81.0) Gecko/20100101 Firefox/81.0",
    "vendor": ""
  },
  "system-info": {
    "file.encoding": "UTF-8",
    "java.runtime.name": "OpenJDK Runtime Environment",
    "java.runtime.version": "11.0.7+10",
    "java.vendor": "AdoptOpenJDK",
    "java.vendor.url": "https://adoptopenjdk.net/",
    "java.version": "11.0.7",
    "java.vm.name": "OpenJDK 64-Bit Server VM",
    "java.vm.version": "11.0.7+10",
    "os.name": "Linux",
    "os.version": "4.14.186-146.268.amzn2.x86_64",
    "user.language": "en",
    "user.timezone": "GMT"
  },
  "metabase-info": {
    "databases": [
      "mongo",
      "googleanalytics",
      "postgres"
    ],
    "hosting-env": "unknown",
    "application-database": "postgres",
    "application-database-details": {
      "database": {
        "name": "PostgreSQL",
        "version": "11.7"
      },
      "jdbc-driver": {
        "name": "PostgreSQL JDBC Driver",
        "version": "42.2.8"
      }
    },
    "run-mode": "prod",
    "version": {
      "date": "2020-09-21",
      "tag": "v1.36.6.1",
      "branch": "enterprise-release-1.36.x",
      "hash": "5695e2b"
    },
    "settings": {
      "report-timezone": "US/Eastern"
    }
  }
}

Here are the final moments of my psql server before becoming full unresponsive:

2020-10-14 02:00:14.508 UTC [31786] analysis@merged-watchdog STATEMENT:  -- Metabase
        SELECT "public"."rest_framework_api_key_apikey"."hashed_key" AS "hashed_key", "public"."rest_framework_api_key_apikey"."name" AS "name", "public"."rest_framework_api_key_apikey"."created" AS "created", "pu
blic"."rest_framework_api_key_apikey"."expiry_date" AS "expiry_date", "public"."rest_framework_api_key_apikey"."revoked" AS "revoked", "public"."rest_framework_api_key_apikey"."prefix" AS "prefix" FROM "public"."r
est_framework_api_key_apikey" LIMIT 10000
2020-10-14 02:00:14.601 UTC [31786] analysis@merged-watchdog ERROR:  permission denied for relation rest_hooks_hook
2020-10-14 02:00:14.601 UTC [31786] analysis@merged-watchdog STATEMENT:  -- Metabase
        SELECT "public"."rest_hooks_hook"."target" AS "target", "public"."rest_hooks_hook"."event" AS "event", "public"."rest_hooks_hook"."created" AS "created", "public"."rest_hooks_hook"."updated" AS "updated", 
"public"."rest_hooks_hook"."user_id" AS "user_id" FROM "public"."rest_hooks_hook" LIMIT 10000
2020-10-14 02:05:03.440 UTC [31785] analysis@merged-watchdog LOG:  could not receive data from client: Connection reset by peer
2020-10-14 02:05:03.440 UTC [31786] analysis@merged-watchdog LOG:  could not receive data from client: Connection reset by peer
2020-10-14 03:00:15.030 UTC [16489] analysis@merged-watchdog ERROR:  permission denied for relation rest_framework_api_key_apikey
2020-10-14 03:00:15.030 UTC [16489] analysis@merged-watchdog STATEMENT:  -- Metabase
        SELECT "public"."rest_framework_api_key_apikey"."hashed_key" AS "hashed_key", "public"."rest_framework_api_key_apikey"."name" AS "name", "public"."rest_framework_api_key_apikey"."created" AS "created", "pu
blic"."rest_framework_api_key_apikey"."expiry_date" AS "expiry_date", "public"."rest_framework_api_key_apikey"."revoked" AS "revoked", "public"."rest_framework_api_key_apikey"."prefix" AS "prefix" FROM "public"."r
est_framework_api_key_apikey" LIMIT 10000
2020-10-14 03:00:15.131 UTC [16489] analysis@merged-watchdog ERROR:  permission denied for relation rest_hooks_hook
2020-10-14 03:00:15.131 UTC [16489] analysis@merged-watchdog STATEMENT:  -- Metabase
        SELECT "public"."rest_hooks_hook"."target" AS "target", "public"."rest_hooks_hook"."event" AS "event", "public"."rest_hooks_hook"."created" AS "created", "public"."rest_hooks_hook"."updated" AS "updated", 
"public"."rest_hooks_hook"."user_id" AS "user_id" FROM "public"."rest_hooks_hook" LIMIT 10000
2020-10-14 03:39:50.685 UTC [18135] analysis@merged-watchdog LOG:  could not receive data from client: Connection reset by peer
2020-10-14 04:00:12.359 UTC [6674] analysis@merged-watchdog ERROR:  permission denied for relation rest_framework_api_key_apikey
2020-10-14 04:00:12.359 UTC [6674] analysis@merged-watchdog STATEMENT:  -- Metabase
        SELECT "public"."rest_framework_api_key_apikey"."hashed_key" AS "hashed_key", "public"."rest_framework_api_key_apikey"."name" AS "name", "public"."rest_framework_api_key_apikey"."created" AS "created", "pu
blic"."rest_framework_api_key_apikey"."expiry_date" AS "expiry_date", "public"."rest_framework_api_key_apikey"."revoked" AS "revoked", "public"."rest_framework_api_key_apikey"."prefix" AS "prefix" FROM "public"."r
est_framework_api_key_apikey" LIMIT 10000
2020-10-14 04:00:12.437 UTC [6674] analysis@merged-watchdog ERROR:  permission denied for relation rest_hooks_hook
2020-10-14 04:00:12.437 UTC [6674] analysis@merged-watchdog STATEMENT:  -- Metabase
        SELECT "public"."rest_hooks_hook"."target" AS "target", "public"."rest_hooks_hook"."event" AS "event", "public"."rest_hooks_hook"."created" AS "created", "public"."rest_hooks_hook"."updated" AS "updated", 
"public"."rest_hooks_hook"."user_id" AS "user_id" FROM "public"."rest_hooks_hook" LIMIT 10000
2020-10-14 04:06:28.072 UTC [6674] analysis@merged-watchdog LOG:  could not receive data from client: Connection reset by peer
2020-10-14 04:24:09.483 UTC [9994] analysis@merged-watchdog FATAL:  remaining connection slots are reserved for non-replication superuser connections
2020-10-14 04:24:09.501 UTC [9993] analysis@merged-watchdog FATAL:  remaining connection slots are reserved for non-replication superuser connections
2020-10-14 04:24:09.501 UTC [9992] analysis@merged-watchdog FATAL:  remaining connection slots are reserved for non-replication superuser connections
2020-10-14 04:24:09.577 UTC [9997] analysis@merged-watchdog FATAL:  sorry, too many clients already
2020-10-14 04:24:09.581 UTC [9996] analysis@merged-watchdog FATAL:  remaining connection slots are reserved for non-replication superuser connections
2020-10-14 04:24:09.583 UTC [9995] analysis@merged-watchdog FATAL:  remaining connection slots are reserved for non-replication superuser connections
2020-10-14 04:24:09.596 UTC [9998] analysis@merged-watchdog FATAL:  remaining connection slots are reserved for non-replication superuser connections
2020-10-14 04:24:09.608 UTC [9999] analysis@merged-watchdog FATAL:  remaining connection slots are reserved for non-replication superuser connections
2020-10-14 04:24:10.641 UTC [10007] analysis@merged-watchdog FATAL:  remaining connection slots are reserved for non-replication superuser connections
2020-10-14 04:24:10.650 UTC [10005] analysis@merged-watchdog FATAL:  remaining connection slots are reserved for non-replication superuser connections
2020-10-14 04:24:10.656 UTC [10006] analysis@merged-watchdog FATAL:  remaining connection slots are reserved for non-replication superuser connections
2020-10-14 04:24:10.675 UTC [10004] analysis@merged-watchdog FATAL:  remaining connection slots are reserved for non-replication superuser connections
2020-10-14 04:24:10.689 UTC [10003] analysis@merged-watchdog FATAL:  remaining connection slots are reserved for non-replication superuser connections
2020-10-14 04:24:10.691 UTC [10002] analysis@merged-watchdog FATAL:  remaining connection slots are reserved for non-replication superuser connections
2020-10-14 04:24:10.799 UTC [10009] analysis@merged-watchdog FATAL:  remaining connection slots are reserved for non-replication superuser connections
2020-10-14 04:24:10.801 UTC [10008] analysis@merged-watchdog FATAL:  remaining connection slots are reserved for non-replication superuser connections
2020-10-14 04:24:11.707 UTC [10012] analysis@merged-watchdog FATAL:  remaining connection slots are reserved for non-replication superuser connections
2020-10-14 04:24:11.711 UTC [10011] analysis@merged-watchdog FATAL:  remaining connection slots are reserved for non-replication superuser connections
2020-10-14 04:24:11.718 UTC [10010] analysis@merged-watchdog FATAL:  remaining connection slots are reserved for non-replication superuser connections
2020-10-14 04:24:11.750 UTC [10013] analysis@merged-watchdog FATAL:  remaining connection slots are reserved for non-replication superuser connections
2020-10-14 04:24:11.761 UTC [10015] analysis@merged-watchdog FATAL:  remaining connection slots are reserved for non-replication superuser connections
2020-10-14 04:24:11.767 UTC [10014] analysis@merged-watchdog FATAL:  remaining connection slots are reserved for non-replication superuser connections
2020-10-14 04:24:11.855 UTC [10017] analysis@merged-watchdog FATAL:  remaining connection slots are reserved for non-replication superuser connections
2020-10-14 04:24:11.858 UTC [10016] analysis@merged-watchdog FATAL:  remaining connection slots are reserved for non-replication superuser connections
2020-10-14 04:24:12.764 UTC [10018] analysis@merged-watchdog FATAL:  remaining connection slots are reserved for non-replication superuser connections
2020-10-14 04:24:12.772 UTC [10020] analysis@merged-watchdog FATAL:  remaining connection slots are reserved for non-replication superuser connections
2020-10-14 04:24:12.775 UTC [10019] analysis@merged-watchdog FATAL:  remaining connection slots are reserved for non-replication superuser connections
2020-10-14 04:24:12.801 UTC [10022] analysis@merged-watchdog FATAL:  remaining connection slots are reserved for non-replication superuser connections
2020-10-14 04:24:12.811 UTC [10023] analysis@merged-watchdog FATAL:  remaining connection slots are reserved for non-replication superuser connections
2020-10-14 04:24:12.816 UTC [10024] analysis@merged-watchdog FATAL:  remaining connection slots are reserved for non-replication superuser connections
2020-10-14 04:24:12.917 UTC [10025] analysis@merged-watchdog FATAL:  remaining connection slots are reserved for non-replication superuser connections
2020-10-14 04:24:12.919 UTC [10026] analysis@merged-watchdog FATAL:  remaining connection slots are reserved for non-replication superuser connections
2020-10-14 04:24:13.816 UTC [10027] analysis@merged-watchdog FATAL:  remaining connection slots are reserved for non-replication superuser connections
2020-10-14 04:24:13.825 UTC [10029] analysis@merged-watchdog FATAL:  remaining connection slots are reserved for non-replication superuser connections
2020-10-14 04:24:13.826 UTC [10028] analysis@merged-watchdog FATAL:  remaining connection slots are reserved for non-replication superuser connections
2020-10-14 04:24:13.869 UTC [10031] analysis@merged-watchdog FATAL:  remaining connection slots are reserved for non-replication superuser connections
2020-10-14 04:24:13.872 UTC [10032] analysis@merged-watchdog FATAL:  remaining connection slots are reserved for non-replication superuser connections
2020-10-14 04:24:13.878 UTC [10030] analysis@merged-watchdog FATAL:  remaining connection slots are reserved for non-replication superuser connections
2020-10-14 04:24:13.979 UTC [10033] analysis@merged-watchdog FATAL:  remaining connection slots are reserved for non-replication superuser connections
2020-10-14 04:24:13.986 UTC [10034] analysis@merged-watchdog FATAL:  remaining connection slots are reserved for non-replication superuser connections
2020-10-14 04:24:14.889 UTC [10035] analysis@merged-watchdog FATAL:  remaining connection slots are reserved for non-replication superuser connections
2020-10-14 04:24:14.893 UTC [10036] analysis@merged-watchdog FATAL:  remaining connection slots are reserved for non-replication superuser connections
2020-10-14 04:24:14.894 UTC [10037] analysis@merged-watchdog FATAL:  remaining connection slots are reserved for non-replication superuser connections
2020-10-14 04:24:14.927 UTC [10040] analysis@merged-watchdog FATAL:  remaining connection slots are reserved for non-replication superuser connections
2020-10-14 04:24:14.927 UTC [10038] analysis@merged-watchdog FATAL:  remaining connection slots are reserved for non-replication superuser connections
2020-10-14 04:24:14.928 UTC [10039] analysis@merged-watchdog FATAL:  remaining connection slots are reserved for non-replication superuser connections
2020-10-14 04:24:15.040 UTC [10043] analysis@merged-watchdog FATAL:  remaining connection slots are reserved for non-replication superuser connections
2020-10-14 04:24:15.042 UTC [10041] analysis@merged-watchdog FATAL:  remaining connection slots are reserved for non-replication superuser connections
2020-10-14 04:24:15.946 UTC [10047] analysis@merged-watchdog FATAL:  remaining connection slots are reserved for non-replication superuser connections

The metabase logs seem to only be available for the last hour or so, so I will have to wait until metabase eats up all my connections again. Is there another way @flamber ?

@flamber
Copy link
Contributor

flamber commented Oct 15, 2020

@Mapiarz You should use the support email, when using the Enterprise Edition. Also, the latest release is 1.36.7
I don't know which database you're showing logs from, but I'm guessing it's Postgres. Are you using pgbouncer or similar?
How many connections does the database allow? How many connections is Metabase configured to use?
I don't know how you're hosting Metabase, but it looks like EC2, but I don't know if you're using JAR or Docker.
The Metabase logs, should indicate if connections are open, and if they are queuing.

@Mapiarz
Copy link

Mapiarz commented Oct 15, 2020

@flamber We are not self-hosting Metabase, we are using the Metabase Cloud Starter plan. As for other questions:

  • Yes, sorry, I'm talking about Postgres
  • No pgbouncer or equivalent
  • Postgres configured for 100 connections
  • No idea how the Cloud hosted Metabase is configured

@flamber
Copy link
Contributor

flamber commented Oct 15, 2020

@Mapiarz Okay, very helpful information. Are you using SSH tunnel?

Still the log will show details about threads, connections, queues etc like:
2020-10-15T15:58:19+02:00 DEBUG metabase.middleware.log POST /api/card/1471/query 202 [ASYNC: completed] 537.3 ms (26 DB calls) App DB connections: 1/13 Jetty threads: 2/50 (8 idle, 0 queued) (151 total active threads) Queries in flight: 1 (0 queued)

It looks like it's running out of connections hourly, which is the default sync-process, and that makes me think that the previous sync somehow doesn't complete.

But it should only consume 15 connections max - unless you have configured multiple database in Admin > Databases, since each configured here can handle up to 15 connections simultaneously.

@Mapiarz
Copy link

Mapiarz commented Oct 15, 2020

@flamber Some answers:

  • Not using SSH tunnel
  • Just 1 postgres DB configured
  • The default sync process does not seem to make Postgres run out of available connection slots if we are not actively using metabase, the number of idle connections is not growing
  • We've ran out of available connection slots after my co-worker does some analysis in Metabase.

I've just opened up metabase and opened one of our dashboards with a bunch of custom questions and the number of idle connections to Postgres grew to 33. The more questions/dashboards I open, the more connections open up. It seems that whenever I open something new, the number of connections grows, but when I navigate away (or even close the Metabase tab) the connections are not closed. Surely that's not the intended behaviour? The good news is that I now know how to reproduce the behaviour so I'm sure we'll get to the bottom of this.

Here's a log I just grabbed from the Metabase admin panel: https://dpaste.org/Dc4X/slim (it's long so I've pasted it elsewhere).

@paoliniluis
Copy link
Contributor

IMPORTANT: if anyone hits this issue and is running on RDS AND has performance insights enabled, please provide us with all the info that performance insights provides so we can identify the queries that are being run when connections get out of control. Thanks!

@dpsutton
Copy link
Contributor

So I think there might be two different issues in this thread. To try to reintroduce some clarity, it would be helpful if people distinguished when talking about metabase's own app db and other databases used as sources for metabase. The two issues I think I'm seeing:

  1. metabase creates too many connections to its own app db
  2. metabase locks up when it has lots of idle open connections to a source db (also, pivot tables might play a role in this)

I think I think I have solved problem number 2. Number 1 remains under active investigation. Here's the issue with number 2:
Metabase performs multiple queries to enable pivot tables. It first runs the "main" query, and then runs subsequent queries to get the pivot information. The subsequent queries are run while the first connection is still open but idle. It is closed after all subqueries have run. This can create a deadlock scenario.

Simplified version where the connection pool is 2 to the source db:

  • open up a dashboard with two pivot tables on it
  • each question runs its original queries, saturating the connection pool
  • each question attempts to acquire another connection to the source db to run the sub queries
  • these connection attempts are queued waiting on two open connections in the pool but the two checked out connections are not closed until the entire pivot results are returned.
  • any subsequent questions against the source db are queued after these so no more queries can go to the source db
  • sometimes the app becomes unresponsive. (I'm not sure of the exact mechanism of this though and hope it no longer occurs when the deadlock on the source db connections is resolved.)

dpsutton added a commit that referenced this issue May 26, 2022
Addresses part of #8679

Pivot tables can have subqueries that run to create tallies. We do not
hold the entirety of resultsets in memory so we have a bit of an
inversion of control flow: connections are opened, queries run, and
result sets are transduced and then the connection closed.

The error here was that subsequent queries for the pivot were run while
the first connection is still held open. But the connection is no longer
needed. But enough pivots running at the same time in a dashboard can
create a deadlock where the subqueries need a new connection, but the
main queries cannot be released until the subqueries have completed.

Also, rf management is critical. It's completion arity must be called
once and only once. We also have middleware that need to be
composed (format, etc) and others that can only be composed
once (limit). We have to save the original reducing function before
composition (this is the one that can write to the download writer, etc)
but compose it each time we use it with `(rff metadata)` so we have the
format and other middleware. Keeping this distinction in mind will save
you lots of time. (The limit query will ignore all subsequent rows if
you just grab the output of `(rff metadata)` and not the rf returned
from the `:rff` key on the context.

But this takes the following connection management:

```
tap> "OPENING CONNECTION 0"
tap> "already open: "
  tap> "OPENING CONNECTION 1"
  tap> "already open: 0"
  tap> "CLOSING CONNECTION 1"
  tap> "OPENING CONNECTION 2"
  tap> "already open: 0"
  tap> "CLOSING CONNECTION 2"
  tap> "OPENING CONNECTION 3"
  tap> "already open: 0"
  tap> "CLOSING CONNECTION 3"
tap> "CLOSING CONNECTION 0"
```

and properly sequences it so that connection 0 is closed before opening
connection 1.

It hijacks the executef to just pass that function into the reducef part
so we can reduce multiple times and therefore control the
connections. Otherwise the reducef happens "inside" of the executef at
which point the connection is closed.

Care is taken to ensure that:
- the init is only called once (subsequent queries have the init of the
rf overridden to just return `init` (the acc passed in) rather than
`(rf)`
- the completion arity is only called once (use of `(completing rf)` and
the reducing function in the subsequent queries is just `([acc] acc)`
and does not call `(rf acc)`. Remember this is just on the lower
reducing function and all of the takes, formats, etc _above_ it will
have the completion arity called because we are using transduce. The
completion arity is what takes the volatile rows and row counts and
actually nests them in the `{:data {:rows []}` structure. Without
calling that once (and ONLY once) you end up with no actual
results. they are just in memory.
@rlotun rlotun modified the milestone: 0.43.2 May 27, 2022
diogormendes pushed a commit that referenced this issue May 31, 2022
Addresses part of #8679

Pivot tables can have subqueries that run to create tallies. We do not
hold the entirety of resultsets in memory so we have a bit of an
inversion of control flow: connections are opened, queries run, and
result sets are transduced and then the connection closed.

The error here was that subsequent queries for the pivot were run while
the first connection is still held open. But the connection is no longer
needed. But enough pivots running at the same time in a dashboard can
create a deadlock where the subqueries need a new connection, but the
main queries cannot be released until the subqueries have completed.

Also, rf management is critical. It's completion arity must be called
once and only once. We also have middleware that need to be
composed (format, etc) and others that can only be composed
once (limit). We have to save the original reducing function before
composition (this is the one that can write to the download writer, etc)
but compose it each time we use it with `(rff metadata)` so we have the
format and other middleware. Keeping this distinction in mind will save
you lots of time. (The limit query will ignore all subsequent rows if
you just grab the output of `(rff metadata)` and not the rf returned
from the `:rff` key on the context.

But this takes the following connection management:

```
tap> "OPENING CONNECTION 0"
tap> "already open: "
  tap> "OPENING CONNECTION 1"
  tap> "already open: 0"
  tap> "CLOSING CONNECTION 1"
  tap> "OPENING CONNECTION 2"
  tap> "already open: 0"
  tap> "CLOSING CONNECTION 2"
  tap> "OPENING CONNECTION 3"
  tap> "already open: 0"
  tap> "CLOSING CONNECTION 3"
tap> "CLOSING CONNECTION 0"
```

and properly sequences it so that connection 0 is closed before opening
connection 1.

It hijacks the executef to just pass that function into the reducef part
so we can reduce multiple times and therefore control the
connections. Otherwise the reducef happens "inside" of the executef at
which point the connection is closed.

Care is taken to ensure that:
- the init is only called once (subsequent queries have the init of the
rf overridden to just return `init` (the acc passed in) rather than
`(rf)`
- the completion arity is only called once (use of `(completing rf)` and
the reducing function in the subsequent queries is just `([acc] acc)`
and does not call `(rf acc)`. Remember this is just on the lower
reducing function and all of the takes, formats, etc _above_ it will
have the completion arity called because we are using transduce. The
completion arity is what takes the volatile rows and row counts and
actually nests them in the `{:data {:rows []}` structure. Without
calling that once (and ONLY once) you end up with no actual
results. they are just in memory.
dpsutton added a commit that referenced this issue May 31, 2022
Addresses part of #8679

Pivot tables can have subqueries that run to create tallies. We do not
hold the entirety of resultsets in memory so we have a bit of an
inversion of control flow: connections are opened, queries run, and
result sets are transduced and then the connection closed.

The error here was that subsequent queries for the pivot were run while
the first connection is still held open. But the connection is no longer
needed. But enough pivots running at the same time in a dashboard can
create a deadlock where the subqueries need a new connection, but the
main queries cannot be released until the subqueries have completed.

Also, rf management is critical. It's completion arity must be called
once and only once. We also have middleware that need to be
composed (format, etc) and others that can only be composed
once (limit). We have to save the original reducing function before
composition (this is the one that can write to the download writer, etc)
but compose it each time we use it with `(rff metadata)` so we have the
format and other middleware. Keeping this distinction in mind will save
you lots of time. (The limit query will ignore all subsequent rows if
you just grab the output of `(rff metadata)` and not the rf returned
from the `:rff` key on the context.

But this takes the following connection management:

```
tap> "OPENING CONNECTION 0"
tap> "already open: "
  tap> "OPENING CONNECTION 1"
  tap> "already open: 0"
  tap> "CLOSING CONNECTION 1"
  tap> "OPENING CONNECTION 2"
  tap> "already open: 0"
  tap> "CLOSING CONNECTION 2"
  tap> "OPENING CONNECTION 3"
  tap> "already open: 0"
  tap> "CLOSING CONNECTION 3"
tap> "CLOSING CONNECTION 0"
```

and properly sequences it so that connection 0 is closed before opening
connection 1.

It hijacks the executef to just pass that function into the reducef part
so we can reduce multiple times and therefore control the
connections. Otherwise the reducef happens "inside" of the executef at
which point the connection is closed.

Care is taken to ensure that:
- the init is only called once (subsequent queries have the init of the
rf overridden to just return `init` (the acc passed in) rather than
`(rf)`
- the completion arity is only called once (use of `(completing rf)` and
the reducing function in the subsequent queries is just `([acc] acc)`
and does not call `(rf acc)`. Remember this is just on the lower
reducing function and all of the takes, formats, etc _above_ it will
have the completion arity called because we are using transduce. The
completion arity is what takes the volatile rows and row counts and
actually nests them in the `{:data {:rows []}` structure. Without
calling that once (and ONLY once) you end up with no actual
results. they are just in memory.
diogormendes pushed a commit that referenced this issue May 31, 2022
Addresses part of #8679

Pivot tables can have subqueries that run to create tallies. We do not
hold the entirety of resultsets in memory so we have a bit of an
inversion of control flow: connections are opened, queries run, and
result sets are transduced and then the connection closed.

The error here was that subsequent queries for the pivot were run while
the first connection is still held open. But the connection is no longer
needed. But enough pivots running at the same time in a dashboard can
create a deadlock where the subqueries need a new connection, but the
main queries cannot be released until the subqueries have completed.

Also, rf management is critical. It's completion arity must be called
once and only once. We also have middleware that need to be
composed (format, etc) and others that can only be composed
once (limit). We have to save the original reducing function before
composition (this is the one that can write to the download writer, etc)
but compose it each time we use it with `(rff metadata)` so we have the
format and other middleware. Keeping this distinction in mind will save
you lots of time. (The limit query will ignore all subsequent rows if
you just grab the output of `(rff metadata)` and not the rf returned
from the `:rff` key on the context.

But this takes the following connection management:

```
tap> "OPENING CONNECTION 0"
tap> "already open: "
  tap> "OPENING CONNECTION 1"
  tap> "already open: 0"
  tap> "CLOSING CONNECTION 1"
  tap> "OPENING CONNECTION 2"
  tap> "already open: 0"
  tap> "CLOSING CONNECTION 2"
  tap> "OPENING CONNECTION 3"
  tap> "already open: 0"
  tap> "CLOSING CONNECTION 3"
tap> "CLOSING CONNECTION 0"
```

and properly sequences it so that connection 0 is closed before opening
connection 1.

It hijacks the executef to just pass that function into the reducef part
so we can reduce multiple times and therefore control the
connections. Otherwise the reducef happens "inside" of the executef at
which point the connection is closed.

Care is taken to ensure that:
- the init is only called once (subsequent queries have the init of the
rf overridden to just return `init` (the acc passed in) rather than
`(rf)`
- the completion arity is only called once (use of `(completing rf)` and
the reducing function in the subsequent queries is just `([acc] acc)`
and does not call `(rf acc)`. Remember this is just on the lower
reducing function and all of the takes, formats, etc _above_ it will
have the completion arity called because we are using transduce. The
completion arity is what takes the volatile rows and row counts and
actually nests them in the `{:data {:rows []}` structure. Without
calling that once (and ONLY once) you end up with no actual
results. they are just in memory.
@cavaestoca
Copy link

cavaestoca commented Oct 20, 2022

Also experiencing this issue. Just to add few things that I've noticed in our environment:

  • The problem itself (apparently) it's not related to query times, but accessing the application (maybe related to connections to Metabase DB)
  • After restarting the application, access to the app is really fast, no bottlenecks 👍
  • Running show full processlist; on Metabase DB after restart shows just a few "sleep" metabase processes

image

  • Soon after, the number of "sleep" metabase connections starts to grow until it reaches the limit set on MB_APPLICATION_DB_MAX_CONNECTION_POOL_SIZE

image

  • Once the limit is reached, every user that was not using the app after a while (not sure if its a session timeout) has to wait around 60 seconds so the application can load and continue to use (test bellow in a incognito window)

This wait time is the biggest issue we have ⚠️

image

  • After reaching the limit, the connections never drop, MySql just keeps constantly with max connections on sleep (left running for couple of weeks and connections never dropped, the example below I was using 50 connections limit. Since then we've changed to 100 to see if there was any change. No improvement happened)

image

  • Other interesting thing is that every "sleep" metabase connection restarts exactly every 60 seconds as well
  • Not sure as well if NGINX is having any issue related

Enviroment details:

  • Metabase version 0.44.4
  • Using Docker on AWS EBS
  • Around 10-15 users using concurrently every working hour
  • Metabase DB: MySql 8
  • Connected DB: Redshift, Snowflake, Mysql (metabase), Postgres
  • env variables that we've changed
MAX_SESSION_AGE → 7200 
MB_APPLICATION_DB_MAX_CONNECTION_POOL_SIZE → 100
MB_JDBC_DATA_WAREHOUSE_MAX_CONNECTION_POOL_SIZE → 75
MB_SESSION_COOKIES → true
MB_SESSION_TIMEOUT → 8 hours

Connection pool details after restart, no wait for users:

{"connection-pools":{"metabase-mysql-app-db":{"numConnections":7,"numIdleConnections":7,"numBusyConnections":0,"minPoolSize":1,"maxPoolSize":100}}}

Connection pool details after a few minutes and users have to wait 60 seconds:

{"connection-pools":{"metabase-mysql-app-db":{"numConnections":100,"numIdleConnections":100,"numBusyConnections":0,"minPoolSize":1,"maxPoolSize":100},"db-9-snowflake-ACCURATE":{"numConnections":1,"numIdleConnections":1,"numBusyConnections":0,"minPoolSize":1,"maxPoolSize":75},"db-2-redshift-dev":{"numConnections":50,"numIdleConnections":50,"numBusyConnections":0,"minPoolSize":1,"maxPoolSize":75},"db-8-snowflake-ANALYTICS":{"numConnections":1,"numIdleConnections":1,"numBusyConnections":0,"minPoolSize":1,"maxPoolSize":75}}}

Diagnostics info:

{
  "browser-info": {
    "language": "pt",
    "platform": "MacIntel",
    "userAgent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36",
    "vendor": "Google Inc."
  },
  "system-info": {
    "file.encoding": "UTF-8",
    "java.runtime.name": "OpenJDK Runtime Environment",
    "java.runtime.version": "11.0.16.1+1",
    "java.vendor": "Eclipse Adoptium",
    "java.vendor.url": "https://adoptium.net/",
    "java.version": "11.0.16.1",
    "java.vm.name": "OpenJDK 64-Bit Server VM",
    "java.vm.version": "11.0.16.1+1",
    "os.name": "Linux",
    "os.version": "4.14.232-176.381.amzn2.x86_64",
    "user.language": "en",
    "user.timezone": "GMT"
  },
  "metabase-info": {
    "databases": [
      "redshift",
      "postgres",
      "snowflake",
      "mysql"
    ],
    "hosting-env": "unknown",
    "application-database": "mysql",
    "application-database-details": {
      "database": {
        "name": "MySQL",
        "version": "8.0.28"
      },
      "jdbc-driver": {
        "name": "MariaDB Connector/J",
        "version": "2.7.6"
      }
    },
    "run-mode": "prod",
    "version": {
      "date": "2022-09-29",
      "tag": "v0.44.4",
      "branch": "release-x.44.x",
      "hash": "382d728"
    },
    "settings": {
      "report-timezone": null
    }
  }
}

@monoprosito
Copy link

@flamber @paoliniluis Is there any update on this issue? 🤔

@paoliniluis
Copy link
Contributor

@monoprosito what are you seeing? which version are you running?

@lexrost
Copy link

lexrost commented Aug 16, 2023

I have exactly this problem and bug rises consistently.
also i have a probable solution.

I guess that transaction is open for statements that return to much data. My guess is that when metabase is not able to download entire query result it leaves transaction open up to the moment when user goes through entire pagination. I don't know exact amout in bytes or rows that causes it.

I've added a limit on query that left transaction in idle and the bug went away. It does not replicate anymore.
it might be somehow connected to metabase limit magic number

@ziouf
Copy link

ziouf commented Oct 11, 2023

Still an issue for me

{
  "browser-info": {
    "language": "fr-FR",
    "platform": "Linux x86_64",
    "userAgent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36",
    "vendor": "Google Inc."
  },
  "system-info": {
    "file.encoding": "UTF-8",
    "java.runtime.name": "OpenJDK Runtime Environment",
    "java.runtime.version": "11.0.20.1+1",
    "java.vendor": "Eclipse Adoptium",
    "java.vendor.url": "https://adoptium.net/",
    "java.version": "11.0.20.1",
    "java.vm.name": "OpenJDK 64-Bit Server VM",
    "java.vm.version": "11.0.20.1+1",
    "os.name": "Linux",
    "os.version": "5.4.0-148-generic",
    "user.language": "en",
    "user.timezone": "UTC"
  },
  "metabase-info": {
    "databases": [
      "postgres"
    ],
    "hosting-env": "unknown",
    "application-database": "postgres",
    "application-database-details": {
      "database": {
        "name": "PostgreSQL",
        "version": "14.8 (Debian 14.8-1.pgdg110+1)"
      },
      "jdbc-driver": {
        "name": "PostgreSQL JDBC Driver",
        "version": "42.5.4"
      }
    },
    "run-mode": "prod",
    "version": {
      "date": "2023-10-03",
      "tag": "v0.47.3",
      "branch": "?",
      "hash": "4202328"
    },
    "settings": {
      "report-timezone": "Europe/Paris"
    }
  }
}

@paoliniluis
Copy link
Contributor

We’re launching a feature that will allow to kill dangling connections in 47.4

@paoliniluis
Copy link
Contributor

NEW: you can now define MB_JDBC_DATA_WAREHOUSE_UNRETURNED_CONNECTION_TIMEOUT_SECONDS as an environment variable. We suggest you put this env var to 20 minutes (1200 seconds)

@dpsutton dpsutton removed their assignment Jun 3, 2024
@ranquild ranquild added the .Team/QueryProcessor(deprecated) Use .Team/Querying instead label Jun 20, 2024
@camsaul
Copy link
Member

camsaul commented Jun 24, 2024

Default value is already 20 minutes

@camsaul
Copy link
Member

camsaul commented Jun 24, 2024

Recategorizing this as P2 since 47.4+ has MB_JDBC_DATA_WAREHOUSE_UNRETURNED_CONNECTION_TIMEOUT_SECONDS and we haven't heard more about people running into this problem since that change happened

@camsaul camsaul added Priority:P2 Average run of the mill bug and removed Priority:P1 Security holes w/o exploit, crashing, setup/upgrade, login, broken common features, correctness labels Jun 24, 2024
@NevRA NevRA added .Team/Querying and removed .Team/QueryProcessor(deprecated) Use .Team/Querying instead labels Aug 16, 2024
@paoliniluis
Copy link
Contributor

There's a debugUnreturnedConnectionStackTraces thing in the c3p0 library that might help us out here: https://www.mchange.com/projects/c3p0/#unreturnedConnectionTimeout

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests