-
Notifications
You must be signed in to change notification settings - Fork 5.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Metabase is leaving open transactions with locks on our Redshift database #11441
Comments
Hi @seanami Do you see any errors in the Metabase log or the frontend, when this happens? |
I'm not sure. As far as I know, I haven't been the one to run one of these queries that results in open locks on the Redshift server. Let me see if I can investigate the Metabase logs from around that time and see if there are any errors... |
The time around that oldest open transaction is no longer available in the Metabase logs within the Troubleshooting section of the Metabase UI, but there is a more recent open transaction where that time range is available in the Metabase logs. Looking at around the time of I found these errors:
And then shortly after that:
And a bit later:
And a bit later:
I omitted the queries here, but unfortunately the query text for any of these Metabase log lines doesn't match the query text on the Redshift server for the pid around the same time with the stuck transaction, so I'm not sure that any of these logs are related... 😕 |
@seanami |
@flamber The above logs are from right around the time of one such "locked up" query from Metabase. Were those not sufficient to test your hypothesis? What additional logs do you need? |
@seanami It would be nice to know Diagnostic Info. |
Here's the full diagnostic info:
As for the 6 points, are you referring to the list beginning with "What is the time zone of the data you think is being displayed improperly?"? I'm not having issues with timezones or wrong numbers, so this doesn't seem relevant / I don't know how to answer these... |
@seanami Okay, I was just seeing the timezone error and "divide by zero" and expected that the issue is coming from a bad timezone configuration. Bad timezone can cause really strange problems, lik what happened in the forum last week. |
OK, thank you very much for your help! My instinct says this isn't related to timezones (although I could be wrong), so I'll wait to hear if see if anyone else has other ideas. In the meantime, I'll experiment with some Redshift settings to see if I can use a timeout to prevent these transactions/locks from sticking around indefinitely. |
I looked more into the timezone issue today, and it seems that Redshift is Postgres-compatible and supports the SET TIME ZONE command, so I don't think a timezone issue should be affecting that database. The error message likely came from another database, perhaps our Google Analytics "DB"... |
@seanami The error should be from the Redshift. Notice the |
Oh right 🤦♂ I completely forgot about the |
We're experiencing the same thing with Redshift. A few times a week everything on our Redshift cluster locks up and I have to go through and manually terminate a number of metabase pids that are locking various tables. It's not reproducible at the moment. Metabase will work fine with no Redshift locks for a day or two before something happens. |
We are also experiencing this problem with Redshift since v0.33.6 (including the newest version 0.34.0). For us this is temporarily fixed by using v0.33.4; does this version use a different Redshift driver? |
We're having the same issue after upgrading from v0.32.10 to v0.34.0. Every few days our cluster locks up and many tables cannot be read from; queries simply hang indefinitely. Killing queries has not helped; the only thing that's worked is rebooting the cluster as a whole which we'd love to not have to do. |
We've tried reverting to Metabase 0.33.4 as @n0rritt suggested, and so far that appears to have addressed the issue for us as well. We typically would have stuck locks every 1-3 days, and we haven't seen one in the past 3 days. Seems like there may have been an update to the underlying Redshift drivers after that version number which introduced an issue? |
Same here, took me a while to find the culprit (running on 0.34.1), since some of the tables where locked for days and had to automate a lock removal script. Our issue is with Airflow when it runs a job which query requires the In case you want to remove these locks without restarting your Redshift cluster @c0bra
Where I remove any PIDs that have taken more than 30 minutes execution time. Stop gap measure but works fine for now. |
Raising this to a P1 since it appears to be affected quite a few folks who are upgrading to 0.34. |
We did upgrade the Redshift driver in Metabase v0.33.5.1 from I haven't yet been able to come up with a way to reproduce this locally, even making a query throw a divide by zero error. |
I would guess this is an upstream issue with the Redshift JDBC driver, we've ran into several resource-leak bugs with it in the past. E.g.
So that might be the issue. I'll look at a way to repro this locally and then see if 1.2.37 fixes it or if not roll back to 1.2.32. |
I'm not able to produce this. I've launched 1000 or so Metabase queries in parallel and a full-DB sync process at the same time and queried the open transactions every few seconds while those were running. According to the Redshift dox:
So I used the following SQL: SELECT t.txn_owner AS owner, t.txn_db AS db, t.lock_mode AS mode, t.granted, t.lockable_object_type AS lock_type,
t.txn_start, q.starttime AS query_start, q.endtime AS query_end, q.aborted, tbl.table, substring(q.querytxt,1,40) AS query,
t.pid AS pid
FROM svv_transactions t
LEFT JOIN stl_query q
ON t.pid = q.pid
LEFT JOIN svv_table_info tbl
ON t.relation = tbl.table_id
WHERE TRUE
AND q.starttime > (current_timestamp - interval '2 minutes')
AND q.querytxt IS NOT NULL
AND q.querytxt LIKE '-- Metabase%'
AND t.lock_mode LIKE '%ExclusiveLock'
AND t.lockable_object_type = 'relation'
AND t.granted = true
ORDER BY t.txn_start DESC; I get no results from running that query. Removing the Am I missing something here? |
The TIMEZONE error, e.g.
sounds like a separate/unrelated issue |
We can't quite reproduce it either. It seems to get locked up after a few days of use. One other note is that we are embedding metabase reports in iframes according to https://www.metabase.com/embedding/. When we experience the issue, we don't see any actively running metabase queries (via |
@gnilrets Are they some sort of For regular Metabase queries at least it already looks like they're running with
which shouldn't require any exclusive locks. I'll have to check the MB sync queries. |
@camsaul . It just happened. If I run
And here's the results (I've reduced some of the output to focus just on the tables locked by metabase and the queries waiting for metabase to release the locks). My query (run by the
|
This will be fixed by #11832 |
Experiencing this same issue. Restarting the app server via the EB console is what does the trick for me. |
Fixed by #11832. Implemented custom low-level JDBC logic that aggressively frees resources and avoids creating unnecessary transactions in the first place |
I could not find it explicitly in the code, but is Metabase Redshift connection configured with We were facing the same issue - currently we're running v0.34.3 and last week we were facing the lock contention problem when out ETL was running concurrently with many Metabase queries. I see that the issue has been resolved in 0.35 release - I will upgrade today, and report here if we still see the problems - but I hope the fix does resolve them 🤞. We have contacted Amazon support to help us debug the issue, and for the record I wanted to share what we got to know. It may be useful if someone is debugging the problem. I find looking into Message from Amazon support: When a statement is executed from a client with autocommit disabled, a "silent" BEGIN statement is automatically (implicitly) executed to initiate the transaction, but the transaction is not automatically commited. Keep in mind that even SELECT queries are executed within a transaction and will hold an ACCESS SHARE lock on a table, which conflicts with ACCESS EXCLUSIVE locks (acquired by statements such as ALTER TABLE). For more information on the conflicting lock modes, see page. In your case, we can see this behavior for transaction (XID) 81653141, whereby the transaction was still holding "AccessShareLock"s on tables even though the statement inside it completed (confirmed with the endtime). What seems to have happened here, is that user (with userid = 100), connected to the cluster with autocommit disabled, and executed a SELECT query. The select query completed, but the session remained open. Later on, any DDL commands on the same table will end up blocked until that session is closed. You can run the following query to see all the statements (including BEGIN/END utilities) executed within that transaction:
the above should show the BEGIN statement that was implicitly opened, the select query, but no COMMIT/END/ROLLBACK. Often times , when users connect to a database with the intention to only run SELECT queries, they do not realize that there could be any impact. ScenarioUser A connects to the database from SQL workbench J (
The query completes in a few seconds, but user A does not close their session. An ETL user connects a couple of minutes later, and runs the following DDL statement:
The alter table statement will be blocked in the above scenario. We execute the following query to find the conflict:
We then check the statements executed as part of that XID and verify that the select completed
So now, how to solve this issue and prevent it from occurring again :
|
@botchniaque I think you're looking for this, which doesn't exist on 0.35+: |
Thanks @flamber. So, do I understand correctly that 0.35.+ does not disable |
@botchniaque From what I remember, no. It caused lock-up of big installations. But don't quote me ;-) |
@flamber ok. I'll just upgrade to v0.35 - I'll be reaching out again if the problem persists. Thanks |
@flamber I have v0.35.2 and still happening the problem |
Al locks dessapear any time after when I logged out from metabase |
@pedroperezTB Latest release is 0.35.4. Please post "Diagnostic Info" and any logs that can help figuring out what is happening. Locks on a datasource should not have anything to do with you being logged into Metabase or not, since that's handled on the application database. |
Describe the bug
We're seeing that queries from Metabase result in locks being acquired and held after the query is complete, which then leads to deadlocks from other ETL processes that try to write to the database.
Logs
Here's an example of the most recent deadlock situation we encountered that was kicked off by a Metabase query.
Looking at the open transactions on the Redshift server, we see that the oldest transaction is from pid 30037, which eventually leads to a deadlock for a different query (pid 19803) trying to get an exclusive lock to update one of those tables.
Looking at the logged info about the query for that pid, we see that it was a Metabase query that took just over a minute to run and was not aborted.
I can't figure out how Metabase would be running a query that results in an perpetually open transaction holding locks. Any ideas? Has anybody experienced a similar issue using Metabase with Redshift and other processes that occasionally change/update the underlying tables?
Expected behavior
My expectation would be that Metabase always releases any locks that it acquires in the process of running a query once that query is complete.
Screenshots
If applicable, add screenshots to help explain your problem.
Information about your Metabase Installation:
Severity
We're experiencing deadlocks around once every day or every other day, which is preventing our company from relying on Metabase for data analysis. When the deadlocks occur, it starts to block most queries and dashboards that normal users at our company need.
The text was updated successfully, but these errors were encountered: