Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metabase is leaving open transactions with locks on our Redshift database #11441

Closed
seanami opened this issue Dec 4, 2019 · 36 comments
Closed
Assignees
Milestone

Comments

@seanami
Copy link

seanami commented Dec 4, 2019

Describe the bug

We're seeing that queries from Metabase result in locks being acquired and held after the query is complete, which then leads to deadlocks from other ETL processes that try to write to the database.

Logs

Here's an example of the most recent deadlock situation we encountered that was kicked off by a Metabase query.

Looking at the open transactions on the Redshift server, we see that the oldest transaction is from pid 30037, which eventually leads to a deadlock for a different query (pid 19803) trying to get an exclusive lock to update one of those tables.

redshift=# select txn_owner,txn_db,pid,txn_start,lock_mode,lockable_object_type,svv_table_info.table,granted from svv_transa
ctions left join svv_table_info on svv_transactions.relation = svv_table_info.table_id where pid <> pg_backend_pid() order b
y txn_start asc;
 txn_owner |  txn_db  |  pid  |         txn_start          |       lock_mode       | lockable_object_type |          table          | granted 
-----------+----------+-------+----------------------------+-----------------------+----------------------+-------------------------+---------
 admin     | redshift | 30037 | 2019-12-03 22:06:24.773403 | AccessShareLock       | relation             | table1                 | t
 admin     | redshift | 30037 | 2019-12-03 22:06:24.773403 | AccessShareLock       | relation             | table2             | t
 admin     | redshift | 30037 | 2019-12-03 22:06:24.773403 | AccessShareLock       | relation             | table3                    | t
 admin     | redshift | 30037 | 2019-12-03 22:06:24.773403 | AccessShareLock       | relation             | table4                   | t
 admin     | redshift | 30037 | 2019-12-03 22:06:24.773403 | AccessShareLock       | relation             | table5                | t
 admin     | redshift | 30037 | 2019-12-03 22:06:24.773403 | ExclusiveLock         | transactionid        |                         | t
... (Other similar transactions for different PIDs opened by Metabase)
 stitch    | redshift | 19803 | 2019-12-04 01:28:28.919984 | AccessExclusiveLock   | relation             | table1                    | f

Looking at the logged info about the query for that pid, we see that it was a Metabase query that took just over a minute to run and was not aborted.

redshift=# select pid, starttime, endtime, aborted, substring (querytxt,1,40) as querystart from stl_query where pid = 30037;
  pid  |         starttime          |          endtime           | aborted |                querystart                
-------+----------------------------+----------------------------+---------+------------------------------------------
 30037 | 2019-12-03 22:06:24.771824 | 2019-12-03 22:07:29.523499 |       0 | -- Metabase:: userID: 3 queryType: nativ

I can't figure out how Metabase would be running a query that results in an perpetually open transaction holding locks. Any ideas? Has anybody experienced a similar issue using Metabase with Redshift and other processes that occasionally change/update the underlying tables?

Expected behavior

My expectation would be that Metabase always releases any locks that it acquires in the process of running a query once that query is complete.

Screenshots

If applicable, add screenshots to help explain your problem.

Information about your Metabase Installation:

  • Your databases: Amazon Redshift
  • Metabase version: 0.33.5.1
  • Metabase hosting environment: Docker container on AWS with Amazon's Metabase image
  • Metabase internal database: Postgres

Severity

We're experiencing deadlocks around once every day or every other day, which is preventing our company from relying on Metabase for data analysis. When the deadlocks occur, it starts to block most queries and dashboards that normal users at our company need.

@seanami seanami added .Needs Triage Type:Bug Product defects labels Dec 4, 2019
@flamber
Copy link
Contributor

flamber commented Dec 4, 2019

Hi @seanami Do you see any errors in the Metabase log or the frontend, when this happens?

@seanami
Copy link
Author

seanami commented Dec 4, 2019

I'm not sure. As far as I know, I haven't been the one to run one of these queries that results in open locks on the Redshift server.

Let me see if I can investigate the Metabase logs from around that time and see if there are any errors...

@seanami
Copy link
Author

seanami commented Dec 4, 2019

The time around that oldest open transaction is no longer available in the Metabase logs within the Troubleshooting section of the Metabase UI, but there is a more recent open transaction where that time range is available in the Metabase logs.

Looking at around the time of 2019-12-04 00:52:23, which I'm pretty sure is the same as 2019-12-03T16:52:23-08:00 in the Metabase logs...

I found these errors:

2019-12-03T16:52:25-08:00 ERROR metabase.driver.sql-jdbc.execute Failed to set timezone:
SQLException:
Message: [Amazon](500310) Invalid operation: Divide by zero;
SQLState: 57014
Error Code: 500310

And then shortly after that:

2019-12-03T16:52:26-08:00 WARN metabase.query-processor.middleware.process-userland-query Query failure {:status :failed,
:class java.lang.Exception,
:error "[Amazon](500310) Invalid operation: Divide by zero",
:stacktrace
("--> driver.sql_jdbc.execute$do_with_try_catch.invokeStatic(execute.clj:227)"
"driver.sql_jdbc.execute$do_with_try_catch.invoke(execute.clj:217)"
"driver.sql_jdbc.execute$execute_query.invokeStatic(execute.clj:291)"
"driver.sql_jdbc.execute$execute_query.invoke(execute.clj:285)"
"driver.sql_jdbc$fn__67318.invokeStatic(sql_jdbc.clj:44)"
"driver.sql_jdbc$fn__67318.invoke(sql_jdbc.clj:43)"
"query_processor$fn__44023$execute_query__44028$fn__44029.invoke(query_processor.clj:71)"
"query_processor$fn__44023$execute_query__44028.invoke(query_processor.clj:65)"
"query_processor.middleware.mbql_to_native$mbql__GT_native$fn__34284.invoke(mbql_to_native.clj:38)"
"query_processor.middleware.annotate$result_rows_maps__GT_vectors$fn__36550.invoke(annotate.clj:541)"
"query_processor.middleware.annotate$add_column_info$fn__36456.invoke(annotate.clj:485)"
"query_processor.middleware.cumulative_aggregations$handle_cumulative_aggregations$fn__37491.invoke(cumulative_aggregations.clj:57)"
"query_processor.middleware.resolve_joins$resolve_joins$fn__41131.invoke(resolve_joins.clj:184)"
"query_processor.middleware.limit$limit$fn__38092.invoke(limit.clj:19)"
"query_processor.middleware.results_metadata$record_and_return_metadata_BANG_$fn__43890.invoke(results_metadata.clj:87)"
"query_processor.middleware.format_rows$format_rows$fn__38080.invoke(format_rows.clj:26)"
"query_processor.middleware.add_dimension_projections$add_remapping$fn__35043.invoke(add_dimension_projections.clj:232)"
"query_processor.middleware.add_source_metadata$add_source_metadata_for_source_queries$fn__35684.invoke(add_source_metadata.clj:107)"
"query_processor.middleware.resolve_source_table$resolve_source_tables$fn__41181.invoke(resolve_source_table.clj:46)"
"query_processor.middleware.add_row_count_and_status$add_row_count_and_status$fn__35531.invoke(add_row_count_and_status.clj:16)"
"query_processor.middleware.driver_specific$process_query_in_context$fn__37567.invoke(driver_specific.clj:12)"
"query_processor.middleware.add_settings$add_settings$fn__35554.invoke(add_settings.clj:45)"
"query_processor.middleware.resolve_driver$resolve_driver$fn__40795.invoke(resolve_driver.clj:22)"
"query_processor.middleware.bind_effective_timezone$bind_effective_timezone$fn__36881$fn__36882.invoke(bind_effective_timezone.clj:9)"
"util.date$call_with_effective_timezone.invokeStatic(date.clj:88)"
"util.date$call_with_effective_timezone.invoke(date.clj:77)"
"query_processor.middleware.bind_effective_timezone$bind_effective_timezone$fn__36881.invoke(bind_effective_timezone.clj:8)"
"query_processor.middleware.store$initialize_store$fn__43915$fn__43916.invoke(store.clj:11)"
"query_processor.store$do_with_store.invokeStatic(store.clj:46)"
"query_processor.store$do_with_store.invoke(store.clj:40)"
"query_processor.middleware.store$initialize_store$fn__43915.invoke(store.clj:10)"
"query_processor.middleware.async$async__GT_sync$fn__34195.invoke(async.clj:23)"
"query_processor.middleware.async_wait$runnable$fn__36607.invoke(async_wait.clj:89)"),
:query
{:constraints {:max-results 10000, :max-results-bare-rows 2000},
:type :native,
:middleware {:userland-query? true},
:native
{:query
"...",
:template-tags {...},
:info
{:executed-by 1,
:context :question,
:card-id 264,
:dashboard-id nil,
:query-hash [-113, 5, -23, 50, 64, 49, 3, 116, 77, -30, 108, -2, -73, 53, 46, 60, -68, 24, 116, -43, 69, -112, 38, 11, -104, 125, 44, -54, -74, -86, 96, 90]},
:parameters [...],
:async? true,
:cache-ttl 42},
:cause
{:class java.sql.SQLException,
:error "[Amazon](500310) Invalid operation: Divide by zero;",
:cause {:class com.amazon.support.exceptions.ErrorException, :error "[Amazon](500310) Invalid operation: Divide by zero;"}}}

2019-12-03T16:52:26-08:00 INFO metabase.query-processor.middleware.cache Query took 2266 ms to run; miminum for cache eligibility is 15000 ms

And a bit later:

2019-12-03T16:52:31-08:00 ERROR metabase.driver.sql-jdbc.execute Failed to set timezone:
SQLException:
Message: [Amazon](500051) ERROR processing query/statement. Error: Execution of prepared query failed , Query: -- Metabase:: userID: 1 queryType: native queryHash: 267c601be94399816d4058d442e84e15633dcb4764f65cb3184560d23936d2fa
...
SQLState: HY000
Error Code: 500051

And a bit later:

2019-12-03T16:52:48-08:00 ERROR metabase.driver.sql-jdbc.execute Failed to set timezone:
SQLException:
Message: [Amazon](500051) ERROR processing query/statement. Error: Execution of prepared query failed , Query: -- Metabase:: userID: 1 queryType: native queryHash: 6df7b4f54d7da4642135d61faa10f0c52b82b84324f92471698d4831edec04a3
...
SQLState: HY000
Error Code: 500051

I omitted the queries here, but unfortunately the query text for any of these Metabase log lines doesn't match the query text on the Redshift server for the pid around the same time with the stuck transaction, so I'm not sure that any of these logs are related... 😕

@flamber
Copy link
Contributor

flamber commented Dec 4, 2019

@seanami
I don't think the timezone errors are related to the transactions left open. But I have a feeling that you might be seeing an existing issue, but without the logs, it's hard for me to know. So can you catch the log, when everything locks up and post that (redact where needed)?
And also post "Diagnostic Info" from Admin > Troubleshooting, at the same time.

@seanami
Copy link
Author

seanami commented Dec 4, 2019

@flamber The above logs are from right around the time of one such "locked up" query from Metabase. Were those not sufficient to test your hypothesis? What additional logs do you need?

@flamber
Copy link
Contributor

flamber commented Dec 4, 2019

@seanami It would be nice to know Diagnostic Info.
What's the answer to the 6 points in the troubleshooting guide?
https://www.metabase.com/docs/latest/troubleshooting-guide/timezones.html

@seanami
Copy link
Author

seanami commented Dec 4, 2019

Here's the full diagnostic info:

{
  "browser-info": {
    "language": "en-US",
    "platform": "MacIntel",
    "userAgent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36",
    "vendor": "Google Inc."
  },
  "system-info": {
    "java.runtime.name": "OpenJDK Runtime Environment",
    "java.runtime.version": "11.0.3+7",
    "java.vendor": "AdoptOpenJDK",
    "java.vendor.url": "https://adoptopenjdk.net/",
    "java.version": "11.0.3",
    "java.vm.name": "OpenJDK 64-Bit Server VM",
    "java.vm.version": "11.0.3+7",
    "os.name": "Linux",
    "os.version": "4.14.154-99.181.amzn1.x86_64",
    "user.language": "en",
    "user.timezone": "US/Pacific"
  },
  "metabase-info": {
    "databases": [
      "googleanalytics",
      "mysql",
      "redshift"
    ],
    "hosting-env": "unknown",
    "application-database": "postgres",
    "run-mode": "prod",
    "version": {
      "date": "2019-11-13",
      "tag": "v0.33.5.1",
      "branch": "release-0.33.x",
      "hash": "26159d2"
    },
    "settings": {
      "report-timezone": "US/Pacific"
    }
  }
}

As for the 6 points, are you referring to the list beginning with "What is the time zone of the data you think is being displayed improperly?"? I'm not having issues with timezones or wrong numbers, so this doesn't seem relevant / I don't know how to answer these...

@flamber
Copy link
Contributor

flamber commented Dec 4, 2019

@seanami Okay, I was just seeing the timezone error and "divide by zero" and expected that the issue is coming from a bad timezone configuration. Bad timezone can cause really strange problems, lik what happened in the forum last week.
Let's see what the developers with more knowledge about Redshift says. Might be related to #8679

@seanami
Copy link
Author

seanami commented Dec 4, 2019

OK, thank you very much for your help! My instinct says this isn't related to timezones (although I could be wrong), so I'll wait to hear if see if anyone else has other ideas.

In the meantime, I'll experiment with some Redshift settings to see if I can use a timeout to prevent these transactions/locks from sticking around indefinitely.

@seanami
Copy link
Author

seanami commented Dec 6, 2019

I looked more into the timezone issue today, and it seems that Redshift is Postgres-compatible and supports the SET TIME ZONE command, so I don't think a timezone issue should be affecting that database. The error message likely came from another database, perhaps our Google Analytics "DB"...

@flamber
Copy link
Contributor

flamber commented Dec 6, 2019

@seanami The error should be from the Redshift. Notice the [Amazon](500310) Invalid operation. But someone on the forum reported a Snowflake query being sent to Redshift, which would point towards the same for you, if for some obscure reason a GA query is being run on Redshift.
But I think you're right, the hanging connection is probably not related to the timezone error.

@seanami
Copy link
Author

seanami commented Dec 6, 2019

Oh right 🤦‍♂ I completely forgot about the [Amazon] tag there. I can't find any of these in the current available logs, but will check again later. I agree with you that I don't think the locks issue is related though.

@gnilrets
Copy link

We're experiencing the same thing with Redshift. A few times a week everything on our Redshift cluster locks up and I have to go through and manually terminate a number of metabase pids that are locking various tables. It's not reproducible at the moment. Metabase will work fine with no Redshift locks for a day or two before something happens.

@t0hai
Copy link

t0hai commented Jan 6, 2020

We are also experiencing this problem with Redshift since v0.33.6 (including the newest version 0.34.0). For us this is temporarily fixed by using v0.33.4; does this version use a different Redshift driver?

@c0bra
Copy link

c0bra commented Jan 9, 2020

We're having the same issue after upgrading from v0.32.10 to v0.34.0. Every few days our cluster locks up and many tables cannot be read from; queries simply hang indefinitely. Killing queries has not helped; the only thing that's worked is rebooting the cluster as a whole which we'd love to not have to do.

@seanami
Copy link
Author

seanami commented Jan 10, 2020

We've tried reverting to Metabase 0.33.4 as @n0rritt suggested, and so far that appears to have addressed the issue for us as well. We typically would have stuck locks every 1-3 days, and we haven't seen one in the past 3 days.

Seems like there may have been an update to the underlying Redshift drivers after that version number which introduced an issue?

@gvisniuc
Copy link

gvisniuc commented Jan 16, 2020

Same here, took me a while to find the culprit (running on 0.34.1), since some of the tables where locked for days and had to automate a lock removal script.

Our issue is with Airflow when it runs a job which query requires the AccessExclusiveLock as described here https://aws.amazon.com/premiumsupport/knowledge-center/prevent-locks-blocking-queries-redshift/.
In our case during the DROP operation of certain tables.

In case you want to remove these locks without restarting your Redshift cluster @c0bra
I am running this query

select a.txn_owner, a.txn_db, a.xid, a.pid, a.txn_start, datediff(m,a.txn_start,getdate()) as duration_minutes, a.lock_mode, a.relation as table_id,d.relname as tablename, a.granted,b.pid as blocking_pid ,datediff(s,a.txn_start,getdate())/86400||' days '||datediff(s,a.txn_start,getdate())%86400/3600||' hrs '||datediff(s,a.txn_start,getdate())%3600/60||' mins '||datediff(s,a.txn_start,getdate())%60||' secs' as txn_duration
from svv_transactions a
left join (select pid,relation,granted from pg_locks group by 1,2,3) b
on a.relation=b.relation and a.granted='f' and b.granted='t'
left join pg_class d on a.relation=d.oid
where  a.relation is not null;

Where I remove any PIDs that have taken more than 30 minutes execution time. Stop gap measure but works fine for now.

@mazameli mazameli added Database/Redshift .Performance Priority:P1 Security holes w/o exploit, crashing, setup/upgrade, login, broken common features, correctness and removed .Needs Triage labels Jan 16, 2020
@mazameli
Copy link
Contributor

Raising this to a P1 since it appears to be affected quite a few folks who are upgrading to 0.34.

@dacort
Copy link
Contributor

dacort commented Jan 20, 2020

We did upgrade the Redshift driver in Metabase v0.33.5.1 from 1.2.32.1056 to 1.2.36.1060. I didn't see anything in the release notes that caught my eye too much...

I haven't yet been able to come up with a way to reproduce this locally, even making a query throw a divide by zero error.

@camsaul
Copy link
Member

camsaul commented Jan 21, 2020

I would guess this is an upstream issue with the Redshift JDBC driver, we've ran into several resource-leak bugs with it in the past. E.g.

So that might be the issue. I'll look at a way to repro this locally and then see if 1.2.37 fixes it or if not roll back to 1.2.32.

@camsaul camsaul self-assigned this Jan 21, 2020
@camsaul camsaul added this to the 0.34.2 milestone Jan 21, 2020
@camsaul
Copy link
Member

camsaul commented Jan 22, 2020

I'm not able to produce this. I've launched 1000 or so Metabase queries in parallel and a full-DB sync process at the same time and queried the open transactions every few seconds while those were running. According to the Redshift dox:

If lock_mode is ExclusiveLock and granted is true (t), then this transaction ID is an open transaction

So I used the following SQL:

SELECT t.txn_owner AS owner, t.txn_db AS db, t.lock_mode AS mode, t.granted, t.lockable_object_type AS lock_type, 
       t.txn_start, q.starttime AS query_start, q.endtime AS query_end, q.aborted, tbl.table, substring(q.querytxt,1,40) AS query, 
       t.pid AS pid 
FROM svv_transactions t 
LEFT JOIN stl_query q
       ON t.pid = q.pid 
LEFT JOIN svv_table_info tbl
       ON t.relation = tbl.table_id 
WHERE TRUE 
  AND q.starttime > (current_timestamp - interval '2 minutes')
  AND q.querytxt IS NOT NULL 
  AND q.querytxt LIKE '-- Metabase%' 
  AND t.lock_mode LIKE '%ExclusiveLock' 
  AND t.lockable_object_type = 'relation'
  AND t.granted = true 
ORDER BY t.txn_start DESC;

I get no results from running that query. Removing the lock_mode or lockable_object_type WHERE clauses gives me a few rows, all of which are eitherExclusiveLocks on transactionids, or well as some AccessShareLocks on relations. Since I'm assuming each transactionid is unique, having ExclusiveLocks on them doesn't seem problematic.

Am I missing something here?

@camsaul camsaul removed this from the 0.34.2 milestone Jan 22, 2020
@camsaul camsaul added .Need More Info .Unable to Reproduce Priority:P2 Average run of the mill bug and removed Priority:P1 Security holes w/o exploit, crashing, setup/upgrade, login, broken common features, correctness labels Jan 22, 2020
@camsaul camsaul removed their assignment Jan 22, 2020
@camsaul
Copy link
Member

camsaul commented Jan 22, 2020

The TIMEZONE error, e.g.

2019-12-03T16:52:25-08:00 ERROR metabase.driver.sql-jdbc.execute Failed to set timezone:
SQLException:
Message: [Amazon](500310) Invalid operation: Divide by zero;
SQLState: 57014
Error Code: 500310

sounds like a separate/unrelated issue

@gnilrets
Copy link

We can't quite reproduce it either. It seems to get locked up after a few days of use. One other note is that we are embedding metabase reports in iframes according to https://www.metabase.com/embedding/.

When we experience the issue, we don't see any actively running metabase queries (via select * from stv_recents where status = 'Running'). However, we do see locks on tables, and the pids holding the locks are owned by metabase.

@camsaul
Copy link
Member

camsaul commented Jan 22, 2020

@gnilrets Are they some sort of ExclusiveLocks? Could you join against stl_query and post querytxt for one of the queries (redacting identifiers as needed)? It would be useful for debugging whether it's a normal user-facing query or a background sync query.

For regular Metabase queries at least it already looks like they're running with

TRANSACTION ISOLATION: :transaction-read-uncommitted
TYPE:                  :type-forward-only
HOLDABILITY:           :close-cursors-at-commit
CONCURRENCY:           :concur-read-only

which shouldn't require any exclusive locks. I'll have to check the MB sync queries.

@gnilrets
Copy link

@camsaul . It just happened. If I run select * from stv_recents where status='Running', I get no queries that are being run by metabase. However, if I run the following query, I see that metabase is holding locks on some tables:

select
  svv_transactions.txn_owner,
  svv_transactions.txn_db,
  svv_transactions.xid,
  svv_transactions.pid,
  svv_transactions.txn_start,
  svv_transactions.lock_mode,
  svv_transactions.relation as table_id,
  pg_class.nspname as "schema",
  nvl(trim(stv_tbl_perm."name"),pg_class.relname) as tablename,
  svv_transactions.granted,pg_locks.pid as blocking_pid ,
  datediff(s,svv_transactions.txn_start,getdate())/86400||' days '||datediff(s,svv_transactions.txn_start,getdate())%86400/3600||' hrs '||datediff(s,svv_transactions.txn_start,getdate())%3600/60||' mins '||datediff(s,svv_transactions.txn_start,getdate())%60||' secs' as txn_duration,
  stl_query.querytxt
from
  svv_transactions
left join (
     select pid,relation,granted from pg_locks group by 1,2,3
) pg_locks
on
  svv_transactions.relation=pg_locks.relation
  and svv_transactions.granted='f'
  and pg_locks.granted='t'
left join (
     select * from stv_tbl_perm where slice=0
) stv_tbl_perm
on
  svv_transactions.relation=stv_tbl_perm.id
left join (
  select pg_class.oid, pg_class.relname, pg_namespace.nspname from pg_class inner join pg_namespace on pg_class.relnamespace = pg_namespace.oid
)
  pg_class
on
  svv_transactions.relation=pg_class.oid
left join
  stl_query
on
  stl_query.pid = svv_transactions.pid
  and stl_query.xid = svv_transactions.xid
where
  svv_transactions.relation is not null
order by
  svv_transactions.txn_start
;

And here's the results (I've reduced some of the output to focus just on the tables locked by metabase and the queries waiting for metabase to release the locks). My query (run by the de_jobs user) is waiting for locks on tables held by the metabase user, which have been locked for over 7 hours even though no queries are actively running.

txn_owner|txn_db|xid      |pid  |txn_start          |lock_mode          |table_id|schema     |tablename            |granted|blocking_pid|txn_duration               |querytxt
metabase |itkdw |151931017|9983 |2020-01-22 16:21:38|AccessShareLock    |1249    |pg_catalog |pg_attribute         |true   |            |0 days 7 hrs 37 mins 3 secs|"-- Metabase select coach_name as coach , count(*) as ""Cohort Students"" , sum(term_outgoing_communications) as ""Total Outgoing"" , sum(term_incoming_communications) as ""Total Incoming"" , case when sum(term_incoming_communications) = 0 then '' else cast(sum(term_outgoing_communications) * 1.0 / sum(term_incoming_communications) * 1.0 as decimal(3, 1)) || ' : 1' end as ""Outgoing to Incoming Ratio"" from wh_insights.rep_student_terms where termperiod = 'Current Term' and term_registered = 1 and ""wh_insights"".""rep_student_terms"".""organization"" = $1 and ""wh_insights"".""rep_student_terms"".""cohort"" = $2 group by 1 order by coach"
metabase |itkdw |151931017|9983 |2020-01-22 16:21:38|AccessShareLock    |1249    |pg_catalog |pg_attribute         |true   |            |0 days 7 hrs 37 mins 3 secs|"-- Metabase select coach_name as coach , count(*) as ""Cohort Students"" , sum(term_outgoing_communications) as ""Total Outgoing"" , sum(term_incoming_communications) as ""Total Incoming"" , case when sum(term_incoming_communications) = 0 then '' else cast(sum(term_outgoing_communications) * 1.0 / sum(term_incoming_communications) * 1.0 as decimal(3, 1)) || ' : 1' end as ""Outgoing to Incoming Ratio"" from wh_insights.rep_student_terms where termperiod = 'Current Term' and term_registered = 1 and ""wh_insights"".""rep_student_terms"".""organization"" = $1 and ""wh_insights"".""rep_student_terms"".""cohort"" = $2 group by 1 order by coach"
metabase |itkdw |151931017|9983 |2020-01-22 16:21:38|AccessShareLock    |1247    |pg_catalog |pg_type              |true   |            |0 days 7 hrs 37 mins 3 secs|"-- Metabase select coach_name as coach , count(*) as ""Cohort Students"" , sum(term_outgoing_communications) as ""Total Outgoing"" , sum(term_incoming_communications) as ""Total Incoming"" , case when sum(term_incoming_communications) = 0 then '' else cast(sum(term_outgoing_communications) * 1.0 / sum(term_incoming_communications) * 1.0 as decimal(3, 1)) || ' : 1' end as ""Outgoing to Incoming Ratio"" from wh_insights.rep_student_terms where termperiod = 'Current Term' and term_registered = 1 and ""wh_insights"".""rep_student_terms"".""organization"" = $1 and ""wh_insights"".""rep_student_terms"".""cohort"" = $2 group by 1 order by coach"
metabase |itkdw |151931017|9983 |2020-01-22 16:21:38|AccessShareLock    |1247    |pg_catalog |pg_type              |true   |            |0 days 7 hrs 37 mins 3 secs|"-- Metabase select coach_name as coach , count(*) as ""Cohort Students"" , sum(term_outgoing_communications) as ""Total Outgoing"" , sum(term_incoming_communications) as ""Total Incoming"" , case when sum(term_incoming_communications) = 0 then '' else cast(sum(term_outgoing_communications) * 1.0 / sum(term_incoming_communications) * 1.0 as decimal(3, 1)) || ' : 1' end as ""Outgoing to Incoming Ratio"" from wh_insights.rep_student_terms where termperiod = 'Current Term' and term_registered = 1 and ""wh_insights"".""rep_student_terms"".""organization"" = $1 and ""wh_insights"".""rep_student_terms"".""cohort"" = $2 group by 1 order by coach"
metabase |itkdw |151931017|9983 |2020-01-22 16:21:38|AccessShareLock    |39953854|wh_insights|rep_student_terms    |true   |            |0 days 7 hrs 37 mins 3 secs|"-- Metabase select coach_name as coach , count(*) as ""Cohort Students"" , sum(term_outgoing_communications) as ""Total Outgoing"" , sum(term_incoming_communications) as ""Total Incoming"" , case when sum(term_incoming_communications) = 0 then '' else cast(sum(term_outgoing_communications) * 1.0 / sum(term_incoming_communications) * 1.0 as decimal(3, 1)) || ' : 1' end as ""Outgoing to Incoming Ratio"" from wh_insights.rep_student_terms where termperiod = 'Current Term' and term_registered = 1 and ""wh_insights"".""rep_student_terms"".""organization"" = $1 and ""wh_insights"".""rep_student_terms"".""cohort"" = $2 group by 1 order by coach"
metabase |itkdw |151931017|9983 |2020-01-22 16:21:38|AccessShareLock    |39953854|wh_insights|rep_student_terms    |true   |            |0 days 7 hrs 37 mins 3 secs|"-- Metabase select coach_name as coach , count(*) as ""Cohort Students"" , sum(term_outgoing_communications) as ""Total Outgoing"" , sum(term_incoming_communications) as ""Total Incoming"" , case when sum(term_incoming_communications) = 0 then '' else cast(sum(term_outgoing_communications) * 1.0 / sum(term_incoming_communications) * 1.0 as decimal(3, 1)) || ' : 1' end as ""Outgoing to Incoming Ratio"" from wh_insights.rep_student_terms where termperiod = 'Current Term' and term_registered = 1 and ""wh_insights"".""rep_student_terms"".""organization"" = $1 and ""wh_insights"".""rep_student_terms"".""cohort"" = $2 group by 1 order by coach"
metabase |itkdw |151931023|10067|2020-01-22 16:21:38|AccessShareLock    |100076  |           |                     |true   |            |0 days 7 hrs 37 mins 3 secs|padb_fetch_sample: select count(*) from volt_tt_59cbceb42f325
metabase |itkdw |151931023|10067|2020-01-22 16:21:38|AccessShareLock    |100076  |           |                     |true   |            |0 days 7 hrs 37 mins 3 secs|CREATE TEMP TABLE volt_tt_59cbceb42f325(term_name) AS SELECT wh_insights.rep_student_terms.term_name FROM wh_insights.rep_student_terms WHERE wh_insights.rep_student_terms.cohort = $2 AND wh_insights.rep_student_terms.organization = $1 AND wh_insights.rep_student_terms.termperiod = 'Current Term' GROUP BY 1;
metabase |itkdw |151931023|10067|2020-01-22 16:21:38|AccessShareLock    |100076  |           |                     |true   |            |0 days 7 hrs 37 mins 3 secs|-- Metabase with base as ( select term_name from wh_insights.rep_student_terms where termperiod = 'Current Term' and ""wh_insights"".""rep_student_terms"".""organization"" = $1 and ""wh_insights"".""rep_student_terms"".""cohort"" = $2 group by 1 ) , row_count as ( select count(base2.*) total_rows from base ) , combo as ( select base.* , row_count.* from base cross join row_count ) , error as ( select cast('Select one organization' as varchar) as error_msg ) select case when total_rows > 1 then error.error_msg else combo.term_name end as term_name from combo cross join error group by 1
metabase |itkdw |151931023|10067|2020-01-22 16:21:38|AccessExclusiveLock|39955753|           |volt_tt_59cbceb42f325|true   |            |0 days 7 hrs 37 mins 3 secs|padb_fetch_sample: select count(*) from volt_tt_59cbceb42f325
metabase |itkdw |151931023|10067|2020-01-22 16:21:38|AccessExclusiveLock|39955753|           |volt_tt_59cbceb42f325|true   |            |0 days 7 hrs 37 mins 3 secs|CREATE TEMP TABLE volt_tt_59cbceb42f325(term_name) AS SELECT wh_insights.rep_student_terms.term_name FROM wh_insights.rep_student_terms WHERE wh_insights.rep_student_terms.cohort = $2 AND wh_insights.rep_student_terms.organization = $1 AND wh_insights.rep_student_terms.termperiod = 'Current Term' GROUP BY 1;
metabase |itkdw |151931023|10067|2020-01-22 16:21:38|AccessExclusiveLock|39955753|           |volt_tt_59cbceb42f325|true   |            |0 days 7 hrs 37 mins 3 secs|-- Metabase with base as ( select term_name from wh_insights.rep_student_terms where termperiod = 'Current Term' and ""wh_insights"".""rep_student_terms"".""organization"" = $1 and ""wh_insights"".""rep_student_terms"".""cohort"" = $2 group by 1 ) , row_count as ( select count(base2.*) total_rows from base ) , combo as ( select base.* , row_count.* from base cross join row_count ) , error as ( select cast('Select one organization' as varchar) as error_msg ) select case when total_rows > 1 then error.error_msg else combo.term_name end as term_name from combo cross join error group by 1
metabase |itkdw |151931023|10067|2020-01-22 16:21:38|AccessShareLock    |39955753|           |volt_tt_59cbceb42f325|true   |            |0 days 7 hrs 37 mins 3 secs|padb_fetch_sample: select count(*) from volt_tt_59cbceb42f325
metabase |itkdw |151931023|10067|2020-01-22 16:21:38|AccessShareLock    |39955753|           |volt_tt_59cbceb42f325|true   |            |0 days 7 hrs 37 mins 3 secs|CREATE TEMP TABLE volt_tt_59cbceb42f325(term_name) AS SELECT wh_insights.rep_student_terms.term_name FROM wh_insights.rep_student_terms WHERE wh_insights.rep_student_terms.cohort = $2 AND wh_insights.rep_student_terms.organization = $1 AND wh_insights.rep_student_terms.termperiod = 'Current Term' GROUP BY 1;
metabase |itkdw |151931023|10067|2020-01-22 16:21:38|AccessShareLock    |39955753|           |volt_tt_59cbceb42f325|true   |            |0 days 7 hrs 37 mins 3 secs|-- Metabase with base as ( select term_name from wh_insights.rep_student_terms where termperiod = 'Current Term' and ""wh_insights"".""rep_student_terms"".""organization"" = $1 and ""wh_insights"".""rep_student_terms"".""cohort"" = $2 group by 1 ) , row_count as ( select count(base2.*) total_rows from base ) , combo as ( select base.* , row_count.* from base cross join row_count ) , error as ( select cast('Select one organization' as varchar) as error_msg ) select case when total_rows > 1 then error.error_msg else combo.term_name end as term_name from combo cross join error group by 1
metabase |itkdw |151931023|10067|2020-01-22 16:21:38|AccessShareLock    |39953854|wh_insights|rep_student_terms    |true   |            |0 days 7 hrs 37 mins 3 secs|padb_fetch_sample: select count(*) from volt_tt_59cbceb42f325
metabase |itkdw |151931023|10067|2020-01-22 16:21:38|AccessShareLock    |39953854|wh_insights|rep_student_terms    |true   |            |0 days 7 hrs 37 mins 3 secs|CREATE TEMP TABLE volt_tt_59cbceb42f325(term_name) AS SELECT wh_insights.rep_student_terms.term_name FROM wh_insights.rep_student_terms WHERE wh_insights.rep_student_terms.cohort = $2 AND wh_insights.rep_student_terms.organization = $1 AND wh_insights.rep_student_terms.termperiod = 'Current Term' GROUP BY 1;
metabase |itkdw |151931023|10067|2020-01-22 16:21:38|AccessShareLock    |39953854|wh_insights|rep_student_terms    |true   |            |0 days 7 hrs 37 mins 3 secs|-- Metabase with base as ( select term_name from wh_insights.rep_student_terms where termperiod = 'Current Term' and ""wh_insights"".""rep_student_terms"".""organization"" = $1 and ""wh_insights"".""rep_student_terms"".""cohort"" = $2 group by 1 ) , row_count as ( select count(base2.*) total_rows from base ) , combo as ( select base.* , row_count.* from base cross join row_count ) , error as ( select cast('Select one organization' as varchar) as error_msg ) select case when total_rows > 1 then error.error_msg else combo.term_name end as term_name from combo cross join error group by 1
de_jobs  |itkdw |152010160|494  |2020-01-22 23:52:51|AccessExclusiveLock|39953854|wh_insights|rep_student_terms    |false  |10067       |0 days 0 hrs 5 mins 50 secs|create table ""itkdw"".""wh_insights"".""insights_rep_student_terms__dbt_tmp"" diststyle key distkey (student_uuid) as ( SELECT rep_term.*, convert_timezone('US/Pacific', term_end_date) term_end_date_us_pacific, convert_timezone('US/Pacific', term_census_date) term_census_date_us_pacific, convert_timezone('US/Pacific', term_start_date) term_start_date_us_pacific, org_uuid, rep_term.org_name as organization FROM ""itkdw"".""wh_analytics"".""rep_student_terms"" rep_term JOIN ""itkdw"".""wh_insights"".""ucoach_students"" ucoach_students ON ucoach_students.stu_uuid = rep_term.student_uuid );
de_jobs  |itkdw |152010160|494  |2020-01-22 23:52:51|AccessExclusiveLock|39953854|wh_insights|rep_student_terms    |false  |10067       |0 days 0 hrs 5 mins 50 secs|padb_fetch_sample: select * from insights_rep_student_terms__dbt_tmp
de_jobs  |itkdw |152010160|494  |2020-01-22 23:52:51|AccessExclusiveLock|39953854|wh_insights|rep_student_terms    |false  |10067       |0 days 0 hrs 5 mins 50 secs|padb_fetch_sample: select * from insights_rep_student_terms__dbt_tmp
de_jobs  |itkdw |152010160|494  |2020-01-22 23:52:51|AccessExclusiveLock|39953854|wh_insights|rep_student_terms    |false  |9983        |0 days 0 hrs 5 mins 50 secs|create table ""itkdw"".""wh_insights"".""insights_rep_student_terms__dbt_tmp"" diststyle key distkey (student_uuid) as ( SELECT rep_term.*, convert_timezone('US/Pacific', term_end_date) term_end_date_us_pacific, convert_timezone('US/Pacific', term_census_date) term_census_date_us_pacific, convert_timezone('US/Pacific', term_start_date) term_start_date_us_pacific, org_uuid, rep_term.org_name as organization FROM ""itkdw"".""wh_analytics"".""rep_student_terms"" rep_term JOIN ""itkdw"".""wh_insights"".""ucoach_students"" ucoach_students ON ucoach_students.stu_uuid = rep_term.student_uuid );
de_jobs  |itkdw |152010160|494  |2020-01-22 23:52:51|AccessExclusiveLock|39953854|wh_insights|rep_student_terms    |false  |9983        |0 days 0 hrs 5 mins 50 secs|padb_fetch_sample: select * from insights_rep_student_terms__dbt_tmp
de_jobs  |itkdw |152010160|494  |2020-01-22 23:52:51|AccessExclusiveLock|39953854|wh_insights|rep_student_terms    |false  |9983        |0 days 0 hrs 5 mins 50 secs|padb_fetch_sample: select * from insights_rep_student_terms__dbt_tmp

@camsaul
Copy link
Member

camsaul commented Feb 3, 2020

This will be fixed by #11832

@patricktrainer
Copy link

Experiencing this same issue.
Running Metabase on ElasticBeanstalk on Redshift as well.

Restarting the app server via the EB console is what does the trick for me.

@camsaul camsaul added this to the 0.35.0 milestone Feb 19, 2020
@camsaul
Copy link
Member

camsaul commented Feb 19, 2020

Fixed by #11832. Implemented custom low-level JDBC logic that aggressively frees resources and avoids creating unnecessary transactions in the first place

@camsaul camsaul closed this as completed Feb 19, 2020
@botchniaque
Copy link
Contributor

I could not find it explicitly in the code, but is Metabase Redshift connection configured with autocommit=True?
The lock contention problem is caused by an explicit BEGIN and a missing explicit COMMIT around the SELECT statement from Metabase.

We were facing the same issue - currently we're running v0.34.3 and last week we were facing the lock contention problem when out ETL was running concurrently with many Metabase queries.

I see that the issue has been resolved in 0.35 release - I will upgrade today, and report here if we still see the problems - but I hope the fix does resolve them 🤞.

We have contacted Amazon support to help us debug the issue, and for the record I wanted to share what we got to know. It may be useful if someone is debugging the problem. I find looking into svl_statementtext especially useful as it explicitly shows the problem with missing COMMIT:


Message from Amazon support:

When a statement is executed from a client with autocommit disabled, a "silent" BEGIN statement is automatically (implicitly) executed to initiate the transaction, but the transaction is not automatically commited. Keep in mind that even SELECT queries are executed within a transaction and will hold an ACCESS SHARE lock on a table, which conflicts with ACCESS EXCLUSIVE locks (acquired by statements such as ALTER TABLE). For more information on the conflicting lock modes, see page. In your case, we can see this behavior for transaction (XID) 81653141, whereby the transaction was still holding "AccessShareLock"s on tables even though the statement inside it completed (confirmed with the endtime). What seems to have happened here, is that user (with userid = 100), connected to the cluster with autocommit disabled, and executed a SELECT query. The select query completed, but the session remained open. Later on, any DDL commands on the same table will end up blocked until that session is closed.

You can run the following query to see all the statements (including BEGIN/END utilities) executed within that transaction:

select userid, pid, xid, starttime, endtime, substring(text,1,75) as text 
from svl_statementtext 
where sequence = 0 
and xid = 81653141 
order by starttime;

the above should show the BEGIN statement that was implicitly opened, the select query, but no COMMIT/END/ROLLBACK.

Often times , when users connect to a database with the intention to only run SELECT queries, they do not realize that there could be any impact.
I have added an example below to demonstrate how problems can be caused down the line, when a client submits queries (SELECT's) with autocommit disabled, and without closing the sessions.

Scenario

User A connects to the database from SQL workbench J (autocommit = false) and runs the following query :

select * from t1;

The query completes in a few seconds, but user A does not close their session.

An ETL user connects a couple of minutes later, and runs the following DDL statement:

alter table t1 rename to t1_old;

The alter table statement will be blocked in the above scenario.

We execute the following query to find the conflict:

rsdb=# select a.txn_owner, a.txn_db, a.xid, a.pid, a.txn_start, a.lock_mode, a.relation as table_id,nvl(trim(c."name"),d.relname) as tablename, a.granted,b.pid as blocking_pid ,datediff(s,a.txn_start,getdate())/86400||' days '||datediff(s,a.txn_start,getdate())%86400/3600||' hrs '||datediff(s,a.txn_start,getdate())%3600/60||' mins '||datediff(s,a.txn_start,getdate())%60||' secs' as txn_duration
from svv_transactions a
left join (select pid,relation,granted from pg_locks group by 1,2,3) b
on a.relation=b.relation and a.granted='f' and b.granted='t'
left join (select * from stv_tbl_perm where slice=0) c
on a.relation=c.id
left join pg_class d on a.relation=d.oid
where  a.relation is not null and a.relation = 't1'::regclass::oid;
 txn_owner | txn_db |  xid  |  pid  |         txn_start          |    lock_mode    | table_id | tablename | granted | blocking_pid |         txn_duration
-----------+--------+-------+-------+----------------------------+-----------------+----------+-----------+---------+--------------+------------------------------
 master    | rsdb   | 75117 | 12134 | 2020-06-06 12:25:45.230886 | AccessShareLock |   108590 | t1        | t       |              | 0 days 0 hrs 38 mins 45 secs
(1 row)

We then check the statements executed as part of that XID and verify that the select completed

rsdb=# select pid, xid, userid, starttime, endtime, substring(text,1,100) as text from svl_statementtext where xid = 75117 order by starttime;
  pid  |  xid  | userid |         starttime          |          endtime           |       text
-------+-------+--------+----------------------------+----------------------------+------------------
 12134 | 75117 |    100 | 2020-06-06 12:25:41.836158 | 2020-06-06 12:25:41.836159 | BEGIN;
 12134 | 75117 |    100 | 2020-06-06 12:25:45.233851 | 2020-06-06 12:25:45.313076 | select * from t1
(2 rows)

So now, how to solve this issue and prevent it from occurring again :

  1. Database users needs to do their due diligence when connecting the cluster by making sure that their clients are configured to use autocommit, or that they always close off their sessions when their work is done
  2. Statements executed and grouped intentionally as part of an explicit transaction (user specified BEGIN) always needs to be closed off with an COMMIT/END

@flamber
Copy link
Contributor

flamber commented Jun 8, 2020

@botchniaque I think you're looking for this, which doesn't exist on 0.35+:
https://github.com/metabase/metabase/blob/v0.34.3/src/metabase/driver/sql_jdbc/execute.clj#L236-L277

@botchniaque
Copy link
Contributor

@botchniaque I think you're looking for this, which doesn't exist on 0.35+:
https://github.com/metabase/metabase/blob/v0.34.3/src/metabase/driver/sql_jdbc/execute.clj#L236-L277

Thanks @flamber. So, do I understand correctly that 0.35.+ does not disable autocommit anymore?

@flamber
Copy link
Contributor

flamber commented Jun 8, 2020

@botchniaque From what I remember, no. It caused lock-up of big installations. But don't quote me ;-)

@botchniaque
Copy link
Contributor

@flamber ok. I'll just upgrade to v0.35 - I'll be reaching out again if the problem persists. Thanks

@pedroperezTB
Copy link

pedroperezTB commented Jun 9, 2020

@flamber I have v0.35.2 and still happening the problem

@pedroperezTB
Copy link

Al locks dessapear any time after when I logged out from metabase

@flamber
Copy link
Contributor

flamber commented Jun 9, 2020

@pedroperezTB Latest release is 0.35.4. Please post "Diagnostic Info" and any logs that can help figuring out what is happening. Locks on a datasource should not have anything to do with you being logged into Metabase or not, since that's handled on the application database.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests