⚓ T375186 databases preswitchover checks

Subject	Repo	Branch	Lines +/-
mariadb: toggle notifications for db2229	operations/puppet	production	+0 -1
mariadb: compare weight and instance distribution across dcs	operations/software	master	+164 -0
mariadb: remove db2129	operations/puppet	production	+1 -6

		Status	Subtype	Assigned	Task
		Resolved		Scott_French	T370962 Southward Datacenter Switchover (September 2024)
		Resolved		ABran-WMF	T375186 databases preswitchover checks

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 19 2024, 12:47 PM

ABran-WMF changed the task status from Open to In Progress.Sep 19 2024, 12:47 PM

ABran-WMF triaged this task as High priority.

ABran-WMF moved this task from Triage to In progress on the DBA board.

@Ladsgroup I've been informed that subscribing at creation time on phabricator will spare you the notification email, so here is one for you!

@jcrespo
I've quickly drafted a script to read from dbconfig, here is a summary:

EQIAD - s8

Number of instances in subgroup 'api': 6
Number of instances in subgroup 'vslow': 1
Number of instances in subgroup 'dump': 1
Total number of servers: 11
Total weight: 2676
Average weight per instance: 243.27
Instances with weight 0: ['db1193']
Instances with weight less than 10: ['db1193', 'db1167']

CODFW - s8

Number of instances in subgroup 'api': 6
Number of instances in subgroup 'vslow': 1
Number of instances in subgroup 'dump': 1
Total number of servers: 11
Total weight: 2701
Average weight per instance: 245.55
Instances with weight 0: ['db2165']
Instances with weight less than 10: ['db2165', 'db2152']

EQIAD - s2

Number of instances in subgroup 'api': 5
Number of instances in subgroup 'vslow': 0
Number of instances in subgroup 'dump': 0
Total number of servers: 8
Total weight: 2600
Average weight per instance: 325.00
Instances with weight 0: ['db1222']
Instances with weight less than 10: ['db1222']

CODFW - s2

Number of instances in subgroup 'api': 5
Number of instances in subgroup 'vslow': 1
Number of instances in subgroup 'dump': 1
Total number of servers: 8
Total weight: 2600
Average weight per instance: 325.00
Instances with weight 0: ['db2204']
Instances with weight less than 10: ['db2204']

EQIAD - s1

Number of instances in subgroup 'api': 10
Number of instances in subgroup 'vslow': 1
Number of instances in subgroup 'dump': 1
Total number of servers: 13
Total weight: 2326
Average weight per instance: 178.92
Instances with weight 0: ['db1163']
Instances with weight less than 10: ['db1163', 'db1206']

CODFW - s1

Number of instances in subgroup 'api': 5
Number of instances in subgroup 'vslow': 1
Number of instances in subgroup 'dump': 1
Total number of servers: 13
Total weight: 3250
Average weight per instance: 250.00
Instances with weight 0: ['db2203']
Instances with weight less than 10: ['db2203']

EQIAD - DEFAULT

Number of instances in subgroup 'api': 0
Number of instances in subgroup 'vslow': 1
Number of instances in subgroup 'dump': 1
Total number of servers: 7
Total weight: 2800
Average weight per instance: 400.00
Instances with weight 0: ['db1189']
Instances with weight less than 10: ['db1189']

CODFW - DEFAULT

Number of instances in subgroup 'api': 0
Number of instances in subgroup 'vslow': 1
Number of instances in subgroup 'dump': 1
Total number of servers: 8
Total weight: 2700
Average weight per instance: 337.50
Instances with weight 0: ['db2205']
Instances with weight less than 10: ['db2205']

EQIAD - s5

Number of instances in subgroup 'api': 2
Number of instances in subgroup 'vslow': 1
Number of instances in subgroup 'dump': 1
Total number of servers: 7
Total weight: 2050
Average weight per instance: 292.86
Instances with weight 0: ['db1230']
Instances with weight less than 10: ['db1230']

CODFW - s5

Number of instances in subgroup 'api': 3
Number of instances in subgroup 'vslow': 1
Number of instances in subgroup 'dump': 1
Total number of servers: 8
Total weight: 2300
Average weight per instance: 287.50
Instances with weight 0: ['db2123']
Instances with weight less than 10: ['db2123']

EQIAD - s4

Number of instances in subgroup 'api': 6
Number of instances in subgroup 'vslow': 2
Number of instances in subgroup 'dump': 2
Total number of servers: 12
Total weight: 2402
Average weight per instance: 200.17
Instances with weight 0: ['db1160']
Instances with weight less than 10: ['db1160', 'db1248', 'db1221']

CODFW - s4

Number of instances in subgroup 'api': 5
Number of instances in subgroup 'vslow': 2
Number of instances in subgroup 'dump': 2
Total number of servers: 10
Total weight: 2300
Average weight per instance: 230.00
Instances with weight 0: ['db2179']
Instances with weight less than 10: ['db2179']

EQIAD - s6

Number of instances in subgroup 'api': 3
Number of instances in subgroup 'vslow': 0
Number of instances in subgroup 'dump': 0
Total number of servers: 7
Total weight: 2100
Average weight per instance: 300.00
Instances with weight 0: ['db1201']
Instances with weight less than 10: ['db1201']

CODFW - s6

Number of instances in subgroup 'api': 4
Number of instances in subgroup 'vslow': 1
Number of instances in subgroup 'dump': 1
Total number of servers: 10
Total weight: 2600
Average weight per instance: 260.00
Instances with weight 0: ['db2214']
Instances with weight less than 10: ['db2214']

EQIAD - s7

Number of instances in subgroup 'api': 5
Number of instances in subgroup 'vslow': 1
Number of instances in subgroup 'dump': 1
Total number of servers: 9
Total weight: 2300
Average weight per instance: 255.56
Instances with weight 0: ['db1236']
Instances with weight less than 10: ['db1236']

CODFW - s7

Number of instances in subgroup 'api': 3
Number of instances in subgroup 'vslow': 1
Number of instances in subgroup 'dump': 1
Total number of servers: 9
Total weight: 3400
Average weight per instance: 377.78
Instances with weight 0: ['db2218']
Instances with weight less than 10: ['db2218']

here is the quick script to produce this: https://phabricator.wikimedia.org/P69331

Change #1074184 had a related patch set uploaded (by Arnaudb; author: Arnaudb):

[operations/software@master] mariadb: compare weight and instance distribution across dcs

https://gerrit.wikimedia.org/r/1074184

gerritbot added a project: Patch-For-Review.Sep 19 2024, 1:50 PM

Change #1074188 had a related patch set uploaded (by Arnaudb; author: Arnaudb):

[operations/puppet@production] mariadb: remove db2129

https://gerrit.wikimedia.org/r/1074188

db2129 from T374623 was pooled and was a duplicate candidate master w/ db2229. This patch fixes the instance duplication and prepares db2129 for removal: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1074188

ABran-WMF updated the task description. (Show Details)Sep 19 2024, 2:25 PM

Change #1074188 merged by Arnaudb:

[operations/puppet@production] mariadb: remove db2129

https://gerrit.wikimedia.org/r/1074188

I forgot to mention, I think orchestrator has a similar tool, but I found in the past a tool like db-replication-tree useful for this kind of work (preparation) and later tuning after switchover:

In case you didn't remember it, it was created to facilitate a table-like with the most important parameters (read only, connection latency, etc). And shows in red things that are unexpected or to be aware (high latencies, read only off). Just commenting it in case it is useful for this ticket.

interesting indeed, I'll check!

ABran-WMF updated the task description. (Show Details)Sep 20 2024, 8:59 AM

ABran-WMF updated the task description. (Show Details)

I've also forked a dashboard (WIP) to be able to compare between sections/dcs

I've spotted fewer oddities on codfw vs eqiad:

codfw underweighed: db2173 (sanitarium master), db2152
eqiad underweighted: db1160, db1248, db1221, db1167, db1206, db1248, db1221

Will continue on digging potential contention issues.
Note: this should be fully automated.

ABran-WMF moved this task from Todo to Side lane on the Data-Persistence-SRE board.Sep 23 2024, 6:12 AM

ABran-WMF mentioned this in T375359: mariadb - fully automate dc preswitchover evaluations.

ABran-WMF updated the task description. (Show Details)Sep 23 2024, 8:35 AM

jcrespo updated the task description. (Show Details)Sep 23 2024, 8:36 AM

jcrespo updated the task description. (Show Details)Sep 23 2024, 10:15 AM

@Scott_French wrote the patch:

https://gerrit.wikimedia.org/r/c/operations/dns/+/1073897

I compared dns, puppet and replication topology (orchestrator) and found only 2 discrepancies: pc2007 is not marked as a master on puppet (may not be an issue as it may not be yet fully setup) CC Amir. And x2 may not be needed to be switchover-ed, as it is active-active (but it won't hurt). I checked all other hosts, both on puppet, dbctl and orchestrator.

Mentioned in SAL (#wikimedia-operations) [2024-09-23T10:51:52Z] <jynus> starting db master table checks on s1 (db1163, db2203) T375186

jcrespo added a parent task: T370962: Southward Datacenter Switchover (September 2024).Sep 23 2024, 10:58 AM

@ABran-WMF I see that db2229 in theory its productionization has finished (T373579) and it is pooled as a candidate master, but it has notifications disabled. Is that expected (e.g. hardware crash)?

jcrespo updated the task description. (Show Details)Sep 23 2024, 11:37 AM

Change #1074986 had a related patch set uploaded (by Arnaudb; author: Arnaudb):

[operations/puppet@production] mariadb: toggle notifications for db2229

https://gerrit.wikimedia.org/r/1074986

Change #1074986 merged by Arnaudb:

[operations/puppet@production] mariadb: toggle notifications for db2229

https://gerrit.wikimedia.org/r/1074986

In T375186#10167045, @jcrespo wrote:

@ABran-WMF I see that db2229 in theory its productionization has finished (T373579) and it is pooled as a candidate master, but it has notifications disabled. Is that expected (e.g. hardware crash)?

its an overlook! good catch! its been fixed

Check the other host on puppet/icinga with notifications disabled, I think I saw others, but maybe those are being setup/decom: db2185/6/7.

In T375186#10167168, @jcrespo wrote:

Check the other host on puppet/icinga with notifications disabled, I think I saw others, but maybe those are being setup/decom: db2185/6/7.

will do!

I've spotted no oddity on es/pc/x sections:

EQIAD - es7

Total number of servers: 3
Total weight: 210
Average weight per instance: 70.00
Instances with weight 0: []
Instances with weight less than 10: []

CODFW - es7

Total number of servers: 3
Total weight: 200
Average weight per instance: 66.67
Instances with weight 0: ['es2038']
Instances with weight less than 10: ['es2038']

EQIAD - pc5

Total number of servers: 1
Total weight: 1
Average weight per instance: 1.00
Instances with weight 0: []
Instances with weight less than 10: ['pc1017']

CODFW - pc5

Total number of servers: 1
Total weight: 1
Average weight per instance: 1.00
Instances with weight 0: []
Instances with weight less than 10: ['pc2017']

EQIAD - x1

Total number of servers: 4
Total weight: 300
Average weight per instance: 75.00
Instances with weight 0: ['db1220']
Instances with weight less than 10: ['db1220']

CODFW - x1

Total number of servers: 5
Total weight: 400
Average weight per instance: 80.00
Instances with weight 0: ['db2196']
Instances with weight less than 10: ['db2196']

EQIAD - pc2

Total number of servers: 1
Total weight: 1
Average weight per instance: 1.00
Instances with weight 0: []
Instances with weight less than 10: ['pc1012']

CODFW - pc2

Total number of servers: 1
Total weight: 1
Average weight per instance: 1.00
Instances with weight 0: []
Instances with weight less than 10: ['pc2012']

EQIAD - pc4

Total number of servers: 1
Total weight: 1
Average weight per instance: 1.00
Instances with weight 0: []
Instances with weight less than 10: ['pc1016']

CODFW - pc4

Total number of servers: 1
Total weight: 1
Average weight per instance: 1.00
Instances with weight 0: []
Instances with weight less than 10: ['pc2015']

EQIAD - es4

Total number of servers: 2
Total weight: 200
Average weight per instance: 100.00
Instances with weight 0: []
Instances with weight less than 10: []

CODFW - es4

Total number of servers: 3
Total weight: 300
Average weight per instance: 100.00
Instances with weight 0: []
Instances with weight less than 10: []

EQIAD - es3

Total number of servers: 3
Total weight: 300
Average weight per instance: 100.00
Instances with weight 0: []
Instances with weight less than 10: []

CODFW - es3

Total number of servers: 3
Total weight: 300
Average weight per instance: 100.00
Instances with weight 0: []
Instances with weight less than 10: []

EQIAD - es6

Total number of servers: 3
Total weight: 210
Average weight per instance: 70.00
Instances with weight 0: []
Instances with weight less than 10: []

CODFW - es6

Total number of servers: 3
Total weight: 210
Average weight per instance: 70.00
Instances with weight 0: []
Instances with weight less than 10: []

EQIAD - pc3

Total number of servers: 1
Total weight: 1
Average weight per instance: 1.00
Instances with weight 0: []
Instances with weight less than 10: ['pc1013']

CODFW - pc3

Total number of servers: 1
Total weight: 1
Average weight per instance: 1.00
Instances with weight 0: []
Instances with weight less than 10: ['pc2013']

EQIAD - x2

Total number of servers: 1
Total weight: 0
Average weight per instance: 0.00
Instances with weight 0: ['db1152']
Instances with weight less than 10: ['db1152']

CODFW - x2

Total number of servers: 1
Total weight: 0
Average weight per instance: 0.00
Instances with weight 0: ['db2142']
Instances with weight less than 10: ['db2142']

EQIAD - pc1

Total number of servers: 1
Total weight: 1
Average weight per instance: 1.00
Instances with weight 0: []
Instances with weight less than 10: ['pc1011']

CODFW - pc1

Total number of servers: 1
Total weight: 1
Average weight per instance: 1.00
Instances with weight 0: []
Instances with weight less than 10: ['pc2011']

EQIAD - es5

Total number of servers: 3
Total weight: 300
Average weight per instance: 100.00
Instances with weight 0: []
Instances with weight less than 10: []

CODFW - es5

Total number of servers: 3
Total weight: 300
Average weight per instance: 100.00
Instances with weight 0: []
Instances with weight less than 10: []

EQIAD - es1

Total number of servers: 3
Total weight: 300
Average weight per instance: 100.00
Instances with weight 0: []
Instances with weight less than 10: []

CODFW - es1

Total number of servers: 3
Total weight: 300
Average weight per instance: 100.00
Instances with weight 0: []
Instances with weight less than 10: []

EQIAD - es2

Total number of servers: 3
Total weight: 300
Average weight per instance: 100.00
Instances with weight 0: []
Instances with weight less than 10: []

CODFW - es2

Total number of servers: 3
Total weight: 300
Average weight per instance: 100.00
Instances with weight 0: []
Instances with weight less than 10: []

@jcrespo to be super verbose and explicit about notifications:

X instances

hieradata/hosts/db1151.yaml:
  2  mariadb::shard: 'x2'
  3: profile::monitoring::notifications_enabled: false
  4  

hieradata/hosts/db1153.yaml:
  2  mariadb::shard: 'x2'
  3: profile::monitoring::notifications_enabled: false
  4  # Testing MariaDB 10.11 T365805 - do not promote to master

hieradata/hosts/db2143.yaml:
  2  mariadb::shard: 'x2'
  3: profile::monitoring::notifications_enabled: false

hieradata/hosts/db2144.yaml:
  2  mariadb::shard: 'x2'
  3: profile::monitoring::notifications_enabled: false

all those are OK

zarcillo

hieradata/hosts/db2185.yaml:
  4  profile::monitoring::is_critical: false
  5: profile::monitoring::notifications_enabled: false
  6

despite it being the replica: I'm sure we would want it to alert us

Sanitarium

hieradata/hosts/db2186.yaml:
  7    s8: 48G
  8: profile::monitoring::notifications_enabled: false
  9  

hieradata/hosts/db2187.yaml:
  7    s7: 38G
  8: profile::monitoring::notifications_enabled: false
  9

those are ok, so are db222+

dbproxies

Those are OK as well

ParserCaches

hieradata/hosts/pc1014.yaml:
  4  profile::mariadb::config::innodb_flush_log_at_trx_commit: '0'
  5: profile::monitoring::notifications_enabled: false

hieradata/hosts/pc1015.yaml:
  5  profile::mariadb::config::innodb_flush_log_at_trx_commit: '0'
  6: profile::monitoring::notifications_enabled: false

hieradata/hosts/pc2014.yaml:
  4  profile::mariadb::config::innodb_flush_log_at_trx_commit: '0'
  5: profile::monitoring::notifications_enabled: false

hieradata/hosts/pc2016.yaml:
  4  profile::mariadb::config::innodb_flush_log_at_trx_commit: '0'
  5: profile::monitoring::notifications_enabled: false

I've spotted few parsercache instances that have no notification enabled, maybe this are worth reconsidering?

Here is the full output of the research for future reference: P69393

ABran-WMF updated the task description. (Show Details)Sep 23 2024, 12:39 PM

ABran-WMF updated the task description. (Show Details)

ABran-WMF updated the task description. (Show Details)Sep 23 2024, 12:58 PM

@jcrespo wdyt about:

SELECT 
    ROUND((innodb_buffer_pool_pages_data / innodb_buffer_pool_pages_total) * 100, 2) AS Buffer_Pool_Usage_Percentage
FROM 
    (SELECT 
         VARIABLE_VALUE AS innodb_buffer_pool_pages_data
     FROM 
         information_schema.GLOBAL_STATUS 
     WHERE 
         VARIABLE_NAME = 'Innodb_buffer_pool_pages_data') AS A,
    (SELECT 
         VARIABLE_VALUE AS innodb_buffer_pool_pages_total
     FROM 
         information_schema.GLOBAL_STATUS 
     WHERE 
         VARIABLE_NAME = 'Innodb_buffer_pool_pages_total') AS B;

for the subtask:

check status of buffer pool usage + create a task for monitoring

on db2230 it returns a proper result:

*************************** 1. row ***************************
Buffer_Pool_Usage_Percentage: 0.01

its harmless to run on the whole fleet and would give a human readable output that is easy to add to this CR

ABran-WMF updated the task description. (Show Details)Sep 24 2024, 7:29 AM

check section capacities (count, qps, ..)

s4 is a bit undersized on codfw (-2 instances)
s6 is a bit oversized (+2 instances)
default section and s5 are a bit oversized (+1)

ABran-WMF updated the task description. (Show Details)Sep 24 2024, 7:50 AM

for

read only verifications + prometheus migration plan check

I've reused expressions from T367282 https://gerrit.wikimedia.org/r/c/operations/alerts/+/1050498/12/team-data-persistence/mysql-read_only.yaml

as visible here

Both primary and secondary pools have misconfigured read_only states → lets check it on our 1:1 today

Mentioned in SAL (#wikimedia-operations) [2024-09-24T12:12:28Z] <jynus> running db-compare on s2, s3 T375186

db-compare abwiki change_tag ct_id db1189 db2205 --step 10
Starting comparison between id 1 and 63456
2024-09-24T12:25:17.345270: row id 991/63456, ETA: 10m26s, 0 chunk(s) found different
DIFFERENCE on db2205.codfw.wmnet:3306: WHERE ct_id BETWEEN 1901 AND 1910

abwiki change_tag ct_id = 1908 is missing from db2205 (or not deleted from db1189).

jcrespo raised the priority of this task from High to Unbreak Now!.Sep 24 2024, 12:31 PM

jcrespo added a project: Wikimedia-database-issue (Bad data).

I created a dedicated task to track this as it is critical: T375507

lowering priority as T375507 tracks this issue in a dedicated manner

jcrespo added a comment.Sep 24 2024, 12:53 PM

This comment was removed by jcrespo.

So in

In T375186#10170073, @ABran-WMF wrote:

@jcrespo wdyt about:

The query looked sane but was not:

arnaudb@db2158:~ $ sudo mysql -e "SELECT 
    ROUND((innodb_buffer_pool_pages_data / innodb_buffer_pool_pages_total) * 100, 2) AS Buffer_Pool_Usage_Percentage 
FROM 
    (SELECT 
         VARIABLE_VALUE AS innodb_buffer_pool_pages_data
     FROM 
         information_schema.GLOBAL_STATUS 
     WHERE 
         VARIABLE_NAME = 'Innodb_buffer_pool_pages_data') AS A,
    (SELECT 
         VARIABLE_VALUE AS innodb_buffer_pool_pages_total
     FROM 
         information_schema.GLOBAL_STATUS 
     WHERE VARIABLE_NAME = 'Innodb_buffer_pool_pages_total') AS B;"
+------------------------------+
| Buffer_Pool_Usage_Percentage |
+------------------------------+
|                       122.73 |
+------------------------------+

so, improving on the query:

arnaudb@db2158:~ $ sudo mysql -e "SELECT 
    pool_id,
    pool_size,
    free_buffers,
    database_pages,
    (pool_size - free_buffers) / pool_size * 100 AS Buffer_Pool_Usage_Percentage
FROM 
    information_schema.INNODB_BUFFER_POOL_STATS;
"
+---------+-----------+--------------+----------------+------------------------------+
| pool_id | pool_size | free_buffers | database_pages | Buffer_Pool_Usage_Percentage |
+---------+-----------+--------------+----------------+------------------------------+
|       0 |  24465792 |            0 |       30026932 |                     100.0000 |
+---------+-----------+--------------+----------------+------------------------------+

This one looks a bit better, I'll check if it's properly representative and will compare across fleet

https://grafana.wikimedia.org/goto/Ss01jBgHg?orgId=1 → this is way faster and more visual

db2142,db2143,db2144,db1151,db1152,db1153 have low buffer pools

ABran-WMF updated the task description. (Show Details)Sep 24 2024, 2:05 PM

event is OK everywhere:

arnaudb@cumin1002:~ $ sudo cumin db[2115-2116,2123,2126,2128,2130-2136,2139-2240].codfw.wmnet,db[1125,1150-1249].eqiad.wmnet "mysql -e 'select @@event_scheduler\G'" -p 0
215 hosts will be targeted:
db[2115-2116,2123,2126,2128,2130-2136,2139-2240].codfw.wmnet,db[1125,1150-1249].eqiad.wmnet
OK to proceed on 215 hosts? Enter the number of affected hosts to confirm or "q" to quit: 215
===== NODE GROUP =====                                                                                                                                                                                                                  
(11) db[2223,2226,2228,2231-2236,2239-2240].codfw.wmnet                                                                                                                                                                                 
----- OUTPUT of 'mysql -e 'select...ent_scheduler\G'' -----                                                                                                                                                                             
bash: line 1: mysql: command not found                                                                                                                                                                                                  
===== NODE GROUP =====                                                                                                                                                                                                                  
(182) db[2115-2116,2123,2126,2128,2130-2136,2140,2142-2159,2161-2185,2188-2196,2202-2222,2224-2225,2227,2229-2230,2237-2238].codfw.wmnet,db[1125,1151-1153,1156-1170,1172-1207,1209-1215,1218-1224,1226-1238,1241-1244,1247-1249].eqiad.wmnet                                                                                                                                                                                                                                   
----- OUTPUT of 'mysql -e 'select...ent_scheduler\G'' -----                                                                                                                                                                             
*************************** 1. row ***************************                                                                                                                                                                          
@@event_scheduler: ON                                                                                                                                                                                                                   
===== NODE GROUP =====                                                                                                                                                                                                                  
(8) db[2160,2186-2187].codfw.wmnet,db[1154-1155,1208,1217,1246].eqiad.wmnet                                                                                                                                                             
----- OUTPUT of 'mysql -e 'select...ent_scheduler\G'' -----                                                                                                                                                                             
ERROR 2002 (HY000): Can't connect to local server through socket '/run/mysqld/mysqld.sock' (2)                                                                                                                                          
===== NODE GROUP =====                                                                                                                                                                                                                  
(14) db[2139,2141,2197-2201].codfw.wmnet,db[1150,1171,1216,1225,1239-1240,1245].eqiad.wmnet                                                                                                                                             
----- OUTPUT of 'mysql -e 'select...ent_scheduler\G'' -----                                                                                                                                                                             
ERROR 2002 (HY000): Can't connect to local server through socket '/tmp/mysql.sock' (2)                                                                                                                                                  
================                                                                                                                                                                                                                        
PASS |██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████                            |  85% (182/215) [00:04<00:00, 42.26hosts/s]
FAIL |████████████████████████████                                                                                                                                                           |  15% (33/215) [00:04<00:23,  7.66hosts/s]
15.3% (33/215) of nodes failed to execute command 'mysql -e 'select...ent_scheduler\G'': db[2139,2141,2160,2186-2187,2197-2201,2223,2226,2228,2231-2236,2239-2240].codfw.wmnet,db[1150,1154-1155,1171,1208,1216-1217,1225,1239-1240,1245-1246].eqiad.wmnet
84.7% (182/215) success ratio (>= 0.0% threshold) for command: 'mysql -e 'select...ent_scheduler\G''.: db[2115-2116,2123,2126,2128,2130-2136,2140,2142-2159,2161-2185,2188-2196,2202-2222,2224-2225,2227,2229-2230,2237-2238].codfw.wmnet,db[1125,1151-1153,1156-1170,1172-1207,1209-1215,1218-1224,1226-1238,1241-1244,1247-1249].eqiad.wmnet
84.7% (182/215) success ratio (>= 0.0% threshold) of nodes successfully executed all commands.: db[2115-2116,2123,2126,2128,2130-2136,2140,2142-2159,2161-2185,2188-2196,2202-2222,2224-2225,2227,2229-2230,2237-2238].codfw.wmnet,db[1125,1151-1153,1156-1170,1172-1207,1209-1215,1218-1224,1226-1238,1241-1244,1247-1249].eqiad.wmnet

failed instances are all legit to fail.

events in database:

arnaudb@cumin1002:~ $ sudo cumin db[2115-2116,2123,2126,2128,2130-2136,2139-2240].codfw.wmnet,db[1125,1150-1249].eqiad.wmnet "mysql ops -e 'SHOW EVENTS FROM ops;' -s -N | awk '{print \$2}'" -p 0
215 hosts will be targeted:
db[2115-2116,2123,2126,2128,2130-2136,2139-2240].codfw.wmnet,db[1125,1150-1249].eqiad.wmnet
OK to proceed on 215 hosts? Enter the number of affected hosts to confirm or "q" to quit: 215
===== NODE GROUP =====                                                                                                                                                                                                                  
(11) db[2223,2226,2228,2231-2236,2239-2240].codfw.wmnet                                                                                                                                                                                 
----- OUTPUT of 'mysql ops -e 'SH...awk '{print $2}'' -----                                                                                                                                                                             
bash: line 1: mysql: command not found                                                                                                                                                                                                  
===== NODE GROUP =====                                                                                                                                                                                                                  
(14) db[2132-2135,2183-2185].codfw.wmnet,db[1159,1164,1176,1204-1205,1215,1228].eqiad.wmnet                                                                                                                                             
----- OUTPUT of 'mysql ops -e 'SH...awk '{print $2}'' -----                                                                                                                                                                             
ERROR 1049 (42000): Unknown database 'ops'                                                                                                                                                                                              
===== NODE GROUP =====                                                                                                                                                                                                                  
(20) db[2123,2142,2165,2179,2196,2203-2205,2214,2218].codfw.wmnet,db[1152,1160,1163,1189,1193,1201,1220,1222,1230,1236].eqiad.wmnet                                                                                                     
----- OUTPUT of 'mysql ops -e 'SH...awk '{print $2}'' -----                                                                                                                                                                             
wmf_master_purge                                                                                                                                                                                                                        
wmf_master_wikiuser_sleep                                                                                                                                                                                                               
===== NODE GROUP =====                                                                                                                                                                                                                  
(146) db[2115-2116,2126,2128,2130-2131,2136,2140,2143-2159,2161-2164,2166-2178,2180-2182,2188-2195,2202,2206-2213,2215-2217,2219-2222,2224-2225,2227,2229,2237-2238].codfw.wmnet,db[1151,1153,1156-1158,1161-1162,1165-1170,1172-1175,1177-1188,1190-1192,1194-1200,1202-1203,1206-1207,1209-1214,1218-1219,1221,1223-1224,1226-1227,1229,1231-1235,1237-1238,1241-1244,1247-1249].eqiad.wmnet                                                                                  
----- OUTPUT of 'mysql ops -e 'SH...awk '{print $2}'' -----                                                                                                                                                                             
wmf_slave_overload                                                                                                                                                                                                                      
wmf_slave_purge                                                                                                                                                                                                                         
wmf_slave_wikiuser_sleep
wmf_slave_wikiuser_slow
===== NODE GROUP =====                                                                                                                                                                                                                  
(8) db[2160,2186-2187].codfw.wmnet,db[1154-1155,1208,1217,1246].eqiad.wmnet                                                                                                                                                             
----- OUTPUT of 'mysql ops -e 'SH...awk '{print $2}'' -----                                                                                                                                                                             
ERROR 2002 (HY000): Can't connect to local server through socket '/run/mysqld/mysqld.sock' (2)                                                                                                                                          
===== NODE GROUP =====                                                                                                                                                                                                                  
(14) db[2139,2141,2197-2201].codfw.wmnet,db[1150,1171,1216,1225,1239-1240,1245].eqiad.wmnet                                                                                                                                             
----- OUTPUT of 'mysql ops -e 'SH...awk '{print $2}'' -----                                                                                                                                                                             
ERROR 2002 (HY000): Can't connect to local server through socket '/tmp/mysql.sock' (2)                                                                                                                                                  
================                                                                                                                                                                                                                        
PASS |██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100% (215/215) [00:03<00:00, 58.30hosts/s]
FAIL |                                                                                                                                                                                                |   0% (0/215) [00:03<?, ?hosts/s]
100.0% (215/215) success ratio (>= 0.0% threshold) for command: 'mysql ops -e 'SH...awk '{print $2}''.
100.0% (215/215) success ratio (>= 0.0% threshold) of nodes successfully executed all commands.

Trimming down by cumin alias A:db-role-master:

arnaudb@cumin1002:~ $ sudo cumin A:db-role-master "mysql ops -e 'SHOW EVENTS FROM ops;' -s -N | awk '{print \$2}'" -p 0
47 hosts will be targeted:
db[2123,2132-2135,2142,2165,2179,2183,2185,2196,2203-2205,2214,2218,2230].codfw.wmnet,db[1125,1152,1159-1160,1163-1164,1176,1189,1193,1201,1204,1215,1220,1222,1228,1230,1236].eqiad.wmnet,es[2035,2038].codfw.wmnet,es[1038-1039].eqiad.wmnet,pc[2011-2013,2015].codfw.wmnet,pc[1011-1012,1015-1017].eqiad.wmnet
OK to proceed on 47 hosts? Enter the number of affected hosts to confirm or "q" to quit: 47
===== NODE GROUP =====                                                                                              
(1) es2035.codfw.wmnet                                                                                              
----- OUTPUT of 'mysql ops -e 'SH...awk '{print $2}'' -----                                                         
wmf_slave_overload                                                                                                  
wmf_slave_purge                                                                                                     
wmf_slave_wikiuser_sleep
wmf_slave_wikiuser_slow
===== NODE GROUP =====                                                                                              
(13) db[2132-2135,2183,2185].codfw.wmnet,db[1159,1164,1176,1204,1215,1228].eqiad.wmnet,pc1017.eqiad.wmnet           
----- OUTPUT of 'mysql ops -e 'SH...awk '{print $2}'' -----                                                         
ERROR 1049 (42000): Unknown database 'ops'                                                                          
===== NODE GROUP =====                                                                                              
(28) db[2123,2142,2165,2179,2196,2203-2205,2214,2218].codfw.wmnet,db[1152,1160,1163,1189,1193,1201,1220,1222,1230,1236].eqiad.wmnet,es2038.codfw.wmnet,es[1038-1039].eqiad.wmnet,pc[2011-2013].codfw.wmnet,pc[1011-1012].eqiad.wmnet    
----- OUTPUT of 'mysql ops -e 'SH...awk '{print $2}'' -----                                                         
wmf_master_purge                                                                                                    
wmf_master_wikiuser_sleep

and running the opposite sieving on non A:db-role-master

arnaudb@cumin1002:~ $ sudo cumin 'A:db-all and not A:db-multiinstance and not A:db-role-master' "mysql ops -e 'SHOW EVENTS FROM ops;' -s -N | awk '{print \$2}'" -p 0
201 hosts will be targeted:
an-redacteddb1001.eqiad.wmnet,clouddb[1013-1020].eqiad.wmnet,db[2115-2116,2126,2128,2130-2131,2136,2140,2143-2159,2161-2164,2166-2178,2180-2182,2184,2188-2195,2202,2206-2213,2215-2217,2219-2222,2224-2225,2227,2229,2237-2238].codfw.wmnet,db[1151,1153,1156-1158,1161-1162,1165-1170,1172-1175,1177-1188,1190-1192,1194-1200,1202-1203,1205-1207,1209-1214,1218-1219,1221,1223-1224,1226-1227,1229,1231-1235,1237-1238,1241-1244,1246-1249].eqiad.wmnet,es[2020-2034,2036-2037,2039-2040].codfw.wmnet,es[1020-1037,1040].eqiad.wmnet,pc[2014,2016-2017].codfw.wmnet,pc[1013-1014].eqiad.wmnet
OK to proceed on 201 hosts? Enter the number of affected hosts to confirm or "q" to quit: 201
===== NODE GROUP =====                                                                                                                                                                                                                  
(9) an-redacteddb1001.eqiad.wmnet,clouddb[1013-1020].eqiad.wmnet                                                                                                                                                                        
----- OUTPUT of 'mysql ops -e 'SH...awk '{print $2}'' -----                                                                                                                                                                             
ERROR 2002 (HY000): Can't connect to local server through socket '/tmp/mysql.sock' (2)                                                                                                                                                  
===== NODE GROUP =====                                                                                                                                                                                                                  
(1) db1246.eqiad.wmnet                                                                                                                                                                                                                  
----- OUTPUT of 'mysql ops -e 'SH...awk '{print $2}'' -----                                                                                                                                                                             
ERROR 2002 (HY000): Can't connect to local server through socket '/run/mysqld/mysqld.sock' (2)                                                                                                                                          
===== NODE GROUP =====                                                                                                                                                                                                                  
(3) db2184.codfw.wmnet,db1205.eqiad.wmnet,pc2017.codfw.wmnet                                                                                                                                                                            
----- OUTPUT of 'mysql ops -e 'SH...awk '{print $2}'' -----                                                                                                                                                                             
ERROR 1049 (42000): Unknown database 'ops'                                                                                                                                                                                              
===== NODE GROUP =====                                                                                                                                                                                                                  
(5) es[2021,2024].codfw.wmnet,es[1021,1024].eqiad.wmnet,pc1013.eqiad.wmnet                                                                                                                                                              
----- OUTPUT of 'mysql ops -e 'SH...awk '{print $2}'' -----                                                                                                                                                                             
wmf_master_purge                                                                                                                                                                                                                        
wmf_master_wikiuser_sleep                                                                                                                                                                                                               
===== NODE GROUP =====                                                                                                                                                                                                                  
(182) db[2115-2116,2126,2128,2130-2131,2136,2140,2143-2159,2161-2164,2166-2178,2180-2182,2188-2195,2202,2206-2213,2215-2217,2219-2222,2224-2225,2227,2229,2237-2238].codfw.wmnet,db[1151,1153,1156-1158,1161-1162,1165-1170,1172-1175,1177-1188,1190-1192,1194-1200,1202-1203,1206-1207,1209-1214,1218-1219,1221,1223-1224,1226-1227,1229,1231-1235,1237-1238,1241-1244,1247-1249].eqiad.wmnet,es[2020,2022-2023,2025-2034,2036-2037,2039-2040].codfw.wmnet,es[1020,1022-1023,1025-1037,1040].eqiad.wmnet,pc2014.codfw.wmnet,pc1014.eqiad.wmnet
----- OUTPUT of 'mysql ops -e 'SH...awk '{print $2}'' -----                                                                                                                                                                             
wmf_slave_overload                                                                                                                                                                                                                      
wmf_slave_purge                                                                                                                                                                                                                         
wmf_slave_wikiuser_sleep
wmf_slave_wikiuser_slow
================                                                                                                                                                                                                                        
PASS |██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100% (201/201) [00:03<00:00, 51.66hosts/s]
FAIL |                                                                                                                                                                                                |   0% (0/201) [00:03<?, ?hosts/s]
100.0% (201/201) success ratio (>= 0.0% threshold) for command: 'mysql ops -e 'SH...awk '{print $2}''.
100.0% (201/201) success ratio (>= 0.0% threshold) of nodes successfully executed all commands.

We only have consistency issue on non core hosts on events.

ABran-WMF updated the task description. (Show Details)Sep 24 2024, 2:35 PM

primary nodes have a proper config except for es2035.codfw.wmnet. suggested fix:

curl -sS 'https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/+/refs/heads/master/dbtools/events_coredb_master.sql?format=TEXT' | base64 -d | sudo db-mysql es2035

and on the other side:

(4) es[2021,2024].codfw.wmnet,es[1021,1024].eqiad.wmnet                                                                                                                                                                                 
----- OUTPUT of 'mysql ops -e 'SH...awk '{print $2}'' -----                                                                                                                                                                             
wmf_master_purge                                                                                                                                                                                                                        
wmf_master_wikiuser_sleep

only those have consistency issues, with the following suggested fix:

for host in $(echo "es[2021,2024].codfw.wmnet,es[1021,1024].eqiad.wmnet" |nodeset -e); do
echo doing ${host}
time  curl -sS 'https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/+/refs/heads/master/dbtools/events_coredb_master.sql?format=TEXT' | base64 -d | sudo db-mysql ${host%%.*}

done

wdyt @jcrespo ?

edit: for sanity check, here is a oneliner proofing $host print:

for host in $(echo "es[2021,2024].codfw.wmnet,es[1021,1024].eqiad.wmnet" |nodeset -e); do echo "${host%%.*}" ; done
es2021
es2024
es1021
es1024

ABran-WMF mentioned this in T375589: mariadb monitoring: buffer pool usage.Sep 25 2024, 6:46 AM

ABran-WMF updated the task description. (Show Details)

ABran-WMF updated the task description. (Show Details)Sep 25 2024, 6:55 AM

ABran-WMF updated the task description. (Show Details)Sep 25 2024, 8:06 AM

ABran-WMF updated the task description. (Show Details)Sep 25 2024, 8:10 AM

as updated in task description: P69410 for secondaries and P69411 for primaries contain the last 10 loglines for each mariadb instance

Secondaries are looking correct from what I saw, but primaries may be a bit more concerning. I'll dig in db1159 for instance (and try to scrape the few other snowflakes)

ABran-WMF mentioned this in T375593: [misc] db1159 data corruption.Sep 25 2024, 8:20 AM

after a quick sanity check with @jcrespo:

for host in $(echo "es[2021,2024].codfw.wmnet,es[1021,1024].eqiad.wmnet" |nodeset -e); do
echo doing ${host}
time sudo db-mysql ${host%%.*} < /home/arnaudb/software/dbtools/events_coredb_master.sql
done

this is the logic that emerged:

for host in $(echo "es[2021,2024].codfw.wmnet,es[1021,1024].eqiad.wmnet" |nodeset -e); do
echo doing ${host}
time sudo db-mysql ${host%%.*} "select @@read_only;"
done

but it does not seem as trivial as this command simply fails, so I'll stick to the original plan to avoid unforeseen issues

everything went through no trouble:

arnaudb@cumin1002:~ $ for host in $(echo "es[2021,2024].codfw.wmnet,es[1021,1024].eqiad.wmnet" |nodeset -e); do
echo doing ${host}
time curl -sS 'https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/+/refs/heads/master/dbtools/events_coredb_master.sql?format=TEXT' | base64 -d | sudo db-mysql ${host%%.*}

done
doing es2021.codfw.wmnet

real 0m0.804s
user 0m0.095s
sys 0m0.021s
doing es2024.codfw.wmnet

real 0m0.797s
user 0m0.092s
sys 0m0.045s
doing es1021.eqiad.wmnet

real 0m0.139s
user 0m0.075s
sys 0m0.056s
doing es1024.eqiad.wmnet

real 0m0.123s
user 0m0.094s
sys 0m0.032s
arnaudb@cumin1002:~ $ curl -sS 'https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/+/refs/heads/master/dbtools/events_coredb_master.sql?format=TEXT' | base64 -d | sudo db-mysql es2035

Please rerun the original check when finished so we can confirm all hosts are ready!

I will give a last look at the ones that failed to report an ops db:

===== NODE GROUP =====                                                                                                                                
(13) db[2132-2135,2183,2185].codfw.wmnet,db[1159,1164,1176,1204,1215,1228].eqiad.wmnet,pc1017.eqiad.wmnet                                             
----- OUTPUT of 'mysql ops -e 'SH...awk '{print $2}'' -----                                                                                           
ERROR 1049 (42000): Unknown database 'ops'

===== NODE GROUP =====                                                                                                                                
(3) db2184.codfw.wmnet,db1205.eqiad.wmnet,pc2017.codfw.wmnet                                                                                          
----- OUTPUT of 'mysql ops -e 'SH...awk '{print $2}'' -----                                                                                           
ERROR 1049 (42000): Unknown database 'ops'

Mentioned in SAL (#wikimedia-operations) [2024-09-25T11:30:24Z] <jynus> running DELETE + REPLACE on kowiki db2218 (s7) T375186

Mentioned in SAL (#wikimedia-operations) [2024-09-25T11:40:00Z] <jynus> running DELETE + REPLACE on metawiki db2218 (s7) T375186

^ wrong ticket, sorry.

the only think I saw missing was pc1017 pc2017 which I am guessing they are the new hosts for pc5 ?
in any case, not a big issue

Great work, @ABran-WMF .

jcrespo updated the task description. (Show Details)Sep 25 2024, 2:03 PM

Maintenance_bot moved this task from In progress to Done on the DBA board.Sep 25 2024, 2:29 PM

ABran-WMF mentioned this in T376370: mariadb: create a synthetic monitoring indicator for dc switchover readiness.Oct 3 2024, 8:48 AM

	F57523932: image.png
	Sep 19 2024, 3:47 PM

databases preswitchover checksClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

EQIAD - s8

CODFW - s8

EQIAD - s2

CODFW - s2

EQIAD - s1

CODFW - s1

EQIAD - DEFAULT

CODFW - DEFAULT

EQIAD - s5

CODFW - s5

EQIAD - s4

CODFW - s4

EQIAD - s6

CODFW - s6

EQIAD - s7

CODFW - s7

EQIAD - es7

CODFW - es7

EQIAD - pc5

CODFW - pc5

EQIAD - x1

CODFW - x1

EQIAD - pc2

CODFW - pc2

EQIAD - pc4

CODFW - pc4

EQIAD - es4

CODFW - es4

EQIAD - es3

CODFW - es3

EQIAD - es6

CODFW - es6

EQIAD - pc3

CODFW - pc3

EQIAD - x2

CODFW - x2

EQIAD - pc1

CODFW - pc1

EQIAD - es5

CODFW - es5

EQIAD - es1

CODFW - es1

EQIAD - es2

CODFW - es2

X instances

zarcillo

Sanitarium

dbproxies

ParserCaches

databases preswitchover checks
Closed, ResolvedPublic
Actions

Related Objects
Search...