Page MenuHomePhabricator

databases preswitchover checks
Closed, ResolvedPublic

Description

  • compare weights and groups (API/vslow/dumps/...) and pooling for core hosts
  • compare weights and groups (API/vslow/dumps/...) and pooling for es hosts
  • ensure that 1 master 1 candidate master per section declared and usual topology constraints are OK. On core sections: https://phabricator.wikimedia.org/T375186#10161055
  • check section capacities (count, qps, ..)
  • read only verifications + prometheus migration plan check
  • monitoring notifications enabled on all relevant hosts - T375186#10167284
  • validate event/query killer and pt-heartbeat is properly setup everywhere
  • clean orchestrator output P69307 (db1246 skipped due to T374215)
  • check status of buffer pool usage +
  • create a task for buffer pool usage monitoring T375589
  • check mariadb services on all codfw for logging oddities https://phabricator.wikimedia.org/P69410 for replicas and https://phabricator.wikimedia.org/P69411 for primaries
  • Write a dns patch for the new masters and compare DNS, puppet, dbctl and real replica topology (e.g. orchestrator/dbtree) agrees
  • Run compare table among eqiad and codfw masters to ensure they all have the same data (see missing checks at T375507)

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
ABran-WMF changed the task status from Open to In Progress.Sep 19 2024, 12:47 PM
ABran-WMF triaged this task as High priority.
ABran-WMF moved this task from Triage to In progress on the DBA board.

@Ladsgroup I've been informed that subscribing at creation time on phabricator will spare you the notification email, so here is one for you!

@jcrespo
I've quickly drafted a script to read from dbconfig, here is a summary:

EQIAD - s8

  • Number of instances in subgroup 'api': 6
  • Number of instances in subgroup 'vslow': 1
  • Number of instances in subgroup 'dump': 1
  • Total number of servers: 11
  • Total weight: 2676
  • Average weight per instance: 243.27
  • Instances with weight 0: ['db1193']
  • Instances with weight less than 10: ['db1193', 'db1167']

CODFW - s8

  • Number of instances in subgroup 'api': 6
  • Number of instances in subgroup 'vslow': 1
  • Number of instances in subgroup 'dump': 1
  • Total number of servers: 11
  • Total weight: 2701
  • Average weight per instance: 245.55
  • Instances with weight 0: ['db2165']
  • Instances with weight less than 10: ['db2165', 'db2152']

EQIAD - s2

  • Number of instances in subgroup 'api': 5
  • Number of instances in subgroup 'vslow': 0
  • Number of instances in subgroup 'dump': 0
  • Total number of servers: 8
  • Total weight: 2600
  • Average weight per instance: 325.00
  • Instances with weight 0: ['db1222']
  • Instances with weight less than 10: ['db1222']

CODFW - s2

  • Number of instances in subgroup 'api': 5
  • Number of instances in subgroup 'vslow': 1
  • Number of instances in subgroup 'dump': 1
  • Total number of servers: 8
  • Total weight: 2600
  • Average weight per instance: 325.00
  • Instances with weight 0: ['db2204']
  • Instances with weight less than 10: ['db2204']

EQIAD - s1

  • Number of instances in subgroup 'api': 10
  • Number of instances in subgroup 'vslow': 1
  • Number of instances in subgroup 'dump': 1
  • Total number of servers: 13
  • Total weight: 2326
  • Average weight per instance: 178.92
  • Instances with weight 0: ['db1163']
  • Instances with weight less than 10: ['db1163', 'db1206']

CODFW - s1

  • Number of instances in subgroup 'api': 5
  • Number of instances in subgroup 'vslow': 1
  • Number of instances in subgroup 'dump': 1
  • Total number of servers: 13
  • Total weight: 3250
  • Average weight per instance: 250.00
  • Instances with weight 0: ['db2203']
  • Instances with weight less than 10: ['db2203']

EQIAD - DEFAULT

  • Number of instances in subgroup 'api': 0
  • Number of instances in subgroup 'vslow': 1
  • Number of instances in subgroup 'dump': 1
  • Total number of servers: 7
  • Total weight: 2800
  • Average weight per instance: 400.00
  • Instances with weight 0: ['db1189']
  • Instances with weight less than 10: ['db1189']

CODFW - DEFAULT

  • Number of instances in subgroup 'api': 0
  • Number of instances in subgroup 'vslow': 1
  • Number of instances in subgroup 'dump': 1
  • Total number of servers: 8
  • Total weight: 2700
  • Average weight per instance: 337.50
  • Instances with weight 0: ['db2205']
  • Instances with weight less than 10: ['db2205']

EQIAD - s5

  • Number of instances in subgroup 'api': 2
  • Number of instances in subgroup 'vslow': 1
  • Number of instances in subgroup 'dump': 1
  • Total number of servers: 7
  • Total weight: 2050
  • Average weight per instance: 292.86
  • Instances with weight 0: ['db1230']
  • Instances with weight less than 10: ['db1230']

CODFW - s5

  • Number of instances in subgroup 'api': 3
  • Number of instances in subgroup 'vslow': 1
  • Number of instances in subgroup 'dump': 1
  • Total number of servers: 8
  • Total weight: 2300
  • Average weight per instance: 287.50
  • Instances with weight 0: ['db2123']
  • Instances with weight less than 10: ['db2123']

EQIAD - s4

  • Number of instances in subgroup 'api': 6
  • Number of instances in subgroup 'vslow': 2
  • Number of instances in subgroup 'dump': 2
  • Total number of servers: 12
  • Total weight: 2402
  • Average weight per instance: 200.17
  • Instances with weight 0: ['db1160']
  • Instances with weight less than 10: ['db1160', 'db1248', 'db1221']

CODFW - s4

  • Number of instances in subgroup 'api': 5
  • Number of instances in subgroup 'vslow': 2
  • Number of instances in subgroup 'dump': 2
  • Total number of servers: 10
  • Total weight: 2300
  • Average weight per instance: 230.00
  • Instances with weight 0: ['db2179']
  • Instances with weight less than 10: ['db2179']

EQIAD - s6

  • Number of instances in subgroup 'api': 3
  • Number of instances in subgroup 'vslow': 0
  • Number of instances in subgroup 'dump': 0
  • Total number of servers: 7
  • Total weight: 2100
  • Average weight per instance: 300.00
  • Instances with weight 0: ['db1201']
  • Instances with weight less than 10: ['db1201']

CODFW - s6

  • Number of instances in subgroup 'api': 4
  • Number of instances in subgroup 'vslow': 1
  • Number of instances in subgroup 'dump': 1
  • Total number of servers: 10
  • Total weight: 2600
  • Average weight per instance: 260.00
  • Instances with weight 0: ['db2214']
  • Instances with weight less than 10: ['db2214']

EQIAD - s7

  • Number of instances in subgroup 'api': 5
  • Number of instances in subgroup 'vslow': 1
  • Number of instances in subgroup 'dump': 1
  • Total number of servers: 9
  • Total weight: 2300
  • Average weight per instance: 255.56
  • Instances with weight 0: ['db1236']
  • Instances with weight less than 10: ['db1236']

CODFW - s7

  • Number of instances in subgroup 'api': 3
  • Number of instances in subgroup 'vslow': 1
  • Number of instances in subgroup 'dump': 1
  • Total number of servers: 9
  • Total weight: 3400
  • Average weight per instance: 377.78
  • Instances with weight 0: ['db2218']
  • Instances with weight less than 10: ['db2218']

Change #1074184 had a related patch set uploaded (by Arnaudb; author: Arnaudb):

[operations/software@master] mariadb: compare weight and instance distribution across dcs

https://gerrit.wikimedia.org/r/1074184

Change #1074188 had a related patch set uploaded (by Arnaudb; author: Arnaudb):

[operations/puppet@production] mariadb: remove db2129

https://gerrit.wikimedia.org/r/1074188

db2129 from T374623 was pooled and was a duplicate candidate master w/ db2229. This patch fixes the instance duplication and prepares db2129 for removal: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1074188

Change #1074188 merged by Arnaudb:

[operations/puppet@production] mariadb: remove db2129

https://gerrit.wikimedia.org/r/1074188

I forgot to mention, I think orchestrator has a similar tool, but I found in the past a tool like db-replication-tree useful for this kind of work (preparation) and later tuning after switchover:

image.png (1×1 px, 640 KB)

In case you didn't remember it, it was created to facilitate a table-like with the most important parameters (read only, connection latency, etc). And shows in red things that are unexpected or to be aware (high latencies, read only off). Just commenting it in case it is useful for this ticket.

interesting indeed, I'll check!

ABran-WMF updated the task description. (Show Details)

I've also forked a dashboard (WIP) to be able to compare between sections/dcs

I've spotted fewer oddities on codfw vs eqiad:

  • codfw underweighed: db2173 (sanitarium master), db2152
  • eqiad underweighted: db1160, db1248, db1221, db1167, db1206, db1248, db1221

Will continue on digging potential contention issues.
Note: this should be fully automated.

@Scott_French wrote the patch:

https://gerrit.wikimedia.org/r/c/operations/dns/+/1073897

I compared dns, puppet and replication topology (orchestrator) and found only 2 discrepancies: pc2007 is not marked as a master on puppet (may not be an issue as it may not be yet fully setup) CC Amir. And x2 may not be needed to be switchover-ed, as it is active-active (but it won't hurt). I checked all other hosts, both on puppet, dbctl and orchestrator.

Mentioned in SAL (#wikimedia-operations) [2024-09-23T10:51:52Z] <jynus> starting db master table checks on s1 (db1163, db2203) T375186

@ABran-WMF I see that db2229 in theory its productionization has finished (T373579) and it is pooled as a candidate master, but it has notifications disabled. Is that expected (e.g. hardware crash)?

Change #1074986 had a related patch set uploaded (by Arnaudb; author: Arnaudb):

[operations/puppet@production] mariadb: toggle notifications for db2229

https://gerrit.wikimedia.org/r/1074986

Change #1074986 merged by Arnaudb:

[operations/puppet@production] mariadb: toggle notifications for db2229

https://gerrit.wikimedia.org/r/1074986

@ABran-WMF I see that db2229 in theory its productionization has finished (T373579) and it is pooled as a candidate master, but it has notifications disabled. Is that expected (e.g. hardware crash)?

its an overlook! good catch! its been fixed

Check the other host on puppet/icinga with notifications disabled, I think I saw others, but maybe those are being setup/decom: db2185/6/7.

Check the other host on puppet/icinga with notifications disabled, I think I saw others, but maybe those are being setup/decom: db2185/6/7.

will do!

I've spotted no oddity on es/pc/x sections:

EQIAD - es7

  • Total number of servers: 3
  • Total weight: 210
  • Average weight per instance: 70.00
  • Instances with weight 0: []
  • Instances with weight less than 10: []

CODFW - es7

  • Total number of servers: 3
  • Total weight: 200
  • Average weight per instance: 66.67
  • Instances with weight 0: ['es2038']
  • Instances with weight less than 10: ['es2038']

EQIAD - pc5

  • Total number of servers: 1
  • Total weight: 1
  • Average weight per instance: 1.00
  • Instances with weight 0: []
  • Instances with weight less than 10: ['pc1017']

CODFW - pc5

  • Total number of servers: 1
  • Total weight: 1
  • Average weight per instance: 1.00
  • Instances with weight 0: []
  • Instances with weight less than 10: ['pc2017']

EQIAD - x1

  • Total number of servers: 4
  • Total weight: 300
  • Average weight per instance: 75.00
  • Instances with weight 0: ['db1220']
  • Instances with weight less than 10: ['db1220']

CODFW - x1

  • Total number of servers: 5
  • Total weight: 400
  • Average weight per instance: 80.00
  • Instances with weight 0: ['db2196']
  • Instances with weight less than 10: ['db2196']

EQIAD - pc2

  • Total number of servers: 1
  • Total weight: 1
  • Average weight per instance: 1.00
  • Instances with weight 0: []
  • Instances with weight less than 10: ['pc1012']

CODFW - pc2

  • Total number of servers: 1
  • Total weight: 1
  • Average weight per instance: 1.00
  • Instances with weight 0: []
  • Instances with weight less than 10: ['pc2012']

EQIAD - pc4

  • Total number of servers: 1
  • Total weight: 1
  • Average weight per instance: 1.00
  • Instances with weight 0: []
  • Instances with weight less than 10: ['pc1016']

CODFW - pc4

  • Total number of servers: 1
  • Total weight: 1
  • Average weight per instance: 1.00
  • Instances with weight 0: []
  • Instances with weight less than 10: ['pc2015']

EQIAD - es4

  • Total number of servers: 2
  • Total weight: 200
  • Average weight per instance: 100.00
  • Instances with weight 0: []
  • Instances with weight less than 10: []

CODFW - es4

  • Total number of servers: 3
  • Total weight: 300
  • Average weight per instance: 100.00
  • Instances with weight 0: []
  • Instances with weight less than 10: []

EQIAD - es3

  • Total number of servers: 3
  • Total weight: 300
  • Average weight per instance: 100.00
  • Instances with weight 0: []
  • Instances with weight less than 10: []

CODFW - es3

  • Total number of servers: 3
  • Total weight: 300
  • Average weight per instance: 100.00
  • Instances with weight 0: []
  • Instances with weight less than 10: []

EQIAD - es6

  • Total number of servers: 3
  • Total weight: 210
  • Average weight per instance: 70.00
  • Instances with weight 0: []
  • Instances with weight less than 10: []

CODFW - es6

  • Total number of servers: 3
  • Total weight: 210
  • Average weight per instance: 70.00
  • Instances with weight 0: []
  • Instances with weight less than 10: []

EQIAD - pc3

  • Total number of servers: 1
  • Total weight: 1
  • Average weight per instance: 1.00
  • Instances with weight 0: []
  • Instances with weight less than 10: ['pc1013']

CODFW - pc3

  • Total number of servers: 1
  • Total weight: 1
  • Average weight per instance: 1.00
  • Instances with weight 0: []
  • Instances with weight less than 10: ['pc2013']

EQIAD - x2

  • Total number of servers: 1
  • Total weight: 0
  • Average weight per instance: 0.00
  • Instances with weight 0: ['db1152']
  • Instances with weight less than 10: ['db1152']

CODFW - x2

  • Total number of servers: 1
  • Total weight: 0
  • Average weight per instance: 0.00
  • Instances with weight 0: ['db2142']
  • Instances with weight less than 10: ['db2142']

EQIAD - pc1

  • Total number of servers: 1
  • Total weight: 1
  • Average weight per instance: 1.00
  • Instances with weight 0: []
  • Instances with weight less than 10: ['pc1011']

CODFW - pc1

  • Total number of servers: 1
  • Total weight: 1
  • Average weight per instance: 1.00
  • Instances with weight 0: []
  • Instances with weight less than 10: ['pc2011']

EQIAD - es5

  • Total number of servers: 3
  • Total weight: 300
  • Average weight per instance: 100.00
  • Instances with weight 0: []
  • Instances with weight less than 10: []

CODFW - es5

  • Total number of servers: 3
  • Total weight: 300
  • Average weight per instance: 100.00
  • Instances with weight 0: []
  • Instances with weight less than 10: []

EQIAD - es1

  • Total number of servers: 3
  • Total weight: 300
  • Average weight per instance: 100.00
  • Instances with weight 0: []
  • Instances with weight less than 10: []

CODFW - es1

  • Total number of servers: 3
  • Total weight: 300
  • Average weight per instance: 100.00
  • Instances with weight 0: []
  • Instances with weight less than 10: []

EQIAD - es2

  • Total number of servers: 3
  • Total weight: 300
  • Average weight per instance: 100.00
  • Instances with weight 0: []
  • Instances with weight less than 10: []

CODFW - es2

  • Total number of servers: 3
  • Total weight: 300
  • Average weight per instance: 100.00
  • Instances with weight 0: []
  • Instances with weight less than 10: []

@jcrespo to be super verbose and explicit about notifications:

X instances

hieradata/hosts/db1151.yaml:
  2  mariadb::shard: 'x2'
  3: profile::monitoring::notifications_enabled: false
  4  

hieradata/hosts/db1153.yaml:
  2  mariadb::shard: 'x2'
  3: profile::monitoring::notifications_enabled: false
  4  # Testing MariaDB 10.11 T365805 - do not promote to master

hieradata/hosts/db2143.yaml:
  2  mariadb::shard: 'x2'
  3: profile::monitoring::notifications_enabled: false

hieradata/hosts/db2144.yaml:
  2  mariadb::shard: 'x2'
  3: profile::monitoring::notifications_enabled: false

all those are OK

zarcillo

hieradata/hosts/db2185.yaml:
  4  profile::monitoring::is_critical: false
  5: profile::monitoring::notifications_enabled: false
  6

despite it being the replica: I'm sure we would want it to alert us

Sanitarium

hieradata/hosts/db2186.yaml:
  7    s8: 48G
  8: profile::monitoring::notifications_enabled: false
  9  

hieradata/hosts/db2187.yaml:
  7    s7: 38G
  8: profile::monitoring::notifications_enabled: false
  9

those are ok, so are db222+

dbproxies

Those are OK as well

ParserCaches

hieradata/hosts/pc1014.yaml:
  4  profile::mariadb::config::innodb_flush_log_at_trx_commit: '0'
  5: profile::monitoring::notifications_enabled: false

hieradata/hosts/pc1015.yaml:
  5  profile::mariadb::config::innodb_flush_log_at_trx_commit: '0'
  6: profile::monitoring::notifications_enabled: false

hieradata/hosts/pc2014.yaml:
  4  profile::mariadb::config::innodb_flush_log_at_trx_commit: '0'
  5: profile::monitoring::notifications_enabled: false

hieradata/hosts/pc2016.yaml:
  4  profile::mariadb::config::innodb_flush_log_at_trx_commit: '0'
  5: profile::monitoring::notifications_enabled: false

I've spotted few parsercache instances that have no notification enabled, maybe this are worth reconsidering?

Here is the full output of the research for future reference: P69393

@jcrespo wdyt about:

SELECT 
    ROUND((innodb_buffer_pool_pages_data / innodb_buffer_pool_pages_total) * 100, 2) AS Buffer_Pool_Usage_Percentage
FROM 
    (SELECT 
         VARIABLE_VALUE AS innodb_buffer_pool_pages_data
     FROM 
         information_schema.GLOBAL_STATUS 
     WHERE 
         VARIABLE_NAME = 'Innodb_buffer_pool_pages_data') AS A,
    (SELECT 
         VARIABLE_VALUE AS innodb_buffer_pool_pages_total
     FROM 
         information_schema.GLOBAL_STATUS 
     WHERE 
         VARIABLE_NAME = 'Innodb_buffer_pool_pages_total') AS B;

for the subtask:

check status of buffer pool usage + create a task for monitoring

on db2230 it returns a proper result:

*************************** 1. row ***************************
Buffer_Pool_Usage_Percentage: 0.01

its harmless to run on the whole fleet and would give a human readable output that is easy to add to this CR

check section capacities (count, qps, ..)

  • s4 is a bit undersized on codfw (-2 instances)
  • s6 is a bit oversized (+2 instances)
  • default section and s5 are a bit oversized (+1)

for

read only verifications + prometheus migration plan check

I've reused expressions from T367282 https://gerrit.wikimedia.org/r/c/operations/alerts/+/1050498/12/team-data-persistence/mysql-read_only.yaml

as visible here

Both primary and secondary pools have misconfigured read_only states → lets check it on our 1:1 today

!

db-compare abwiki change_tag ct_id db1189 db2205 --step 10
Starting comparison between id 1 and 63456
2024-09-24T12:25:17.345270: row id 991/63456, ETA: 10m26s, 0 chunk(s) found different
DIFFERENCE on db2205.codfw.wmnet:3306: WHERE ct_id BETWEEN 1901 AND 1910

abwiki change_tag ct_id = 1908 is missing from db2205 (or not deleted from db1189).

jcrespo raised the priority of this task from High to Unbreak Now!.Sep 24 2024, 12:31 PM

I created a dedicated task to track this as it is critical: T375507

ABran-WMF lowered the priority of this task from Unbreak Now! to High.Sep 24 2024, 12:51 PM

lowering priority as T375507 tracks this issue in a dedicated manner

So in

@jcrespo wdyt about:

The query looked sane but was not:

arnaudb@db2158:~ $ sudo mysql -e "SELECT 
    ROUND((innodb_buffer_pool_pages_data / innodb_buffer_pool_pages_total) * 100, 2) AS Buffer_Pool_Usage_Percentage 
FROM 
    (SELECT 
         VARIABLE_VALUE AS innodb_buffer_pool_pages_data
     FROM 
         information_schema.GLOBAL_STATUS 
     WHERE 
         VARIABLE_NAME = 'Innodb_buffer_pool_pages_data') AS A,
    (SELECT 
         VARIABLE_VALUE AS innodb_buffer_pool_pages_total
     FROM 
         information_schema.GLOBAL_STATUS 
     WHERE VARIABLE_NAME = 'Innodb_buffer_pool_pages_total') AS B;"
+------------------------------+
| Buffer_Pool_Usage_Percentage |
+------------------------------+
|                       122.73 |
+------------------------------+

so, improving on the query:

arnaudb@db2158:~ $ sudo mysql -e "SELECT 
    pool_id,
    pool_size,
    free_buffers,
    database_pages,
    (pool_size - free_buffers) / pool_size * 100 AS Buffer_Pool_Usage_Percentage
FROM 
    information_schema.INNODB_BUFFER_POOL_STATS;
"
+---------+-----------+--------------+----------------+------------------------------+
| pool_id | pool_size | free_buffers | database_pages | Buffer_Pool_Usage_Percentage |
+---------+-----------+--------------+----------------+------------------------------+
|       0 |  24465792 |            0 |       30026932 |                     100.0000 |
+---------+-----------+--------------+----------------+------------------------------+

This one looks a bit better, I'll check if it's properly representative and will compare across fleet

https://grafana.wikimedia.org/goto/Ss01jBgHg?orgId=1 → this is way faster and more visual

db2142,db2143,db2144,db1151,db1152,db1153 have low buffer pools

event is OK everywhere:

arnaudb@cumin1002:~ $ sudo cumin db[2115-2116,2123,2126,2128,2130-2136,2139-2240].codfw.wmnet,db[1125,1150-1249].eqiad.wmnet "mysql -e 'select @@event_scheduler\G'" -p 0
215 hosts will be targeted:
db[2115-2116,2123,2126,2128,2130-2136,2139-2240].codfw.wmnet,db[1125,1150-1249].eqiad.wmnet
OK to proceed on 215 hosts? Enter the number of affected hosts to confirm or "q" to quit: 215
===== NODE GROUP =====                                                                                                                                                                                                                  
(11) db[2223,2226,2228,2231-2236,2239-2240].codfw.wmnet                                                                                                                                                                                 
----- OUTPUT of 'mysql -e 'select...ent_scheduler\G'' -----                                                                                                                                                                             
bash: line 1: mysql: command not found                                                                                                                                                                                                  
===== NODE GROUP =====                                                                                                                                                                                                                  
(182) db[2115-2116,2123,2126,2128,2130-2136,2140,2142-2159,2161-2185,2188-2196,2202-2222,2224-2225,2227,2229-2230,2237-2238].codfw.wmnet,db[1125,1151-1153,1156-1170,1172-1207,1209-1215,1218-1224,1226-1238,1241-1244,1247-1249].eqiad.wmnet                                                                                                                                                                                                                                   
----- OUTPUT of 'mysql -e 'select...ent_scheduler\G'' -----                                                                                                                                                                             
*************************** 1. row ***************************                                                                                                                                                                          
@@event_scheduler: ON                                                                                                                                                                                                                   
===== NODE GROUP =====                                                                                                                                                                                                                  
(8) db[2160,2186-2187].codfw.wmnet,db[1154-1155,1208,1217,1246].eqiad.wmnet                                                                                                                                                             
----- OUTPUT of 'mysql -e 'select...ent_scheduler\G'' -----                                                                                                                                                                             
ERROR 2002 (HY000): Can't connect to local server through socket '/run/mysqld/mysqld.sock' (2)                                                                                                                                          
===== NODE GROUP =====                                                                                                                                                                                                                  
(14) db[2139,2141,2197-2201].codfw.wmnet,db[1150,1171,1216,1225,1239-1240,1245].eqiad.wmnet                                                                                                                                             
----- OUTPUT of 'mysql -e 'select...ent_scheduler\G'' -----                                                                                                                                                                             
ERROR 2002 (HY000): Can't connect to local server through socket '/tmp/mysql.sock' (2)                                                                                                                                                  
================                                                                                                                                                                                                                        
PASS |██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████                            |  85% (182/215) [00:04<00:00, 42.26hosts/s]
FAIL |████████████████████████████                                                                                                                                                           |  15% (33/215) [00:04<00:23,  7.66hosts/s]
15.3% (33/215) of nodes failed to execute command 'mysql -e 'select...ent_scheduler\G'': db[2139,2141,2160,2186-2187,2197-2201,2223,2226,2228,2231-2236,2239-2240].codfw.wmnet,db[1150,1154-1155,1171,1208,1216-1217,1225,1239-1240,1245-1246].eqiad.wmnet
84.7% (182/215) success ratio (>= 0.0% threshold) for command: 'mysql -e 'select...ent_scheduler\G''.: db[2115-2116,2123,2126,2128,2130-2136,2140,2142-2159,2161-2185,2188-2196,2202-2222,2224-2225,2227,2229-2230,2237-2238].codfw.wmnet,db[1125,1151-1153,1156-1170,1172-1207,1209-1215,1218-1224,1226-1238,1241-1244,1247-1249].eqiad.wmnet
84.7% (182/215) success ratio (>= 0.0% threshold) of nodes successfully executed all commands.: db[2115-2116,2123,2126,2128,2130-2136,2140,2142-2159,2161-2185,2188-2196,2202-2222,2224-2225,2227,2229-2230,2237-2238].codfw.wmnet,db[1125,1151-1153,1156-1170,1172-1207,1209-1215,1218-1224,1226-1238,1241-1244,1247-1249].eqiad.wmnet

failed instances are all legit to fail.

events in database:

arnaudb@cumin1002:~ $ sudo cumin db[2115-2116,2123,2126,2128,2130-2136,2139-2240].codfw.wmnet,db[1125,1150-1249].eqiad.wmnet "mysql ops -e 'SHOW EVENTS FROM ops;' -s -N | awk '{print \$2}'" -p 0
215 hosts will be targeted:
db[2115-2116,2123,2126,2128,2130-2136,2139-2240].codfw.wmnet,db[1125,1150-1249].eqiad.wmnet
OK to proceed on 215 hosts? Enter the number of affected hosts to confirm or "q" to quit: 215
===== NODE GROUP =====                                                                                                                                                                                                                  
(11) db[2223,2226,2228,2231-2236,2239-2240].codfw.wmnet                                                                                                                                                                                 
----- OUTPUT of 'mysql ops -e 'SH...awk '{print $2}'' -----                                                                                                                                                                             
bash: line 1: mysql: command not found                                                                                                                                                                                                  
===== NODE GROUP =====                                                                                                                                                                                                                  
(14) db[2132-2135,2183-2185].codfw.wmnet,db[1159,1164,1176,1204-1205,1215,1228].eqiad.wmnet                                                                                                                                             
----- OUTPUT of 'mysql ops -e 'SH...awk '{print $2}'' -----                                                                                                                                                                             
ERROR 1049 (42000): Unknown database 'ops'                                                                                                                                                                                              
===== NODE GROUP =====                                                                                                                                                                                                                  
(20) db[2123,2142,2165,2179,2196,2203-2205,2214,2218].codfw.wmnet,db[1152,1160,1163,1189,1193,1201,1220,1222,1230,1236].eqiad.wmnet                                                                                                     
----- OUTPUT of 'mysql ops -e 'SH...awk '{print $2}'' -----                                                                                                                                                                             
wmf_master_purge                                                                                                                                                                                                                        
wmf_master_wikiuser_sleep                                                                                                                                                                                                               
===== NODE GROUP =====                                                                                                                                                                                                                  
(146) db[2115-2116,2126,2128,2130-2131,2136,2140,2143-2159,2161-2164,2166-2178,2180-2182,2188-2195,2202,2206-2213,2215-2217,2219-2222,2224-2225,2227,2229,2237-2238].codfw.wmnet,db[1151,1153,1156-1158,1161-1162,1165-1170,1172-1175,1177-1188,1190-1192,1194-1200,1202-1203,1206-1207,1209-1214,1218-1219,1221,1223-1224,1226-1227,1229,1231-1235,1237-1238,1241-1244,1247-1249].eqiad.wmnet                                                                                  
----- OUTPUT of 'mysql ops -e 'SH...awk '{print $2}'' -----                                                                                                                                                                             
wmf_slave_overload                                                                                                                                                                                                                      
wmf_slave_purge                                                                                                                                                                                                                         
wmf_slave_wikiuser_sleep
wmf_slave_wikiuser_slow
===== NODE GROUP =====                                                                                                                                                                                                                  
(8) db[2160,2186-2187].codfw.wmnet,db[1154-1155,1208,1217,1246].eqiad.wmnet                                                                                                                                                             
----- OUTPUT of 'mysql ops -e 'SH...awk '{print $2}'' -----                                                                                                                                                                             
ERROR 2002 (HY000): Can't connect to local server through socket '/run/mysqld/mysqld.sock' (2)                                                                                                                                          
===== NODE GROUP =====                                                                                                                                                                                                                  
(14) db[2139,2141,2197-2201].codfw.wmnet,db[1150,1171,1216,1225,1239-1240,1245].eqiad.wmnet                                                                                                                                             
----- OUTPUT of 'mysql ops -e 'SH...awk '{print $2}'' -----                                                                                                                                                                             
ERROR 2002 (HY000): Can't connect to local server through socket '/tmp/mysql.sock' (2)                                                                                                                                                  
================                                                                                                                                                                                                                        
PASS |██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100% (215/215) [00:03<00:00, 58.30hosts/s]
FAIL |                                                                                                                                                                                                |   0% (0/215) [00:03<?, ?hosts/s]
100.0% (215/215) success ratio (>= 0.0% threshold) for command: 'mysql ops -e 'SH...awk '{print $2}''.
100.0% (215/215) success ratio (>= 0.0% threshold) of nodes successfully executed all commands.

Trimming down by cumin alias A:db-role-master:

arnaudb@cumin1002:~ $ sudo cumin A:db-role-master "mysql ops -e 'SHOW EVENTS FROM ops;' -s -N | awk '{print \$2}'" -p 0
47 hosts will be targeted:
db[2123,2132-2135,2142,2165,2179,2183,2185,2196,2203-2205,2214,2218,2230].codfw.wmnet,db[1125,1152,1159-1160,1163-1164,1176,1189,1193,1201,1204,1215,1220,1222,1228,1230,1236].eqiad.wmnet,es[2035,2038].codfw.wmnet,es[1038-1039].eqiad.wmnet,pc[2011-2013,2015].codfw.wmnet,pc[1011-1012,1015-1017].eqiad.wmnet
OK to proceed on 47 hosts? Enter the number of affected hosts to confirm or "q" to quit: 47
===== NODE GROUP =====                                                                                              
(1) es2035.codfw.wmnet                                                                                              
----- OUTPUT of 'mysql ops -e 'SH...awk '{print $2}'' -----                                                         
wmf_slave_overload                                                                                                  
wmf_slave_purge                                                                                                     
wmf_slave_wikiuser_sleep
wmf_slave_wikiuser_slow
===== NODE GROUP =====                                                                                              
(13) db[2132-2135,2183,2185].codfw.wmnet,db[1159,1164,1176,1204,1215,1228].eqiad.wmnet,pc1017.eqiad.wmnet           
----- OUTPUT of 'mysql ops -e 'SH...awk '{print $2}'' -----                                                         
ERROR 1049 (42000): Unknown database 'ops'                                                                          
===== NODE GROUP =====                                                                                              
(28) db[2123,2142,2165,2179,2196,2203-2205,2214,2218].codfw.wmnet,db[1152,1160,1163,1189,1193,1201,1220,1222,1230,1236].eqiad.wmnet,es2038.codfw.wmnet,es[1038-1039].eqiad.wmnet,pc[2011-2013].codfw.wmnet,pc[1011-1012].eqiad.wmnet    
----- OUTPUT of 'mysql ops -e 'SH...awk '{print $2}'' -----                                                         
wmf_master_purge                                                                                                    
wmf_master_wikiuser_sleep

and running the opposite sieving on non A:db-role-master

arnaudb@cumin1002:~ $ sudo cumin 'A:db-all and not A:db-multiinstance and not A:db-role-master' "mysql ops -e 'SHOW EVENTS FROM ops;' -s -N | awk '{print \$2}'" -p 0
201 hosts will be targeted:
an-redacteddb1001.eqiad.wmnet,clouddb[1013-1020].eqiad.wmnet,db[2115-2116,2126,2128,2130-2131,2136,2140,2143-2159,2161-2164,2166-2178,2180-2182,2184,2188-2195,2202,2206-2213,2215-2217,2219-2222,2224-2225,2227,2229,2237-2238].codfw.wmnet,db[1151,1153,1156-1158,1161-1162,1165-1170,1172-1175,1177-1188,1190-1192,1194-1200,1202-1203,1205-1207,1209-1214,1218-1219,1221,1223-1224,1226-1227,1229,1231-1235,1237-1238,1241-1244,1246-1249].eqiad.wmnet,es[2020-2034,2036-2037,2039-2040].codfw.wmnet,es[1020-1037,1040].eqiad.wmnet,pc[2014,2016-2017].codfw.wmnet,pc[1013-1014].eqiad.wmnet
OK to proceed on 201 hosts? Enter the number of affected hosts to confirm or "q" to quit: 201
===== NODE GROUP =====                                                                                                                                                                                                                  
(9) an-redacteddb1001.eqiad.wmnet,clouddb[1013-1020].eqiad.wmnet                                                                                                                                                                        
----- OUTPUT of 'mysql ops -e 'SH...awk '{print $2}'' -----                                                                                                                                                                             
ERROR 2002 (HY000): Can't connect to local server through socket '/tmp/mysql.sock' (2)                                                                                                                                                  
===== NODE GROUP =====                                                                                                                                                                                                                  
(1) db1246.eqiad.wmnet                                                                                                                                                                                                                  
----- OUTPUT of 'mysql ops -e 'SH...awk '{print $2}'' -----                                                                                                                                                                             
ERROR 2002 (HY000): Can't connect to local server through socket '/run/mysqld/mysqld.sock' (2)                                                                                                                                          
===== NODE GROUP =====                                                                                                                                                                                                                  
(3) db2184.codfw.wmnet,db1205.eqiad.wmnet,pc2017.codfw.wmnet                                                                                                                                                                            
----- OUTPUT of 'mysql ops -e 'SH...awk '{print $2}'' -----                                                                                                                                                                             
ERROR 1049 (42000): Unknown database 'ops'                                                                                                                                                                                              
===== NODE GROUP =====                                                                                                                                                                                                                  
(5) es[2021,2024].codfw.wmnet,es[1021,1024].eqiad.wmnet,pc1013.eqiad.wmnet                                                                                                                                                              
----- OUTPUT of 'mysql ops -e 'SH...awk '{print $2}'' -----                                                                                                                                                                             
wmf_master_purge                                                                                                                                                                                                                        
wmf_master_wikiuser_sleep                                                                                                                                                                                                               
===== NODE GROUP =====                                                                                                                                                                                                                  
(182) db[2115-2116,2126,2128,2130-2131,2136,2140,2143-2159,2161-2164,2166-2178,2180-2182,2188-2195,2202,2206-2213,2215-2217,2219-2222,2224-2225,2227,2229,2237-2238].codfw.wmnet,db[1151,1153,1156-1158,1161-1162,1165-1170,1172-1175,1177-1188,1190-1192,1194-1200,1202-1203,1206-1207,1209-1214,1218-1219,1221,1223-1224,1226-1227,1229,1231-1235,1237-1238,1241-1244,1247-1249].eqiad.wmnet,es[2020,2022-2023,2025-2034,2036-2037,2039-2040].codfw.wmnet,es[1020,1022-1023,1025-1037,1040].eqiad.wmnet,pc2014.codfw.wmnet,pc1014.eqiad.wmnet
----- OUTPUT of 'mysql ops -e 'SH...awk '{print $2}'' -----                                                                                                                                                                             
wmf_slave_overload                                                                                                                                                                                                                      
wmf_slave_purge                                                                                                                                                                                                                         
wmf_slave_wikiuser_sleep
wmf_slave_wikiuser_slow
================                                                                                                                                                                                                                        
PASS |██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100% (201/201) [00:03<00:00, 51.66hosts/s]
FAIL |                                                                                                                                                                                                |   0% (0/201) [00:03<?, ?hosts/s]
100.0% (201/201) success ratio (>= 0.0% threshold) for command: 'mysql ops -e 'SH...awk '{print $2}''.
100.0% (201/201) success ratio (>= 0.0% threshold) of nodes successfully executed all commands.

We only have consistency issue on non core hosts on events.

primary nodes have a proper config except for es2035.codfw.wmnet. suggested fix:

curl -sS 'https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/+/refs/heads/master/dbtools/events_coredb_master.sql?format=TEXT' | base64 -d | sudo db-mysql es2035

and on the other side:

(4) es[2021,2024].codfw.wmnet,es[1021,1024].eqiad.wmnet                                                                                                                                                                                 
----- OUTPUT of 'mysql ops -e 'SH...awk '{print $2}'' -----                                                                                                                                                                             
wmf_master_purge                                                                                                                                                                                                                        
wmf_master_wikiuser_sleep

only those have consistency issues, with the following suggested fix:

for host in $(echo "es[2021,2024].codfw.wmnet,es[1021,1024].eqiad.wmnet" |nodeset -e); do
echo doing ${host}
time  curl -sS 'https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/+/refs/heads/master/dbtools/events_coredb_master.sql?format=TEXT' | base64 -d | sudo db-mysql ${host%%.*}

done

wdyt @jcrespo ?

edit: for sanity check, here is a oneliner proofing $host print:

for host in $(echo "es[2021,2024].codfw.wmnet,es[1021,1024].eqiad.wmnet" |nodeset -e); do echo "${host%%.*}" ; done
es2021
es2024
es1021
es1024

as updated in task description: P69410 for secondaries and P69411 for primaries contain the last 10 loglines for each mariadb instance

Secondaries are looking correct from what I saw, but primaries may be a bit more concerning. I'll dig in db1159 for instance (and try to scrape the few other snowflakes)

after a quick sanity check with @jcrespo:

for host in $(echo "es[2021,2024].codfw.wmnet,es[1021,1024].eqiad.wmnet" |nodeset -e); do
echo doing ${host}
time sudo db-mysql ${host%%.*} < /home/arnaudb/software/dbtools/events_coredb_master.sql
done

this is the logic that emerged:

for host in $(echo "es[2021,2024].codfw.wmnet,es[1021,1024].eqiad.wmnet" |nodeset -e); do
echo doing ${host}
time sudo db-mysql ${host%%.*} "select @@read_only;"
done

but it does not seem as trivial as this command simply fails, so I'll stick to the original plan to avoid unforeseen issues

everything went through no trouble:

arnaudb@cumin1002:~ $ for host in $(echo "es[2021,2024].codfw.wmnet,es[1021,1024].eqiad.wmnet" |nodeset -e); do
echo doing ${host}
time curl -sS 'https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/+/refs/heads/master/dbtools/events_coredb_master.sql?format=TEXT' | base64 -d | sudo db-mysql ${host%%.*}

done
doing es2021.codfw.wmnet

real 0m0.804s
user 0m0.095s
sys 0m0.021s
doing es2024.codfw.wmnet

real 0m0.797s
user 0m0.092s
sys 0m0.045s
doing es1021.eqiad.wmnet

real 0m0.139s
user 0m0.075s
sys 0m0.056s
doing es1024.eqiad.wmnet

real 0m0.123s
user 0m0.094s
sys 0m0.032s
arnaudb@cumin1002:~ $ curl -sS 'https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/+/refs/heads/master/dbtools/events_coredb_master.sql?format=TEXT' | base64 -d | sudo db-mysql es2035

Please rerun the original check when finished so we can confirm all hosts are ready!

I will give a last look at the ones that failed to report an ops db:

===== NODE GROUP =====                                                                                                                                
(13) db[2132-2135,2183,2185].codfw.wmnet,db[1159,1164,1176,1204,1215,1228].eqiad.wmnet,pc1017.eqiad.wmnet                                             
----- OUTPUT of 'mysql ops -e 'SH...awk '{print $2}'' -----                                                                                           
ERROR 1049 (42000): Unknown database 'ops'
===== NODE GROUP =====                                                                                                                                
(3) db2184.codfw.wmnet,db1205.eqiad.wmnet,pc2017.codfw.wmnet                                                                                          
----- OUTPUT of 'mysql ops -e 'SH...awk '{print $2}'' -----                                                                                           
ERROR 1049 (42000): Unknown database 'ops'

Mentioned in SAL (#wikimedia-operations) [2024-09-25T11:30:24Z] <jynus> running DELETE + REPLACE on kowiki db2218 (s7) T375186

Mentioned in SAL (#wikimedia-operations) [2024-09-25T11:40:00Z] <jynus> running DELETE + REPLACE on metawiki db2218 (s7) T375186

the only think I saw missing was pc1017 pc2017 which I am guessing they are the new hosts for pc5 ?
in any case, not a big issue

Great work, @ABran-WMF .