- compare weights and groups (API/vslow/dumps/...) and pooling for core hosts
- compare weights and groups (API/vslow/dumps/...) and pooling for es hosts
- ensure that 1 master 1 candidate master per section declared and usual topology constraints are OK. On core sections: https://phabricator.wikimedia.org/T375186#10161055
- check section capacities (count, qps, ..)
- read only verifications + prometheus migration plan check
- monitoring notifications enabled on all relevant hosts - T375186#10167284
- validate event/query killer and pt-heartbeat is properly setup everywhere
- clean orchestrator output P69307 (db1246 skipped due to T374215)
- check status of buffer pool usage +
- create a task for buffer pool usage monitoring T375589
- check mariadb services on all codfw for logging oddities https://phabricator.wikimedia.org/P69410 for replicas and https://phabricator.wikimedia.org/P69411 for primaries
- Write a dns patch for the new masters and compare DNS, puppet, dbctl and real replica topology (e.g. orchestrator/dbtree) agrees
- Run compare table among eqiad and codfw masters to ensure they all have the same data (see missing checks at T375507)
Description
Details
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | Scott_French | T370962 Southward Datacenter Switchover (September 2024) | |||
Resolved | ABran-WMF | T375186 databases preswitchover checks |
Event Timeline
@Ladsgroup I've been informed that subscribing at creation time on phabricator will spare you the notification email, so here is one for you!
@jcrespo
I've quickly drafted a script to read from dbconfig, here is a summary:
EQIAD - s8
- Number of instances in subgroup 'api': 6
- Number of instances in subgroup 'vslow': 1
- Number of instances in subgroup 'dump': 1
- Total number of servers: 11
- Total weight: 2676
- Average weight per instance: 243.27
- Instances with weight 0: ['db1193']
- Instances with weight less than 10: ['db1193', 'db1167']
CODFW - s8
- Number of instances in subgroup 'api': 6
- Number of instances in subgroup 'vslow': 1
- Number of instances in subgroup 'dump': 1
- Total number of servers: 11
- Total weight: 2701
- Average weight per instance: 245.55
- Instances with weight 0: ['db2165']
- Instances with weight less than 10: ['db2165', 'db2152']
EQIAD - s2
- Number of instances in subgroup 'api': 5
- Number of instances in subgroup 'vslow': 0
- Number of instances in subgroup 'dump': 0
- Total number of servers: 8
- Total weight: 2600
- Average weight per instance: 325.00
- Instances with weight 0: ['db1222']
- Instances with weight less than 10: ['db1222']
CODFW - s2
- Number of instances in subgroup 'api': 5
- Number of instances in subgroup 'vslow': 1
- Number of instances in subgroup 'dump': 1
- Total number of servers: 8
- Total weight: 2600
- Average weight per instance: 325.00
- Instances with weight 0: ['db2204']
- Instances with weight less than 10: ['db2204']
EQIAD - s1
- Number of instances in subgroup 'api': 10
- Number of instances in subgroup 'vslow': 1
- Number of instances in subgroup 'dump': 1
- Total number of servers: 13
- Total weight: 2326
- Average weight per instance: 178.92
- Instances with weight 0: ['db1163']
- Instances with weight less than 10: ['db1163', 'db1206']
CODFW - s1
- Number of instances in subgroup 'api': 5
- Number of instances in subgroup 'vslow': 1
- Number of instances in subgroup 'dump': 1
- Total number of servers: 13
- Total weight: 3250
- Average weight per instance: 250.00
- Instances with weight 0: ['db2203']
- Instances with weight less than 10: ['db2203']
EQIAD - DEFAULT
- Number of instances in subgroup 'api': 0
- Number of instances in subgroup 'vslow': 1
- Number of instances in subgroup 'dump': 1
- Total number of servers: 7
- Total weight: 2800
- Average weight per instance: 400.00
- Instances with weight 0: ['db1189']
- Instances with weight less than 10: ['db1189']
CODFW - DEFAULT
- Number of instances in subgroup 'api': 0
- Number of instances in subgroup 'vslow': 1
- Number of instances in subgroup 'dump': 1
- Total number of servers: 8
- Total weight: 2700
- Average weight per instance: 337.50
- Instances with weight 0: ['db2205']
- Instances with weight less than 10: ['db2205']
EQIAD - s5
- Number of instances in subgroup 'api': 2
- Number of instances in subgroup 'vslow': 1
- Number of instances in subgroup 'dump': 1
- Total number of servers: 7
- Total weight: 2050
- Average weight per instance: 292.86
- Instances with weight 0: ['db1230']
- Instances with weight less than 10: ['db1230']
CODFW - s5
- Number of instances in subgroup 'api': 3
- Number of instances in subgroup 'vslow': 1
- Number of instances in subgroup 'dump': 1
- Total number of servers: 8
- Total weight: 2300
- Average weight per instance: 287.50
- Instances with weight 0: ['db2123']
- Instances with weight less than 10: ['db2123']
EQIAD - s4
- Number of instances in subgroup 'api': 6
- Number of instances in subgroup 'vslow': 2
- Number of instances in subgroup 'dump': 2
- Total number of servers: 12
- Total weight: 2402
- Average weight per instance: 200.17
- Instances with weight 0: ['db1160']
- Instances with weight less than 10: ['db1160', 'db1248', 'db1221']
CODFW - s4
- Number of instances in subgroup 'api': 5
- Number of instances in subgroup 'vslow': 2
- Number of instances in subgroup 'dump': 2
- Total number of servers: 10
- Total weight: 2300
- Average weight per instance: 230.00
- Instances with weight 0: ['db2179']
- Instances with weight less than 10: ['db2179']
EQIAD - s6
- Number of instances in subgroup 'api': 3
- Number of instances in subgroup 'vslow': 0
- Number of instances in subgroup 'dump': 0
- Total number of servers: 7
- Total weight: 2100
- Average weight per instance: 300.00
- Instances with weight 0: ['db1201']
- Instances with weight less than 10: ['db1201']
CODFW - s6
- Number of instances in subgroup 'api': 4
- Number of instances in subgroup 'vslow': 1
- Number of instances in subgroup 'dump': 1
- Total number of servers: 10
- Total weight: 2600
- Average weight per instance: 260.00
- Instances with weight 0: ['db2214']
- Instances with weight less than 10: ['db2214']
EQIAD - s7
- Number of instances in subgroup 'api': 5
- Number of instances in subgroup 'vslow': 1
- Number of instances in subgroup 'dump': 1
- Total number of servers: 9
- Total weight: 2300
- Average weight per instance: 255.56
- Instances with weight 0: ['db1236']
- Instances with weight less than 10: ['db1236']
CODFW - s7
- Number of instances in subgroup 'api': 3
- Number of instances in subgroup 'vslow': 1
- Number of instances in subgroup 'dump': 1
- Total number of servers: 9
- Total weight: 3400
- Average weight per instance: 377.78
- Instances with weight 0: ['db2218']
- Instances with weight less than 10: ['db2218']
Change #1074184 had a related patch set uploaded (by Arnaudb; author: Arnaudb):
[operations/software@master] mariadb: compare weight and instance distribution across dcs
Change #1074188 had a related patch set uploaded (by Arnaudb; author: Arnaudb):
[operations/puppet@production] mariadb: remove db2129
db2129 from T374623 was pooled and was a duplicate candidate master w/ db2229. This patch fixes the instance duplication and prepares db2129 for removal: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1074188
Change #1074188 merged by Arnaudb:
[operations/puppet@production] mariadb: remove db2129
I forgot to mention, I think orchestrator has a similar tool, but I found in the past a tool like db-replication-tree useful for this kind of work (preparation) and later tuning after switchover:
In case you didn't remember it, it was created to facilitate a table-like with the most important parameters (read only, connection latency, etc). And shows in red things that are unexpected or to be aware (high latencies, read only off). Just commenting it in case it is useful for this ticket.
I've spotted fewer oddities on codfw vs eqiad:
- codfw underweighed: db2173 (sanitarium master), db2152
- eqiad underweighted: db1160, db1248, db1221, db1167, db1206, db1248, db1221
Will continue on digging potential contention issues.
Note: this should be fully automated.
@Scott_French wrote the patch:
https://gerrit.wikimedia.org/r/c/operations/dns/+/1073897
I compared dns, puppet and replication topology (orchestrator) and found only 2 discrepancies: pc2007 is not marked as a master on puppet (may not be an issue as it may not be yet fully setup) CC Amir. And x2 may not be needed to be switchover-ed, as it is active-active (but it won't hurt). I checked all other hosts, both on puppet, dbctl and orchestrator.
Mentioned in SAL (#wikimedia-operations) [2024-09-23T10:51:52Z] <jynus> starting db master table checks on s1 (db1163, db2203) T375186
@ABran-WMF I see that db2229 in theory its productionization has finished (T373579) and it is pooled as a candidate master, but it has notifications disabled. Is that expected (e.g. hardware crash)?
Change #1074986 had a related patch set uploaded (by Arnaudb; author: Arnaudb):
[operations/puppet@production] mariadb: toggle notifications for db2229
Change #1074986 merged by Arnaudb:
[operations/puppet@production] mariadb: toggle notifications for db2229
Check the other host on puppet/icinga with notifications disabled, I think I saw others, but maybe those are being setup/decom: db2185/6/7.
will do!
I've spotted no oddity on es/pc/x sections:
EQIAD - es7
- Total number of servers: 3
- Total weight: 210
- Average weight per instance: 70.00
- Instances with weight 0: []
- Instances with weight less than 10: []
CODFW - es7
- Total number of servers: 3
- Total weight: 200
- Average weight per instance: 66.67
- Instances with weight 0: ['es2038']
- Instances with weight less than 10: ['es2038']
EQIAD - pc5
- Total number of servers: 1
- Total weight: 1
- Average weight per instance: 1.00
- Instances with weight 0: []
- Instances with weight less than 10: ['pc1017']
CODFW - pc5
- Total number of servers: 1
- Total weight: 1
- Average weight per instance: 1.00
- Instances with weight 0: []
- Instances with weight less than 10: ['pc2017']
EQIAD - x1
- Total number of servers: 4
- Total weight: 300
- Average weight per instance: 75.00
- Instances with weight 0: ['db1220']
- Instances with weight less than 10: ['db1220']
CODFW - x1
- Total number of servers: 5
- Total weight: 400
- Average weight per instance: 80.00
- Instances with weight 0: ['db2196']
- Instances with weight less than 10: ['db2196']
EQIAD - pc2
- Total number of servers: 1
- Total weight: 1
- Average weight per instance: 1.00
- Instances with weight 0: []
- Instances with weight less than 10: ['pc1012']
CODFW - pc2
- Total number of servers: 1
- Total weight: 1
- Average weight per instance: 1.00
- Instances with weight 0: []
- Instances with weight less than 10: ['pc2012']
EQIAD - pc4
- Total number of servers: 1
- Total weight: 1
- Average weight per instance: 1.00
- Instances with weight 0: []
- Instances with weight less than 10: ['pc1016']
CODFW - pc4
- Total number of servers: 1
- Total weight: 1
- Average weight per instance: 1.00
- Instances with weight 0: []
- Instances with weight less than 10: ['pc2015']
EQIAD - es4
- Total number of servers: 2
- Total weight: 200
- Average weight per instance: 100.00
- Instances with weight 0: []
- Instances with weight less than 10: []
CODFW - es4
- Total number of servers: 3
- Total weight: 300
- Average weight per instance: 100.00
- Instances with weight 0: []
- Instances with weight less than 10: []
EQIAD - es3
- Total number of servers: 3
- Total weight: 300
- Average weight per instance: 100.00
- Instances with weight 0: []
- Instances with weight less than 10: []
CODFW - es3
- Total number of servers: 3
- Total weight: 300
- Average weight per instance: 100.00
- Instances with weight 0: []
- Instances with weight less than 10: []
EQIAD - es6
- Total number of servers: 3
- Total weight: 210
- Average weight per instance: 70.00
- Instances with weight 0: []
- Instances with weight less than 10: []
CODFW - es6
- Total number of servers: 3
- Total weight: 210
- Average weight per instance: 70.00
- Instances with weight 0: []
- Instances with weight less than 10: []
EQIAD - pc3
- Total number of servers: 1
- Total weight: 1
- Average weight per instance: 1.00
- Instances with weight 0: []
- Instances with weight less than 10: ['pc1013']
CODFW - pc3
- Total number of servers: 1
- Total weight: 1
- Average weight per instance: 1.00
- Instances with weight 0: []
- Instances with weight less than 10: ['pc2013']
EQIAD - x2
- Total number of servers: 1
- Total weight: 0
- Average weight per instance: 0.00
- Instances with weight 0: ['db1152']
- Instances with weight less than 10: ['db1152']
CODFW - x2
- Total number of servers: 1
- Total weight: 0
- Average weight per instance: 0.00
- Instances with weight 0: ['db2142']
- Instances with weight less than 10: ['db2142']
EQIAD - pc1
- Total number of servers: 1
- Total weight: 1
- Average weight per instance: 1.00
- Instances with weight 0: []
- Instances with weight less than 10: ['pc1011']
CODFW - pc1
- Total number of servers: 1
- Total weight: 1
- Average weight per instance: 1.00
- Instances with weight 0: []
- Instances with weight less than 10: ['pc2011']
EQIAD - es5
- Total number of servers: 3
- Total weight: 300
- Average weight per instance: 100.00
- Instances with weight 0: []
- Instances with weight less than 10: []
CODFW - es5
- Total number of servers: 3
- Total weight: 300
- Average weight per instance: 100.00
- Instances with weight 0: []
- Instances with weight less than 10: []
EQIAD - es1
- Total number of servers: 3
- Total weight: 300
- Average weight per instance: 100.00
- Instances with weight 0: []
- Instances with weight less than 10: []
CODFW - es1
- Total number of servers: 3
- Total weight: 300
- Average weight per instance: 100.00
- Instances with weight 0: []
- Instances with weight less than 10: []
EQIAD - es2
- Total number of servers: 3
- Total weight: 300
- Average weight per instance: 100.00
- Instances with weight 0: []
- Instances with weight less than 10: []
CODFW - es2
- Total number of servers: 3
- Total weight: 300
- Average weight per instance: 100.00
- Instances with weight 0: []
- Instances with weight less than 10: []
@jcrespo to be super verbose and explicit about notifications:
X instances
hieradata/hosts/db1151.yaml: 2 mariadb::shard: 'x2' 3: profile::monitoring::notifications_enabled: false 4 hieradata/hosts/db1153.yaml: 2 mariadb::shard: 'x2' 3: profile::monitoring::notifications_enabled: false 4 # Testing MariaDB 10.11 T365805 - do not promote to master hieradata/hosts/db2143.yaml: 2 mariadb::shard: 'x2' 3: profile::monitoring::notifications_enabled: false hieradata/hosts/db2144.yaml: 2 mariadb::shard: 'x2' 3: profile::monitoring::notifications_enabled: false
all those are OK
zarcillo
hieradata/hosts/db2185.yaml: 4 profile::monitoring::is_critical: false 5: profile::monitoring::notifications_enabled: false 6
despite it being the replica: I'm sure we would want it to alert us
Sanitarium
hieradata/hosts/db2186.yaml: 7 s8: 48G 8: profile::monitoring::notifications_enabled: false 9 hieradata/hosts/db2187.yaml: 7 s7: 38G 8: profile::monitoring::notifications_enabled: false 9
those are ok, so are db222+
dbproxies
Those are OK as well
ParserCaches
hieradata/hosts/pc1014.yaml: 4 profile::mariadb::config::innodb_flush_log_at_trx_commit: '0' 5: profile::monitoring::notifications_enabled: false hieradata/hosts/pc1015.yaml: 5 profile::mariadb::config::innodb_flush_log_at_trx_commit: '0' 6: profile::monitoring::notifications_enabled: false hieradata/hosts/pc2014.yaml: 4 profile::mariadb::config::innodb_flush_log_at_trx_commit: '0' 5: profile::monitoring::notifications_enabled: false hieradata/hosts/pc2016.yaml: 4 profile::mariadb::config::innodb_flush_log_at_trx_commit: '0' 5: profile::monitoring::notifications_enabled: false
I've spotted few parsercache instances that have no notification enabled, maybe this are worth reconsidering?
Here is the full output of the research for future reference: P69393
@jcrespo wdyt about:
SELECT ROUND((innodb_buffer_pool_pages_data / innodb_buffer_pool_pages_total) * 100, 2) AS Buffer_Pool_Usage_Percentage FROM (SELECT VARIABLE_VALUE AS innodb_buffer_pool_pages_data FROM information_schema.GLOBAL_STATUS WHERE VARIABLE_NAME = 'Innodb_buffer_pool_pages_data') AS A, (SELECT VARIABLE_VALUE AS innodb_buffer_pool_pages_total FROM information_schema.GLOBAL_STATUS WHERE VARIABLE_NAME = 'Innodb_buffer_pool_pages_total') AS B;
for the subtask:
check status of buffer pool usage + create a task for monitoring
on db2230 it returns a proper result:
*************************** 1. row *************************** Buffer_Pool_Usage_Percentage: 0.01
its harmless to run on the whole fleet and would give a human readable output that is easy to add to this CR
check section capacities (count, qps, ..)
- s4 is a bit undersized on codfw (-2 instances)
- s6 is a bit oversized (+2 instances)
- default section and s5 are a bit oversized (+1)
for
read only verifications + prometheus migration plan check
I've reused expressions from T367282 https://gerrit.wikimedia.org/r/c/operations/alerts/+/1050498/12/team-data-persistence/mysql-read_only.yaml
as visible here
- https://grafana.wikimedia.org/goto/j2xTNfRNg?orgId=1
- https://grafana.wikimedia.org/goto/4b6oHBgNg?orgId=1
Both primary and secondary pools have misconfigured read_only states → lets check it on our 1:1 today
Mentioned in SAL (#wikimedia-operations) [2024-09-24T12:12:28Z] <jynus> running db-compare on s2, s3 T375186
!
db-compare abwiki change_tag ct_id db1189 db2205 --step 10 Starting comparison between id 1 and 63456 2024-09-24T12:25:17.345270: row id 991/63456, ETA: 10m26s, 0 chunk(s) found different DIFFERENCE on db2205.codfw.wmnet:3306: WHERE ct_id BETWEEN 1901 AND 1910
So in
The query looked sane but was not:
arnaudb@db2158:~ $ sudo mysql -e "SELECT ROUND((innodb_buffer_pool_pages_data / innodb_buffer_pool_pages_total) * 100, 2) AS Buffer_Pool_Usage_Percentage FROM (SELECT VARIABLE_VALUE AS innodb_buffer_pool_pages_data FROM information_schema.GLOBAL_STATUS WHERE VARIABLE_NAME = 'Innodb_buffer_pool_pages_data') AS A, (SELECT VARIABLE_VALUE AS innodb_buffer_pool_pages_total FROM information_schema.GLOBAL_STATUS WHERE VARIABLE_NAME = 'Innodb_buffer_pool_pages_total') AS B;" +------------------------------+ | Buffer_Pool_Usage_Percentage | +------------------------------+ | 122.73 | +------------------------------+
so, improving on the query:
arnaudb@db2158:~ $ sudo mysql -e "SELECT pool_id, pool_size, free_buffers, database_pages, (pool_size - free_buffers) / pool_size * 100 AS Buffer_Pool_Usage_Percentage FROM information_schema.INNODB_BUFFER_POOL_STATS; " +---------+-----------+--------------+----------------+------------------------------+ | pool_id | pool_size | free_buffers | database_pages | Buffer_Pool_Usage_Percentage | +---------+-----------+--------------+----------------+------------------------------+ | 0 | 24465792 | 0 | 30026932 | 100.0000 | +---------+-----------+--------------+----------------+------------------------------+
This one looks a bit better, I'll check if it's properly representative and will compare across fleet
https://grafana.wikimedia.org/goto/Ss01jBgHg?orgId=1 → this is way faster and more visual
db2142,db2143,db2144,db1151,db1152,db1153 have low buffer pools
event is OK everywhere:
arnaudb@cumin1002:~ $ sudo cumin db[2115-2116,2123,2126,2128,2130-2136,2139-2240].codfw.wmnet,db[1125,1150-1249].eqiad.wmnet "mysql -e 'select @@event_scheduler\G'" -p 0 215 hosts will be targeted: db[2115-2116,2123,2126,2128,2130-2136,2139-2240].codfw.wmnet,db[1125,1150-1249].eqiad.wmnet OK to proceed on 215 hosts? Enter the number of affected hosts to confirm or "q" to quit: 215 ===== NODE GROUP ===== (11) db[2223,2226,2228,2231-2236,2239-2240].codfw.wmnet ----- OUTPUT of 'mysql -e 'select...ent_scheduler\G'' ----- bash: line 1: mysql: command not found ===== NODE GROUP ===== (182) db[2115-2116,2123,2126,2128,2130-2136,2140,2142-2159,2161-2185,2188-2196,2202-2222,2224-2225,2227,2229-2230,2237-2238].codfw.wmnet,db[1125,1151-1153,1156-1170,1172-1207,1209-1215,1218-1224,1226-1238,1241-1244,1247-1249].eqiad.wmnet ----- OUTPUT of 'mysql -e 'select...ent_scheduler\G'' ----- *************************** 1. row *************************** @@event_scheduler: ON ===== NODE GROUP ===== (8) db[2160,2186-2187].codfw.wmnet,db[1154-1155,1208,1217,1246].eqiad.wmnet ----- OUTPUT of 'mysql -e 'select...ent_scheduler\G'' ----- ERROR 2002 (HY000): Can't connect to local server through socket '/run/mysqld/mysqld.sock' (2) ===== NODE GROUP ===== (14) db[2139,2141,2197-2201].codfw.wmnet,db[1150,1171,1216,1225,1239-1240,1245].eqiad.wmnet ----- OUTPUT of 'mysql -e 'select...ent_scheduler\G'' ----- ERROR 2002 (HY000): Can't connect to local server through socket '/tmp/mysql.sock' (2) ================ PASS |██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ | 85% (182/215) [00:04<00:00, 42.26hosts/s] FAIL |████████████████████████████ | 15% (33/215) [00:04<00:23, 7.66hosts/s] 15.3% (33/215) of nodes failed to execute command 'mysql -e 'select...ent_scheduler\G'': db[2139,2141,2160,2186-2187,2197-2201,2223,2226,2228,2231-2236,2239-2240].codfw.wmnet,db[1150,1154-1155,1171,1208,1216-1217,1225,1239-1240,1245-1246].eqiad.wmnet 84.7% (182/215) success ratio (>= 0.0% threshold) for command: 'mysql -e 'select...ent_scheduler\G''.: db[2115-2116,2123,2126,2128,2130-2136,2140,2142-2159,2161-2185,2188-2196,2202-2222,2224-2225,2227,2229-2230,2237-2238].codfw.wmnet,db[1125,1151-1153,1156-1170,1172-1207,1209-1215,1218-1224,1226-1238,1241-1244,1247-1249].eqiad.wmnet 84.7% (182/215) success ratio (>= 0.0% threshold) of nodes successfully executed all commands.: db[2115-2116,2123,2126,2128,2130-2136,2140,2142-2159,2161-2185,2188-2196,2202-2222,2224-2225,2227,2229-2230,2237-2238].codfw.wmnet,db[1125,1151-1153,1156-1170,1172-1207,1209-1215,1218-1224,1226-1238,1241-1244,1247-1249].eqiad.wmnet
failed instances are all legit to fail.
events in database:
arnaudb@cumin1002:~ $ sudo cumin db[2115-2116,2123,2126,2128,2130-2136,2139-2240].codfw.wmnet,db[1125,1150-1249].eqiad.wmnet "mysql ops -e 'SHOW EVENTS FROM ops;' -s -N | awk '{print \$2}'" -p 0 215 hosts will be targeted: db[2115-2116,2123,2126,2128,2130-2136,2139-2240].codfw.wmnet,db[1125,1150-1249].eqiad.wmnet OK to proceed on 215 hosts? Enter the number of affected hosts to confirm or "q" to quit: 215 ===== NODE GROUP ===== (11) db[2223,2226,2228,2231-2236,2239-2240].codfw.wmnet ----- OUTPUT of 'mysql ops -e 'SH...awk '{print $2}'' ----- bash: line 1: mysql: command not found ===== NODE GROUP ===== (14) db[2132-2135,2183-2185].codfw.wmnet,db[1159,1164,1176,1204-1205,1215,1228].eqiad.wmnet ----- OUTPUT of 'mysql ops -e 'SH...awk '{print $2}'' ----- ERROR 1049 (42000): Unknown database 'ops' ===== NODE GROUP ===== (20) db[2123,2142,2165,2179,2196,2203-2205,2214,2218].codfw.wmnet,db[1152,1160,1163,1189,1193,1201,1220,1222,1230,1236].eqiad.wmnet ----- OUTPUT of 'mysql ops -e 'SH...awk '{print $2}'' ----- wmf_master_purge wmf_master_wikiuser_sleep ===== NODE GROUP ===== (146) db[2115-2116,2126,2128,2130-2131,2136,2140,2143-2159,2161-2164,2166-2178,2180-2182,2188-2195,2202,2206-2213,2215-2217,2219-2222,2224-2225,2227,2229,2237-2238].codfw.wmnet,db[1151,1153,1156-1158,1161-1162,1165-1170,1172-1175,1177-1188,1190-1192,1194-1200,1202-1203,1206-1207,1209-1214,1218-1219,1221,1223-1224,1226-1227,1229,1231-1235,1237-1238,1241-1244,1247-1249].eqiad.wmnet ----- OUTPUT of 'mysql ops -e 'SH...awk '{print $2}'' ----- wmf_slave_overload wmf_slave_purge wmf_slave_wikiuser_sleep wmf_slave_wikiuser_slow ===== NODE GROUP ===== (8) db[2160,2186-2187].codfw.wmnet,db[1154-1155,1208,1217,1246].eqiad.wmnet ----- OUTPUT of 'mysql ops -e 'SH...awk '{print $2}'' ----- ERROR 2002 (HY000): Can't connect to local server through socket '/run/mysqld/mysqld.sock' (2) ===== NODE GROUP ===== (14) db[2139,2141,2197-2201].codfw.wmnet,db[1150,1171,1216,1225,1239-1240,1245].eqiad.wmnet ----- OUTPUT of 'mysql ops -e 'SH...awk '{print $2}'' ----- ERROR 2002 (HY000): Can't connect to local server through socket '/tmp/mysql.sock' (2) ================ PASS |██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100% (215/215) [00:03<00:00, 58.30hosts/s] FAIL | | 0% (0/215) [00:03<?, ?hosts/s] 100.0% (215/215) success ratio (>= 0.0% threshold) for command: 'mysql ops -e 'SH...awk '{print $2}''. 100.0% (215/215) success ratio (>= 0.0% threshold) of nodes successfully executed all commands.
Trimming down by cumin alias A:db-role-master:
arnaudb@cumin1002:~ $ sudo cumin A:db-role-master "mysql ops -e 'SHOW EVENTS FROM ops;' -s -N | awk '{print \$2}'" -p 0 47 hosts will be targeted: db[2123,2132-2135,2142,2165,2179,2183,2185,2196,2203-2205,2214,2218,2230].codfw.wmnet,db[1125,1152,1159-1160,1163-1164,1176,1189,1193,1201,1204,1215,1220,1222,1228,1230,1236].eqiad.wmnet,es[2035,2038].codfw.wmnet,es[1038-1039].eqiad.wmnet,pc[2011-2013,2015].codfw.wmnet,pc[1011-1012,1015-1017].eqiad.wmnet OK to proceed on 47 hosts? Enter the number of affected hosts to confirm or "q" to quit: 47 ===== NODE GROUP ===== (1) es2035.codfw.wmnet ----- OUTPUT of 'mysql ops -e 'SH...awk '{print $2}'' ----- wmf_slave_overload wmf_slave_purge wmf_slave_wikiuser_sleep wmf_slave_wikiuser_slow ===== NODE GROUP ===== (13) db[2132-2135,2183,2185].codfw.wmnet,db[1159,1164,1176,1204,1215,1228].eqiad.wmnet,pc1017.eqiad.wmnet ----- OUTPUT of 'mysql ops -e 'SH...awk '{print $2}'' ----- ERROR 1049 (42000): Unknown database 'ops' ===== NODE GROUP ===== (28) db[2123,2142,2165,2179,2196,2203-2205,2214,2218].codfw.wmnet,db[1152,1160,1163,1189,1193,1201,1220,1222,1230,1236].eqiad.wmnet,es2038.codfw.wmnet,es[1038-1039].eqiad.wmnet,pc[2011-2013].codfw.wmnet,pc[1011-1012].eqiad.wmnet ----- OUTPUT of 'mysql ops -e 'SH...awk '{print $2}'' ----- wmf_master_purge wmf_master_wikiuser_sleep
and running the opposite sieving on non A:db-role-master
arnaudb@cumin1002:~ $ sudo cumin 'A:db-all and not A:db-multiinstance and not A:db-role-master' "mysql ops -e 'SHOW EVENTS FROM ops;' -s -N | awk '{print \$2}'" -p 0 201 hosts will be targeted: an-redacteddb1001.eqiad.wmnet,clouddb[1013-1020].eqiad.wmnet,db[2115-2116,2126,2128,2130-2131,2136,2140,2143-2159,2161-2164,2166-2178,2180-2182,2184,2188-2195,2202,2206-2213,2215-2217,2219-2222,2224-2225,2227,2229,2237-2238].codfw.wmnet,db[1151,1153,1156-1158,1161-1162,1165-1170,1172-1175,1177-1188,1190-1192,1194-1200,1202-1203,1205-1207,1209-1214,1218-1219,1221,1223-1224,1226-1227,1229,1231-1235,1237-1238,1241-1244,1246-1249].eqiad.wmnet,es[2020-2034,2036-2037,2039-2040].codfw.wmnet,es[1020-1037,1040].eqiad.wmnet,pc[2014,2016-2017].codfw.wmnet,pc[1013-1014].eqiad.wmnet OK to proceed on 201 hosts? Enter the number of affected hosts to confirm or "q" to quit: 201 ===== NODE GROUP ===== (9) an-redacteddb1001.eqiad.wmnet,clouddb[1013-1020].eqiad.wmnet ----- OUTPUT of 'mysql ops -e 'SH...awk '{print $2}'' ----- ERROR 2002 (HY000): Can't connect to local server through socket '/tmp/mysql.sock' (2) ===== NODE GROUP ===== (1) db1246.eqiad.wmnet ----- OUTPUT of 'mysql ops -e 'SH...awk '{print $2}'' ----- ERROR 2002 (HY000): Can't connect to local server through socket '/run/mysqld/mysqld.sock' (2) ===== NODE GROUP ===== (3) db2184.codfw.wmnet,db1205.eqiad.wmnet,pc2017.codfw.wmnet ----- OUTPUT of 'mysql ops -e 'SH...awk '{print $2}'' ----- ERROR 1049 (42000): Unknown database 'ops' ===== NODE GROUP ===== (5) es[2021,2024].codfw.wmnet,es[1021,1024].eqiad.wmnet,pc1013.eqiad.wmnet ----- OUTPUT of 'mysql ops -e 'SH...awk '{print $2}'' ----- wmf_master_purge wmf_master_wikiuser_sleep ===== NODE GROUP ===== (182) db[2115-2116,2126,2128,2130-2131,2136,2140,2143-2159,2161-2164,2166-2178,2180-2182,2188-2195,2202,2206-2213,2215-2217,2219-2222,2224-2225,2227,2229,2237-2238].codfw.wmnet,db[1151,1153,1156-1158,1161-1162,1165-1170,1172-1175,1177-1188,1190-1192,1194-1200,1202-1203,1206-1207,1209-1214,1218-1219,1221,1223-1224,1226-1227,1229,1231-1235,1237-1238,1241-1244,1247-1249].eqiad.wmnet,es[2020,2022-2023,2025-2034,2036-2037,2039-2040].codfw.wmnet,es[1020,1022-1023,1025-1037,1040].eqiad.wmnet,pc2014.codfw.wmnet,pc1014.eqiad.wmnet ----- OUTPUT of 'mysql ops -e 'SH...awk '{print $2}'' ----- wmf_slave_overload wmf_slave_purge wmf_slave_wikiuser_sleep wmf_slave_wikiuser_slow ================ PASS |██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100% (201/201) [00:03<00:00, 51.66hosts/s] FAIL | | 0% (0/201) [00:03<?, ?hosts/s] 100.0% (201/201) success ratio (>= 0.0% threshold) for command: 'mysql ops -e 'SH...awk '{print $2}''. 100.0% (201/201) success ratio (>= 0.0% threshold) of nodes successfully executed all commands.
We only have consistency issue on non core hosts on events.
primary nodes have a proper config except for es2035.codfw.wmnet. suggested fix:
curl -sS 'https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/+/refs/heads/master/dbtools/events_coredb_master.sql?format=TEXT' | base64 -d | sudo db-mysql es2035
and on the other side:
(4) es[2021,2024].codfw.wmnet,es[1021,1024].eqiad.wmnet ----- OUTPUT of 'mysql ops -e 'SH...awk '{print $2}'' ----- wmf_master_purge wmf_master_wikiuser_sleep
only those have consistency issues, with the following suggested fix:
for host in $(echo "es[2021,2024].codfw.wmnet,es[1021,1024].eqiad.wmnet" |nodeset -e); do echo doing ${host} time curl -sS 'https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/+/refs/heads/master/dbtools/events_coredb_master.sql?format=TEXT' | base64 -d | sudo db-mysql ${host%%.*} done
wdyt @jcrespo ?
edit: for sanity check, here is a oneliner proofing $host print:
for host in $(echo "es[2021,2024].codfw.wmnet,es[1021,1024].eqiad.wmnet" |nodeset -e); do echo "${host%%.*}" ; done es2021 es2024 es1021 es1024
as updated in task description: P69410 for secondaries and P69411 for primaries contain the last 10 loglines for each mariadb instance
Secondaries are looking correct from what I saw, but primaries may be a bit more concerning. I'll dig in db1159 for instance (and try to scrape the few other snowflakes)
after a quick sanity check with @jcrespo:
for host in $(echo "es[2021,2024].codfw.wmnet,es[1021,1024].eqiad.wmnet" |nodeset -e); do echo doing ${host} time sudo db-mysql ${host%%.*} < /home/arnaudb/software/dbtools/events_coredb_master.sql done
this is the logic that emerged:
for host in $(echo "es[2021,2024].codfw.wmnet,es[1021,1024].eqiad.wmnet" |nodeset -e); do echo doing ${host} time sudo db-mysql ${host%%.*} "select @@read_only;" done
but it does not seem as trivial as this command simply fails, so I'll stick to the original plan to avoid unforeseen issues
everything went through no trouble:
arnaudb@cumin1002:~ $ for host in $(echo "es[2021,2024].codfw.wmnet,es[1021,1024].eqiad.wmnet" |nodeset -e); do
echo doing ${host}
time curl -sS 'https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/+/refs/heads/master/dbtools/events_coredb_master.sql?format=TEXT' | base64 -d | sudo db-mysql ${host%%.*}
done
doing es2021.codfw.wmnet
real 0m0.804s
user 0m0.095s
sys 0m0.021s
doing es2024.codfw.wmnet
real 0m0.797s
user 0m0.092s
sys 0m0.045s
doing es1021.eqiad.wmnet
real 0m0.139s
user 0m0.075s
sys 0m0.056s
doing es1024.eqiad.wmnet
real 0m0.123s
user 0m0.094s
sys 0m0.032s
arnaudb@cumin1002:~ $ curl -sS 'https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/+/refs/heads/master/dbtools/events_coredb_master.sql?format=TEXT' | base64 -d | sudo db-mysql es2035
I will give a last look at the ones that failed to report an ops db:
===== NODE GROUP ===== (13) db[2132-2135,2183,2185].codfw.wmnet,db[1159,1164,1176,1204,1215,1228].eqiad.wmnet,pc1017.eqiad.wmnet ----- OUTPUT of 'mysql ops -e 'SH...awk '{print $2}'' ----- ERROR 1049 (42000): Unknown database 'ops'
===== NODE GROUP ===== (3) db2184.codfw.wmnet,db1205.eqiad.wmnet,pc2017.codfw.wmnet ----- OUTPUT of 'mysql ops -e 'SH...awk '{print $2}'' ----- ERROR 1049 (42000): Unknown database 'ops'
Mentioned in SAL (#wikimedia-operations) [2024-09-25T11:30:24Z] <jynus> running DELETE + REPLACE on kowiki db2218 (s7) T375186
Mentioned in SAL (#wikimedia-operations) [2024-09-25T11:40:00Z] <jynus> running DELETE + REPLACE on metawiki db2218 (s7) T375186
the only think I saw missing was pc1017 pc2017 which I am guessing they are the new hosts for pc5 ?
in any case, not a big issue
Great work, @ABran-WMF .