Page MenuHomePhabricator

Investigate stale dashboards after logstash.discovery.wmnet switch to codfw
Closed, ResolvedPublic

Description

At 15:46 UTC, the logstash.discovery.wmnet A/P service was switched from eqiad to codfw as part of switchover day 1 (T370962). About 45m later, there were reports in -operations of stale / missing dashboards in the UI:

16:33:45 <Dreamy_Jazz> logstash.wikimedia.org seems broken to me. Our team's dashboard has disappeared and the home page seems to be an outdated version.
16:34:14 <Dreamy_Jazz> When I view logstash.wikimedia.org I see "AHT Team" which was removed a while ago
16:34:23 <Dreamy_Jazz> And no "Trust and Safety Product"
16:35:06 <Dreamy_Jazz> Plus when I open our team's dashboard from a link we have saved in a google doc, there is an error saying the dashboard does not exist. The URL we have saved is https://logstash.wikimedia.org/app/dashboards#/view/bc0caa20-92d5-11ee-b8fa-893e52d5cd7d?_g=(filters%3A!()%2CrefreshInterval%3A(pause%3A!t%2Cvalue%3A0)%2Ctime%3A(from%3Anow-1w%2Cto%3Anow))

This was resolved by switching back to eqiad around 16:39.

Opening a follow-up task to investigate what might have led to the staleness, and if any additional pre-switchover actions are required in the future.

If it turns out that this is an expected behavior and logstash should not be in scope for the "elective" twice-a-year switchover, we also have the option of adding it to the service exclusion list.

Thanks!

Event Timeline

colewhite subscribed.

This is known and expected behavior for our deployment of OpenSearch Dashboards.

This problem happened because the process to restore dashboards data from backup was not followed prior to the switchover. The eqiad and codfw clusters are completely separate from one another for independent redundancy.

Switchovers are possible, but must be coordinated with Observability to do the backup and restore process quickly to minimize the possibility of delta between the backup and restore steps.

Given this and other concerns not detailed here, I would elect to exclude this service from the list unless there is a hard requirement.

Thanks for the quick turnaround @colewhite!

Alright, given that this requires additional manual / coordinated action, I think it seems reasonable to add logstash to the exclusion list. Of course, it probably is useful to test the switchover procedure at some point, but your team can drive that independently of the twice-a-year switchover cadence.

Also, I now see that the kibana7 service didn't have an associated discovery record until quite recently (June) as part of the work in T356386. That explains why this was not an issue during the last switchover in March 2024.

I'll post a patch for the exclusion list later today.

Change #1075314 had a related patch set uploaded (by Scott French; author: Scott French):

[operations/cookbooks@master] sre.discovery.datacenter: exclude kibana7

https://gerrit.wikimedia.org/r/1075314

Change #1075314 merged by jenkins-bot:

[operations/cookbooks@master] sre.discovery.datacenter: exclude kibana7

https://gerrit.wikimedia.org/r/1075314

akosiaris claimed this task.
akosiaris subscribed.

Exclusion merged, doesn't look like there is a need for another followup I think I 'll close this as resolved.