Page MenuHomePhabricator

DC-OpsGroup
ActivePublic

Members (7)

Watchers (1)

Details

Description

Tasks handled by the Wikimedia Foundation's datacenter operations team, which is a sub-team of the SRE department.

This project includes sub-project procurement, decommission-hardware, and every single datacenter site-specific project: ops-codfw, ops-drmrs, ops-eqdfw, ops-eqiad, ops-eqord, ops-esams , ops-eqsin, ops-ulsfo, & ops-magru .

This can be linked to via: https://phabricator.wikimedia.org/tag/dc-ops/

Please note any wikitech documentation handled by DC-Ops is linked off of https://wikitech.wikimedia.org/wiki/Dc-operations

SLAs

DC-Ops makes every attempt to resolve all tasks and requests in a timely manner. We've implemented the following SLA targets.

Please note none of these start until both the clarified start time and with proper project tags. See details for each type of task request in their section below. Please use templates listed below.

ProjectDays to ResolveSLA startTemplate
procurement90Date of Task filingProcurement Template
Racking/Installation30Arrival of Hardware to DC site
Hardware Failure / Repair10Date of Task filingHardware Failure Template
Decommission45When all sub-team steps are complete and task is assigned to on-siteServer Decommission Template

Hardware Repair

If you need to file a task requesting hardware troubleshooting, please use the File Hardware Failure Task link here or in the navbar on the left.

Troubleshooting includes hardware failures, raid re-configuration, etc...

A full runbook on how to troubleshoot hardware failures can be viewed here: https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook

Requesting Hardware

If you have a budget line item, and want to file a request for pricing, please file your procurement request via this link. If you do not yet have a budget line for the request in this fiscal year, you can still file via that link, merely list that there is no budget allocation in that section of the task.

Once hardware has been ordered, a racking task must be entered using the form. This form may also be used if a system has to be moved and re-imaged.

Decommissioning Hardware

All hardware being returned to DC-Ops for processing into spares, or into decommission state and removed from the rack.

Any hardware no longer required for use should have a task filed for decommission via the pre-defined server decommission request form.

Netbox Reporting

The template for netbox report errors is here: https://phabricator.wikimedia.org/maniphest/task/edit/form/133/

Recent Activity

Today

fnegri changed the status of T380673: Kernel error Server cloudvirt1061 may have kernel errors from Open to In Progress.
Wed, Nov 27, 2:29 PM · SRE, cloud-services-team (Hardware), ops-eqiad, DC-Ops
Stashbot added a comment to T380673: Kernel error Server cloudvirt1061 may have kernel errors.

Mentioned in SAL (#wikimedia-operations) [2024-11-27T14:25:13Z] <fnegri@cumin1002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 14 days, 0:00:00 on cloudvirt1061.eqiad.wmnet with reason: cloudvirt1061 needs maintenance T380673

Wed, Nov 27, 2:25 PM · SRE, cloud-services-team (Hardware), ops-eqiad, DC-Ops
Stashbot added a comment to T380673: Kernel error Server cloudvirt1061 may have kernel errors.

Mentioned in SAL (#wikimedia-operations) [2024-11-27T14:25:00Z] <fnegri@cumin1002> START - Cookbook sre.hosts.downtime for 14 days, 0:00:00 on cloudvirt1061.eqiad.wmnet with reason: cloudvirt1061 needs maintenance T380673

Wed, Nov 27, 2:25 PM · SRE, cloud-services-team (Hardware), ops-eqiad, DC-Ops
Jclark-ctr moved T380673: Kernel error Server cloudvirt1061 may have kernel errors from Backlog to Hardware Failure / Troubleshoot on the ops-eqiad board.
Wed, Nov 27, 2:25 PM · SRE, cloud-services-team (Hardware), ops-eqiad, DC-Ops
Jclark-ctr added a comment to T380673: Kernel error Server cloudvirt1061 may have kernel errors.

Finished with bios update waiting on dell for response for new ticket

Wed, Nov 27, 2:25 PM · SRE, cloud-services-team (Hardware), ops-eqiad, DC-Ops
ops-monitoring-bot added a comment to T378030: Q2:rack/setup/install wdqs102[567].

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host wdqs1027.eqiad.wmnet with OS bullseye

Wed, Nov 27, 2:21 PM · Data-Platform-SRE (2024.11.09 - 2024.11.29), wmde-wikidata-tech, Wikidata, Wikidata-Query-Service, SRE, Discovery-Search, ops-eqiad, DC-Ops
ops-monitoring-bot added a comment to T378030: Q2:rack/setup/install wdqs102[567].

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host wdqs1026.eqiad.wmnet with OS bullseye

Wed, Nov 27, 2:21 PM · Data-Platform-SRE (2024.11.09 - 2024.11.29), wmde-wikidata-tech, Wikidata, Wikidata-Query-Service, SRE, Discovery-Search, ops-eqiad, DC-Ops
Jclark-ctr added a comment to T380673: Kernel error Server cloudvirt1061 may have kernel errors.

Dell rejected parts request opening new ticket with them 201666996

Wed, Nov 27, 2:11 PM · SRE, cloud-services-team (Hardware), ops-eqiad, DC-Ops
aborrero added a comment to T380673: Kernel error Server cloudvirt1061 may have kernel errors.

the server has been drained and is ready for a reboot when you need it.

Wed, Nov 27, 1:41 PM · SRE, cloud-services-team (Hardware), ops-eqiad, DC-Ops
MoritzMuehlenhoff updated the task description for T380307: installation tracking for hosts affected by magru re-shuffle.
Wed, Nov 27, 1:37 PM · Infrastructure-Foundations, Traffic, ops-magru, DC-Ops
ops-monitoring-bot added a comment to T378030: Q2:rack/setup/install wdqs102[567].

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host wdqs1026.eqiad.wmnet with OS bullseye

Wed, Nov 27, 1:21 PM · Data-Platform-SRE (2024.11.09 - 2024.11.29), wmde-wikidata-tech, Wikidata, Wikidata-Query-Service, SRE, Discovery-Search, ops-eqiad, DC-Ops
ops-monitoring-bot added a comment to T378030: Q2:rack/setup/install wdqs102[567].

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host wdqs1025.eqiad.wmnet with OS bullseye

Wed, Nov 27, 1:21 PM · Data-Platform-SRE (2024.11.09 - 2024.11.29), wmde-wikidata-tech, Wikidata, Wikidata-Query-Service, SRE, Discovery-Search, ops-eqiad, DC-Ops
ops-monitoring-bot added a comment to T378030: Q2:rack/setup/install wdqs102[567].

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host wdqs1027.eqiad.wmnet with OS bullseye

Wed, Nov 27, 1:20 PM · Data-Platform-SRE (2024.11.09 - 2024.11.29), wmde-wikidata-tech, Wikidata, Wikidata-Query-Service, SRE, Discovery-Search, ops-eqiad, DC-Ops
MoritzMuehlenhoff added a comment to T378358: ganeti2042 seems to have a broken CPU? (new Supermicro node).

The server restarted itself again earlier the day

Wed, Nov 27, 12:20 PM · SRE, ops-codfw, DC-Ops
jcrespo added a comment to T377853: RAID monitoring on new hardware spec requires new or updated user space cli tool.

I think we could easily try to swap perccli with storcli for the host swith SAS3908 onboard, but I am struggling to download the binary from the website (it doesnt' show up from the research).

Wed, Nov 27, 12:04 PM · SRE-swift-storage, DC-Ops, SRE-tools, observability, Puppet, Infrastructure-Foundations
MoritzMuehlenhoff added a comment to T377853: RAID monitoring on new hardware spec requires new or updated user space cli tool.

One other option is to try https://github.com/namiltd/megactl with this controller. (The underlying chipset is usually the same and on ms-be2081 I can also see /dev/megaraid_sas_ioctl_node).

Wed, Nov 27, 9:36 AM · SRE-swift-storage, DC-Ops, SRE-tools, observability, Puppet, Infrastructure-Foundations
elukey updated subscribers of T371389: Q1:rack/setup/install ms-be10{83-91}.

@VRiley-WMF @Jclark-ctr Hi! We are ready to start provisioning these nodes, but the procedure is a little bit more convoluted than the usual since we need to force UEFI and there are still some Supermicro bugs that upstream is working on.

Wed, Nov 27, 9:30 AM · SRE, SRE-swift-storage, Data-Persistence, ops-eqiad, DC-Ops
MoritzMuehlenhoff added a comment to T377853: RAID monitoring on new hardware spec requires new or updated user space cli tool.

There are debs available in the Thomas Krenn repo (German server vendor):
https://www.thomas-krenn.com/de/wiki/StorCLI_unter_Ubuntu_installieren

Wed, Nov 27, 9:30 AM · SRE-swift-storage, DC-Ops, SRE-tools, observability, Puppet, Infrastructure-Foundations
MoritzMuehlenhoff added a comment to T377853: RAID monitoring on new hardware spec requires new or updated user space cli tool.

There are debs available in the Thomas Krenn repo (German server vendor):
https://www.thomas-krenn.com/de/wiki/StorCLI_unter_Ubuntu_installieren

Wed, Nov 27, 9:29 AM · SRE-swift-storage, DC-Ops, SRE-tools, observability, Puppet, Infrastructure-Foundations
elukey reopened T371400: Q1:rack/setup/install ms-be208[1-8] as "Open".

@Jhancock.wm hi! We have done a lot of weird tests with these nodes, I think that we should re-run provision for all of them to check that nothing weird that was tested is still in place, and possibly reimage all of them too.

Wed, Nov 27, 9:22 AM · SRE, SRE-swift-storage, Data-Persistence, ops-codfw, DC-Ops
elukey added a comment to T377853: RAID monitoring on new hardware spec requires new or updated user space cli tool.

I think we could easily try to swap perccli with storcli for the host swith SAS3908 onboard, but I am struggling to download the binary from the website (it doesnt' show up from the research).

Wed, Nov 27, 9:08 AM · SRE-swift-storage, DC-Ops, SRE-tools, observability, Puppet, Infrastructure-Foundations
elukey added a comment to T377853: RAID monitoring on new hardware spec requires new or updated user space cli tool.

I tried to dowload and install perccli == 007.2616.0000.0000 on ms-be2081 but no luck, same issue.

Wed, Nov 27, 8:59 AM · SRE-swift-storage, DC-Ops, SRE-tools, observability, Puppet, Infrastructure-Foundations

Yesterday

BCornwall updated the task description for T380307: installation tracking for hosts affected by magru re-shuffle.
Tue, Nov 26, 11:36 PM · Infrastructure-Foundations, Traffic, ops-magru, DC-Ops
bking moved T378034: Q2:rack/setup/install elastic211[0-5] from Blocked/Waiting to Done on the Data-Platform-SRE (2024.11.09 - 2024.11.29) board.
Tue, Nov 26, 10:49 PM · Data-Platform-SRE (2024.11.09 - 2024.11.29), SRE, Discovery-Search, ops-codfw, DC-Ops
BCornwall updated the task description for T380307: installation tracking for hosts affected by magru re-shuffle.
Tue, Nov 26, 10:40 PM · Infrastructure-Foundations, Traffic, ops-magru, DC-Ops
BCornwall updated the task description for T380307: installation tracking for hosts affected by magru re-shuffle.
Tue, Nov 26, 9:59 PM · Infrastructure-Foundations, Traffic, ops-magru, DC-Ops
RobH updated subscribers of T380307: installation tracking for hosts affected by magru re-shuffle.

@MoritzMuehlenhoff : ganeti700[12] are ready for reimage but I've just run out of steam for today. If you don't get to their reimage on Wednesday I'll do so on my Wednesday AM.

Tue, Nov 26, 9:39 PM · Infrastructure-Foundations, Traffic, ops-magru, DC-Ops
BCornwall updated the task description for T380307: installation tracking for hosts affected by magru re-shuffle.
Tue, Nov 26, 9:37 PM · Infrastructure-Foundations, Traffic, ops-magru, DC-Ops
RobH updated the task description for T380307: installation tracking for hosts affected by magru re-shuffle.
Tue, Nov 26, 9:35 PM · Infrastructure-Foundations, Traffic, ops-magru, DC-Ops
Maintenance_bot removed a project from T378030: Q2:rack/setup/install wdqs102[567]: Patch-For-Review.
Tue, Nov 26, 9:32 PM · Data-Platform-SRE (2024.11.09 - 2024.11.29), wmde-wikidata-tech, Wikidata, Wikidata-Query-Service, SRE, Discovery-Search, ops-eqiad, DC-Ops
ops-monitoring-bot added a comment to T380307: installation tracking for hosts affected by magru re-shuffle.

cookbooks.sre.hosts.decommission executed by robh@cumin2002 for hosts: lvs7001.magru.wmnet

  • lvs7001.magru.wmnet (FAIL)
    • Downtimed host on Icinga/Alertmanager
    • Found physical host
    • Downtimed management interface on Alertmanager
    • Unable to connect to the host, wipe of swraid, partition-table and filesystem signatures will not be performed: Cumin execution failed (exit_code=2)
    • Powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
Tue, Nov 26, 9:22 PM · Infrastructure-Foundations, Traffic, ops-magru, DC-Ops
gerritbot added a comment to T378030: Q2:rack/setup/install wdqs102[567].

Change #1098133 merged by Ryan Kemper:

[operations/puppet@production] wdqs102[567]: move back to insetup role

https://gerrit.wikimedia.org/r/1098133

Tue, Nov 26, 9:21 PM · Data-Platform-SRE (2024.11.09 - 2024.11.29), wmde-wikidata-tech, Wikidata, Wikidata-Query-Service, SRE, Discovery-Search, ops-eqiad, DC-Ops
ops-monitoring-bot added a comment to T380307: installation tracking for hosts affected by magru re-shuffle.

cookbooks.sre.hosts.decommission executed by robh@cumin2002 for hosts: cp7010.magru.wmnet

  • cp7010.magru.wmnet (FAIL)
    • Downtimed host on Icinga/Alertmanager
    • Found physical host
    • Downtimed management interface on Alertmanager
    • Unable to connect to the host, wipe of swraid, partition-table and filesystem signatures will not be performed: Cumin execution failed (exit_code=2)
    • Powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
Tue, Nov 26, 9:20 PM · Infrastructure-Foundations, Traffic, ops-magru, DC-Ops
RobH updated the task description for T380307: installation tracking for hosts affected by magru re-shuffle.
Tue, Nov 26, 9:06 PM · Infrastructure-Foundations, Traffic, ops-magru, DC-Ops
RobH updated the task description for T380307: installation tracking for hosts affected by magru re-shuffle.
Tue, Nov 26, 9:03 PM · Infrastructure-Foundations, Traffic, ops-magru, DC-Ops
ops-monitoring-bot added a comment to T380307: installation tracking for hosts affected by magru re-shuffle.

cookbooks.sre.hosts.decommission executed by robh@cumin2002 for hosts: cp7002.magru.wmnet

  • cp7002.magru.wmnet (FAIL)
    • Downtimed host on Icinga/Alertmanager
    • Found physical host
    • Downtimed management interface on Alertmanager
    • Unable to connect to the host, wipe of swraid, partition-table and filesystem signatures will not be performed: Cumin execution failed (exit_code=2)
    • Powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
Tue, Nov 26, 8:50 PM · Infrastructure-Foundations, Traffic, ops-magru, DC-Ops
ops-monitoring-bot added a comment to T380307: installation tracking for hosts affected by magru re-shuffle.

cookbooks.sre.hosts.decommission executed by robh@cumin2002 for hosts: dns7002.wikimedia.org

  • dns7002.wikimedia.org (FAIL)
    • Downtimed host on Icinga/Alertmanager
    • Found physical host
    • Downtimed management interface on Alertmanager
    • Unable to connect to the host, wipe of swraid, partition-table and filesystem signatures will not be performed: Cumin execution failed (exit_code=2)
    • Powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
Tue, Nov 26, 8:47 PM · Infrastructure-Foundations, Traffic, ops-magru, DC-Ops
RobH updated the task description for T380307: installation tracking for hosts affected by magru re-shuffle.
Tue, Nov 26, 8:34 PM · Infrastructure-Foundations, Traffic, ops-magru, DC-Ops
RobH updated the task description for T380307: installation tracking for hosts affected by magru re-shuffle.
Tue, Nov 26, 8:34 PM · Infrastructure-Foundations, Traffic, ops-magru, DC-Ops
ops-monitoring-bot added a comment to T378030: Q2:rack/setup/install wdqs102[567].

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host wdqs1027.eqiad.wmnet with OS bullseye executed with errors:

  • wdqs1027 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console wdqs1027.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.
Tue, Nov 26, 8:32 PM · Data-Platform-SRE (2024.11.09 - 2024.11.29), wmde-wikidata-tech, Wikidata, Wikidata-Query-Service, SRE, Discovery-Search, ops-eqiad, DC-Ops
ops-monitoring-bot added a comment to T378030: Q2:rack/setup/install wdqs102[567].

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host wdqs1026.eqiad.wmnet with OS bullseye executed with errors:

  • wdqs1026 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console wdqs1026.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.
Tue, Nov 26, 8:32 PM · Data-Platform-SRE (2024.11.09 - 2024.11.29), wmde-wikidata-tech, Wikidata, Wikidata-Query-Service, SRE, Discovery-Search, ops-eqiad, DC-Ops
ops-monitoring-bot added a comment to T378030: Q2:rack/setup/install wdqs102[567].

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host wdqs1025.eqiad.wmnet with OS bullseye executed with errors:

  • wdqs1025 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console wdqs1025.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.
Tue, Nov 26, 8:32 PM · Data-Platform-SRE (2024.11.09 - 2024.11.29), wmde-wikidata-tech, Wikidata, Wikidata-Query-Service, SRE, Discovery-Search, ops-eqiad, DC-Ops
RobH updated the task description for T380307: installation tracking for hosts affected by magru re-shuffle.
Tue, Nov 26, 8:27 PM · Infrastructure-Foundations, Traffic, ops-magru, DC-Ops
BCornwall updated the task description for T380307: installation tracking for hosts affected by magru re-shuffle.
Tue, Nov 26, 8:25 PM · Infrastructure-Foundations, Traffic, ops-magru, DC-Ops
ops-monitoring-bot added a comment to T380307: installation tracking for hosts affected by magru re-shuffle.

cookbooks.sre.hosts.decommission executed by robh@cumin2002 for hosts: cp7004.magru.wmnet

  • cp7004.magru.wmnet (FAIL)
    • Downtimed host on Icinga/Alertmanager
    • Found physical host
    • Downtimed management interface on Alertmanager
    • Unable to connect to the host, wipe of swraid, partition-table and filesystem signatures will not be performed: Cumin execution failed (exit_code=2)
    • Powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
    • Host steps raised exception: No non-mgmt connected interfaces found for cp7004. Please check Netbox.
Tue, Nov 26, 8:16 PM · Infrastructure-Foundations, Traffic, ops-magru, DC-Ops
ops-monitoring-bot added a comment to T380307: installation tracking for hosts affected by magru re-shuffle.

cookbooks.sre.hosts.decommission executed by robh@cumin2002 for hosts: ganeti7002.magru.wmnet

  • ganeti7002.magru.wmnet (FAIL)
    • Downtimed host on Icinga/Alertmanager
    • Found physical host
    • Downtimed management interface on Alertmanager
    • Wiped all swraid, partition-table and filesystem signatures
    • Powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
    • Host steps raised exception: No non-mgmt connected interfaces found for ganeti7002. Please check Netbox.
Tue, Nov 26, 8:13 PM · Infrastructure-Foundations, Traffic, ops-magru, DC-Ops
gerritbot added a project to T378030: Q2:rack/setup/install wdqs102[567]: Patch-For-Review.
Tue, Nov 26, 8:04 PM · Data-Platform-SRE (2024.11.09 - 2024.11.29), wmde-wikidata-tech, Wikidata, Wikidata-Query-Service, SRE, Discovery-Search, ops-eqiad, DC-Ops
gerritbot added a comment to T378030: Q2:rack/setup/install wdqs102[567].

Change #1098133 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] wdqs102[567]: move back to insetup role

https://gerrit.wikimedia.org/r/1098133

Tue, Nov 26, 8:04 PM · Data-Platform-SRE (2024.11.09 - 2024.11.29), wmde-wikidata-tech, Wikidata, Wikidata-Query-Service, SRE, Discovery-Search, ops-eqiad, DC-Ops
RobH updated the task description for T380307: installation tracking for hosts affected by magru re-shuffle.
Tue, Nov 26, 8:01 PM · Infrastructure-Foundations, Traffic, ops-magru, DC-Ops
RobH updated the task description for T380307: installation tracking for hosts affected by magru re-shuffle.
Tue, Nov 26, 8:00 PM · Infrastructure-Foundations, Traffic, ops-magru, DC-Ops