Page MenuHomePhabricator

Q3:rack/setup/install db218[567]
Closed, ResolvedPublic

Description

This task will track the racking, setup, and OS installation of db218[567]

Hostname / Racking / Installation Details

Hostnames: db218[567]
Racking Proposal: Any rack as long as they are in different rows.
Networking Setup: # of Connections:1 , Speed:1G. Vlan: Private AAAA records: N
Partitioning/Raid: HW Raid: Y, Partman recipe and/or desired Raid Level: RAID10 (partman recipe already done in puppet by @Marostegui )
OS Distro: Bullseye
Sub-team Technical Contact: @Marostegui

Per host setup checklist

Each host should have its own setup checklist copied and pasted into the list below.

db2185: Rack B8 - U29 - Port 28
  • - receive in system on procurement task T325210 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer from an active cumin host to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to netboot.pp, and site.pp role::insetup::data_persistence
  • - OS installation & initital puppet run via sre.hosts.reimage cookbook.
db2186: Rack C3 - U14 - Port 13
  • - receive in system on procurement task T325210 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer from an active cumin host to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to netboot.pp, and site.pp role::insetup::data_persistence
  • - OS installation & initital puppet run via sre.hosts.reimage cookbook.
db2187: Rack D6 - U8 - Port 7
  • - receive in system on procurement task T325210 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer from an active cumin host to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to netboot.pp, and site.pp role::insetup::data_persistence
  • - OS installation & initital puppet run via sre.hosts.reimage cookbook.

Event Timeline

RobH mentioned this in Unknown Object (Task).Jan 5 2023, 6:33 PM
RobH added a parent task: Unknown Object (Task).Jan 5 2023, 6:38 PM
RobH moved this task from Backlog to Racking Tasks on the ops-codfw board.
RobH unsubscribed.

Change 890004 had a related patch set uploaded (by Papaul; author: Papaul):

[operations/puppet@production] ADd db218[567] to site.pp

https://gerrit.wikimedia.org/r/890004

Change 890004 merged by Papaul:

[operations/puppet@production] ADd db218[567] to site.pp

https://gerrit.wikimedia.org/r/890004

Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host db2185.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host db2185.codfw.wmnet with OS bullseye executed with errors:

  • db2185 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host db2185.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host db2187.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host db2185.codfw.wmnet with OS bullseye completed:

  • db2185 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202302171641_pt1979_3287176_db2185.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active

@Volans fyi the 3 db nodes above are R650xs just receives those. We worked already on 1 R650 in the pass. On the 650xs provision cookbook is not setting the serial communication, it is leaving it to default ( on without console Redirection and com1)

Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host db2187.codfw.wmnet with OS bullseye completed:

  • db2187 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202302171710_pt1979_3295487_db2187.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active

Change 890438 had a related patch set uploaded (by Volans; author: Volans):

[operations/cookbooks@master] sre.hosts.provision: add PowerEdge R650xs

https://gerrit.wikimedia.org/r/890438

@Volans fyi the 3 db nodes above are R650xs just receives those. We worked already on 1 R650 in the pass. On the 650xs provision cookbook is not setting the serial communication, it is leaving it to default ( on without console Redirection and com1)

@Papaul if that's the same issue of the other models I guess this should fix it: https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/890438

I've merged the above patch. @Papaul could you please re-run the provision on those hosts and see if that works and for the new hosts if that fixes the issue? Thanks.

Change 890438 merged by Volans:

[operations/cookbooks@master] sre.hosts.provision: add PowerEdge R650xs

https://gerrit.wikimedia.org/r/890438

@Volans thank you I did already the changes manually on 2 hosts but i will run it on the one that I haven't setup yet and let you know. Also it looks like we have received some other model in Eqaid R750 so maybe we should also add this model. I don't know how the BIOS is setup on those @Jclark-ctr will have more info.

Thanks

@Papaul ideally we should find what's the characteristic that determines the change (iDRAC version?, BIOS version?, Dell GEN?) and automatically detect that instead of having a hardcoded list. Let me know if you happen to know based on what that changes.

@Volans thanks one thing i know for sure is that the R650 and R750 are Dell 15th generation and the other like R440 and R740 are 14th generation. I will check and see if i can find more information.

Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host db2186.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host db2186.codfw.wmnet with OS bullseye executed with errors:

  • db2186 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host db2186.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host db2186.codfw.wmnet with OS bullseye executed with errors:

  • db2186 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202302212034_pt1979_172451_db2186.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host db2186.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host db2186.codfw.wmnet with OS bullseye completed:

  • db2186 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202302220131_pt1979_227128_db2186.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
Papaul updated the task description. (Show Details)

complete