Difference between revisions of "Storage Runbook"

From Internet Computer Wiki
Jump to: navigation, search
(Creating the page)
m (fixing spelling)
Line 1: Line 1:
== These instructions are for thee Dell R6525 servers and are based on the following assumptions: ==
+
== These instructions are for Dell R6525 servers and are based on the following assumptions: ==
 
* Participation in the Internet Computer program
 
* Participation in the Internet Computer program
 
* A node provider registered with the NNS
 
* A node provider registered with the NNS

Revision as of 14:13, 13 July 2022

These instructions are for Dell R6525 servers and are based on the following assumptions:

  • Participation in the Internet Computer program
  • A node provider registered with the NNS
  • Active nodes that are in a current state of active or inactive and need to have hard drives and possibly power supplies installed for Dell R6525

Hardware Requirements

  1. Depending on the Internet Computer Rack (ICR) configuration for your data center location(s), you will receive the following hardware per node/server:
    • 9 - 3.2TB 2.5in hard drives in Dell trays
    • 1 - 800W power supply
    • 1 - 6’ C13/C14 power cord
  2. Confirm the correct quantity of equipment for your location(s) in either 14 or 28 node/server configurations. If you are missing any equipment, please contact [email protected].
    • Depending on how much time and space you have available, stage the hardware near your cabinets (either on a table or in a safe accessible area).

Prepare the cabinet(s) for hardware installation

  1. At the rear of the cabinets, please verify that you can access the rear of the nodes/servers to install the 2nd power supply.
  2. Existing networking or power cables could be blocking access to the open power supply slot.
  3. At the front of the cabinet(s), if bezels were installed, please verify that they are not locked, and if they are locked, that a key is available to remove the bezel.

Shutting down the node/server

  1. In preparation for the installation of the 2nd power supply, please locate a “crash cart” or KVM setup usually available in your data center, and connect the VGA and USB keyboard cable to the 1st node server at the rear of the cabinet..
  2. In order to gracefully shutdown the node/server, go to the front of the cabinet, press and hold the green LED power button for exactly 2 seconds. The picture below shows where to turn off the server:
    screenshot
  3. Once completed, move back to the rear of the cabinet and check the monitor of the crash cart connected to the server and look for the screen to post the shutdown of the server, see below:
    screenshot
  4. Possible errors:
    • If the monitor does not show this message but you still show a signal from the monitor and can see a login prompt, then you didn’t hold the button down for 2 seconds. Please try again.
    • If the screen is dark with no signal coming from the node/server, then you held down the button for longer than 2 seconds, causing a forced power down. This can corrupt the operating system, so avoid this as much as possible.

Installing the 2nd power supply

  1. At the rear of the cabinet, please locate the open power supply spot for the first node/server:
    screenshot
  2. Remove the blank:
    screenshot
    screenshot
  3. Take one of the power supplies and orient it so the orange release tab is on the right and prepare to insert into the open slot:
    screenshot
  4. Insert the power supply into the open slot and push into the slot until you hear the “click’ where the orange release tab locks the power supply into place:
    screenshot
    screenshot
  5. If the power supply doesn’t seat properly, please check the rear of the power supply for any damage or warping of the power contacts at the bottom. If they appear bent, gently straighten or flatten the contacts and reattempt insertion. If the power supply still fails to seat properly, please set aside and mark as damaged and attempt with another power supply; then contact Dell support and open a ticket to repair the node/server.
  6. With the power supply installed, locate a C13/C14 power cord and prepare to connect to the new power supply. All ICR cabinets should be installed with A/B power, or Redundant power, where two Power Distribution Units (PDU) that provide power to the nodes/servers are installed in the cabinet(s).
  7. When connecting the 2nd power supply, make sure to connect the power cable to the PDU not used by the 1st power supply of the node/server. Verify this by tracing the power cable of the 1st power supply to the PDU it is connected to
  8. Insert the “male” end of the power cable to an available outlet of the PDU:
    screenshot
    screenshot
  9. Insert the “female” end of the power cable to the 2nd power supply just installed:
    screenshot
    screenshot
  10. The green LED on the power supply should light up showing a good power supply and connection to the PDU. If the power supply fails to go green or does not light up, please try the following troubleshooting steps in this order:
    • Try another outlet on the same PDU
    • Remove the current power cable and locate a different power cable and reconnect power.
    • Remove the power cable and then remove the power supply (press the orange release tab to the left and pull on the plastic LED loop, pulling straight back until the power supply is clear of the node/server), and use another available power supply and repeat the steps to insert and attach the power cable to the power supply. If none of these steps activate the 2nd power supply, remove the power supply and power cable from the node/server, and mark this hardware as “Questionable” (Q) and save for use with other nodes/servers.

Installing the hard drives

  1. Go to the front of the cabinet and locate the server with the crash cart connected. If a bezel is installed, please continue. If not, skip to 5.d.
    screenshot
  2. Locate the bezel release button on the left side and press:
    screenshot
  3. Once pressed, grab the bezel and pull slightly forward and to the left to release from the front/right side of the node/server:
    screenshot
  4. The drives bays should have 1 installed 3.2 TB hard drive, and 9 open slots with “blanks” installed:
    screenshot
  5. Starting on the left with slot1, using your thumb and forefinger, while pressing the two release tabs on the front of the blank, pull out the blank. Repeat this for slots 2-9:
    screenshot
    screenshot
  6. Locate your hard drives, and pick out a total of 9 for this server:
    screenshot
  7. Remove the each hard drive from the static packaging and prepare for insertion:
    screenshot
  8. Take the first drive and press the orange drive release button:
    screenshot
    screenshot
  9. Gently align the left and right tabs of the hard drive tray with the open slot and insert half way:
    screenshot
  10. Again, gently push the hard drive the rest of the way just under the locking tab until you feel the hard drive seat on the backplane, and the locking tab will begin to slightly close:
    screenshot
  11. Now press on the locking tab to fully seat the hard drive into the node/server:
    screenshot
    screenshot
  12. Repeat the last four steps for the remaining 8 hard drives. In the event that any of the hard drives do not properly seat in the slot, please try the following troubleshooting steps:
    • Check to make sure the hard drive is installed properly in the hard drive tray. If not then please remove the hard drive and install correctly in the tray
    • Check to make sure that there is no damage on the backplane for that particular slot. If there is damage, please contact Dell support and open a ticket to repair the node/server.
    • Warping can occur with the slot and hard drive tray. As you reinsert the hard drive, check for resistance and adjust the angle of insertion very slightly up and down and see if the hard drive seats on the back plane properly
    • Try a different hard drive in the same slot you might be having issues with, and set aside the previous faulty drive if a new hard drive works properly. Please contact Dell support and open a ticket regarding the failed drive.
    screenshot

Validating the hard drives are visible in the node/server Bios

  1. Power up the server with the newly installed hard drives:
    screenshot
  2. Return to the rear of the cabinet and observe the monitor for the node/server boot messaging:
    screenshot
    screenshot
  3. When the menu options appear, press “F2” on the keyboard to enter the node/server bios:
    screenshot
  4. From the Main Bios menu, push the down arrow 2 times to “Device Settings” and press enter:
    screenshot
  5. From this menu, you will see all the devices associated with the node/server. Using the down arrow, press it until you get to the bottom of the screen. You should see NVMe SSD installed in slots 0-9 (slots 1-9 are the hard drives you just installed). If you do not see hard drives in slots 1-9, the try the following troubleshooting steps:
    • Go to the front of the node/server and hold down the green power button until the node/server shuts down completely. Pressing the orange release button, pull the locking tab for the hard drive in question, removing the hard drive. Then, reinsert the hard drive per prior instructions. Power the node/server back on, re-enter the bios and check if the hard drive appears in the bios.
    • If the hard drive doesn’t appear in the bios, power down the node/server again, and locate a different hard drive, remove the faulty hard drive and insert the replacement hard drive. Power the node/server back on, re-enter the bios and check if the hard drive appears in the bios.
    • If this hard drive in the slot fails to appear, open a ticket with Dell Support and note the issue with this server.
    screenshot
    screenshot
  6. Using the up arrow, highlight the hard drive in slot 1:
    screenshot
  7. Press enter:
    screenshot
  8. Press enter again and press the down arrow twice, make sure that the PCIe Maximum Link Speed and PCIe Negotiated Link Speed both show 16.0 GT/s. If they do not show 16.0 GT/sm please contact [email protected] for instructions for troubleshooting:
    screenshot
  9. Once you have verified that either all of the hard drives are reporting properly, and have identified the faulty drives for follow up, please repeat steps for part 5 for the remaining nodes/servers. Make sure to reinstall the bezels for each server if applicable, and if there are any questions on this process please reach out to [email protected] and they will direct you to the appropriate engineering teams.