Difference between revisions of "Troubleshooting Unhealthy Nodes"
From Internet Computer Wiki
Katie.peters (talk | contribs) (Adding redeployment steps) |
Katie.peters (talk | contribs) m (Adding a bit more explanation) |
||
Line 1: | Line 1: | ||
− | Steps to take when a server is unhealthy, but the connectivity in the data center is functioning correctly: | + | Steps to take when a server is unhealthy or has been removed from the network, but the connectivity in the data center is functioning correctly: |
* Ensure that the server is powered on. | * Ensure that the server is powered on. | ||
* Ensure that all link lights for active network interfaces are on. | * Ensure that all link lights for active network interfaces are on. | ||
** If any link lights are off, check for failed cables by swapping them out for known good cables as needed. | ** If any link lights are off, check for failed cables by swapping them out for known good cables as needed. | ||
− | *Hook up a crash cart and check for errors on the screen, troubleshoot as needed | + | *Hook up a crash cart and check for errors on the screen, troubleshoot as needed. |
* Contact Dell if hardware issues are found or suspected. | * Contact Dell if hardware issues are found or suspected. | ||
− | ** If Dell requires a TSR log, see [[IDRAC access and TSR logs]] | + | ** If Dell requires a TSR log, see [[IDRAC access and TSR logs|IDRAC access and TSR logs.]] |
** [[Updating_Firmware|Updating the firmware]] might also resolve the issue. | ** [[Updating_Firmware|Updating the firmware]] might also resolve the issue. | ||
* If no known error is found, please [[IC OS Installation Runbook - Dell Poweredge|redeploy the node with a fresh IC-OS image]]. | * If no known error is found, please [[IC OS Installation Runbook - Dell Poweredge|redeploy the node with a fresh IC-OS image]]. | ||
− | ** The deployment process identifies/fixes | + | ** The deployment process identifies/fixes many software issues. |
+ | ** Note that if an old IC-OS image is used, the node will "appear" to be healthy at first, but it will not be able to catch up to the blockchain and will therefore fall behind and become unhealthy again. Thus, a current IC-OS image ''must'' be used. | ||
** At the end, obtain the new principal ID for the node from the crash cart screen. Then search for the node's principal on the [https://dashboard.internetcomputer.org/nodes IC dashboard] to verify that the node is healthy. | ** At the end, obtain the new principal ID for the node from the crash cart screen. Then search for the node's principal on the [https://dashboard.internetcomputer.org/nodes IC dashboard] to verify that the node is healthy. | ||
[[Node Provider Troubleshooting|All Node Provider Troubleshooting links]] | [[Node Provider Troubleshooting|All Node Provider Troubleshooting links]] |
Revision as of 14:35, 11 May 2023
Steps to take when a server is unhealthy or has been removed from the network, but the connectivity in the data center is functioning correctly:
- Ensure that the server is powered on.
- Ensure that all link lights for active network interfaces are on.
- If any link lights are off, check for failed cables by swapping them out for known good cables as needed.
- Hook up a crash cart and check for errors on the screen, troubleshoot as needed.
- Contact Dell if hardware issues are found or suspected.
- If Dell requires a TSR log, see IDRAC access and TSR logs.
- Updating the firmware might also resolve the issue.
- If no known error is found, please redeploy the node with a fresh IC-OS image.
- The deployment process identifies/fixes many software issues.
- Note that if an old IC-OS image is used, the node will "appear" to be healthy at first, but it will not be able to catch up to the blockchain and will therefore fall behind and become unhealthy again. Thus, a current IC-OS image must be used.
- At the end, obtain the new principal ID for the node from the crash cart screen. Then search for the node's principal on the IC dashboard to verify that the node is healthy.