Difference between revisions of "Unhealthy Nodes"

From Internet Computer Wiki
Jump to: navigation, search
m
m
Line 24: Line 24:
 
    
 
    
 
==== '''Orchestrator Started message on screen''' ====
 
==== '''Orchestrator Started message on screen''' ====
This message is not an error, nor is it confirmation that the node is running properly.
+
This message is not an error, nor is it confirmation that the node is running properly. This must be determined in other ways:
  
* '''Check [https://dashboard.internetcomputer.org/ the dashboard]''' to see if that particular node is "Awaiting Subnet" or "Active in Subnet." Use the principal ID that was assigned to the node when it was onboarded to identify it.
+
* '''Check [https://dashboard.internetcomputer.org/ the dashboard]''' to check the status of the node. (Status explanations are [https://wiki.internetcomputer.org/wiki/Node_Provider_Troubleshooting#Node_Status_on_the_Dashboard here].) Use the principal ID that was assigned to the node when it was onboarded to identify it.
 
* If the node is not visible on the dashboard, then it has not registered with the Internet Computer.
 
* If the node is not visible on the dashboard, then it has not registered with the Internet Computer.
 
** If you have recently installed a current IC-OS image, then you can try inserting the HSM and/or a reboot to see if it joins. This would work if a recent IC-OS installation was successful and only the registration and joining was interrupted.
 
** If you have recently installed a current IC-OS image, then you can try inserting the HSM and/or a reboot to see if it joins. This would work if a recent IC-OS installation was successful and only the registration and joining was interrupted.
 
** If you have ''not'' recently installed a current IC-OS image, then do ''not'' insert the HSM. You do not want the node to rejoin with an old IC-OS image, as it will only fail again. Instead, you should consider [[Updating Firmware|upgrading the firmware]] if it is running on old versions, and then redeploy the node with [[Node Provider Documentation|a fresh/current IC-OS image]] (which will assign a new principal to the node so that you can identify it in the dashboard.)
 
** If you have ''not'' recently installed a current IC-OS image, then do ''not'' insert the HSM. You do not want the node to rejoin with an old IC-OS image, as it will only fail again. Instead, you should consider [[Updating Firmware|upgrading the firmware]] if it is running on old versions, and then redeploy the node with [[Node Provider Documentation|a fresh/current IC-OS image]] (which will assign a new principal to the node so that you can identify it in the dashboard.)
 +
 +
.
  
 
[[Node Provider Troubleshooting|All Node Provider Troubleshooting links]]
 
[[Node Provider Troubleshooting|All Node Provider Troubleshooting links]]

Revision as of 20:39, 12 July 2023

Steps to take when a server is unhealthy or has been removed from the network, but the connectivity in the data center is functioning correctly:

  • Ensure that the server is powered on.
  • Ensure that all link lights for active network interfaces are on.
    • If any link lights are off, check for failed cables by swapping them out for known good cables as needed.
  • Hook up a crash cart and check for errors on the screen, troubleshoot as needed.
  • Contact your hardware vender if hardware issues are found or suspected.
  • Consider updating the firmware if it has been a long time, and/or if you have recently had other nodes that needed firmware upgrades to become healthy again.
  • If no known error is found, please redeploy the node with a fresh IC-OS image.
    • The deployment process identifies/fixes many software issues.
    • Note that if an old IC-OS image is used, the node will "appear" to be healthy at first, but it will not be able to catch up to the blockchain and will therefore fall behind and become unhealthy again. Thus, a current IC-OS image must be used.
    • At the end, obtain the new principal ID for the node from the crash cart screen.
    • Then search for the node's principal on the IC dashboard to verify that the node is healthy.

Use the dashboard to verify that the node is healthy

  • The node count for your data center should match the number of nodes in that data center.
  • Look for the principal ID for the node which you are servicing and make sure that it is "Awaiting Subnet" status
  • If the node isn't listed at all, then it needs to be redeployed the node with a fresh IC-OS image.


If a node is healthy ("Awaiting Subnet" status) for a while and then changes to "Offline," then whatever the issue was originally still exists. Troubleshoot hardware, upgrade firmware, etc to resolve the issue.

Orchestrator Started message on screen

This message is not an error, nor is it confirmation that the node is running properly. This must be determined in other ways:

  • Check the dashboard to check the status of the node. (Status explanations are here.) Use the principal ID that was assigned to the node when it was onboarded to identify it.
  • If the node is not visible on the dashboard, then it has not registered with the Internet Computer.
    • If you have recently installed a current IC-OS image, then you can try inserting the HSM and/or a reboot to see if it joins. This would work if a recent IC-OS installation was successful and only the registration and joining was interrupted.
    • If you have not recently installed a current IC-OS image, then do not insert the HSM. You do not want the node to rejoin with an old IC-OS image, as it will only fail again. Instead, you should consider upgrading the firmware if it is running on old versions, and then redeploy the node with a fresh/current IC-OS image (which will assign a new principal to the node so that you can identify it in the dashboard.)

.

All Node Provider Troubleshooting links