Difference between revisions of "Unhealthy Nodes"

From Internet Computer Wiki
Jump to: navigation, search
m
(Updated the whole page, added more info from Radek.)
Line 1: Line 1:
Steps to take when a server is unhealthy or has been removed from the network, but the connectivity in the data center is functioning correctly:
+
=== '''Use [https://dashboard.internetcomputer.org/centers the dashboard] to verify that the node is healthy''' ===
* Ensure that the server is powered on.
+
*The node count for your data center should match the number of nodes in that data center.
* Ensure that all link lights for active network interfaces are on.
+
* Look for the principal ID for the node which you are servicing. Status explanations are [https://wiki.internetcomputer.org/wiki/Node_Provider_Troubleshooting#Node_Status_on_the_Dashboard here].
** If any link lights are off, check for failed cables by swapping them out for known good cables as needed.
+
* If the node isn't listed at all, then it needs to be [[IC-OS Installation Runbook|redeployed the node with a fresh IC-OS image]].
*Hook up a crash cart and check for errors on the screen, troubleshoot as needed.
 
* Contact your hardware vender if hardware issues are found or suspected.  
 
** If your vendor requires a TSR log, see [[IDRAC access and TSR logs]] for an example of how to retrieve one from a Dell server.
 
** [[Updating_Firmware|Updating the firmware]] might also resolve the issue.
 
* Consider [[Updating Firmware|updating the firmware]] if it has been a long time, and/or if you have recently had other nodes that needed firmware upgrades to become healthy again.
 
* If no known error is found, please [[IC-OS Installation Runbook|redeploy the node with a fresh IC-OS image]].
 
** The deployment process identifies/fixes many software issues.
 
** Note that if an old IC-OS image is used, the node will "appear" to be healthy at first, but it will not be able to catch up to the blockchain and will therefore fall behind and become unhealthy again. Thus, a current IC-OS image ''must'' be used.
 
** At the end, obtain the new principal ID for the node from the crash cart screen.
 
** Then search for the node's principal on the [https://dashboard.internetcomputer.org/nodes IC dashboard] to verify that the node is healthy.
 
  
==== '''Use [https://dashboard.internetcomputer.org/centers the dashboard] to verify that the node is healthy''' ====
+
=== '''"Orchestrator Started" message''' ===
 
+
This message is not an error, nor is it confirmation that the node is running properly. This must be determined in other ways:
* The node count for your data center should match the number of nodes in that data center.
 
* Look for the principal ID for the node which you are servicing and make sure that it is "Awaiting Subnet" status
 
* If the node isn't listed at all, then it needs to be [[IC-OS Installation Runbook|redeployed the node with a fresh IC-OS image]].
 
  
 +
*'''Check [https://dashboard.internetcomputer.org/ the dashboard]''' to check the status of the node. (Status explanations are [https://wiki.internetcomputer.org/wiki/Node_Provider_Troubleshooting#Node_Status_on_the_Dashboard here].) Use the principal ID that was assigned to the node when it was onboarded to identify it.
 +
*If the node is not visible on the dashboard, then it has not registered with the Internet Computer.
 +
**If you have recently installed a current IC-OS image, then you can try inserting the HSM and/or a reboot to see if it joins. This would work if a recent IC-OS installation was successful and only the registration and joining was interrupted.
 +
**If you have ''not'' recently installed a current IC-OS image, then do ''not'' insert the HSM. You do not want the node to rejoin with an old IC-OS image, as it will only fail again. Instead, you should consider [[Updating Firmware|upgrading the firmware]] if it is running on old versions, and then redeploy the node with [[IC-OS Installation Runbook|a fresh/current IC-OS image]] (which will assign a new principal to the node so that you can identify it in the dashboard.)
  
'''If a node is healthy ("Awaiting Subnet" status) for a while and then changes to "Offline," then whatever the issue was originally still exists.''' Troubleshoot hardware, upgrade firmware, etc to resolve the issue.   
+
=== '''Troubleshooting Steps''' ===
 
+
These steps may help when a server is unhealthy or has been removed from the network, but the connectivity in the data center is functioning correctly:
==== '''Orchestrator Started message on screen''' ====
 
This message is not an error, nor is it confirmation that the node is running properly. This must be determined in other ways:
 
  
* '''Check [https://dashboard.internetcomputer.org/ the dashboard]''' to check the status of the node. (Status explanations are [https://wiki.internetcomputer.org/wiki/Node_Provider_Troubleshooting#Node_Status_on_the_Dashboard here].) Use the principal ID that was assigned to the node when it was onboarded to identify it.
+
#Verify if the server is up and running:
* If the node is not visible on the dashboard, then it has not registered with the Internet Computer.
+
#*Check the power status of the server.
** If you have recently installed a current IC-OS image, then you can try inserting the HSM and/or a reboot to see if it joins. This would work if a recent IC-OS installation was successful and only the registration and joining was interrupted.
+
#*Check if the server is displaying any error messages or indicators.
** If you have ''not'' recently installed a current IC-OS image, then do ''not'' insert the HSM. You do not want the node to rejoin with an old IC-OS image, as it will only fail again. Instead, you should consider [[Updating Firmware|upgrading the firmware]] if it is running on old versions, and then redeploy the node with [[IC-OS Installation Runbook|a fresh/current IC-OS image]] (which will assign a new principal to the node so that you can identify it in the dashboard.)
+
#*If possible, access the server remotely or physically to ensure it is functioning properly.
 +
#Verify the cabling and port status on the switch:
 +
#*Check the physical connection of the network cable between the server and the switch.
 +
#*Ensure that the cable is securely plugged into the correct port on both ends.
 +
#*Look for any signs of damage or loose connections.
 +
#*Test the connectivity by trying a different network cable or using the same cable on a different port.
 +
#Hook up a crash cart and check for errors on the screen, troubleshoot as needed.
 +
#Check for recent port flaps/link failures or any other activities which might cause it:
 +
#*Check the logs or monitoring systems for any indications of port flapping or link failures.
 +
#*Investigate any recent changes or activities that could have affected the network connection.
 +
#*Consider any software updates, configuration changes, or physical alterations made recently.
 +
#Try to perform a re-seat of cable/breakout/SFP/QSFP toward the affected machine:
 +
#*Disconnect and reconnect the network cable at both ends (server and switch).
 +
#*If applicable, re-seat any breakout cables, SFP modules, or QSFP modules used in the connection.
 +
#*Ensure a secure and proper connection is established.
 +
#Check with the switch vendor:
 +
#*If the issue persists, contact the switch vendor's support team for further assistance.
 +
#*Provide them with detailed information about the problem and any troubleshooting steps you have already taken.
 +
#*Follow vendor guidance to troubleshoot and resolve the issue.
 +
#**If your vendor requires a TSR log, see [[IDRAC access and TSR logs]] for an example of how to retrieve one from a Dell server.
 +
#**[[Updating_Firmware|Updating the firmware]] might also resolve the issue.
 +
# Consider [[Updating Firmware|updating the firmware]] if it has been a long time, and/or if you have recently had other nodes that needed firmware upgrades to become healthy again.
 +
# If no known error is found, please [[IC-OS Installation Runbook|redeploy the node with a fresh IC-OS image]].
 +
#* The deployment process identifies/fixes many software issues.
 +
#* Note that if an old IC-OS image is used, the node will "appear" to be healthy at first, but it will not be able to catch up to the blockchain and will therefore fall behind and become unhealthy again. '''Thus, a current IC-OS image ''must'' be used.'''
 +
#* At the end, obtain the new principal ID for the node from the crash cart screen so you can check the dashboard status.
 +
#* '''If a node is healthy ("Awaiting Subnet" status) for a while and then changes to "Offline," then whatever the issue was originally still exists.''' Troubleshoot hardware, upgrade firmware, etc to resolve the issue.
  
 
.
 
.
  
[[Node Provider Troubleshooting|All Node Provider Troubleshooting links]]
+
Back to [[Node Provider Troubleshooting]]
 +
 
 +
Back to [[Node Provider Documentation]]

Revision as of 18:37, 14 September 2023

Use the dashboard to verify that the node is healthy

  • The node count for your data center should match the number of nodes in that data center.
  • Look for the principal ID for the node which you are servicing. Status explanations are here.
  • If the node isn't listed at all, then it needs to be redeployed the node with a fresh IC-OS image.

"Orchestrator Started" message

This message is not an error, nor is it confirmation that the node is running properly. This must be determined in other ways:

  • Check the dashboard to check the status of the node. (Status explanations are here.) Use the principal ID that was assigned to the node when it was onboarded to identify it.
  • If the node is not visible on the dashboard, then it has not registered with the Internet Computer.
    • If you have recently installed a current IC-OS image, then you can try inserting the HSM and/or a reboot to see if it joins. This would work if a recent IC-OS installation was successful and only the registration and joining was interrupted.
    • If you have not recently installed a current IC-OS image, then do not insert the HSM. You do not want the node to rejoin with an old IC-OS image, as it will only fail again. Instead, you should consider upgrading the firmware if it is running on old versions, and then redeploy the node with a fresh/current IC-OS image (which will assign a new principal to the node so that you can identify it in the dashboard.)

Troubleshooting Steps

These steps may help when a server is unhealthy or has been removed from the network, but the connectivity in the data center is functioning correctly:

  1. Verify if the server is up and running:
    • Check the power status of the server.
    • Check if the server is displaying any error messages or indicators.
    • If possible, access the server remotely or physically to ensure it is functioning properly.
  2. Verify the cabling and port status on the switch:
    • Check the physical connection of the network cable between the server and the switch.
    • Ensure that the cable is securely plugged into the correct port on both ends.
    • Look for any signs of damage or loose connections.
    • Test the connectivity by trying a different network cable or using the same cable on a different port.
  3. Hook up a crash cart and check for errors on the screen, troubleshoot as needed.
  4. Check for recent port flaps/link failures or any other activities which might cause it:
    • Check the logs or monitoring systems for any indications of port flapping or link failures.
    • Investigate any recent changes or activities that could have affected the network connection.
    • Consider any software updates, configuration changes, or physical alterations made recently.
  5. Try to perform a re-seat of cable/breakout/SFP/QSFP toward the affected machine:
    • Disconnect and reconnect the network cable at both ends (server and switch).
    • If applicable, re-seat any breakout cables, SFP modules, or QSFP modules used in the connection.
    • Ensure a secure and proper connection is established.
  6. Check with the switch vendor:
    • If the issue persists, contact the switch vendor's support team for further assistance.
    • Provide them with detailed information about the problem and any troubleshooting steps you have already taken.
    • Follow vendor guidance to troubleshoot and resolve the issue.
  7. Consider updating the firmware if it has been a long time, and/or if you have recently had other nodes that needed firmware upgrades to become healthy again.
  8. If no known error is found, please redeploy the node with a fresh IC-OS image.
    • The deployment process identifies/fixes many software issues.
    • Note that if an old IC-OS image is used, the node will "appear" to be healthy at first, but it will not be able to catch up to the blockchain and will therefore fall behind and become unhealthy again. Thus, a current IC-OS image must be used.
    • At the end, obtain the new principal ID for the node from the crash cart screen so you can check the dashboard status.
    • If a node is healthy ("Awaiting Subnet" status) for a while and then changes to "Offline," then whatever the issue was originally still exists. Troubleshoot hardware, upgrade firmware, etc to resolve the issue.

.

Back to Node Provider Troubleshooting

Back to Node Provider Documentation