Troubleshooting Unhealthy Nodes
From Internet Computer Wiki
Revision as of 18:37, 14 September 2023 by Katie.peters (talk | contribs) (Updated the whole page, added more info from Radek.)
Use the dashboard to verify that the node is healthy
- The node count for your data center should match the number of nodes in that data center.
- Look for the principal ID for the node which you are servicing. Status explanations are here.
- If the node isn't listed at all, then it needs to be redeployed the node with a fresh IC-OS image.
"Orchestrator Started" message
This message is not an error, nor is it confirmation that the node is running properly. This must be determined in other ways:
- Check the dashboard to check the status of the node. (Status explanations are here.) Use the principal ID that was assigned to the node when it was onboarded to identify it.
- If the node is not visible on the dashboard, then it has not registered with the Internet Computer.
- If you have recently installed a current IC-OS image, then you can try inserting the HSM and/or a reboot to see if it joins. This would work if a recent IC-OS installation was successful and only the registration and joining was interrupted.
- If you have not recently installed a current IC-OS image, then do not insert the HSM. You do not want the node to rejoin with an old IC-OS image, as it will only fail again. Instead, you should consider upgrading the firmware if it is running on old versions, and then redeploy the node with a fresh/current IC-OS image (which will assign a new principal to the node so that you can identify it in the dashboard.)
Troubleshooting Steps
These steps may help when a server is unhealthy or has been removed from the network, but the connectivity in the data center is functioning correctly:
- Verify if the server is up and running:
- Check the power status of the server.
- Check if the server is displaying any error messages or indicators.
- If possible, access the server remotely or physically to ensure it is functioning properly.
- Verify the cabling and port status on the switch:
- Check the physical connection of the network cable between the server and the switch.
- Ensure that the cable is securely plugged into the correct port on both ends.
- Look for any signs of damage or loose connections.
- Test the connectivity by trying a different network cable or using the same cable on a different port.
- Hook up a crash cart and check for errors on the screen, troubleshoot as needed.
- Check for recent port flaps/link failures or any other activities which might cause it:
- Check the logs or monitoring systems for any indications of port flapping or link failures.
- Investigate any recent changes or activities that could have affected the network connection.
- Consider any software updates, configuration changes, or physical alterations made recently.
- Try to perform a re-seat of cable/breakout/SFP/QSFP toward the affected machine:
- Disconnect and reconnect the network cable at both ends (server and switch).
- If applicable, re-seat any breakout cables, SFP modules, or QSFP modules used in the connection.
- Ensure a secure and proper connection is established.
- Check with the switch vendor:
- If the issue persists, contact the switch vendor's support team for further assistance.
- Provide them with detailed information about the problem and any troubleshooting steps you have already taken.
- Follow vendor guidance to troubleshoot and resolve the issue.
- If your vendor requires a TSR log, see IDRAC access and TSR logs for an example of how to retrieve one from a Dell server.
- Updating the firmware might also resolve the issue.
- Consider updating the firmware if it has been a long time, and/or if you have recently had other nodes that needed firmware upgrades to become healthy again.
- If no known error is found, please redeploy the node with a fresh IC-OS image.
- The deployment process identifies/fixes many software issues.
- Note that if an old IC-OS image is used, the node will "appear" to be healthy at first, but it will not be able to catch up to the blockchain and will therefore fall behind and become unhealthy again. Thus, a current IC-OS image must be used.
- At the end, obtain the new principal ID for the node from the crash cart screen so you can check the dashboard status.
- If a node is healthy ("Awaiting Subnet" status) for a while and then changes to "Offline," then whatever the issue was originally still exists. Troubleshoot hardware, upgrade firmware, etc to resolve the issue.
.
Back to Node Provider Troubleshooting
Back to Node Provider Documentation