Difference between revisions of "Troubleshooting Unhealthy Nodes"

Revision as of 17:29, 17 January 2024

Use the dashboard to verify that the node is healthy

The node count for your data center should match the number of nodes in that data center.
Look for the principal ID for the node which you are servicing. Status explanations are here.
If the node isn't listed at all, then it needs to be redeployed the node with a fresh IC-OS image.

Degraded nodes statuses

>> IC_OrchestratorFlapping

Explanation: Orchestrator coordinates the execution of many IC GuestOS processes, including the IC Replica. If the orchestrator is repeatedly restarting, then the GuestOS and the Replica process likely do not operate as expected.

Possible causes:

Networking issues
Hardware issues
Software problems

Troubleshooting and remediation:

Check if any NNS proposals were recently executed for the nodes in question https://dashboard.internetcomputer.org/governance
Check if there are some bandwidth limitations on these nodes, and ensure any traffic shaping, QoS, DOS protection, etc, are disabled on the ISP side
Perform other network diagnostics checks
Inspect node logs and metrics, if possible
Consult other node providers and DFINITY if there are any known software problems with the latest revision that the node(s) are running

>> IC_Replica_Behind

Explanation: IC Replica is the main process, that runs canisters (smart contracts). If the Replica process cannot catch up, then the replica (node) cannot be a productive member of the IC subnet.

Possible causes:

Networking issues
Hardware issues
Software problems

Troubleshooting and remediation:

Check if there are any hardware issues reported by the machine's BMC (Baseboard Management Controller)
Perform firmware upgrade
Check if any NNS proposals were recently executed for the nodes in question https://dashboard.internetcomputer.org/governance
Check if there are some bandwidth limitations on these nodes, and ensure any traffic shaping, QoS, DOS protection, etc, are disabled on the ISP side
Perform other network diagnostics checks
Inspect node logs and metrics, if possible
Consult other node providers and DFINITY if there are any known software problems with the latest revision that the node(s) are running

"Orchestrator Started" message on console screen

This message shown on console screens is not an error, nor is it confirmation that the node is running properly. This must be determined in other ways:

Check the dashboard to check the status of the node. (Status explanations are here.) Use the principal ID that was assigned to the node when it was onboarded to identify it.
If the node is not visible on the dashboard, then it has not registered with the Internet Computer.
- If you have recently installed a current IC-OS image, then you can try inserting the HSM and/or a reboot to see if it joins. This would work if a recent IC-OS installation was successful and only the registration and joining was interrupted.
- If you have not recently installed a current IC-OS image, then do not insert the HSM. You do not want the node to rejoin with an old IC-OS image, as it will only fail again. Instead, you should consider upgrading the firmware if it is running on old versions, and then redeploy the node with a fresh/current IC-OS image (which will assign a new principal to the node so that you can identify it in the dashboard.)

Offline nodes

Network issues are the main reason why nodes are in an "OFFLINE" state.

Your node may not be reachable from the IC or
you may not be able to reach other nodes the IC from your node or
you may note be able to reach the monitoring servers from your node.

Please refer to the Networking Troubleshooting Steps below.

Another possible reason for an OFFLINE node may be that your GuestOS failed to start due to a RAM failure.

Server Troubleshooting Steps

These steps may help when a server is unhealthy or has been removed from the network, but the connectivity in the data center is functioning correctly:

Verify if the server is up and running:
- Check the power status of the server.
- Check if the server is displaying any error messages or indicators.
- If possible, access the server remotely or physically to ensure it is functioning properly.
Hook up a crash cart and check for errors on the screen, troubleshoot as needed.
Consider updating the firmware if it has been a long time, and/or if you have recently had other nodes that needed firmware upgrades to become healthy again.
If no known error is found, please redeploy the node with a fresh IC-OS image.
- The deployment process identifies/fixes many software issues.
- Note that if an old IC-OS image is used, the node will "appear" to be healthy at first, but it will not be able to catch up to the blockchain and will therefore fall behind and become unhealthy again. Thus, a current IC-OS image must be used.
- At the end, obtain the new principal ID for the node from the crash cart screen so you can check the dashboard status.
- If a node is healthy ("Awaiting Subnet" status) for a while and then changes to "Offline," then whatever the issue was originally still exists. Troubleshoot hardware, upgrade firmware, etc to resolve the issue.

Networking Troubleshooting Steps

Inspect network hardware in the rack for any visible signs of malfunction (e.g. red lights) or incorrect setup
Verify the cabling and port status on the switch:
- Check the physical connection of the network cable between the server and the switch.
- Ensure that the cable is securely plugged into the correct port on both ends.
- Look for any signs of damage or loose connections.
- Test the connectivity by trying a different network cable or using the same cable on a different port.
Check for recent port flaps/link failures or any other activities which might cause it:
- Check the logs or monitoring systems for any indications of port flapping or link failures.
- Investigate any recent changes or activities that could have affected the network connection.
- Consider any software updates, configuration changes, or physical alterations made recently.
Try to perform a re-seat of cable/breakout/SFP/QSFP toward the affected machine:
- Disconnect and reconnect the network cable at both ends (server and switch).
- If applicable, re-seat any breakout cables, SFP modules, or QSFP modules used in the connection.
- Ensure a secure and proper connection is established.
Check with the switch vendor:
- If the issue persists, contact the switch vendor's support team for further assistance.
- Provide them with detailed information about the problem and any troubleshooting steps you have already taken.
- Follow vendor guidance to troubleshoot and resolve the issue.
  - If your vendor requires a TSR log, see IDRAC access and TSR logs for an example of how to retrieve one from a Dell server.
  - Updating the firmware might also resolve the issue.
Utilize an auxiliary machine within the same rack with full network access to run diagnostics tools like ping, traceroute, and nmap
Work with the ISP to troubleshoot and resolve any network routing issues identified during diagnostics
Prepare for future incidents by establishing network redundancy and failover mechanisms

Best practices

Keep a separate machine in the same rack with appropriate tools for network diagnostics and troubleshooting
Engage with the node provider community for support and to share effective troubleshooting techniques

Setting Up an Auxiliary Machine for Network Diagnostics

Robust Internet connectivity is essential. Without access to internal node logs and metrics, troubleshooting requires alternative strategies, including the use of an auxiliary machine within the same rack. Here's a brief outline for setting up an auxiliary machine in the same rack, while following best security practices:

Hardware Setup:
- Choose a server with sufficient resources to run diagnostic tools without impacting its performance. There is no need to follow the gen1/gen2 hardware requirements for this server (since this node would not be joining the IC network) but make sure the server is performant enough to run network tests.
- Ensure physical security measures are in place to prevent unauthorized access.
Operating System and Software:
- Install a secure operating system, like a minimal installation of Linux (we prefer Ubuntu 22.04), which reduces the attack surface.
- Keep the system updated with the latest security patches and firmware updates.
Network Configuration:
- Configure the machine with an IPv6 address in the same range as the IC nodes, for accurate testing.
- Set up a restrictive firewall on the machine to allow only the necessary inbound and outbound traffic. Consider allowing Internet access for this machine only during troubleshooting sessions, and keeping the machine behind a VPN at other times.
Diagnostic Tools:
- Install network diagnostic tools such as ping, traceroute, nmap, tcpdump, and iperf.
- Configure monitoring tools to simulate node activities and track responsiveness.
Security Measures:
- Use strong, unique passwords for all accounts and change them regularly. Or, preferably, do not use passwords at all, and use key-based access instead.
- Implement key-based SSH authentication and disable root login over SSH.
- Regularly review logs for any unusual activities that might indicate a security breach.
Maintenance and Updates:
- Regularly update all software to the latest versions.
- Periodically test your network diagnostic tools to ensure they are functioning as expected.

Frequently Asked Questions

Q: Is the monitoring system open-sourced? How does it communicate with the nodes?

A: The monitoring system configuration is not currently open-sourced. However, the node configuration that is required for proper node operation is fully open source. For more information about the node-to-node and node-to-monitoring communication, refer to the nftables configuration, which is the definitive guide for required open ports on Host OS, and similar nftables configuration for the Guest OS. You can find the configuration for DFINITY-owned DCs and Gen1 node providers here: nftables configuration. We use Victoria Metrics for metrics scraping (documentation) and vector.dev for log scraping.

Q: What are the destination IPs and Ports for Frankfurt, Chicago, and San Francisco for connectivity troubleshooting?

A: At present, detailed node/port information is not publicly accessible, as disclosing this information is considered a security risk. To effectively troubleshoot connectivity issues with your nodes, we recommend setting up a "spot instance" or a temporary virtual machine (VM) with a cloud provider in each of the geographical regions. This approach allows you to test both connectivity and connection stability to your nodes, providing a practical solution for identifying and resolving network-related issues.

Q: The dashboard shows offline or degraded status for <DC>, but everything seems fine. What should we do?

A: Check if any of these issues are present: firewall restrictions, traffic shaping, DOS protection, or QOS features enabled on the ISP side. These should all be disabled for optimal node operation.

Q: Under what circumstances is a node removed from the IC network?

A: A node is removed from the IC network when it's deemed unhealthy. The determination of a node's health is made using tooling from https://github.com/dfinity/dre. This tooling assesses nodes based on various metrics and submits a proposal for their removal to maintain the highest level of decentralization possible. However, there are exceptions. For example, an unhealthy node might be temporarily retained if there are ongoing efforts to recover and restore it.

Q: How long can a node be down before it's excluded from the IC network?

A: There's no set time limit for how long a node can be down before exclusion. The decision is more qualitative and depends on the overall health of the network. Currently, the IC network can tolerate up to 1/3 of nodes in a 13-node subnet being down or unhealthy. This means a subnet can function with up to 4 unhealthy nodes. If the unhealthy nodes do not exceed this threshold, a node might be left in the subnet for a longer period, especially if there are efforts underway to make it healthy again.

Q: When are nodes typically removed or replaced?

A: Node removals or replacements are currently conducted semi-manually and are typically scheduled for Mondays or Fridays. This timing allows Foundation voters to participate in the decision-making process at the start or end of the workweek. However, as this process is manually conducted, exceptions may occur based on specific circumstances.

Q: What are the future plans for node management in the IC network?

A: In the medium term, there are plans to automate node replacements. This means node swaps might occur more frequently and systematically, reducing the manual overhead and potentially enhancing the network's resilience and performance.

Note: As with all network operations, these practices are subject to change based on technological advancements and the evolving needs of the IC network. It's always good to refer to https://github.com/dfinity/dre for the most current information and tooling.

Back to Node Provider Troubleshooting

Back to Node Provider Documentation

@@ Line 6: / Line 6: @@
 === '''Degraded nodes statuses''' ===
-==== IC_OrchestratorFlapping ====
+==== '''>> IC_OrchestratorFlapping''' ====
 '''Explanation:''' Orchestrator coordinates the execution of many IC GuestOS processes, including the IC Replica. If the orchestrator is repeatedly restarting, then the GuestOS and the Replica process likely do not operate as expected.
@@ Line 23: / Line 23: @@
 * Consult other node providers and DFINITY if there are any known software problems with the latest revision that the node(s) are running
-==== IC_Replica_Behind ====
+==== '''>> IC_Replica_Behind''' ====
 '''Explanation:''' IC Replica is the main process, that runs canisters (smart contracts). If the Replica process cannot catch up, then the replica (node) cannot be a productive member of the IC subnet.
@@ Line 42: / Line 42: @@
 * Consult other node providers and DFINITY if there are any known software problems with the latest revision that the node(s) are running
-=== '''"Orchestrator Started" message''' ===
+=== '''"Orchestrator Started" message on console screen''' ===
-This message is not an error, nor is it confirmation that the node is running properly. This must be determined in other ways:
+This message shown on console screens is not an error, nor is it confirmation that the node is running properly. This must be determined in other ways:
 *'''Check [https://dashboard.internetcomputer.org/ the dashboard]''' to check the status of the node. (Status explanations are [https://wiki.internetcomputer.org/wiki/Node_Provider_Troubleshooting#Node_Status_on_the_Dashboard here].) Use the principal ID that was assigned to the node when it was onboarded to identify it.