Difference between revisions of "Troubleshooting Unhealthy Nodes"

From Internet Computer Wiki
Jump to: navigation, search
m (Formatting updates)
Line 27: Line 27:
 
*'''Awaiting Subnet''' - The node is operational and prepared to join a subnet when necessary. Node providers still get full rewards for  nodes awaiting subnet.
 
*'''Awaiting Subnet''' - The node is operational and prepared to join a subnet when necessary. Node providers still get full rewards for  nodes awaiting subnet.
 
*'''Degraded''' - Metrics can be scraped from the node, indicating it is online, but an ALERT has been raised. This status suggests the node may be struggling to keep up with network demands. For specific troubleshooting steps, identify your degraded node status from the below list.
 
*'''Degraded''' - Metrics can be scraped from the node, indicating it is online, but an ALERT has been raised. This status suggests the node may be struggling to keep up with network demands. For specific troubleshooting steps, identify your degraded node status from the below list.
*'''Offline''' - The monitoring system is unable to scrape metrics, possibly due to node failure or data center outage. Prioritize verifying network connectivity and hardware functionality. For specific troubleshooting steps, see below.<br />
+
*'''Offline''' - The monitoring system is unable to scrape metrics, possibly due to node failure or data center outage. Prioritize verifying network connectivity and hardware functionality. For specific troubleshooting steps, see below.
 +
 
 +
If your node is '''not listed at all''': A missing node from the list may indicate significant issues, requiring immediate attention and troubleshooting. If the node was functioning previously and is now not listed at all, this generally means that it started encountering issues and was removed from the registry. <br />
 
==Degraded nodes statuses==
 
==Degraded nodes statuses==
  
Line 48: Line 50:
 
===>> IC_Replica_Behind ===
 
===>> IC_Replica_Behind ===
  
'''Explanation:''' IC Replica is the main process, that runs canisters (smart contracts). If the Replica process cannot catch up, then the replica (node) cannot be a productive member of the IC subnet.
+
'''Explanation:''' IC Replica is the main process. If the Replica process cannot catch up, then the replica (node) cannot be a productive member of the IC subnet.
  
 
'''Possible causes:'''
 
'''Possible causes:'''
Line 71: Line 73:
 
* You may not be able to reach the monitoring servers from your node.
 
* You may not be able to reach the monitoring servers from your node.
  
Please refer to the '''Networking Troubleshooting Steps''' below.  
+
Please refer to the [[Troubleshooting Networking Issues]] guide.  
  
 
Another possible reason for an OFFLINE node may be that your GuestOS failed to start due to a RAM failure.   
 
Another possible reason for an OFFLINE node may be that your GuestOS failed to start due to a RAM failure.   
Line 83: Line 85:
 
# Hook up a crash cart and check for errors on the screen, troubleshoot as needed.
 
# Hook up a crash cart and check for errors on the screen, troubleshoot as needed.
 
# Consider [[Updating Firmware|updating the firmware]] if it has been a long time, and/or if you have recently had other nodes that needed firmware upgrades to become healthy again.
 
# Consider [[Updating Firmware|updating the firmware]] if it has been a long time, and/or if you have recently had other nodes that needed firmware upgrades to become healthy again.
# If no known error is found, please [[Node Provider Roadmap#Milestone Five: Node Machine Onboarding|redeploy the node with a fresh IC-OS image]].
+
# If no known error is found, please [[Node Provider Roadmap#Milestone Five: Node Machine Onboarding|redeploy the node with a fresh IC-OS image]]
 
 
==Networking Troubleshooting Steps==
 
 
 
# Inspect network hardware in the rack for any visible signs of malfunction (e.g. red lights) or incorrect setup
 
# Verify the cabling and port status on the switch:
 
#* Check the physical connection of the network cable between the server and the switch.
 
#* Ensure that the cable is securely plugged into the correct port on both ends.
 
#* Look for any signs of damage or loose connections.
 
#* Test the connectivity by trying a different network cable or using the same cable on a different port.
 
# Check for recent port flaps/link failures or any other activities which might cause it:
 
#* Check the logs or monitoring systems for any indications of port flapping or link failures.
 
#* Investigate any recent changes or activities that could have affected the network connection.
 
#* Consider any software updates, configuration changes, or physical alterations made recently.
 
# Try to perform a re-seat of cable/breakout/SFP/QSFP toward the affected machine:
 
#* Disconnect and reconnect the network cable at both ends (server and switch).
 
#* If applicable, re-seat any breakout cables, SFP modules, or QSFP modules used in the connection.
 
#* Ensure a secure and proper connection is established.
 
# Check with the switch vendor:
 
#* If the issue persists, contact the switch vendor's support team for further assistance.
 
#* Provide them with detailed information about the problem and any troubleshooting steps you have already taken.
 
#* Follow vendor guidance to troubleshoot and resolve the issue.
 
#** If your vendor requires a TSR log, see [[IDRAC access and TSR logs]] for an example of how to retrieve one from a Dell server.
 
#** [[Updating_Firmware|Updating the firmware]] might also resolve the issue.
 
# Utilize an auxiliary machine within the same rack with full network access to run diagnostics tools like <code>ping</code>, <code>traceroute</code>, and <code>nmap</code>
 
# Work with the ISP to troubleshoot and resolve any network routing issues identified during diagnostics
 
# Prepare for future incidents by establishing network redundancy and failover mechanisms
 
  
 
==General best practices==
 
==General best practices==

Revision as of 17:01, 28 June 2024

Troubleshooting node deployment

This guide is designed for troubleshooting unhealthy nodes—those are nodes that successfully installed IC-OS and registered with the IC, but on the dashboard, show a status that is NOT either “Awaiting Subnet” or “Active in Subnet”.

If your node did NOT successfully install IC-OS or if it failed to register with the IC, consult the Troubleshooting Node Deployment Errors guide.

Verify and understand node health status

Background

The dashboard provides real-time status of each node in the network. Nodes are identified by the principal of the currently deployed operating system (the "Node ID"), so the Node ID will change upon node redeployment. Node Providers are expected to maintain a private record correlating each server with its Node ID. This record is crucial for tracking, especially when nodes are redeployed with new Node IDs.

Metrics and Monitoring

Metrics are collected from nodes situated in three key geographical locations: Frankfurt (FR1), Chicago (CH1), and San Francisco (SF1). Each location is equipped with an independent monitoring and observability system. These systems apply specific rules to identify normal and abnormal node behaviors.

Alerts and Troubleshooting

When a node exhibits abnormal behavior, an ALERT is triggered by the monitoring system. The nature of the alert is indicated on the dashboard under the node's status. This is an example of a "Degraded" node:

Dashboard-degraded-node.png

Verifying node status

  • Use the dashboard to verify the status of your node
    • The dashboard can be searched by your Node Provider principal or Node ID. If you search by Node Provider principal, you should see the Node ID of your node and click through to the Node Machine page.

Understanding node status

If the status of your node is NOT either “Awaiting Subnet” or “Active in Subnet”, your node is unhealthy.

The dashboard indicates four possible statuses for each node:

  • Active in Subnet - The node is healthy and actively functioning within a subnet.
  • Awaiting Subnet - The node is operational and prepared to join a subnet when necessary. Node providers still get full rewards for nodes awaiting subnet.
  • Degraded - Metrics can be scraped from the node, indicating it is online, but an ALERT has been raised. This status suggests the node may be struggling to keep up with network demands. For specific troubleshooting steps, identify your degraded node status from the below list.
  • Offline - The monitoring system is unable to scrape metrics, possibly due to node failure or data center outage. Prioritize verifying network connectivity and hardware functionality. For specific troubleshooting steps, see below.

If your node is not listed at all: A missing node from the list may indicate significant issues, requiring immediate attention and troubleshooting. If the node was functioning previously and is now not listed at all, this generally means that it started encountering issues and was removed from the registry.

Degraded nodes statuses

>> IC_OrchestratorFlapping

Explanation: Orchestrator coordinates the execution of many IC GuestOS processes, including the IC Replica. If the orchestrator is repeatedly restarting, then the GuestOS and the Replica process likely do not operate as expected.

Possible causes:

  • Networking issues
  • Hardware issues
  • Software problems

Troubleshooting and remediation:

  • Check if any NNS proposals were recently executed for the nodes in question https://dashboard.internetcomputer.org/governance
  • Check if there are some bandwidth limitations on these nodes, and ensure any traffic shaping, QoS, DOS protection, etc, are disabled on the ISP side
  • Perform other network diagnostics checks
  • Inspect node logs and metrics, if possible
  • Consult other node providers and DFINITY if there are any known software problems with the latest revision that the node(s) are running

>> IC_Replica_Behind

Explanation: IC Replica is the main process. If the Replica process cannot catch up, then the replica (node) cannot be a productive member of the IC subnet.

Possible causes:

  • Networking issues
  • Hardware issues
  • Software problems

Troubleshooting and remediation:

  • Check if there are any hardware issues reported by the machine's BMC (Baseboard Management Controller)
  • Perform firmware upgrade
  • Check if any NNS proposals were recently executed for the nodes in question https://dashboard.internetcomputer.org/governance
  • Check if there are some bandwidth limitations on these nodes, and ensure any traffic shaping, QoS, DOS protection, etc, are disabled on the ISP side
  • Perform other network diagnostics checks
  • Inspect node logs and metrics, if possible
  • Consult other node providers and DFINITY if there are any known software problems with the latest revision that the node(s) are running

Offline nodes

Network issues are the main reason why nodes are in an "OFFLINE" state.

  • Your node may not be reachable from the IC or
  • You may not be able to reach other nodes the IC from your node or
  • You may not be able to reach the monitoring servers from your node.

Please refer to the Troubleshooting Networking Issues guide.

Another possible reason for an OFFLINE node may be that your GuestOS failed to start due to a RAM failure.

Server Troubleshooting Steps

These steps may help when a server is unhealthy or has been removed from the network, but the connectivity in the data center is functioning correctly:

  1. Verify if the server is up and running:
    • Check the power status of the server.
    • Check if the server is displaying any error messages or indicators.
    • If possible, access the server remotely or physically to ensure it is functioning properly.
  2. Hook up a crash cart and check for errors on the screen, troubleshoot as needed.
  3. Consider updating the firmware if it has been a long time, and/or if you have recently had other nodes that needed firmware upgrades to become healthy again.
  4. If no known error is found, please redeploy the node with a fresh IC-OS image

General best practices

  1. Keep a separate machine in the same rack with appropriate tools for network diagnostics and troubleshooting
  2. Engage with the node provider community for support and to share effective troubleshooting techniques

Setting Up an Auxiliary Machine for Network Diagnostics

Robust Internet connectivity is essential. Without access to internal node logs and metrics, troubleshooting requires alternative strategies, including the use of an auxiliary machine within the same rack. Here's a brief outline for setting up an auxiliary machine in the same rack, while following best security practices:

  1. Hardware Setup:
    • Choose a server with sufficient resources to run diagnostic tools without impacting its performance. There is no need to follow the gen1/gen2 hardware requirements for this server (since this node would not be joining the IC network) but make sure the server is performant enough to run network tests.
    • Ensure physical security measures are in place to prevent unauthorized access.
  2. Operating System and Software:
    • Install a secure operating system, like a minimal installation of Linux (we prefer Ubuntu 22.04), which reduces the attack surface.
    • Keep the system updated with the latest security patches and firmware updates.
  3. Network Configuration:
    • Configure the machine with an IPv6 address in the same range as the IC nodes, for accurate testing.
    • Set up a restrictive firewall on the machine to allow only the necessary inbound and outbound traffic. Consider allowing Internet access for this machine only during troubleshooting sessions, and keeping the machine behind a VPN at other times.
  4. Diagnostic Tools:
    • Install network diagnostic tools such as ping, traceroute, nmap, tcpdump, and iperf.
    • Configure monitoring tools to simulate node activities and track responsiveness.
  5. Security Measures:
    • Use strong, unique passwords for all accounts and change them regularly. Or, preferably, do not use passwords at all, and use key-based access instead.
    • Implement key-based SSH authentication and disable root login over SSH.
    • Regularly review logs for any unusual activities that might indicate a security breach.
  6. Maintenance and Updates:
    • Regularly update all software to the latest versions.
    • Periodically test your network diagnostic tools to ensure they are functioning as expected.

Frequently Asked Questions

Q: Is the monitoring system open-sourced? How does it communicate with the nodes?

A: The monitoring system configuration is not currently open-sourced. However, the node configuration that is required for proper node operation is fully open source. For more information about the node-to-node and node-to-monitoring communication, refer to the nftables configuration, which is the definitive guide for required open ports on Host OS, and similar nftables configuration for the Guest OS. You can find the configuration for DFINITY-owned DCs and Gen1 node providers here: nftables configuration. We use Victoria Metrics for metrics scraping (documentation) and vector.dev for log scraping.

Q: What are the destination IPs and Ports for Frankfurt, Chicago, and San Francisco for connectivity troubleshooting?

A: At present, detailed node/port information is not publicly accessible, as disclosing this information is considered a security risk. To effectively troubleshoot connectivity issues with your nodes, we recommend setting up a "spot instance" or a temporary virtual machine (VM) with a cloud provider in each of the geographical regions. This approach allows you to test both connectivity and connection stability to your nodes, providing a practical solution for identifying and resolving network-related issues.

Q: The dashboard shows offline or degraded status for <DC>, but everything seems fine. What should we do?

A: Check if any of these issues are present: firewall restrictions, traffic shaping, DOS protection, or QOS features enabled on the ISP side. These should all be disabled for optimal node operation.

Q: Under what circumstances is a node removed from the IC network?

A: A node is removed from the IC network when it's deemed unhealthy. The determination of a node's health is made using tooling from https://github.com/dfinity/dre. This tooling assesses nodes based on various metrics and submits a proposal for their removal to maintain the highest level of decentralization possible. However, there are exceptions. For example, an unhealthy node might be temporarily retained if there are ongoing efforts to recover and restore it.

Q: How long can a node be down before it's excluded from the IC network?

A: There's no set time limit for how long a node can be down before exclusion. The decision is more qualitative and depends on the overall health of the network. Currently, the IC network can tolerate up to 1/3 of nodes in a 13-node subnet being down or unhealthy. This means a subnet can function with up to 4 unhealthy nodes. If the unhealthy nodes do not exceed this threshold, a node might be left in the subnet for a longer period, especially if there are efforts underway to make it healthy again.

Q: When are nodes typically removed or replaced?

A: Node removals or replacements are currently conducted semi-manually and are typically scheduled for Mondays or Fridays. This timing allows Foundation voters to participate in the decision-making process at the start or end of the workweek. However, as this process is manually conducted, exceptions may occur based on specific circumstances.

Q: What are the future plans for node management in the IC network?

A: In the medium term, there are plans to automate node replacements. This means node swaps might occur more frequently and systematically, reducing the manual overhead and potentially enhancing the network's resilience and performance.

Note: As with all network operations, these practices are subject to change based on technological advancements and the evolving needs of the IC network. It's always good to refer to https://github.com/dfinity/dre for the most current information and tooling.

Q: What is the standard procedure if a faulty component occurs and we have to take the server down for maintenance?

At the moment the process is as follows:

  1. The node provider should give their best to bring the server back up, as soon as possible
  2. The DFINITY DRE team will monitor the situation and submit any proposals to replace faulty nodes, if necessary, or reach out to individual node providers if node replacements wouldn't be effective enough.

So feel free to do maintenance whenever you need to. If node is not in a subnet, there is no problem to take the node down as long as necessary. If the node is in a subnet, node replication should handle it without problems. Please be aware that we are actively working on reward adjustments based on the number of active and productive nodes, so please try not to make the downtime longer than absolutely necessary, to avoid reward reductions.

One thing that would be really helpful from your side is:

  1. Find the subnet id in which the node is located, and check how many nodes in the subnet are currently unhealthy
  2. If there are more than 2 (e.g. 3, 4, ... nodes) unhealthy nodes in the subnet, please consider postponing the maintenance work until the number of unhealthy nodes in the subnet reduces again to under 2.

In the future, any NP will be able to run the DRE tooling and there will be a financial incentive for the node providers to both a) keep all nodes in the subnet healthy, and b) submit proposals to replace unhealthy nodes or to improve decentralization.