Difference between revisions of "Troubleshooting Unhealthy Nodes"

From Internet Computer Wiki
Jump to: navigation, search
(IC UDP port block issue diagnosis)
 
(17 intermediate revisions by 4 users not shown)
Line 1: Line 1:
=== '''Use [https://dashboard.internetcomputer.org/centers the dashboard] to verify that the node is healthy''' ===
+
== Troubleshooting node deployment ==
*The node count for your data center should match the number of nodes in that data center.
+
This guide is designed for troubleshooting ''unhealthy'' nodes—those are nodes that successfully ''installed'' IC-OS and ''registered'' with the IC, but on the [[Node Deployment Guide (with an HSM)|dashboard]], show a status that is NOT either “Awaiting Subnet” or “Active in Subnet”.
* Look for the principal ID for the node which you are servicing. Status explanations are [https://wiki.internetcomputer.org/wiki/Node_Provider_Troubleshooting#Node_Status_on_the_Dashboard here].
 
* If the node isn't listed at all, then it needs to be [[IC-OS Installation Runbook|redeployed the node with a fresh IC-OS image]].
 
  
=== '''"Orchestrator Started" message''' ===
+
If your node did NOT successfully install IC-OS or if it failed to register with the IC, consult the [[Troubleshooting Node Deployment Errors]] guide.
This message is not an error, nor is it confirmation that the node is running properly. This must be determined in other ways:
 
  
*'''Check [https://dashboard.internetcomputer.org/ the dashboard]''' to check the status of the node. (Status explanations are [https://wiki.internetcomputer.org/wiki/Node_Provider_Troubleshooting#Node_Status_on_the_Dashboard here].) Use the principal ID that was assigned to the node when it was onboarded to identify it.
+
== Verify and understand node health status==
*If the node is not visible on the dashboard, then it has not registered with the Internet Computer.
 
**If you have recently installed a current IC-OS image, then you can try inserting the HSM and/or a reboot to see if it joins. This would work if a recent IC-OS installation was successful and only the registration and joining was interrupted.
 
**If you have ''not'' recently installed a current IC-OS image, then do ''not'' insert the HSM. You do not want the node to rejoin with an old IC-OS image, as it will only fail again. Instead, you should consider [[Updating Firmware|upgrading the firmware]] if it is running on old versions, and then redeploy the node with [[IC-OS Installation Runbook|a fresh/current IC-OS image]] (which will assign a new principal to the node so that you can identify it in the dashboard.)
 
  
=== '''Server Troubleshooting Steps''' ===
+
=== Background ===
 +
The dashboard provides real-time status of each node in the network. Nodes are identified by the principal of the currently deployed operating system (the "Node ID"), so the Node ID will change upon node redeployment. Node Providers are expected to maintain a private record correlating each server with its Node ID. This record is crucial for tracking, especially when nodes are redeployed with new Node IDs.
 +
 
 +
====Metrics and Monitoring ====
 +
Metrics are collected from nodes situated in three key geographical locations: Frankfurt (FR1), Chicago (CH1), and San Francisco (SF1). Each location is equipped with an independent monitoring and observability system. These systems apply specific rules to identify normal and abnormal node behaviors.
 +
 
 +
==== Alerts and Troubleshooting ====
 +
When a node exhibits abnormal behavior, an ALERT is triggered by the monitoring system. The nature of the alert is indicated on the dashboard under the node's status. This is an example of a "Degraded" node:[[File:Dashboard-degraded-node.png|center|frameless|499x499px]]
 +
 
 +
=== Verifying node status ===
 +
*Use [https://dashboard.internetcomputer.org/centers the dashboard] to verify the status of your node
 +
**The dashboard can be searched by your '''Node Provider principal''' or '''Node ID'''. If you search by '''Node Provider principal''', you should see the Node ID of your node and click through to the Node Machine page.
 +
 
 +
=== Understanding node status ===
 +
If the status of your node is NOT either “Awaiting Subnet” or “Active in Subnet”, your node is unhealthy.
 +
 
 +
The dashboard indicates four possible statuses for each node:
 +
 
 +
*'''Active in Subnet''' - The node is healthy and actively functioning within a subnet.
 +
*'''Awaiting Subnet''' - The node is operational and prepared to join a subnet when necessary. Node providers still get full rewards for  nodes awaiting subnet.
 +
*'''Degraded''' - Metrics can be scraped from the node, indicating it is online, but an ALERT has been raised. This status suggests the node may be struggling to keep up with network demands. For specific troubleshooting steps, identify your degraded node status from the below list.
 +
*'''Offline''' - The monitoring system is unable to scrape metrics, possibly due to node failure or data center outage. Prioritize verifying network connectivity and hardware functionality. For specific troubleshooting steps, see below.
 +
 
 +
If your node is '''not listed at all''': A missing node from the list may indicate significant issues, requiring immediate attention and troubleshooting. If the node was functioning previously and is now not listed at all, this generally means that it started encountering issues and was removed from the registry. <br />
 +
==Degraded nodes statuses==
 +
 
 +
===>> IC_OrchestratorFlapping===
 +
 
 +
'''Explanation:''' Orchestrator coordinates the execution of many IC GuestOS processes, including the IC Replica. If the orchestrator is repeatedly restarting, then the GuestOS and the Replica process likely do not operate as expected.
 +
 
 +
'''Possible causes:'''
 +
*Networking issues
 +
*Hardware issues
 +
*Software problems
 +
 
 +
'''Troubleshooting and remediation:'''
 +
*Check if any NNS proposals were recently executed for the nodes in question https://dashboard.internetcomputer.org/governance
 +
* Check if there are some bandwidth limitations on these nodes, and ensure any traffic shaping, QoS, DOS protection, etc, are ''disabled'' on the ISP side
 +
* Perform other [https://wiki.internetcomputer.org/w/index.php?title=Unhealthy_Nodes&veaction=edit#Setting_Up_an_Auxiliary_Machine_for_Network_Diagnostics network diagnostics checks]
 +
* Inspect node logs and metrics, if possible
 +
* Consult other node providers and DFINITY if there are any known software problems with the latest revision that the node(s) are running
 +
 
 +
===>> IC_Replica_Behind ===
 +
 
 +
'''Explanation:''' IC Replica is the main process. If the Replica process cannot catch up, then the replica (node) cannot be a productive member of the IC subnet.
 +
 
 +
'''Possible causes:'''
 +
* Networking issues
 +
* Hardware issues
 +
* Software problems
 +
 
 +
'''Troubleshooting and remediation:'''
 +
* Check if there are any hardware issues reported by the machine's BMC (Baseboard Management Controller)
 +
* Perform [https://wiki.internetcomputer.org/wiki/Updating_Firmware firmware upgrade]
 +
* Check if any NNS proposals were recently executed for the nodes in question https://dashboard.internetcomputer.org/governance
 +
* Check if there are some bandwidth limitations on these nodes, and ensure any traffic shaping, QoS, DOS protection, etc, are ''disabled'' on the ISP side
 +
* Perform other [https://wiki.internetcomputer.org/w/index.php?title=Unhealthy_Nodes&veaction=edit#Setting_Up_an_Auxiliary_Machine_for_Network_Diagnostics network diagnostics checks]
 +
* Inspect node logs and metrics, if possible
 +
* Consult other node providers and DFINITY if there are any known software problems with the latest revision that the node(s) are running
 +
 
 +
===>> IC_Replica_Behind_CLOCK ====
 +
 
 +
==== '''Overview of the Problem''' ====
 +
 
 +
The clock on the Internet Computer (IC) replica desynced due to UDP port 123 (NTP traffic) being blocked unexpectedly. This led to degraded performance in the IC dashboard.
 +
To resolve this, you need to configure a test environment to simulate the setup, using a bare metal server or VM running Ubuntu 20.04, connected via your IPv6 subnet, and test if chrony is syncing the clock correctly.
 +
If clock sync fails, we will perform a traceroute to identify where the packet drops are happening and consult the ISP if necessary.
 +
==== 2. '''Setting Up the Test Server''' ====
 +
 
 +
Connect a VM or bare-metal server running '''Ubuntu 20.04''' to your network. Ensure that your '''IPv6 address''' and '''gateway''' are configured correctly.
 +
===== Configure the IPv6 Address and Gateway =====
 +
 
 +
Open the network configuration file using the following command: <syntaxhighlight lang="shell">
 +
sudo nano /etc/netplan/00-installer-config.yaml </syntaxhighlight>
 +
 
 +
Add or edit the IPv6 settings as follows: <syntaxhighlight lang="shell">
 +
network: version: 2 ethernets: eth0: dhcp4: no dhcp6: no addresses: - YOUR_IPV6_ADDRESS/64 gateway6: YOUR_IPV6_GATEWAY nameservers: addresses: - 2001:4860:4860::8888 - 2001:4860:4860::8844 </syntaxhighlight>
 +
 
 +
Apply the configuration: <syntaxhighlight lang="shell">
 +
sudo netplan apply </syntaxhighlight>
 +
 
 +
==== 3. '''Install and Configure Chrony''' ====
 +
 
 +
Chrony is used to synchronize the system clock with NTP servers. Install it by running: <syntaxhighlight lang="shell">
 +
sudo apt update sudo apt install chrony -y </syntaxhighlight>
 +
 
 +
Edit the Chrony configuration file to use NTP pools: <syntaxhighlight lang="shell">
 +
sudo nano /etc/chrony/chrony.conf </syntaxhighlight>
 +
 
 +
Add the following lines to ensure NTP synchronization: <syntaxhighlight lang="shell">
 +
pool 0.pool.ntp.org iburst pool 1.pool.ntp.org iburst pool 2.pool.ntp.org iburst pool 3.pool.ntp.org iburst </syntaxhighlight>
 +
 
 +
Save the file and restart Chrony: <syntaxhighlight lang="shell">
 +
sudo systemctl restart chrony </syntaxhighlight>
 +
 
 +
==== 4. '''Check Chrony Synchronization Status''' ====
 +
 
 +
Verify if Chrony is syncing with the NTP servers: <syntaxhighlight lang="shell">
 +
chronyc tracking </syntaxhighlight>
 +
 
 +
Check the sources that Chrony is using to sync: <syntaxhighlight lang="shell">
 +
chronyc sources </syntaxhighlight>
 +
 
 +
If the synchronization is successful, the chronyc tracking command should show the server is synchronized. If not, proceed to the next step.
 +
==== 5. '''Perform a Traceroute to Diagnose Packet Drop''' ====
 +
 
 +
If Chrony is unable to sync, the issue might be caused by packet drops on UDP port 123. To identify where the issue is, perform a traceroute to one of the NTP servers (for example, <code>0.pool.ntp.org</code>) on port 123 (UDP): <syntaxhighlight lang="shell">
 +
sudo traceroute -U -p 123 0.pool.ntp.org </syntaxhighlight>
 +
 
 +
This command will trace the path of UDP packets to the NTP server, allowing you to see where the packet drops occur.
 +
==== 6. '''Interpret the Results''' ====
 +
 
 +
If the traceroute completes successfully, then the issue might be with the server's firewall or NTP configuration.
 +
If packets are dropped at a specific hop: ** If it’s on your network, check the router, firewall, or switch for any port 123 (UDP) blocks. ** If the issue is with your ISP, the packets might be blocked at their hardware level by IDS/IPS systems. Provide the traceroute results to your ISP's support team for faster resolution.
 +
==== 7. '''Consult ISP''' ====
 +
 
 +
If packet drops are identified on the ISP side, contact your ISP with the traceroute data. Explain that port 123 (UDP) is blocked and this is causing your IC clock to desync. Even though they might not block the port intentionally, security systems like IDS/IPS might kick in to block this traffic.
 +
----
 +
 
 +
=== Additional Commands ===
 +
 
 +
* '''Check UDP open ports on the system:'''
 +
* '''Verify if any firewall rules are blocking UDP port 123:'''
 +
* '''Open UDP port 123 if it's blocked by UFW:'''
 +
 
 +
==Offline nodes==
 +
Network issues are the main reason why nodes are in an "OFFLINE" state. 
 +
 
 +
* Your node may not be reachable from the IC or
 +
* You may not be able to reach other nodes the IC from your node or
 +
* You may not be able to reach the monitoring servers from your node.
 +
 
 +
Please refer to the [[Troubleshooting Networking Issues]] guide.
 +
 
 +
Another possible reason for an OFFLINE node may be that your GuestOS failed to start due to a RAM failure. 
 +
==Server Troubleshooting Steps==
 
These steps may help when a server is unhealthy or has been removed from the network, but the connectivity in the data center is functioning correctly:
 
These steps may help when a server is unhealthy or has been removed from the network, but the connectivity in the data center is functioning correctly:
  
#Verify if the server is up and running:
+
# Verify if the server is up and running:
#*Check the power status of the server.
+
#* Check the power status of the server.
#*Check if the server is displaying any error messages or indicators.
+
#* Check if the server is displaying any error messages or indicators.
#*If possible, access the server remotely or physically to ensure it is functioning properly.
+
#* If possible, access the server remotely or physically to ensure it is functioning properly.
#Hook up a crash cart and check for errors on the screen, troubleshoot as needed.
+
# Hook up a crash cart and check for errors on the screen, troubleshoot as needed.
#Consider [[Updating Firmware|updating the firmware]] if it has been a long time, and/or if you have recently had other nodes that needed firmware upgrades to become healthy again.
+
# Consider [[Updating Firmware|updating the firmware]] if it has been a long time, and/or if you have recently had other nodes that needed firmware upgrades to become healthy again.
# If no known error is found, please [[IC-OS Installation Runbook|redeploy the node with a fresh IC-OS image]].
+
# If no known error is found, please [[Node Provider Roadmap#Milestone Five: Node Machine Onboarding|redeploy the node with a fresh IC-OS image]]
#* The deployment process identifies/fixes many software issues.
+
 
#* Note that if an old IC-OS image is used, the node will "appear" to be healthy at first, but it will not be able to catch up to the blockchain and will therefore fall behind and become unhealthy again. '''Thus, a current IC-OS image must be used.'''
+
==Frequently Asked Questions==
#* At the end, obtain the new principal ID for the node from the crash cart screen so you can check the dashboard status.
+
'''Q: Is the monitoring system open-sourced? How does it communicate with the nodes?'''
#*'''If a node is healthy ("Awaiting Subnet" status) for a while and then changes to "Offline," then whatever the issue was originally still exists.''' Troubleshoot hardware, upgrade firmware, etc to resolve the issue.
+
 
 +
A: The monitoring system configuration is not currently open-sourced. However, the node configuration that is required for proper node operation is fully open source. For more information about the node-to-node and node-to-monitoring communication, refer to the [https://sourcegraph.com/github.com/dfinity/ic/-/blob/ic-os/hostos/rootfs/etc/nftables.conf?L54-112 nftables configuration], which is the definitive guide for required open ports on Host OS, and [https://sourcegraph.com/github.com/dfinity/ic/-/blob/ic-os/guestos/rootfs/opt/ic/share/ic.json5.template?L322 similar nftables configuration] for the Guest OS. You can find the configuration for DFINITY-owned DCs and Gen1 node providers here: nftables configuration. We use [https://docs.victoriametrics.com/ Victoria Metrics] for metrics scraping (documentation) and [https://vector.dev vector.dev] for log scraping.
 +
 
 +
'''Q: What are the destination IPs and Ports for Frankfurt, Chicago, and San Francisco for connectivity troubleshooting?'''
 +
 
 +
A: At present, detailed node/port information is not publicly accessible, as disclosing this information is considered a security risk. To effectively troubleshoot connectivity issues with your nodes, we recommend setting up a "spot instance" or a temporary virtual machine (VM) with a cloud provider in each of the geographical regions. This approach allows you to test both connectivity and connection stability to your nodes, providing a practical solution for identifying and resolving network-related issues.
 +
 
 +
'''Q: The dashboard shows offline or degraded status for <DC>, but everything seems fine. What should we do?'''
 +
 
 +
A: Check if any of these issues are present: firewall restrictions, traffic shaping, DOS protection, or QOS features enabled on the ISP side. These should all be disabled for optimal node operation.
 +
 
 +
'''Q: Under what circumstances is a node removed from the IC network?'''
 +
 
 +
A: A node is removed from the IC network when it's deemed unhealthy. The determination of a node's health is made using tooling from https://github.com/dfinity/dre. This tooling assesses nodes based on various metrics and submits a proposal for their removal to maintain the highest level of decentralization possible. However, there are exceptions. For example, an unhealthy node might be temporarily retained if there are ongoing efforts to recover and restore it.
 +
 
 +
'''Q: How long can a node be down before it's excluded from the IC network?'''
 +
 
 +
A: There's no set time limit for how long a node can be down before exclusion. The decision is more qualitative and depends on the overall health of the network. Currently, the IC network can tolerate up to 1/3 of nodes in a 13-node subnet being down or unhealthy. This means a subnet can function with up to 4 unhealthy nodes. If the unhealthy nodes do not exceed this threshold, a node might be left in the subnet for a longer period, especially if there are efforts underway to make it healthy again.
 +
 
 +
'''Q: When are nodes typically removed or replaced?'''
 +
 
 +
A: Node removals or replacements are currently conducted semi-manually and are typically scheduled for Mondays or Fridays. This timing allows Foundation voters to participate in the decision-making process at the start or end of the workweek. However, as this process is manually conducted, exceptions may occur based on specific circumstances.
  
=== '''Networking Troubleshooting Steps''' ===
+
'''Q: What are the future plans for node management in the IC network?'''
  
# Inspect network hardware in the rack for any visible signs of malfunction (e.g. red lights) or incorrect setup
+
A: In the medium term, there are plans to automate node replacements. This means node swaps might occur more frequently and systematically, reducing the manual overhead and potentially enhancing the network's resilience and performance.
# Verify the cabling and port status on the switch:
 
#*Check the physical connection of the network cable between the server and the switch.
 
#*Ensure that the cable is securely plugged into the correct port on both ends.
 
#*Look for any signs of damage or loose connections.
 
#*Test the connectivity by trying a different network cable or using the same cable on a different port.
 
#Check for recent port flaps/link failures or any other activities which might cause it:
 
#*Check the logs or monitoring systems for any indications of port flapping or link failures.
 
#*Investigate any recent changes or activities that could have affected the network connection.
 
#*Consider any software updates, configuration changes, or physical alterations made recently.
 
#Try to perform a re-seat of cable/breakout/SFP/QSFP toward the affected machine:
 
#*Disconnect and reconnect the network cable at both ends (server and switch).
 
#*If applicable, re-seat any breakout cables, SFP modules, or QSFP modules used in the connection.
 
#*Ensure a secure and proper connection is established.
 
#Check with the switch vendor:
 
#*If the issue persists, contact the switch vendor's support team for further assistance.
 
#*Provide them with detailed information about the problem and any troubleshooting steps you have already taken.
 
#*Follow vendor guidance to troubleshoot and resolve the issue.
 
#**If your vendor requires a TSR log, see [[IDRAC access and TSR logs]] for an example of how to retrieve one from a Dell server.
 
#**[[Updating_Firmware|Updating the firmware]] might also resolve the issue.
 
# Utilize an auxiliary machine within the same rack with full network access to run diagnostics tools like <code>ping</code>, <code>traceroute</code>, and <code>nmap</code>
 
# Work with the ISP to troubleshoot and resolve any network routing issues identified during diagnostics
 
# Prepare for future incidents by establishing network redundancy and failover mechanisms
 
  
=== '''Best practices''' ===
+
Note: As with all network operations, these practices are subject to change based on technological advancements and the evolving needs of the IC network. It's always good to refer to https://github.com/dfinity/dre for the most current information and tooling.
  
# Keep a separate machine in the same rack with appropriate tools for network diagnostics and troubleshooting
+
'''Q: What is the standard procedure if a faulty component occurs and we have to take the server down for maintenance?'''
# Engage with the node provider community for support and to share effective troubleshooting techniques
 
  
 +
At the moment the process is as follows:
  
 +
#The node provider should give their best to bring the server back up, as soon as possible
 +
#The DFINITY DRE team will monitor the situation and submit any proposals to replace faulty nodes, if necessary, or reach out to individual node providers if node replacements wouldn't be effective enough.
  
=== '''Setting Up an Auxiliary Machine for Network Diagnostics''' ===
+
So feel free to do maintenance whenever you need to. If node is not in a subnet, there is no problem to take the node down as long as necessary. If the node is in a subnet, node replication should handle it without problems. Please be aware that we are actively working on reward adjustments based on the number of active and productive nodes, so please try not to make the downtime longer than absolutely necessary, to avoid reward reductions.
Robust Internet connectivity is essential. Without access to internal node logs and metrics, troubleshooting requires alternative strategies, including the use of an auxiliary machine within the same rack. Here's a brief outline for setting up an auxiliary machine in the same rack, while following best security practices:
 
# Hardware Setup:
 
#* Choose a server with sufficient resources to run diagnostic tools without impacting its performance. There is no need to follow the gen1/gen2 hardware requirements for this server (since this node would not be joining the IC network) but make sure the server is performant enough to run network tests.
 
#* Ensure physical security measures are in place to prevent unauthorized access.
 
# Operating System and Software:
 
#* Install a secure operating system, like a minimal installation of Linux (we prefer Ubuntu 22.04), which reduces the attack surface.
 
#* Keep the system updated with the latest security patches and firmware updates.
 
# Network Configuration:
 
#* Configure the machine with an IPv6 address in the same range as the IC nodes, for accurate testing.
 
#* Set up a restrictive firewall on the machine to allow ''only the necessary'' inbound and outbound traffic. Consider allowing Internet access for this machine only during troubleshooting sessions, and keeping the machine behind a VPN at other times.
 
# Diagnostic Tools:
 
#* Install network diagnostic tools such as <code>ping</code>, <code>traceroute</code>, <code>nmap</code>, <code>tcpdump</code>, and <code>iperf</code>.
 
#* Configure monitoring tools to simulate node activities and track responsiveness.
 
# Security Measures:
 
#* Use strong, unique passwords for all accounts and change them regularly. Or, preferably, do not use passwords at all, and use key-based access instead.
 
#* Implement key-based SSH authentication and disable root login over SSH.
 
#* Regularly review logs for any unusual activities that might indicate a security breach.
 
# Maintenance and Updates:
 
#* Regularly update all software to the latest versions.
 
#* Periodically test your network diagnostic tools to ensure they are functioning as expected.
 
  
.
+
One thing that would be really helpful from your side is:
  
Back to [[Node Provider Troubleshooting]]
+
# Find the subnet id in which the node is located, and check how many nodes in the subnet are currently unhealthy
 +
# If there are more than 2 (e.g. 3, 4, ... nodes) unhealthy nodes in the subnet, please consider postponing the maintenance work until the number of unhealthy nodes in the subnet reduces again to under 2.
  
Back to [[Node Provider Documentation]]
+
In the future, any NP will be able to run the [https://github.com/dfinity/dre DRE tooling] and there will be a financial incentive for the node providers to both a) keep all nodes in the subnet healthy, and b) submit proposals to replace unhealthy nodes or to improve decentralization.

Latest revision as of 11:58, 16 October 2024

Troubleshooting node deployment

This guide is designed for troubleshooting unhealthy nodes—those are nodes that successfully installed IC-OS and registered with the IC, but on the dashboard, show a status that is NOT either “Awaiting Subnet” or “Active in Subnet”.

If your node did NOT successfully install IC-OS or if it failed to register with the IC, consult the Troubleshooting Node Deployment Errors guide.

Verify and understand node health status

Background

The dashboard provides real-time status of each node in the network. Nodes are identified by the principal of the currently deployed operating system (the "Node ID"), so the Node ID will change upon node redeployment. Node Providers are expected to maintain a private record correlating each server with its Node ID. This record is crucial for tracking, especially when nodes are redeployed with new Node IDs.

Metrics and Monitoring

Metrics are collected from nodes situated in three key geographical locations: Frankfurt (FR1), Chicago (CH1), and San Francisco (SF1). Each location is equipped with an independent monitoring and observability system. These systems apply specific rules to identify normal and abnormal node behaviors.

Alerts and Troubleshooting

When a node exhibits abnormal behavior, an ALERT is triggered by the monitoring system. The nature of the alert is indicated on the dashboard under the node's status. This is an example of a "Degraded" node:

Dashboard-degraded-node.png

Verifying node status

  • Use the dashboard to verify the status of your node
    • The dashboard can be searched by your Node Provider principal or Node ID. If you search by Node Provider principal, you should see the Node ID of your node and click through to the Node Machine page.

Understanding node status

If the status of your node is NOT either “Awaiting Subnet” or “Active in Subnet”, your node is unhealthy.

The dashboard indicates four possible statuses for each node:

  • Active in Subnet - The node is healthy and actively functioning within a subnet.
  • Awaiting Subnet - The node is operational and prepared to join a subnet when necessary. Node providers still get full rewards for nodes awaiting subnet.
  • Degraded - Metrics can be scraped from the node, indicating it is online, but an ALERT has been raised. This status suggests the node may be struggling to keep up with network demands. For specific troubleshooting steps, identify your degraded node status from the below list.
  • Offline - The monitoring system is unable to scrape metrics, possibly due to node failure or data center outage. Prioritize verifying network connectivity and hardware functionality. For specific troubleshooting steps, see below.

If your node is not listed at all: A missing node from the list may indicate significant issues, requiring immediate attention and troubleshooting. If the node was functioning previously and is now not listed at all, this generally means that it started encountering issues and was removed from the registry.

Degraded nodes statuses

>> IC_OrchestratorFlapping

Explanation: Orchestrator coordinates the execution of many IC GuestOS processes, including the IC Replica. If the orchestrator is repeatedly restarting, then the GuestOS and the Replica process likely do not operate as expected.

Possible causes:

  • Networking issues
  • Hardware issues
  • Software problems

Troubleshooting and remediation:

  • Check if any NNS proposals were recently executed for the nodes in question https://dashboard.internetcomputer.org/governance
  • Check if there are some bandwidth limitations on these nodes, and ensure any traffic shaping, QoS, DOS protection, etc, are disabled on the ISP side
  • Perform other network diagnostics checks
  • Inspect node logs and metrics, if possible
  • Consult other node providers and DFINITY if there are any known software problems with the latest revision that the node(s) are running

>> IC_Replica_Behind

Explanation: IC Replica is the main process. If the Replica process cannot catch up, then the replica (node) cannot be a productive member of the IC subnet.

Possible causes:

  • Networking issues
  • Hardware issues
  • Software problems

Troubleshooting and remediation:

  • Check if there are any hardware issues reported by the machine's BMC (Baseboard Management Controller)
  • Perform firmware upgrade
  • Check if any NNS proposals were recently executed for the nodes in question https://dashboard.internetcomputer.org/governance
  • Check if there are some bandwidth limitations on these nodes, and ensure any traffic shaping, QoS, DOS protection, etc, are disabled on the ISP side
  • Perform other network diagnostics checks
  • Inspect node logs and metrics, if possible
  • Consult other node providers and DFINITY if there are any known software problems with the latest revision that the node(s) are running

>> IC_Replica_Behind_CLOCK =

Overview of the Problem

The clock on the Internet Computer (IC) replica desynced due to UDP port 123 (NTP traffic) being blocked unexpectedly. This led to degraded performance in the IC dashboard. To resolve this, you need to configure a test environment to simulate the setup, using a bare metal server or VM running Ubuntu 20.04, connected via your IPv6 subnet, and test if chrony is syncing the clock correctly. If clock sync fails, we will perform a traceroute to identify where the packet drops are happening and consult the ISP if necessary.

2. Setting Up the Test Server

Connect a VM or bare-metal server running Ubuntu 20.04 to your network. Ensure that your IPv6 address and gateway are configured correctly.

Configure the IPv6 Address and Gateway

Open the network configuration file using the following command:

sudo nano /etc/netplan/00-installer-config.yaml

Add or edit the IPv6 settings as follows:

network: version: 2 ethernets: eth0: dhcp4: no dhcp6: no addresses: - YOUR_IPV6_ADDRESS/64 gateway6: YOUR_IPV6_GATEWAY nameservers: addresses: - 2001:4860:4860::8888 - 2001:4860:4860::8844

Apply the configuration:

sudo netplan apply

3. Install and Configure Chrony

Chrony is used to synchronize the system clock with NTP servers. Install it by running:

sudo apt update sudo apt install chrony -y

Edit the Chrony configuration file to use NTP pools:

sudo nano /etc/chrony/chrony.conf

Add the following lines to ensure NTP synchronization:

pool 0.pool.ntp.org iburst pool 1.pool.ntp.org iburst pool 2.pool.ntp.org iburst pool 3.pool.ntp.org iburst

Save the file and restart Chrony:

sudo systemctl restart chrony

4. Check Chrony Synchronization Status

Verify if Chrony is syncing with the NTP servers:

chronyc tracking

Check the sources that Chrony is using to sync:

chronyc sources

If the synchronization is successful, the chronyc tracking command should show the server is synchronized. If not, proceed to the next step.

5. Perform a Traceroute to Diagnose Packet Drop

If Chrony is unable to sync, the issue might be caused by packet drops on UDP port 123. To identify where the issue is, perform a traceroute to one of the NTP servers (for example, 0.pool.ntp.org) on port 123 (UDP):

sudo traceroute -U -p 123 0.pool.ntp.org

This command will trace the path of UDP packets to the NTP server, allowing you to see where the packet drops occur.

6. Interpret the Results

If the traceroute completes successfully, then the issue might be with the server's firewall or NTP configuration. If packets are dropped at a specific hop: ** If it’s on your network, check the router, firewall, or switch for any port 123 (UDP) blocks. ** If the issue is with your ISP, the packets might be blocked at their hardware level by IDS/IPS systems. Provide the traceroute results to your ISP's support team for faster resolution.

7. Consult ISP

If packet drops are identified on the ISP side, contact your ISP with the traceroute data. Explain that port 123 (UDP) is blocked and this is causing your IC clock to desync. Even though they might not block the port intentionally, security systems like IDS/IPS might kick in to block this traffic.


Additional Commands

  • Check UDP open ports on the system:
  • Verify if any firewall rules are blocking UDP port 123:
  • Open UDP port 123 if it's blocked by UFW:

Offline nodes

Network issues are the main reason why nodes are in an "OFFLINE" state.

  • Your node may not be reachable from the IC or
  • You may not be able to reach other nodes the IC from your node or
  • You may not be able to reach the monitoring servers from your node.

Please refer to the Troubleshooting Networking Issues guide.

Another possible reason for an OFFLINE node may be that your GuestOS failed to start due to a RAM failure.

Server Troubleshooting Steps

These steps may help when a server is unhealthy or has been removed from the network, but the connectivity in the data center is functioning correctly:

  1. Verify if the server is up and running:
    • Check the power status of the server.
    • Check if the server is displaying any error messages or indicators.
    • If possible, access the server remotely or physically to ensure it is functioning properly.
  2. Hook up a crash cart and check for errors on the screen, troubleshoot as needed.
  3. Consider updating the firmware if it has been a long time, and/or if you have recently had other nodes that needed firmware upgrades to become healthy again.
  4. If no known error is found, please redeploy the node with a fresh IC-OS image

Frequently Asked Questions

Q: Is the monitoring system open-sourced? How does it communicate with the nodes?

A: The monitoring system configuration is not currently open-sourced. However, the node configuration that is required for proper node operation is fully open source. For more information about the node-to-node and node-to-monitoring communication, refer to the nftables configuration, which is the definitive guide for required open ports on Host OS, and similar nftables configuration for the Guest OS. You can find the configuration for DFINITY-owned DCs and Gen1 node providers here: nftables configuration. We use Victoria Metrics for metrics scraping (documentation) and vector.dev for log scraping.

Q: What are the destination IPs and Ports for Frankfurt, Chicago, and San Francisco for connectivity troubleshooting?

A: At present, detailed node/port information is not publicly accessible, as disclosing this information is considered a security risk. To effectively troubleshoot connectivity issues with your nodes, we recommend setting up a "spot instance" or a temporary virtual machine (VM) with a cloud provider in each of the geographical regions. This approach allows you to test both connectivity and connection stability to your nodes, providing a practical solution for identifying and resolving network-related issues.

Q: The dashboard shows offline or degraded status for <DC>, but everything seems fine. What should we do?

A: Check if any of these issues are present: firewall restrictions, traffic shaping, DOS protection, or QOS features enabled on the ISP side. These should all be disabled for optimal node operation.

Q: Under what circumstances is a node removed from the IC network?

A: A node is removed from the IC network when it's deemed unhealthy. The determination of a node's health is made using tooling from https://github.com/dfinity/dre. This tooling assesses nodes based on various metrics and submits a proposal for their removal to maintain the highest level of decentralization possible. However, there are exceptions. For example, an unhealthy node might be temporarily retained if there are ongoing efforts to recover and restore it.

Q: How long can a node be down before it's excluded from the IC network?

A: There's no set time limit for how long a node can be down before exclusion. The decision is more qualitative and depends on the overall health of the network. Currently, the IC network can tolerate up to 1/3 of nodes in a 13-node subnet being down or unhealthy. This means a subnet can function with up to 4 unhealthy nodes. If the unhealthy nodes do not exceed this threshold, a node might be left in the subnet for a longer period, especially if there are efforts underway to make it healthy again.

Q: When are nodes typically removed or replaced?

A: Node removals or replacements are currently conducted semi-manually and are typically scheduled for Mondays or Fridays. This timing allows Foundation voters to participate in the decision-making process at the start or end of the workweek. However, as this process is manually conducted, exceptions may occur based on specific circumstances.

Q: What are the future plans for node management in the IC network?

A: In the medium term, there are plans to automate node replacements. This means node swaps might occur more frequently and systematically, reducing the manual overhead and potentially enhancing the network's resilience and performance.

Note: As with all network operations, these practices are subject to change based on technological advancements and the evolving needs of the IC network. It's always good to refer to https://github.com/dfinity/dre for the most current information and tooling.

Q: What is the standard procedure if a faulty component occurs and we have to take the server down for maintenance?

At the moment the process is as follows:

  1. The node provider should give their best to bring the server back up, as soon as possible
  2. The DFINITY DRE team will monitor the situation and submit any proposals to replace faulty nodes, if necessary, or reach out to individual node providers if node replacements wouldn't be effective enough.

So feel free to do maintenance whenever you need to. If node is not in a subnet, there is no problem to take the node down as long as necessary. If the node is in a subnet, node replication should handle it without problems. Please be aware that we are actively working on reward adjustments based on the number of active and productive nodes, so please try not to make the downtime longer than absolutely necessary, to avoid reward reductions.

One thing that would be really helpful from your side is:

  1. Find the subnet id in which the node is located, and check how many nodes in the subnet are currently unhealthy
  2. If there are more than 2 (e.g. 3, 4, ... nodes) unhealthy nodes in the subnet, please consider postponing the maintenance work until the number of unhealthy nodes in the subnet reduces again to under 2.

In the future, any NP will be able to run the DRE tooling and there will be a financial incentive for the node providers to both a) keep all nodes in the subnet healthy, and b) submit proposals to replace unhealthy nodes or to improve decentralization.