Difference between revisions of "Node Provider Troubleshooting"

Latest revision as of 17:28, 15 April 2024

Troubleshooting individual Nodes

Troubleshooting outage of all nodes/entire DC

(This section under construction)

If a whole data center goes down at once, it is usually be an internet issue.

Verify connectivity on that 10g circuit.

Getting the node ID from a node

Hook up a console to the node
The node ID will print to the screen upon a fresh boot and every 15 minutes thereafter.
If a node does not show its principal, the node needs to be redeployed with the current IC-OS image to gain a new principal.

Node Status on the Dashboard

The dashboard provides real-time status of each node in the network. Nodes are identified by the principal of the currently deployed operating system, so the principal of the node will change upon node redeployment. Node Providers are expected to maintain a private record correlating each server with its principal. This record is crucial for tracking, especially when nodes are redeployed with new principals.

Metrics and Monitoring

Metrics are collected from nodes situated in three key geographical locations: Frankfurt (FR1), Chicago (CH1), and San Francisco (SF1). Each location is equipped with an independent monitoring and observability system. These systems apply specific rules to identify normal and abnormal node behaviors.

Alerts and Troubleshooting

When a node exhibits abnormal behavior, an ALERT is triggered by the monitoring system. The nature of the alert is indicated on the dashboard under the node's status.

Screenshot of a degraded node status page

In the event of an ALERT, follow the provided troubleshooting steps. If your issue and solution are not listed, please contribute by adding them to the page.

The dashboard indicates four possible statuses for each node:

Active in Subnet - The node is healthy and actively functioning within a subnet.
Awaiting Subnet - The node is operational and prepared to join a subnet when necessary. Node providers still get full rewards for the node.
Degraded - Metrics can be scraped from the node, indicating it is online, but an ALERT has been raised. This status suggests the node may be struggling to keep up with network demands. Intervention from the node provider and following the troubleshooting steps should be followed to resolve the issue. If you need to remove a node from the registry to service it, see Removing a Node From the Registry
Offline - The monitoring system is unable to scrape metrics, possibly due to node failure or data center outage. Prioritize verifying network connectivity and hardware functionality. Troubleshooting steps should be followed to resolve the issue.
Not listed at all. A missing node from the list may indicate significant issues, requiring immediate attention and troubleshooting. If the node was functioning previously and is now not listed at all, this generally means that it started encountering issues and was removed from the registry.Troubleshooting steps should be followed to resolve the issue.

Checking Node CPU and memory speed

Some server machines run slower than they should, and they may also become slower after certain events (such as power loss) due to firmware bugs, they may have a faulty power supply, insufficient power supply redundancy, etc.. If you suspect that this is the case, you can run the following test on the machine. You can prepare a live Ubuntu USB stick and boot the server from it. Make sure you don't install Ubuntu on the machine and wipe the disks, since you will have to redeploy your node if you do this. You only want to try Ubuntu.

Once you boot from the live Ubuntu image, you can install some packages to it. They will live in memory only and will be gone once you reboot the machine. The test that we found particularly valuable to determine if the problem is present was sysbench. Install it with sudo apt install sysbench and then

sysbench --test=memory run

on the machine (HostOS) and look at the memory transfer speed. Memory speed should be at least 5.6GB/s. If you get less than that, please consult your vendor how to increase the speed to the appropriate level. For instance, with some Dell servers we were seeing 2.6GB/s memory speed and had to upgrade the CPLD firmware to resolve the performance issue. For some SuperMicro servers we have seen improvements by power cycling the server & changing the BIOS setting, Advanced > ACPI Settings > ACPI SRAT L3 Cache As NUMA Domain to Disabled.

Changing your Node Provider principal in the NNS

Please refer to Node Provider NNS proposals

Changing a DC principal

Please refer to Node Provider NNS proposals

Node Provider Matrix channel

Discuss your issue with other Node Providers in the Node Provider Matrix channel.

Back to Node Provider Documentation

@@ Line 1: / Line 1: @@
 ==Troubleshooting individual Nodes==
-* [[Possible Node Onboarding Errors]]
+* [[Troubleshooting Node Deployment Errors]]
+* [[Troubleshooting Failed NNS proposals]]
 * [[Unhealthy Nodes|Troubleshooting Unhealthy Nodes]]
+* [[Troubleshooting Switches]]
+* [[Troubleshooting Packet Loss Issues]]
+* [[Troubleshooting Interface issues]]
 * [[Updating Firmware]]
 * [[iDRAC access and TSR logs]]
-==Changing your Node Provider principal in the NNS==
+==Troubleshooting outage of all nodes/entire DC==
-* [[Changing Your Node Provider Principal]]
+(This section under construction)
-== IC Node Providers Matrix/Element channel ==
+If a whole data center goes down at once, it is usually be an internet issue.
-There is an open Matrix channel that's intended to bring together all existing, future, and potential future Node Providers: https://app.element.io/#/room/#ic-node-providers:matrix.org
-The channel runs on the open and decentralized Matrix network. Among other ways the channel is also accessible from element.io and from the Element desktop app. The Element desktop app is similar in functionality to Slack, and they offer a web UI, a desktop client, and a mobile app.
+* Verify connectivity on that 10g circuit.
-We recommend that you add [https://ems-docs.element.io/books/element-cloud-documentation/page/element-settings an email address in the Element Profile settings] and to [https://element.io/help#settings3 enable notifications for missed messages].
+== Getting the node ID from a node ==
-  🔗 How do I set up email notifications?
- You can set Element up to email you when you have missed some activity (new messages, new invites…). You can do this in the Notification section of your Settings and turn on the toggle labelled as ‘Enable email notifications’.
+* Hook up a console to the node
+* The node ID will print to the screen upon a fresh boot and every 15 minutes thereafter.
+* If a node does not show its principal, the node needs to be [[Node Provider Documentation|redeployed]] with the current IC-OS image to gain a new principal.
+==Node Status on the Dashboard==
+The dashboard provides real-time status of each node in the network. Nodes are identified by the principal of the currently deployed operating system, so the principal of the node will change upon node redeployment. Node Providers are expected to maintain a private record correlating each server with its principal. This record is crucial for tracking, especially when nodes are redeployed with new principals.
+===== Metrics and Monitoring =====
+Metrics are collected from nodes situated in three key geographical locations: Frankfurt (FR1), Chicago (CH1), and San Francisco (SF1). Each location is equipped with an independent monitoring and observability system. These systems apply specific rules to identify normal and abnormal node behaviors.
+===== Alerts and Troubleshooting =====
+When a node exhibits abnormal behavior, an ALERT is triggered by the monitoring system. The nature of the alert is indicated on the dashboard under the node's status.[[File:Dashboard-degraded-node.png|center|frameless|499x499px|Screenshot of a degraded node status page]]
+In the event of an ALERT, follow the provided [[Unhealthy Nodes|troubleshooting steps]]. If your issue and solution are not listed, please contribute by adding them to the page.
+The dashboard indicates four possible statuses for each node:
+*'''Active in Subnet''' - The node is healthy and actively functioning within a subnet.
+*'''Awaiting Subnet''' - The node is operational and prepared to join a subnet when necessary. Node providers still get full rewards for the node.
+*'''Degraded''' - Metrics can be scraped from the node, indicating it is online, but an ALERT has been raised. This status suggests the node may be struggling to keep up with network demands. Intervention from the node provider and following the [[Unhealthy Nodes|troubleshooting steps]] should be followed to resolve the issue. If you need to remove a node from the registry to service it, see [[Removing a Node From the Registry]]
+*'''Offline''' - The monitoring system is unable to scrape metrics, possibly due to node failure or data center outage. Prioritize verifying network connectivity and hardware functionality. [[Unhealthy Nodes|Troubleshooting steps]] should be followed to resolve the issue.
+*'''Not listed at all'''. A missing node from the list may indicate significant issues, requiring immediate attention and troubleshooting. If the node was functioning previously and is now not listed at all, this generally means that it started encountering issues and was removed from the registry.[[Unhealthy Nodes|Troubleshooting steps]] should be followed to resolve the issue.
+==Checking Node CPU and memory speed==
+Some server machines run slower than they should, and they may also become slower after certain events (such as power loss) due to firmware bugs, they may have a faulty power supply, insufficient power supply redundancy, etc.. If you suspect that this is the case, you can run the following test on the machine. You can [https://ubuntu.com/tutorials/try-ubuntu-before-you-install#1-getting-started prepare a live Ubuntu USB stick] and boot the server from it. Make sure you don't install Ubuntu on the machine and wipe the disks, since you will have to redeploy your node if you do this. You only want to ''try'' Ubuntu.
+Once you boot from the live Ubuntu image, you can install some packages to it. They will live in memory only and will be gone once you reboot the machine. The test that we found particularly valuable to determine if the problem is present was [https://manpages.ubuntu.com/manpages/jammy/man1/sysbench.1.html sysbench]. Install it with ''sudo apt install sysbench'' and then
+  sysbench --test=memory run
+on the machine (HostOS) and look at the memory transfer speed. Memory speed should be at least 5.6GB/s.
+If you get less than that, please consult your vendor how to increase the speed to the appropriate level. For instance, with some Dell servers we were seeing 2.6GB/s memory speed and had to upgrade the CPLD firmware to resolve the performance issue. For some SuperMicro servers we have seen improvements by power cycling the server & changing the BIOS setting, '''Advanced''' > '''ACPI Settings''' > '''ACPI SRAT L3 Cache As NUMA Domain''' to '''Disabled'''.
+== Changing your Node Provider principal in the NNS==
+Please refer to [[Node Provider NNS proposals]]
+==Changing a DC principal==
+Please refer to [[Node Provider NNS proposals]]
+==Node Provider Matrix channel ==
+Discuss your issue with other Node Providers in the [[Node Provider Matrix channel]].
+Back to [[Node Provider Documentation]]