Difference between revisions of "Node Provider Troubleshooting"

Latest revision as of 16:10, 12 July 2024

Specific troubleshooting guides

Getting the node ID from a node

Hook up a console to the node.
The node ID will print to the screen upon a fresh boot and every 10 minutes thereafter.
If a node does not show its principal, consult the Troubleshooting Node Deployment Errors page.

Node Provider Matrix channel

After first consulting relevant documentation, discuss your issue with other Node Providers in the Node Provider Matrix channel.

@@ Line 1: / Line 1: @@
-==Troubleshooting individual Nodes==
+==Specific troubleshooting guides==
-* [[Possible Node Onboarding Errors]]
+* [[Troubleshooting Node Deployment Errors]]
-* [[Unhealthy Nodes|Troubleshooting Unhealthy Nodes]]
+* [[Troubleshooting Unhealthy Nodes]]
-* [[Troubleshooting Switches]]
+* [[Troubleshooting Networking Issues]]
-* [[Troubleshooting Packet Loss Issues]]
+* [[Troubleshooting Failed NNS proposals]]
-* [[Troubleshooting Interface issues]]
-* [[Updating Firmware]]
-* [[iDRAC access and TSR logs]]
-* Getting a shell during Node (SetupOS) installation, to troubleshoot a failure:
-** Hit enter until you see a login prompt
-** Log in with user <code>root</code> and empty password
-** Type <code>systemctl stop setupos</code> - it will auto-reboot if you don't do this
-** Now you have root access for diagnostics, etc
-==Node Status on the Dashboard==
+==Getting the node ID from a node==
-The dashboard displays each node, identified by the principal of the operating system currently in use. Node Providers privately maintain records of the server corresponding to each principal, including updates when a node is redeployed with a new principal.
-===== Metrics and Monitoring =====
+# Hook up a console to the node.
-Metrics are collected from nodes across three geographical locations: Frankfurt, Chicago, and San Francisco. Each location operates an independent monitoring and observability system, which applies a set of rules to identify normal and abnormal behaviors. An ALERT is triggered on a node if abnormal behavior is detected, based on these rules. The specific alert name is displayed in the node's status on the dashboard.
+# The node ID will print to the screen upon a fresh boot and every 10 minutes thereafter.
-[[File:Dashboard-degraded-node.png|center|frameless|499x499px|Screenshot of a degraded node status page]]
+# If a node does not show its principal, consult the [[Troubleshooting Node Deployment Errors]] page.
-In the event of an ALERT, follow the provided [[Unhealthy Nodes|troubleshooting steps]]. If your issue and solution are not listed, please contribute by adding them to the page.
-The dashboard indicates four possible statuses for each node:
-*'''Active in Subnet''' - The node is healthy and actively functioning within a subnet.
-*'''Awaiting Subnet''' - The node is operational and prepared to join a subnet when necessary. Node providers still get full rewards for the node.
-*'''Degraded''' - Metrics can be scraped from the node, indicating it is online, but an ALERT has been raised. This status suggests the node may be struggling to keep up with network demands. Intervention from the node provider and following the [[Unhealthy Nodes|troubleshooting steps]] should be followed to resolve the issue. If you need to remove a node from the registry to service it, see [[Removing a Node From the Registry]]
-*'''Offline''' - The monitoring system is unable to scrape metrics, possibly due to node failure or data center outage. Prioritize verifying network connectivity and hardware functionality. [[Unhealthy Nodes|Troubleshooting steps]] should be followed to resolve the issue.
-*'''Not listed at all'''. A missing node from the list may indicate significant issues, requiring immediate attention and troubleshooting. [[Unhealthy Nodes|Troubleshooting steps]] should be followed to resolve the issue.
-==Checking Node CPU and memory speed==
-Some server machines run slower than they should, and they may also become slower after certain events (such as power loss) due to firmware bugs, they may have a faulty power supply, insufficient power supply redundancy, etc.. If you suspect that this is the case, you can run the following test on the machine. You can [https://ubuntu.com/tutorials/try-ubuntu-before-you-install#1-getting-started prepare a live Ubuntu USB stick] and boot the server from it. Make sure you don't install Ubuntu on the machine and wipe the disks, since you will have to redeploy your node if you do this. You only want to ''try'' Ubuntu.
-Once you boot from the live Ubuntu image, you can install some packages to it. They will live in memory only and will be gone once you reboot the machine. The test that we found particularly valuable to determine if the problem is present was [https://manpages.ubuntu.com/manpages/jammy/man1/sysbench.1.html sysbench]. Install it with ''sudo apt install sysbench'' and then
- sysbench --test=memory run
-on the machine (HostOS) and look at the memory transfer speed. Memory speed should be at least 5.6GB/s.
-If you get less than that, please consult your vendor how to increase the speed to the appropriate level. For instance, with some Dell servers we were seeing 2.6GB/s memory speed and had to upgrade the CPLD firmware to resolve the performance issue. For some SuperMicro servers we have seen improvements by power cycling the server & changing the BIOS setting, '''Advanced''' > '''ACPI Settings''' > '''ACPI SRAT L3 Cache As NUMA Domain''' to '''Disabled'''.
-== Changing your Node Provider principal in the NNS==
-*[[Changing Your Node Provider Principal]]
-==Changing a DC principal==
-*[[Changing Your Data Center Principal|Changing Your Data Center Principal (Creating a new Node Operator Record]])
 ==Node Provider Matrix channel ==
-Discuss your issue with other Node Providers in the [[Node Provider Matrix channel]].
+'''<u>After first consulting relevant documentation</u>''', discuss your issue with other Node Providers in the [[Node Provider Matrix channel]].
-Back to [[Node Provider Documentation]]