Difference between revisions of "Node Provider Troubleshooting"

From Internet Computer Wiki
Jump to: navigation, search
m
(Added two more troubleshooting pages)
Line 3: Line 3:
 
* [[Possible Node Onboarding Errors]]
 
* [[Possible Node Onboarding Errors]]
 
* [[Unhealthy Nodes|Troubleshooting Unhealthy Nodes]]
 
* [[Unhealthy Nodes|Troubleshooting Unhealthy Nodes]]
 +
* [[Troubleshooting Switches]]
 +
* [[Troubleshooting Packet Loss Issues]]
 
* [[Updating Firmware]]
 
* [[Updating Firmware]]
 
* [[iDRAC access and TSR logs]]
 
* [[iDRAC access and TSR logs]]

Revision as of 18:19, 14 September 2023

Troubleshooting individual Nodes

Node Status on the Dashboard

The dashboard lists each node by the principal of the currently-running OS. Node Providers track privately which server corresponds to each principal. This includes updating their records when a node is redeployed and gets a new principal.

There are four statuses of node:

  • Active in Subnet - This is a node which is healthy and is currently running a subnet.
  • Awaiting Subnet - This is a node which is healthy and is currently a spare node. It is not running a subnet but it keeping itself updated so that it is ready at a moment's notice to take part in a subnet
  • Offline - This is a node which has completely failed. The failure is recent enough that it hasn't been removed from the registry yet.
    • If there is an outage of some sort at the data center, then the node should come back online and be healthy once it's resolved, as long as it doesn't take too long. Make sure that connectivity to the node is properly supplied before doing anything else.
    • If there are no issues with connectivity, then troubleshooting steps should be taken. Note that the node will have to be removed from the registry before it can be redeployed, if redeployment is needed.
  • Degraded - This node is struggling to keep up with the blockchain. If it's a temporary issue then it should catch back up and become healthy again. If it's a permanent issue, then it will eventually fail and go offline. If it's removed from the registry before it fails completely then it will disappear from the dashboard.
  • Not listed at all. If a node is not listed at all, then it had an issue and it was already removed from the registry. Troubleshooting steps should be taken.

Checking Node CPU and memory speed

Some server machines run slower than they should, and they may also become slower after certain events (such as power loss) due to firmware bugs, they may have a faulty power supply, insufficient power supply redundancy, etc.. If you suspect that this is the case, you can run the following test on the machine. You can prepare a live Ubuntu USB stick and boot the server from it. Make sure you don't install Ubuntu on the machine and wipe the disks, since you will have to redeploy your node if you do this. You only want to try Ubuntu.

Once you boot from the live Ubuntu image, you can install some packages to it. They will live in memory only and will be gone once you reboot the machine. The test that we found particularly valuable to determine if the problem is present was sysbench. Install it with sudo apt install sysbench and then

sysbench --test=memory run

on the machine (HostOS) and look at the memory transfer speed. Memory speed should be at least 5.6GB/s. If you get less than that, please consult your vendor how to increase the speed to the appropriate level. For instance, with some Dell servers we were seeing 2.6GB/s memory speed and had to upgrade the CPLD firmware to resolve the performance issue. For some SuperMicro servers we have seen improvements by power cycling the server & changing the BIOS setting, Advanced > ACPI Settings > ACPI SRAT L3 Cache As NUMA Domain to Disabled.

Changing your Node Provider principal in the NNS

Changing a DC principal

Node Provider Matrix channel

Discuss your issue with other Node Providers in the Node Provider Matrix channel.