Difference between revisions of "Node Provider Troubleshooting"
Katie.peters (talk | contribs) (Added "back' link at the bottom) |
|||
Line 15: | Line 15: | ||
==Node Status on the Dashboard== | ==Node Status on the Dashboard== | ||
− | The dashboard lists each node by the principal of the currently-running OS. Node Providers track privately which server corresponds to each principal. This includes updating their records when a node is redeployed and gets a new principal. | + | The dashboard lists each node by the principal of the currently-running OS. Node Providers track privately which server corresponds to each principal. This includes updating their records when a node is redeployed and gets a new principal. |
+ | Robust connectivity is essential. Without access to internal node logs and metrics, troubleshooting requires alternative strategies, including the use of an auxiliary machine within the same rack. | ||
There are four statuses of node: | There are four statuses of node: | ||
− | * '''Active in Subnet''' - | + | *'''Active in Subnet''' - Indicates a healthy and active node within a subnet. |
− | * '''Awaiting Subnet''' - | + | *'''Awaiting Subnet''' - The node is operational and ready to join a subnet as needed. |
− | * '''Offline''' - | + | *'''Offline''' - Represents a node failure or data center outage. Focus on verifying network connectivity and hardware functionality. [[Unhealthy Nodes|Troubleshooting steps]] should be followed to resolve the issue. |
− | + | *'''Degraded''' - The node is having difficulty keeping up with the network and may require intervention to prevent a complete failure. [[Unhealthy Nodes|Troubleshooting steps]] should be followed to resolve the issue. | |
− | + | *'''Not listed at all'''. A node that is not present may have been removed due to significant issues and requires immediate attention. [[Unhealthy Nodes|Troubleshooting steps]] should be followed to resolve the issue. | |
− | * '''Degraded''' - | ||
− | * '''Not listed at all'''. | ||
− | == Checking Node CPU and memory speed == | + | ==Checking Node CPU and memory speed== |
Some server machines run slower than they should, and they may also become slower after certain events (such as power loss) due to firmware bugs, they may have a faulty power supply, insufficient power supply redundancy, etc.. If you suspect that this is the case, you can run the following test on the machine. You can [https://ubuntu.com/tutorials/try-ubuntu-before-you-install#1-getting-started prepare a live Ubuntu USB stick] and boot the server from it. Make sure you don't install Ubuntu on the machine and wipe the disks, since you will have to redeploy your node if you do this. You only want to ''try'' Ubuntu. | Some server machines run slower than they should, and they may also become slower after certain events (such as power loss) due to firmware bugs, they may have a faulty power supply, insufficient power supply redundancy, etc.. If you suspect that this is the case, you can run the following test on the machine. You can [https://ubuntu.com/tutorials/try-ubuntu-before-you-install#1-getting-started prepare a live Ubuntu USB stick] and boot the server from it. Make sure you don't install Ubuntu on the machine and wipe the disks, since you will have to redeploy your node if you do this. You only want to ''try'' Ubuntu. | ||
Line 37: | Line 36: | ||
If you get less than that, please consult your vendor how to increase the speed to the appropriate level. For instance, with some Dell servers we were seeing 2.6GB/s memory speed and had to upgrade the CPLD firmware to resolve the performance issue. For some SuperMicro servers we have seen improvements by power cycling the server & changing the BIOS setting, '''Advanced''' > '''ACPI Settings''' > '''ACPI SRAT L3 Cache As NUMA Domain''' to '''Disabled'''. | If you get less than that, please consult your vendor how to increase the speed to the appropriate level. For instance, with some Dell servers we were seeing 2.6GB/s memory speed and had to upgrade the CPLD firmware to resolve the performance issue. For some SuperMicro servers we have seen improvements by power cycling the server & changing the BIOS setting, '''Advanced''' > '''ACPI Settings''' > '''ACPI SRAT L3 Cache As NUMA Domain''' to '''Disabled'''. | ||
− | ==Changing your Node Provider principal in the NNS== | + | == Changing your Node Provider principal in the NNS== |
− | * [[Changing Your Node Provider Principal]] | + | *[[Changing Your Node Provider Principal]] |
− | == Changing a DC principal == | + | ==Changing a DC principal== |
− | * [[Changing Your Data Center Principal|Changing Your Data Center Principal (Creating a new Node Operator Record]]) | + | *[[Changing Your Data Center Principal|Changing Your Data Center Principal (Creating a new Node Operator Record]]) |
− | == Node Provider Matrix channel == | + | ==Node Provider Matrix channel == |
Discuss your issue with other Node Providers in the [[Node Provider Matrix channel]]. | Discuss your issue with other Node Providers in the [[Node Provider Matrix channel]]. | ||
Back to [[Node Provider Documentation]] | Back to [[Node Provider Documentation]] |
Revision as of 10:01, 8 November 2023
Troubleshooting individual Nodes
- Possible Node Onboarding Errors
- Troubleshooting Unhealthy Nodes
- Troubleshooting Switches
- Troubleshooting Packet Loss Issues
- Troubleshooting Interface issues
- Updating Firmware
- iDRAC access and TSR logs
- Getting a shell during Node (SetupOS) installation, to troubleshoot a failure:
- Hit enter until you see a login prompt
- Log in with user
root
and empty password - Type
systemctl stop setupos
- it will auto-reboot if you don't do this - Now you have root access for diagnostics, etc
Node Status on the Dashboard
The dashboard lists each node by the principal of the currently-running OS. Node Providers track privately which server corresponds to each principal. This includes updating their records when a node is redeployed and gets a new principal. Robust connectivity is essential. Without access to internal node logs and metrics, troubleshooting requires alternative strategies, including the use of an auxiliary machine within the same rack.
There are four statuses of node:
- Active in Subnet - Indicates a healthy and active node within a subnet.
- Awaiting Subnet - The node is operational and ready to join a subnet as needed.
- Offline - Represents a node failure or data center outage. Focus on verifying network connectivity and hardware functionality. Troubleshooting steps should be followed to resolve the issue.
- Degraded - The node is having difficulty keeping up with the network and may require intervention to prevent a complete failure. Troubleshooting steps should be followed to resolve the issue.
- Not listed at all. A node that is not present may have been removed due to significant issues and requires immediate attention. Troubleshooting steps should be followed to resolve the issue.
Checking Node CPU and memory speed
Some server machines run slower than they should, and they may also become slower after certain events (such as power loss) due to firmware bugs, they may have a faulty power supply, insufficient power supply redundancy, etc.. If you suspect that this is the case, you can run the following test on the machine. You can prepare a live Ubuntu USB stick and boot the server from it. Make sure you don't install Ubuntu on the machine and wipe the disks, since you will have to redeploy your node if you do this. You only want to try Ubuntu.
Once you boot from the live Ubuntu image, you can install some packages to it. They will live in memory only and will be gone once you reboot the machine. The test that we found particularly valuable to determine if the problem is present was sysbench. Install it with sudo apt install sysbench and then
sysbench --test=memory run
on the machine (HostOS) and look at the memory transfer speed. Memory speed should be at least 5.6GB/s. If you get less than that, please consult your vendor how to increase the speed to the appropriate level. For instance, with some Dell servers we were seeing 2.6GB/s memory speed and had to upgrade the CPLD firmware to resolve the performance issue. For some SuperMicro servers we have seen improvements by power cycling the server & changing the BIOS setting, Advanced > ACPI Settings > ACPI SRAT L3 Cache As NUMA Domain to Disabled.
Changing your Node Provider principal in the NNS
Changing a DC principal
Node Provider Matrix channel
Discuss your issue with other Node Providers in the Node Provider Matrix channel.
Back to Node Provider Documentation