Difference between revisions of "Troubleshooting Node Deployment Errors"
Katie.peters (talk | contribs) m |
m |
||
(67 intermediate revisions by 6 users not shown) | |||
Line 1: | Line 1: | ||
− | This page has some error codes that may display as you are onboarding your nodes. Please review the | + | This page has some error codes that may display as you are onboarding your nodes. '''Please review this guide in its entirety before reaching out on the IC Node Provider Matrix channel.''' |
− | If you | + | |
+ | If you need Dell to service your machine, then these links will assist in [[Retrieving a Dell TSR Log|retrieving a Dell TSR Log]] and in resetting the iDRAC password. | ||
+ | |||
+ | == General troubleshooting steps == | ||
+ | These troubleshooting steps have been refined over hundreds of deployments. They catch the vast majority of deployment issues. '''Please complete ALL these steps before messaging in the Matrix channel.''' | ||
+ | |||
+ | # <span class="s1"></span>Re-attempt to verify your node onboarding. Reread '''all''' the directions in your node deployment guide's verification step: | ||
+ | #* <span class="s1"></span>Verify node onboarding: [[Node Deployment Guide (with an HSM)#11. Verify node onboarding|Legacy (Gen-1) Node Deployment Guide (with an HSM)]] | ||
+ | #* <span class="s1"></span>Verify node onboarding: [[Node Deployment Guide#11. Verify node onboarding|Current (Gen-2) Node Deployment Guide (without an HSM)]] | ||
+ | #<span class="s1"></span>Make sure you are using '''[https://dashboard.internetcomputer.org/releases the latest IC-OS release].''' If you are not sure if you are using the latest release, download the latest release and retry your node deployment. | ||
+ | #<span class="s1"></span>Make sure you are using the proper Node deployment guide: | ||
+ | #*[[Node Deployment Guide (with an HSM)|Legacy (Gen-1) Node Deployment Guide (with an HSM)]] | ||
+ | #*[[Node Deployment Guide|Current (Gen-2) Node Deployment Guide (without an HSM)]] | ||
+ | #<span class="s1"></span>Reread '''all''' the directions in your node deployment guide to make sure you aren’t missing something. The directions are precise, and they do change slightly over time. | ||
+ | #<span class="s1"></span>Confirm the networking information inputted to your <code>config.ini</code> file is correct. | ||
+ | # <span class="s1"></span>Reread the [[Node Provider Networking Guide]]. Make sure you aren’t violating anything in the networking “[[Node Provider Networking Guide#What NOT to do|What NOT to do]]” section.<span class="s1"></span> | ||
+ | #<span class="s1"></span>Restart '''everything''' you can between the node machine and the internet (router, firewall, etc.) | ||
+ | #<span class="s1"></span>Restart the node deployment process from the very beginning. Try to reproduce the error you are encountering. | ||
+ | #<span class="s1"></span>Try to deploy to a different node machine (one that is not being used in a subnet). Try to reproduce the error on multiple node machines.<span class="s1"></span><span class="s1"></span> | ||
+ | |||
+ | === Support request information requirements=== | ||
+ | If you are still encountering deployment issues, '''read the rest of this guide'''. If you still can't successfully deploy your nodes, post a support request message in the [https://app.element.io/#/room/#ic-node-providers:matrix.org IC Node Provider Matrix channel] containing '''ALL''' the following information: | ||
+ | *'''A confirmation that you ran through ALL the above general troubleshooting steps''' | ||
+ | * A screenshot of your failure / screenshots of any relevant logs. | ||
+ | *The stage of the deployment in which you are failing | ||
+ | *What deployment method you are using: | ||
+ | **[[Node Deployment Guide (with an HSM)|Legacy (Gen-1) Node Deployment Guide (with an HSM)]] | ||
+ | **[[Node Deployment Guide|Current (Gen-2) Node Deployment Guide (without an HSM)]] | ||
+ | *Is this your first time performing an IC-OS installation? | ||
+ | * Is this your first time performing an IC-OS installation ''in this data center''? | ||
+ | *Is this your first time performing an IC-OS installation ''with this Node Operator Key?'' | ||
+ | *Can you reproduce this issue? | ||
+ | *Machine hardware details (Gen1 / Gen2, server brand) | ||
+ | *The datacenter out of which you are attempting to deploy | ||
+ | *Are you onboarding with an IPv4 address? | ||
+ | *Any relevant networking details | ||
+ | *Any other details you see as relevant | ||
+ | |||
+ | If you post a support request message that doesn't include '''ALL''' the above information, you will be asked to do so. | ||
+ | |||
+ | == Unhealthy nodes == | ||
− | |||
==== Example Error ==== | ==== Example Error ==== | ||
+ | You successfully ''installed'' your node without errors and your node ''registered'' with the IC, but on the [https://wiki.internetcomputer.org/wiki/Node_Deployment_Guide_(with_an_HSM) dashboard], the status of your node is NOT either “Awaiting Subnet” or “Active in Subnet” | ||
+ | |||
+ | To troubleshoot, consult the [[Troubleshooting Unhealthy Nodes]] guide | ||
+ | |||
+ | ==Node registration failures== | ||
+ | ==== Example Error==== | ||
+ | You successfully ''installed'' your node without errors. The machine rebooted and hostOS came up, but now, your node is failing to ''register'' with the IC (your node is not listed on the dashboard). You've run though '''all''' the above general troubleshooting steps but they haven't solved your issue. | ||
+ | |||
+ | ====Common Causes==== | ||
+ | |||
+ | *'''Networking issues''': as the general troubleshooting steps direct, reread the [[Node Provider Networking Guide]], make sure you aren’t violating anything in the networking “[[Node Provider Networking Guide#What NOT to do|What NOT to do]]” section, and please restart '''everything''' you can between the node machine and the internet (router, firewall, etc.) | ||
+ | *'''Node Allowance exhausted''': If you see an error "Node allowance for this node operator is exhausted," this indicates that the node allowance in your node operator record [[Node Provider Onboarding#10. Create a node operator record|created during the Node Provider Onboarding]] has been maxed out. Check the [https://dashboard.internetcomputer.org/ dashboard] to see how many nodes are currently registered under your Node Provider principal ID. If there are more nodes listed than expected or old nodes that should be removed, see this guide on [[Removing a Node From the Registry]]. | ||
+ | *'''"blank" console''': If after the machine reboots and HostOS comes up, you then see a "blank" console, wait at least 10 minutes. You should see the node-ID logged to the console and then you'll be able to search for the node-ID on the [https://dashboard.internetcomputer.org/ dashboard] to verify the node onboarding. | ||
+ | *'''Redeploying node in a subnet''': If you see an error "Cannot remove a node that is a member of a subnet," this indicates that you are attempting to redeploy a node that is currently assigned to a subnet. To redeploy a node in a subnet, the node must first be removed from the subnet. Note that if you performing a firmware upgrade, you do NOT need to redeploy the node. | ||
+ | |||
+ | ==IC-OS installation failures == | ||
+ | Here are some common IC-OS installation failures: | ||
+ | |||
+ | ===Missing Drives=== | ||
+ | ====Example Error==== | ||
-------------------------------------------------------------------------------- | -------------------------------------------------------------------------------- | ||
INTERNET COMPUTER - SETUP - FAILED | INTERNET COMPUTER - SETUP - FAILED | ||
Line 10: | Line 69: | ||
− | Please | + | Please consult the wiki guide: Troubleshooting Node Deployment Errors. |
Line 26: | Line 85: | ||
-------------------------------------------------------------------------------- | -------------------------------------------------------------------------------- | ||
− | ==== Common Causes ==== | + | '''Another version of it might say "Aggregate Disk size does not meet requirements"''' |
− | This error means that the | + | |
+ | ====Common Causes==== | ||
+ | This error means that the IC-OS installation medium could not detect all required drives. This is a common issue, even if you believe that all drives are installed correctly. Some of them may not be functioning properly, or may not be fully seated into the chassis. | ||
− | ==== Suggested Solutions ==== | + | ====Suggested Solutions==== |
Check that all drives are fully seated and installed correctly, or install the required number of drives. You may be able to check the drives for indication LEDs to see which may not be installed or functioning correctly. | Check that all drives are fully seated and installed correctly, or install the required number of drives. You may be able to check the drives for indication LEDs to see which may not be installed or functioning correctly. | ||
− | + | ===Invalid CPU Configuration=== | |
− | = Invalid CPU Configuration = | + | ====Example Error==== |
− | ==== Example Error ==== | ||
-------------------------------------------------------------------------------- | -------------------------------------------------------------------------------- | ||
Line 42: | Line 102: | ||
− | Please | + | Please consult the wiki guide: Troubleshooting Node Deployment Errors. |
− | |||
Line 54: | Line 113: | ||
-------------------------------------------------------------------------------- | -------------------------------------------------------------------------------- | ||
ERROR | ERROR | ||
− | -------------------------------------------------------------------------------- | + | --------------------------------------------------------------------------------<br /> |
− | + | ====Common Causes==== | |
− | ==== Common Causes ==== | ||
Issues related to CPU capability usually mean that the CPUs are not configured correctly in the system BIOS. | Issues related to CPU capability usually mean that the CPUs are not configured correctly in the system BIOS. | ||
− | ==== Suggested Solutions ==== | + | ====Suggested Solutions ==== |
Please check that BIOS settings are configured correctly. It may be helpful to reset all settings to factory defaults, and go through the BIOS configuration again. | Please check that BIOS settings are configured correctly. It may be helpful to reset all settings to factory defaults, and go through the BIOS configuration again. | ||
− | + | ===Unable to Reach Internet=== | |
− | + | ====Example Error==== | |
− | = Unable to Reach Internet = | ||
− | ==== Example Error ==== | ||
-------------------------------------------------------------------------------- | -------------------------------------------------------------------------------- | ||
Line 72: | Line 128: | ||
− | Please | + | Please consult the wiki guide: Troubleshooting Node Deployment Errors. |
Line 91: | Line 147: | ||
This error means that the node is not able to communicate with the network properly. This can be due to a misconfigured network configuration, or due to issues somewhere between the node and the rest of the internet. | This error means that the node is not able to communicate with the network properly. This can be due to a misconfigured network configuration, or due to issues somewhere between the node and the rest of the internet. | ||
− | ==== Suggested Solutions ==== | + | ====Suggested Solutions==== |
Please try to capture any output that is displayed before this error shows. For example: | Please try to capture any output that is displayed before this error shows. For example: | ||
* Printing user defined network settings... | * Printing user defined network settings... | ||
Line 110: | Line 166: | ||
Please compare this, and the initial configuration, to what you expect. If this configuration does not match, please update the initial configuration, and try again. | Please compare this, and the initial configuration, to what you expect. If this configuration does not match, please update the initial configuration, and try again. | ||
− | If this does match the expected configuration, please attempt to diagnose any machines between this node and the rest of the internet. This could be due to improper firewall configuration, or an issue with the data center’s network. If all configuration looks correct, please attempt to reboot any machines between this node and the rest of the | + | If this does match the expected configuration, please attempt to diagnose any machines between this node and the rest of the internet. This could be due to improper firewall configuration, or an issue with the data center’s network. If all configuration looks correct, please attempt to reboot any machines between this node and the rest of the Internet. In most cases, this would be a firewall. Rebooting the firewall - even if it seems to be operating correctly - has resolved this issue many times. |
+ | ===Unable to setup PV=== | ||
+ | ====Example Error==== | ||
− | + | -------------------------------------------------------------------------------- | |
− | + | INTERNET COMPUTER - SETUP - FAILED | |
− | + | -------------------------------------------------------------------------------- | |
− | + | ||
− | + | ||
− | + | ||
+ | Please consult the wiki guide: Troubleshooting Node Deployment Errors. | ||
+ | |||
+ | |||
+ | |||
+ | -------------------------------------------------------------------------------- | ||
+ | ERROR | ||
+ | -------------------------------------------------------------------------------- | ||
+ | |||
+ | |||
+ | Unable to setup PV on drive '/dev/nvme8n1'. | ||
+ | |||
+ | |||
+ | -------------------------------------------------------------------------------- | ||
+ | ERROR | ||
+ | -------------------------------------------------------------------------------- | ||
− | + | ====Common Causes==== | |
− | + | This error means that the node is able to recognize that a drive is installed, but is unable to write to it. This could indicate that there is a hardware issue with the drive. | |
− | |||
− | ==== | + | ====Suggested Solutions==== |
− | + | Please try to remove and re-install all drives, before attempting to install the node again. It may be helpful to independently verify that each drive is functioning correctly. | |
− | === | + | === Warning: Gen2 hardware detected but no Node Operator Private Key found === |
− | + | This is not an ''error''. This is a ''warning'' that Gen2 hardware should be onboarded using the Gen2 deployment method (i.e. without an HSM). | |
+ | If you already onboarded this node machine with an HSM and are just *redeploying* the node, you may continue (just wait 5 minutes for the installation to resume), but if you are onboarding a *new node machine*, you should not use an HSM. | ||
− | * | + | ==Troubleshooting IC-OS installation failures: Getting a shell in SetupOS== |
− | * | + | *During the IC-OS installation, you may hit enter to obtain console access to troubleshoot any issues you are encountering. You can also hit enter at the error page in order to access the console. Hit enter until you see a login prompt |
+ | *Log in with user <code>root</code> and empty password | ||
+ | *Now you have root access for diagnostics, etc. | ||
+ | Note that after the installation is finished and the machine reboots into HostOS, the same troubleshooting console isn't available anymore. It is only available in SetupOS, i.e., during IC-OS installation. After the OS is installed you may boot the machine from a live USB distribution such as [https://ubuntu.com/tutorials/try-ubuntu-before-you-install#1-getting-started Ubuntu Live USB], and troubleshoot that way. |
Latest revision as of 10:01, 30 August 2024
This page has some error codes that may display as you are onboarding your nodes. Please review this guide in its entirety before reaching out on the IC Node Provider Matrix channel.
If you need Dell to service your machine, then these links will assist in retrieving a Dell TSR Log and in resetting the iDRAC password.
General troubleshooting steps
These troubleshooting steps have been refined over hundreds of deployments. They catch the vast majority of deployment issues. Please complete ALL these steps before messaging in the Matrix channel.
- Re-attempt to verify your node onboarding. Reread all the directions in your node deployment guide's verification step:
- Verify node onboarding: Legacy (Gen-1) Node Deployment Guide (with an HSM)
- Verify node onboarding: Current (Gen-2) Node Deployment Guide (without an HSM)
- Make sure you are using the latest IC-OS release. If you are not sure if you are using the latest release, download the latest release and retry your node deployment.
- Make sure you are using the proper Node deployment guide:
- Reread all the directions in your node deployment guide to make sure you aren’t missing something. The directions are precise, and they do change slightly over time.
- Confirm the networking information inputted to your
config.ini
file is correct. - Reread the Node Provider Networking Guide. Make sure you aren’t violating anything in the networking “What NOT to do” section.
- Restart everything you can between the node machine and the internet (router, firewall, etc.)
- Restart the node deployment process from the very beginning. Try to reproduce the error you are encountering.
- Try to deploy to a different node machine (one that is not being used in a subnet). Try to reproduce the error on multiple node machines.
Support request information requirements
If you are still encountering deployment issues, read the rest of this guide. If you still can't successfully deploy your nodes, post a support request message in the IC Node Provider Matrix channel containing ALL the following information:
- A confirmation that you ran through ALL the above general troubleshooting steps
- A screenshot of your failure / screenshots of any relevant logs.
- The stage of the deployment in which you are failing
- What deployment method you are using:
- Is this your first time performing an IC-OS installation?
- Is this your first time performing an IC-OS installation in this data center?
- Is this your first time performing an IC-OS installation with this Node Operator Key?
- Can you reproduce this issue?
- Machine hardware details (Gen1 / Gen2, server brand)
- The datacenter out of which you are attempting to deploy
- Are you onboarding with an IPv4 address?
- Any relevant networking details
- Any other details you see as relevant
If you post a support request message that doesn't include ALL the above information, you will be asked to do so.
Unhealthy nodes
Example Error
You successfully installed your node without errors and your node registered with the IC, but on the dashboard, the status of your node is NOT either “Awaiting Subnet” or “Active in Subnet”
To troubleshoot, consult the Troubleshooting Unhealthy Nodes guide
Node registration failures
Example Error
You successfully installed your node without errors. The machine rebooted and hostOS came up, but now, your node is failing to register with the IC (your node is not listed on the dashboard). You've run though all the above general troubleshooting steps but they haven't solved your issue.
Common Causes
- Networking issues: as the general troubleshooting steps direct, reread the Node Provider Networking Guide, make sure you aren’t violating anything in the networking “What NOT to do” section, and please restart everything you can between the node machine and the internet (router, firewall, etc.)
- Node Allowance exhausted: If you see an error "Node allowance for this node operator is exhausted," this indicates that the node allowance in your node operator record created during the Node Provider Onboarding has been maxed out. Check the dashboard to see how many nodes are currently registered under your Node Provider principal ID. If there are more nodes listed than expected or old nodes that should be removed, see this guide on Removing a Node From the Registry.
- "blank" console: If after the machine reboots and HostOS comes up, you then see a "blank" console, wait at least 10 minutes. You should see the node-ID logged to the console and then you'll be able to search for the node-ID on the dashboard to verify the node onboarding.
- Redeploying node in a subnet: If you see an error "Cannot remove a node that is a member of a subnet," this indicates that you are attempting to redeploy a node that is currently assigned to a subnet. To redeploy a node in a subnet, the node must first be removed from the subnet. Note that if you performing a firmware upgrade, you do NOT need to redeploy the node.
IC-OS installation failures
Here are some common IC-OS installation failures:
Missing Drives
Example Error
-------------------------------------------------------------------------------- INTERNET COMPUTER - SETUP - FAILED -------------------------------------------------------------------------------- Please consult the wiki guide: Troubleshooting Node Deployment Errors. -------------------------------------------------------------------------------- ERROR -------------------------------------------------------------------------------- Not enough drives found. Are all drives correctly installed? -------------------------------------------------------------------------------- ERROR --------------------------------------------------------------------------------
Another version of it might say "Aggregate Disk size does not meet requirements"
Common Causes
This error means that the IC-OS installation medium could not detect all required drives. This is a common issue, even if you believe that all drives are installed correctly. Some of them may not be functioning properly, or may not be fully seated into the chassis.
Suggested Solutions
Check that all drives are fully seated and installed correctly, or install the required number of drives. You may be able to check the drives for indication LEDs to see which may not be installed or functioning correctly.
Invalid CPU Configuration
Example Error
-------------------------------------------------------------------------------- INTERNET COMPUTER - SETUP - FAILED -------------------------------------------------------------------------------- Please consult the wiki guide: Troubleshooting Node Deployment Errors. -------------------------------------------------------------------------------- ERROR -------------------------------------------------------------------------------- Number of threads (16/32) does NOT meet system requirements. -------------------------------------------------------------------------------- ERROR --------------------------------------------------------------------------------
Common Causes
Issues related to CPU capability usually mean that the CPUs are not configured correctly in the system BIOS.
Suggested Solutions
Please check that BIOS settings are configured correctly. It may be helpful to reset all settings to factory defaults, and go through the BIOS configuration again.
Unable to Reach Internet
Example Error
-------------------------------------------------------------------------------- INTERNET COMPUTER - SETUP - FAILED -------------------------------------------------------------------------------- Please consult the wiki guide: Troubleshooting Node Deployment Errors. -------------------------------------------------------------------------------- ERROR -------------------------------------------------------------------------------- Unable to ping IPv6 gateway. -------------------------------------------------------------------------------- ERROR --------------------------------------------------------------------------------
Common Causes
This error means that the node is not able to communicate with the network properly. This can be due to a misconfigured network configuration, or due to issues somewhere between the node and the rest of the internet.
Suggested Solutions
Please try to capture any output that is displayed before this error shows. For example:
* Printing user defined network settings... IPv6 Prefix : XXX IPv6 Subnet : XXX IPv6 Gateway: XXX * Printing system's network settings... IPv6 Prefix : XXX IPv6 Subnet : XXX IPv6 Gateway: XXX * Printing IPv6 addresses... SetupOS: XXX HostOS : XXX GuestOS: XXX
Please compare this, and the initial configuration, to what you expect. If this configuration does not match, please update the initial configuration, and try again.
If this does match the expected configuration, please attempt to diagnose any machines between this node and the rest of the internet. This could be due to improper firewall configuration, or an issue with the data center’s network. If all configuration looks correct, please attempt to reboot any machines between this node and the rest of the Internet. In most cases, this would be a firewall. Rebooting the firewall - even if it seems to be operating correctly - has resolved this issue many times.
Unable to setup PV
Example Error
-------------------------------------------------------------------------------- INTERNET COMPUTER - SETUP - FAILED -------------------------------------------------------------------------------- Please consult the wiki guide: Troubleshooting Node Deployment Errors. -------------------------------------------------------------------------------- ERROR -------------------------------------------------------------------------------- Unable to setup PV on drive '/dev/nvme8n1'. -------------------------------------------------------------------------------- ERROR --------------------------------------------------------------------------------
Common Causes
This error means that the node is able to recognize that a drive is installed, but is unable to write to it. This could indicate that there is a hardware issue with the drive.
Suggested Solutions
Please try to remove and re-install all drives, before attempting to install the node again. It may be helpful to independently verify that each drive is functioning correctly.
Warning: Gen2 hardware detected but no Node Operator Private Key found
This is not an error. This is a warning that Gen2 hardware should be onboarded using the Gen2 deployment method (i.e. without an HSM).
If you already onboarded this node machine with an HSM and are just *redeploying* the node, you may continue (just wait 5 minutes for the installation to resume), but if you are onboarding a *new node machine*, you should not use an HSM.
Troubleshooting IC-OS installation failures: Getting a shell in SetupOS
- During the IC-OS installation, you may hit enter to obtain console access to troubleshoot any issues you are encountering. You can also hit enter at the error page in order to access the console. Hit enter until you see a login prompt
- Log in with user
root
and empty password - Now you have root access for diagnostics, etc.
Note that after the installation is finished and the machine reboots into HostOS, the same troubleshooting console isn't available anymore. It is only available in SetupOS, i.e., during IC-OS installation. After the OS is installed you may boot the machine from a live USB distribution such as Ubuntu Live USB, and troubleshoot that way.