Difference between revisions of "Node Provider Maintenance Guide"

From Internet Computer Wiki
Jump to: navigation, search
m
 
(21 intermediate revisions by 5 users not shown)
Line 1: Line 1:
 +
== Troubleshooting ==
 +
See the [[Node Provider Troubleshooting]] guide for info on troubleshooting failed onboardings, unhealthy nodes, networking, and more.
 +
 
== Submitting NNS proposals ==
 
== Submitting NNS proposals ==
Here are some NNS proposals you may have to submit after onboarding nodes.
+
As a part of being a Node Provider, you will likely have to submit some NNS proposals. The page at the following link describes some of these proposals: [[Node Provider NNS proposals]]
  
=== Adjusting the node allowance in a Data Center ===
+
== Monitoring ==
To adjust the node allowance for an existing node operator record, you would need to use the <code>propose-to-update-node-operator-config</code> subcommand of the <code>ic-admin</code> tool. Here's a step-by-step guide on how to do this:
+
You are expected to regularly monitor the health of your nodes. Node health status is available on the public dashboard. Example: [https://dashboard.internetcomputer.org/node/235hh-hmjhq-dejel-3q5oi-pdz66-dygbp-yi2sy-zmuiq-rj7r7-65hue-wae node status].
  
1.  '''Gather Necessary Information:''' Ensure you have the following details:
+
You can also view your node's [https://internetcomputer.org/docs/current/references/node-providers/node-metrics#manually-obtaining-metrics public health metrics] and monitor it with the [https://internetcomputer.org/docs/current/references/node-providers/node-metrics IC observability stack].
  
* <code>NODE_OPERATOR_ID</code>: The principal ID of the node operator whose allowance you want to change.
+
===Community Tools and Resources===
* <code>NODE_ALLOWANCE</code>: The new number of nodes that the node operator is allowed to add.
 
* <code>NEURON_ID</code>: The ID of the neuron that will propose this change.
 
* <code>DC_ID</code> data center ID where nodes should be added.
 
  
2. '''Prepare the Command''': Construct the <code>ic-admin</code> command using the gathered information. Here's an example template:
+
Several node providers have generously shared tools to facilitate monitoring node health. These tools can provide notifications in case of node issues.
$ ./ic-admin \
 
          --nns-url <nowiki>https://ic0.app</nowiki> \
 
          propose-to-update-node-operator-config \
 
          --node-operator-id $NODE_OPERATOR_ID \
 
          --proposer $NEURON_ID \
 
          $NEW_NODE_ALLOWANCE \
 
          --summary "Add summary" \
 
          [Other Options as Required]   
 
Replace <code>$NODE_OPERATOR_ID</code>, <code>$NEURON_ID</code>, and <code>$NEW_NODE_ALLOWANCE</code> with the actual values. Include additional options as needed, such as `--dc-id` if you're specifying a particular data center.
 
 
 
3.  '''Summary and Proposal Details''': Add a summary and any other proposal details. This could include the reason for the allowance change, expected impacts, etc. These can be added using `--summary` or `--summary-file` options.
 
  
4.  '''Dry Run (recommended)''': If you want to preview the proposal without actually submitting it, you can add the `--dry-run` flag to the command. This is useful for checking the proposal payload and ensuring everything is correct before the actual submission.
+
====Aviate Labs Node Monitor====
  
5.  '''Execute the Command''': Once you are sure about the command and the details, execute it in your terminal. This will submit a proposal to update the node operator's configuration, including the node allowance.
+
*'''Turnkey Solution''': Receive email alerts for unhealthy nodes.
 +
*'''Link''': [https://www.aviatelabs.co/node-monitor AviateLabs Node Monitor]
  
6.  '''Monitor and Voting''': After submitting the proposal, it will typically go through a voting process by the governance system. You should monitor this to see if the proposal gets accepted or rejected.   
+
====DIY Node Monitoring====
  
7.  '''Verification (Post-Approval)''': If the proposal is approved, you may want to verify that the node allowance has been updated as expected. This might involve querying the node operator's record or using other `ic-admin` subcommands for confirmation.
+
*'''GitHub Repository''': Run your own node monitoring system.
 +
*'''Link''': [https://github.com/aviate-labs/node-monitor Aviate Labs GitHub]
  
   
+
====Prometheus Exporter for Node Status====
  
Note that the exact command and options will vary based on your specific configuration and requirements. Make sure to replace placeholders with actual values relevant to your setup.
+
*'''GitHub Repository''': A tool for exporting node status to a Prometheus-compatible format.
 +
*'''Link''': [https://github.com/virtualhive/ic-node-status-prometheus-exporter IC Node Status Prometheus Exporter]
  
To see all available options, you can run:
+
== Common maintenance tasks ==
$ ic-admin --nns-url <nowiki>https://ic0.app</nowiki> propose-to-update-node-operator-config --help
+
*[[Removing a Node From the Registry]]
 +
*[[Adding additional node machines to existing Node Allowance]]
 +
*[[Updating your node's IPv4 and domain name]]
 +
*[[Changing IPv6 addresses of nodes]]
 +
*[[Moving a node from one DC to another]]
 +
*[[iDRAC access and TSR logs]]
 +
*[[Checking node CPU and memory speed]]
 +
*For changing your Node Provider or DC principal, please refer to [[Node Provider NNS proposals]]
 +
*[[Updating Firmware]]
 +
==Permitted tools==
 +
For security and confidentiality reasons, other tools are not allowed to run on the same machine in parallel with the replica. In case you need to troubleshoot an issue, it is recommended to either boot the machine from a USB drive that has a live Linux distribution (e.g. [https://ubuntu.com/tutorials/try-ubuntu-before-you-install#3-boot-from-usb-flash-drive Ubuntu]) or to debug from an auxiliary machine in the same rack on which you have complete control, as described in [[Troubleshooting Unhealthy Nodes#Setting Up an Auxiliary Machine for Network Diagnostics|Unhealthy Nodes#Setting Up an Auxiliary Machine for Network Diagnostics]]
  
== Joining the Node Provider Matrix Channel ==
+
==Scheduled data center outages==
 +
When your data center notifies you of a scheduled outage, you must:
  
Node Providers are encouraged to join the dedicated [[Node Provider Matrix channel]]. This platform is essential for discussing maintenance-related queries and sharing insights about node operations.
+
*Notify DFINITY on the [[Node Provider Matrix channel]]
 +
*Make sure your nodes return to one of the healthy statuses when the outage is resolved:
 +
**Active in Subnet - The node is healthy and actively functioning within a subnet.
 +
**Awaiting Subnet - The node is operational and prepared to join a subnet when necessary.
 +
*If a node is degraded at first, give it a little bit of time in case it needs to catch up, but make sure that it does return to one of the two healthy statuses.
  
=== Communication Guidelines ===
+
==Node rewards based on useful work==
 +
The Internet Computer protocol can tolerate up to 1/3 of nodes misbehaving. There is an ongoing activity to automatically issue node rewards based on useful work, and also to automatically reduce node remuneration in case nodes are misbehaving. This will provide a financial incentive for honest behavior. Please follow the forum and the Matrix channel to stay informed about these activities.
  
* '''Active Participation''': Ensure your notifications are enabled to receive new messages promptly. Your input or intervention might be crucial, especially in urgent situations.
+
In the meantime, the recommendation is to prepare for this by making sure that your nodes are online and healthy at all times, otherwise you risk penalties even before the automatic node rewards based on useful work become active.
* '''Regular Operations''': Regularly monitor the health of your node. Node health status is available on the public dashboard, which. Example: [https://dashboard.internetcomputer.org/node/b5d56-nm7ae-p24jg-t25gp-5bmhb-rjbnt-3dmoq-goqby-5tf6c-ygnnu-aqe node status].
 
  
== Tools and Resources ==
+
== Subnet recovery==
 +
In case subnet recovery is needed, we may have to reach out to you for assistance. Please make sure you closely follow activities in the Matrix Channel, and enable notifications on new messages -- especially direct mentions.
  
Several node providers have generously shared tools to facilitate monitoring node health. These tools can provide notifications in case of node issues.
+
==General best practices==
  
=== Aviate Labs Node Monitor ===
+
# Keep a separate machine in the same rack with appropriate tools for network diagnostics and troubleshooting
 +
# Engage with the node provider community for support and to share effective troubleshooting techniques
 +
===Setting Up an Auxiliary Machine for Network Diagnostics===
 +
Robust Internet connectivity is essential. Without access to internal node logs and metrics, troubleshooting requires alternative strategies, including the use of an auxiliary machine within the same rack. Here's a brief outline for setting up an auxiliary machine in the same rack while following best security practices:
  
* '''Turnkey Solution''': Receive email alerts for unhealthy nodes.
+
# Hardware Setup:
* '''Link''': [https://www.aviatelabs.co/node-monitor AviateLabs Node Monitor]
+
#* Choose a server with sufficient resources to run diagnostic tools without impacting its performance. There is no need to follow the gen1/gen2 hardware requirements for this server (since this node would not be joining the IC network), but make sure the server is performant enough to run network tests.
 +
#* Ensure that physical security measures are in place to prevent unauthorized access.
 +
# Operating System and Software:
 +
#* Install a secure operating system, like a minimal installation of Linux (we prefer Ubuntu 22.04), which reduces the attack surface.
 +
#* Keep the system updated with the latest security patches and firmware updates.
 +
# Network Configuration:
 +
#* Configure the machine with an IPv6 address in the same range as the IC nodes for accurate testing.
 +
#* Set up a restrictive firewall on the machine to allow ''only the necessary'' inbound and outbound traffic. Consider allowing Internet access for this machine only during troubleshooting sessions and keeping the machine behind a VPN at other times.
 +
# Diagnostic Tools:
 +
#* Install network diagnostic tools such as <code>ping</code>, <code>traceroute</code>, <code>nmap</code>, <code>tcpdump</code>, and <code>iperf</code>.
 +
#* Configure monitoring tools to simulate node activities and track responsiveness.
 +
# Security Measures:
 +
#* Use strong, unique passwords for all accounts, and change them regularly. Or, preferably, do not use passwords at all and use key-based access instead.
 +
#* Implement key-based SSH authentication and disable root login over SSH.
 +
#* Regularly review logs for any unusual activities that might indicate a security breach.
 +
# Maintenance and Updates:
 +
#* Regularly update all software to the latest versions.
 +
#* Periodically test your network diagnostic tools to ensure they are functioning as expected.
  
=== DIY Node Monitoring ===
+
==Peer-support and bug reports / resolution: Node Provider Matrix Channel==
  
* '''GitHub Repository''': Run your own node monitoring system.
+
Node Providers are encouraged to join the dedicated [[Node Provider Matrix channel]]. This platform can be used for discussing maintenance-related queries and sharing insights, report issues, and search for previous resolutions for operations.
* '''Link''': [https://github.com/aviate-labs/node-monitor Aviate Labs GitHub]
 
  
=== Prometheus Exporter for Node Status ===
+
Please consult the Matrix channel for troubleshooting issues '''<u>only after consulting the [[Node Provider Troubleshooting]] guide</u>'''
  
* '''GitHub Repository''': A tool for exporting node status to a Prometheus-compatible format.
+
'''Communication Guidelines on the Matrix Channel'''
* '''Link''': [https://github.com/virtualhive/ic-node-status-prometheus-exporter IC Node Status Prometheus Exporter]
 
  
== Additional Notes ==
+
As a Node Provider, ensure your notifications are enabled to receive new messages promptly. Your input or intervention might be crucial, especially in urgent situations.
  
* '''Screenshots''': Include screenshots of the node status from the public dashboard for reference and troubleshooting.
+
It is recommended to add the node provider name to your alias (handle) on the communication platform, to facilitate communication and enable others to quickly and easily mention you.
In case you observe issues, follow: [[Unhealthy Nodes]] and [[Node Provider Troubleshooting]]
 

Latest revision as of 07:20, 19 July 2024

Troubleshooting

See the Node Provider Troubleshooting guide for info on troubleshooting failed onboardings, unhealthy nodes, networking, and more.

Submitting NNS proposals

As a part of being a Node Provider, you will likely have to submit some NNS proposals. The page at the following link describes some of these proposals: Node Provider NNS proposals

Monitoring

You are expected to regularly monitor the health of your nodes. Node health status is available on the public dashboard. Example: node status.

You can also view your node's public health metrics and monitor it with the IC observability stack.

Community Tools and Resources

Several node providers have generously shared tools to facilitate monitoring node health. These tools can provide notifications in case of node issues.

Aviate Labs Node Monitor

DIY Node Monitoring

Prometheus Exporter for Node Status

Common maintenance tasks

Permitted tools

For security and confidentiality reasons, other tools are not allowed to run on the same machine in parallel with the replica. In case you need to troubleshoot an issue, it is recommended to either boot the machine from a USB drive that has a live Linux distribution (e.g. Ubuntu) or to debug from an auxiliary machine in the same rack on which you have complete control, as described in Unhealthy Nodes#Setting Up an Auxiliary Machine for Network Diagnostics

Scheduled data center outages

When your data center notifies you of a scheduled outage, you must:

  • Notify DFINITY on the Node Provider Matrix channel
  • Make sure your nodes return to one of the healthy statuses when the outage is resolved:
    • Active in Subnet - The node is healthy and actively functioning within a subnet.
    • Awaiting Subnet - The node is operational and prepared to join a subnet when necessary.
  • If a node is degraded at first, give it a little bit of time in case it needs to catch up, but make sure that it does return to one of the two healthy statuses.

Node rewards based on useful work

The Internet Computer protocol can tolerate up to 1/3 of nodes misbehaving. There is an ongoing activity to automatically issue node rewards based on useful work, and also to automatically reduce node remuneration in case nodes are misbehaving. This will provide a financial incentive for honest behavior. Please follow the forum and the Matrix channel to stay informed about these activities.

In the meantime, the recommendation is to prepare for this by making sure that your nodes are online and healthy at all times, otherwise you risk penalties even before the automatic node rewards based on useful work become active.

Subnet recovery

In case subnet recovery is needed, we may have to reach out to you for assistance. Please make sure you closely follow activities in the Matrix Channel, and enable notifications on new messages -- especially direct mentions.

General best practices

  1. Keep a separate machine in the same rack with appropriate tools for network diagnostics and troubleshooting
  2. Engage with the node provider community for support and to share effective troubleshooting techniques

Setting Up an Auxiliary Machine for Network Diagnostics

Robust Internet connectivity is essential. Without access to internal node logs and metrics, troubleshooting requires alternative strategies, including the use of an auxiliary machine within the same rack. Here's a brief outline for setting up an auxiliary machine in the same rack while following best security practices:

  1. Hardware Setup:
    • Choose a server with sufficient resources to run diagnostic tools without impacting its performance. There is no need to follow the gen1/gen2 hardware requirements for this server (since this node would not be joining the IC network), but make sure the server is performant enough to run network tests.
    • Ensure that physical security measures are in place to prevent unauthorized access.
  2. Operating System and Software:
    • Install a secure operating system, like a minimal installation of Linux (we prefer Ubuntu 22.04), which reduces the attack surface.
    • Keep the system updated with the latest security patches and firmware updates.
  3. Network Configuration:
    • Configure the machine with an IPv6 address in the same range as the IC nodes for accurate testing.
    • Set up a restrictive firewall on the machine to allow only the necessary inbound and outbound traffic. Consider allowing Internet access for this machine only during troubleshooting sessions and keeping the machine behind a VPN at other times.
  4. Diagnostic Tools:
    • Install network diagnostic tools such as ping, traceroute, nmap, tcpdump, and iperf.
    • Configure monitoring tools to simulate node activities and track responsiveness.
  5. Security Measures:
    • Use strong, unique passwords for all accounts, and change them regularly. Or, preferably, do not use passwords at all and use key-based access instead.
    • Implement key-based SSH authentication and disable root login over SSH.
    • Regularly review logs for any unusual activities that might indicate a security breach.
  6. Maintenance and Updates:
    • Regularly update all software to the latest versions.
    • Periodically test your network diagnostic tools to ensure they are functioning as expected.

Peer-support and bug reports / resolution: Node Provider Matrix Channel

Node Providers are encouraged to join the dedicated Node Provider Matrix channel. This platform can be used for discussing maintenance-related queries and sharing insights, report issues, and search for previous resolutions for operations.

Please consult the Matrix channel for troubleshooting issues only after consulting the Node Provider Troubleshooting guide

Communication Guidelines on the Matrix Channel

As a Node Provider, ensure your notifications are enabled to receive new messages promptly. Your input or intervention might be crucial, especially in urgent situations.

It is recommended to add the node provider name to your alias (handle) on the communication platform, to facilitate communication and enable others to quickly and easily mention you.