Difference between revisions of "Node Provider Maintenance Guide"

From Internet Computer Wiki
Jump to: navigation, search
m
(Deleting the runbook best practices section on second though.)
 
(16 intermediate revisions by 4 users not shown)
Line 1: Line 1:
 +
== Submitting NNS proposals ==
 +
As a part of being a Node Provider, you will likely have to submit some NNS proposals. The page at the following link describes some of these proposals: [[Node Provider NNS proposals]]
  
WORK IN PROGRESS
+
== Monitoring ==
 +
You are expected to regularly monitor the health of your nodes. Node health status is available on the public dashboard. Example: [https://dashboard.internetcomputer.org/node/b5d56-nm7ae-p24jg-t25gp-5bmhb-rjbnt-3dmoq-goqby-5tf6c-ygnnu-aqe node status].
  
Join the [[Node Provider Matrix channel]]. Here, you can submit questions or comments related to Node Provider node maintenance.
+
Also, check out the Tools and Resources section below, for some useful tools that can help you with the monitoring and alerting activites.
  
to-do: Incorporate from: [[Unhealthy Nodes]]
+
== Permitted tools ==
 +
For security and confidentiality reasons, other tools are not allowed to run on the same machine in parallel with the replica. In case you need to troubleshoot an issue, it is recommended to either boot the machine from a USB drive that has a live Linux distribution (e.g. [https://ubuntu.com/tutorials/try-ubuntu-before-you-install#3-boot-from-usb-flash-drive Ubuntu]) or to debug from an auxiliary machine in the same rack on which you have complete control, as described in [[Unhealthy Nodes#Setting Up an Auxiliary Machine for Network Diagnostics]]
  
 +
== Scheduled DC outages ==
 +
When your DC notifies you of a scheduled outage, you must:
  
 +
* Notify DFINITY on the [[Node Provider Matrix channel]]
 +
* Make sure your nodes return to one of the healthy statuses when the DC outage is resolved:
 +
** Active in Subnet - The node is healthy and actively functioning within a subnet.
 +
** Awaiting Subnet - The node is operational and prepared to join a subnet when necessary.
 +
* If a node is degraded at first, give it a little bit of time in case it needs to catch up, but make sure that it does return to one of the two healthy statuses.
  
 +
== Handling degraded nodes ==
 +
Please take a look at [[Node Provider Troubleshooting]]
  
==Node Status on the Dashboard==
+
== Handling dead nodes ==
The dashboard lists each node by the principal of the currently-running OS. Node Providers track privately which server corresponds to each principal. This includes updating their records when a node is redeployed and gets a new principal.
+
Please take a look at [[Node Provider Troubleshooting]]
  
There are four statuses of node:
+
== Node rewards based on useful work ==
 +
The Internet Computer protocol can tolerate up to 1/3 of nodes misbehaving. There is an ongoing activity to automatically issue node rewards based on useful work, and also to automatically reduce node remuneration in case nodes are misbehaving. This will provide a financial incentive for honest behavior. Please follow the forum and the Matrix channel to stay informed about these activities.
  
* '''Active in Subnet''' - This is a node which is healthy and is currently running a subnet.
+
In the meantime, the recommendation is to prepare for this by making sure that your nodes are online and healthy at all times, otherwise you risk penalties even before the automatic node rewards based on useful work become active.
* '''Awaiting Subnet''' - This is a node which is healthy and is currently a spare node. It is not running a subnet but it keeping itself updated so that it is ready at a moment's notice to take part in a subnet
+
 
* '''Offline''' - This is a node which has completely failed. The failure is recent enough that it hasn't been removed from the registry yet. If there is an outage of some sort at the data center, then the node should come back online and be healthy once it's resolved, as long as it doesn't take too long. Make sure that connectivity to the node is properly supplied before doing anything else. If there are no issues with connectivity, then [[Unhealthy Nodes|troubleshooting steps]] should be taken. Note that the node will have to be removed from the registry before it can be redeployed, if redeployment is needed.
+
== Subnet recovery ==
* '''Degraded''' - This node is struggling to keep up with the blockchain. If it's a temporary issue then it should catch back up and become healthy again. If it's a permanent issue, then it will eventually fail and go offline. If it's removed from the registry before it fails completely then it will disappear from the dashboard.
+
In case subnet recovery is needed, we may have to reach out to you for assistance. Please make sure you closely follow activities in the Matrix Channel, and enable notifications on new messages -- especially direct mentions.
* '''Not listed at all'''. If a node is not listed at all, then it had an issue and it was already removed from the registry. [[Unhealthy Nodes|Troubleshooting steps]] should be taken.
+
 
 +
== Peer-support and bug reports / resolution: Node Provider Matrix Channel ==
 +
 
 +
Node Providers are encouraged to join the dedicated [[Node Provider Matrix channel]]. This platform can be used for discussing maintenance-related queries and sharing insights, report issues, and search for previous resolutions for operations.
 +
 
 +
'''Communication Guidelines on the Matrix Channel'''
 +
 
 +
As a Node Provider, ensure your notifications are enabled to receive new messages promptly. Your input or intervention might be crucial, especially in urgent situations.
 +
 
 +
It is recommended to add the node provider name to your alias (handle) on the communication platform, to facilitate communication and enable others to quickly and easily mention you.
 +
 
 +
== Tools and Resources ==
 +
 
 +
Several node providers have generously shared tools to facilitate monitoring node health. These tools can provide notifications in case of node issues.
 +
 
 +
=== Aviate Labs Node Monitor ===
 +
 
 +
* '''Turnkey Solution''': Receive email alerts for unhealthy nodes.
 +
* '''Link''': [https://www.aviatelabs.co/node-monitor AviateLabs Node Monitor]
 +
 
 +
=== DIY Node Monitoring ===
 +
 
 +
* '''GitHub Repository''': Run your own node monitoring system.
 +
* '''Link''': [https://github.com/aviate-labs/node-monitor Aviate Labs GitHub]
 +
 
 +
=== Prometheus Exporter for Node Status ===
 +
 
 +
* '''GitHub Repository''': A tool for exporting node status to a Prometheus-compatible format.
 +
* '''Link''': [https://github.com/virtualhive/ic-node-status-prometheus-exporter IC Node Status Prometheus Exporter]
 +
 
 +
== Additional Notes ==
 +
 
 +
* '''Screenshots''': Include screenshots of the node status from the public dashboard for reference and troubleshooting.
 +
In case you observe issues, follow: [[Unhealthy Nodes]] and [[Node Provider Troubleshooting]]

Latest revision as of 19:00, 1 March 2024

Submitting NNS proposals

As a part of being a Node Provider, you will likely have to submit some NNS proposals. The page at the following link describes some of these proposals: Node Provider NNS proposals

Monitoring

You are expected to regularly monitor the health of your nodes. Node health status is available on the public dashboard. Example: node status.

Also, check out the Tools and Resources section below, for some useful tools that can help you with the monitoring and alerting activites.

Permitted tools

For security and confidentiality reasons, other tools are not allowed to run on the same machine in parallel with the replica. In case you need to troubleshoot an issue, it is recommended to either boot the machine from a USB drive that has a live Linux distribution (e.g. Ubuntu) or to debug from an auxiliary machine in the same rack on which you have complete control, as described in Unhealthy Nodes#Setting Up an Auxiliary Machine for Network Diagnostics

Scheduled DC outages

When your DC notifies you of a scheduled outage, you must:

  • Notify DFINITY on the Node Provider Matrix channel
  • Make sure your nodes return to one of the healthy statuses when the DC outage is resolved:
    • Active in Subnet - The node is healthy and actively functioning within a subnet.
    • Awaiting Subnet - The node is operational and prepared to join a subnet when necessary.
  • If a node is degraded at first, give it a little bit of time in case it needs to catch up, but make sure that it does return to one of the two healthy statuses.

Handling degraded nodes

Please take a look at Node Provider Troubleshooting

Handling dead nodes

Please take a look at Node Provider Troubleshooting

Node rewards based on useful work

The Internet Computer protocol can tolerate up to 1/3 of nodes misbehaving. There is an ongoing activity to automatically issue node rewards based on useful work, and also to automatically reduce node remuneration in case nodes are misbehaving. This will provide a financial incentive for honest behavior. Please follow the forum and the Matrix channel to stay informed about these activities.

In the meantime, the recommendation is to prepare for this by making sure that your nodes are online and healthy at all times, otherwise you risk penalties even before the automatic node rewards based on useful work become active.

Subnet recovery

In case subnet recovery is needed, we may have to reach out to you for assistance. Please make sure you closely follow activities in the Matrix Channel, and enable notifications on new messages -- especially direct mentions.

Peer-support and bug reports / resolution: Node Provider Matrix Channel

Node Providers are encouraged to join the dedicated Node Provider Matrix channel. This platform can be used for discussing maintenance-related queries and sharing insights, report issues, and search for previous resolutions for operations.

Communication Guidelines on the Matrix Channel

As a Node Provider, ensure your notifications are enabled to receive new messages promptly. Your input or intervention might be crucial, especially in urgent situations.

It is recommended to add the node provider name to your alias (handle) on the communication platform, to facilitate communication and enable others to quickly and easily mention you.

Tools and Resources

Several node providers have generously shared tools to facilitate monitoring node health. These tools can provide notifications in case of node issues.

Aviate Labs Node Monitor

DIY Node Monitoring

Prometheus Exporter for Node Status

Additional Notes

  • Screenshots: Include screenshots of the node status from the public dashboard for reference and troubleshooting.

In case you observe issues, follow: Unhealthy Nodes and Node Provider Troubleshooting