Difference between revisions of "Node Provider Maintenance Guide"

Revision as of 21:10, 21 November 2023

Submitting NNS proposals

Here are some NNS proposals you may have to submit after onboarding nodes.

Adjusting the node allowance in a Data Center

To adjust the node allowance for an existing node operator record, you would need to use the propose-to-update-node-operator-config subcommand of the ic-admin tool. You should typically not add a new node operator record if you just want to add more nodes to the existing DC.

Here's a step-by-step guide on how to do this:

1. Gather Necessary Information: Ensure you have the following details:

NODE_PROVIDER_ID: The principal ID of the node provider under which the node operator record is registered.
NODE_OPERATOR_ID: The principal ID of the node operator whose allowance you want to change.
NEURON_ID: The ID of the neuron that will propose this change.
CURRENTLY_REMAINING_NODE_ALLOWANCE: The number of nodes that the node operator is allowed to add to the network without submitting a proposal.
NEW_NODE_ALLOWANCE: The new number of nodes that the node operator is allowed to add.

The parts 1, 2, and 3 should be in your records, and should be the same principals (IDs) used to onboard nodes in the given DC. The part 4 can be obtained from the registry, with ic-admin

$ ic-admin --nns-url https://ic0.app get-node-operator $NODE_OPERATOR_ID

For example:

$ ic-admin --nns-url https://ic0.app get-node-operator yl63e-n74ks-fnefm-einyj-kwqot-7nkim-g5rq4-ctn3h-3ee6h-24fe4-uqe
Fetching the most recent value for key: node_operator_record_yl63e-n74ks-fnefm-einyj-kwqot-7nkim-g5rq4-ctn3h-3ee6h-24fe4-uqe
Most recent version is 35791. Value:
NodeOperator { node_operator_principal_id: yl63e-n74ks-fnefm-einyj-kwqot-7nkim-g5rq4-ctn3h-3ee6h-24fe4-uqe, node_allowance: 0, node_provider_principal_id: niw4y-easue-l3qvz-sozsi-tfkvb-cxcx6-pzslg-5dqld-ooudp-hsuui-xae, dc_id: "mu1", rewardable_nodes: {"type0": 0, "type1": 28}, ipv6: None }

In the above example, the CURRENTLY_REMAINING_NODE_ALLOWANCE is 0. So if you want to add 5 more nodes with the same node operator (i.e. in the same DC), you should use NEW_NODE_ALLOWANCE=5. However, if the CURRENTLY_REMAINING_NODE_ALLOWANCE had value 2, you would only need 3 more nodes on top of your currently remaining allowance (2+3=5), so you should use NEW_NODE_ALLOWANCE=3 in the proposal

2. Prepare the Command: Construct the ic-admin command using the gathered information. Here's an example template:

$ NEURON_ID=XXXXXXXXXXXXXXXXXXXX
$ NODE_PROVIDER_PRINCIPAL=xxxxx-xxxxx-xxxxx-xxxxx-xxxxx-xxxxx-xxxxx-xxxxx-xxxxx-xxxxx-xxx
$ NODE_OPERATOR_PRINCIPAL=xxxxx-xxxxx-xxxxx-xxxxx-xxxxx-xxxxx-xxxxx-xxxxx-xxxxx-xxxxx-xxx
$ NODE_PROVIDER_NAME="My Company"
$ NEW_NODE_ALLOWANCE=5
$ DC_ID=xx
$ FORUM_POST_URL=https://forum.dfinity.org/...

$ ./ic-admin \
        --nns-url https://ic0.app \
        -s ~/.config/dfx/identity/node-provider-hotkey/identity.pem \
    propose-to-update-node-operator-config \
        --node-provider-id $NODE_PROVIDER_PRINCIPAL \
        --node-operator-id $NODE_OPERATOR_PRINCIPAL \
        --summary "Node provider '$NODE_PROVIDER_NAME' is adjusting the node allowance $NODE_ALLOWANCE to nodes in the $DC_ID data center. Link to the forum post for: $FORUM_POST_URL" \
        --proposer $NEURON_ID \
          $NEW_NODE_ALLOWANCE

Replace all placeholder variables above with the actual values before submitting the proposal.

3. Dry Run (strongly recommended): To preview the proposal without actually submitting it, you can add the --dry-run flag to the above command. This is useful for checking the proposal payload and ensuring everything is correct before the actual submission.

4. Execute the Command: Once you are sure about the command and the details, execute it in your terminal. This will submit a proposal to update the node allowance in the node operator's configuration.

5. Monitor and Voting: After submitting the proposal, it will go through a voting process by the governance system. You should monitor this to see if the proposal gets accepted or rejected.

7. Verification (Post-Approval): If the proposal is approved, you may want to verify that the node allowance has been updated as expected. This might involve querying the node operator's record with get-node-operator as described above.

Note that the exact command and options will vary based on your specific configuration and requirements. Make sure to replace placeholders with actual values relevant to your setup.

To see all available options, you can run:

$ ic-admin --nns-url https://ic0.app propose-to-update-node-operator-config --help

Monitoring

You are expected to regularly monitor the health of your nodes. Node health status is available on the public dashboard. Example: node status.

Also, check out the Tools and Resources section below, for some useful tools that can help you with the monitoring and alerting activites.

Permitted tools

For security and confidentiality reasons, other tools are not allowed to run on the same machine in parallel with the replica. In case you need to troubleshoot an issue, it is recommended to either boot the machine from a USB drive that has a live Linux distribution (e.g. Ubuntu) or to debug from an auxiliary machine in the same rack on which you have complete control, as described in Node Provider Troubleshooting

Handling degraded nodes

Please take a look at Node Provider Troubleshooting

Handling dead nodes

Please take a look at Node Provider Troubleshooting

Node rewards based on useful work

The Internet Computer protocol can tolerate up to 1/3 of nodes misbehaving. There is an ongoing activity to automatically issue node rewards based on useful work, and also to automatically reduce node remuneration in case nodes are misbehaving. This will provide a financial incentive for honest behavior. Please follow the forum and the Matrix channel to stay informed about these activities.

In the meantime, the recommendation is to prepare for this by making sure that your nodes are online and healthy at all times, otherwise you risk penalties even before the automatic node rewards based on useful work become active.

Subnet recovery

In case subnet recovery is needed, we may have to reach out to you for assistance. Please make sure you closely follow activities in the Matrix Channel, and enable notifications on new messages -- especially direct mentions.

Peer-support and bug reports / resolution: Node Provider Matrix Channel

Node Providers are encouraged to join the dedicated Node Provider Matrix channel. This platform can be used for discussing maintenance-related queries and sharing insights, report issues, and search for previous resolutions for operations.

Communication Guidelines on the Matrix Channel

As a Node Provider, ensure your notifications are enabled to receive new messages promptly. Your input or intervention might be crucial, especially in urgent situations.

It is recommended to add the node provider name to your alias (handle) on the communication platform, to facilitate communication and enable others to quickly and easily mention you.

Tools and Resources

Several node providers have generously shared tools to facilitate monitoring node health. These tools can provide notifications in case of node issues.

Aviate Labs Node Monitor

Turnkey Solution: Receive email alerts for unhealthy nodes.
Link: AviateLabs Node Monitor

DIY Node Monitoring

GitHub Repository: Run your own node monitoring system.
Link: Aviate Labs GitHub

Prometheus Exporter for Node Status

GitHub Repository: A tool for exporting node status to a Prometheus-compatible format.
Link: IC Node Status Prometheus Exporter

Additional Notes

Screenshots: Include screenshots of the node status from the public dashboard for reference and troubleshooting.

In case you observe issues, follow: Unhealthy Nodes and Node Provider Troubleshooting

@@ Line 56: / Line 56: @@
   $ ic-admin --nns-url <nowiki>https://ic0.app</nowiki> propose-to-update-node-operator-config --help
-== Joining the Node Provider Matrix Channel ==
+== Monitoring ==
+You are expected to regularly monitor the health of your nodes. Node health status is available on the public dashboard. Example: [https://dashboard.internetcomputer.org/node/b5d56-nm7ae-p24jg-t25gp-5bmhb-rjbnt-3dmoq-goqby-5tf6c-ygnnu-aqe node status].
-Node Providers are encouraged to join the dedicated [[Node Provider Matrix channel]]. This platform is essential for discussing maintenance-related queries and sharing insights about node operations.
+Also, check out the Tools and Resources section below, for some useful tools that can help you with the monitoring and alerting activites.
-=== Communication Guidelines ===
+== Permitted tools ==
+For security and confidentiality reasons, other tools are not allowed to run on the same machine in parallel with the replica. In case you need to troubleshoot an issue, it is recommended to either boot the machine from a USB drive that has a live Linux distribution (e.g. [https://ubuntu.com/tutorials/try-ubuntu-before-you-install#3-boot-from-usb-flash-drive Ubuntu]) or to debug from an auxiliary machine in the same rack on which you have complete control, as described in [[Node Provider Troubleshooting]]
-* '''Active Participation''': Ensure your notifications are enabled to receive new messages promptly. Your input or intervention might be crucial, especially in urgent situations.
+== Handling degraded nodes ==
-* '''Regular Operations''': Regularly monitor the health of your node. Node health status is available on the public dashboard, which. Example: [https://dashboard.internetcomputer.org/node/b5d56-nm7ae-p24jg-t25gp-5bmhb-rjbnt-3dmoq-goqby-5tf6c-ygnnu-aqe node status].
+Please take a look at [[Node Provider Troubleshooting]]
+== Handling dead nodes ==
+Please take a look at [[Node Provider Troubleshooting]]
+== Node rewards based on useful work ==
+The Internet Computer protocol can tolerate up to 1/3 of nodes misbehaving. There is an ongoing activity to automatically issue node rewards based on useful work, and also to automatically reduce node remuneration in case nodes are misbehaving. This will provide a financial incentive for honest behavior. Please follow the forum and the Matrix channel to stay informed about these activities.
+In the meantime, the recommendation is to prepare for this by making sure that your nodes are online and healthy at all times, otherwise you risk penalties even before the automatic node rewards based on useful work become active.
+== Subnet recovery ==
+In case subnet recovery is needed, we may have to reach out to you for assistance. Please make sure you closely follow activities in the Matrix Channel, and enable notifications on new messages -- especially direct mentions.
+== Peer-support and bug reports / resolution: Node Provider Matrix Channel ==
+Node Providers are encouraged to join the dedicated [[Node Provider Matrix channel]]. This platform can be used for discussing maintenance-related queries and sharing insights, report issues, and search for previous resolutions for operations.
+'''Communication Guidelines on the Matrix Channel'''
+As a Node Provider, ensure your notifications are enabled to receive new messages promptly. Your input or intervention might be crucial, especially in urgent situations.
+It is recommended to add the node provider name to your alias (handle) on the communication platform, to facilitate communication and enable others to quickly and easily mention you.
 == Tools and Resources ==