Difference between revisions of "Manual Node Recovery Guide"

From Internet Computer Wiki
Jump to: navigation, search
m
(Update TUI screenshots)
 
(10 intermediate revisions by the same user not shown)
Line 1: Line 1:
🚧🚧🚧 UNDER CONSTRUCTION 🚧🚧🚧
+
This runbook describes what steps node providers need to take during a manual node recovery.
  
'''⚠️⚠️⚠️ Disclaimer:''' Do not attempt this unless you are certain it is appropriate. Manual node recovery should only occur in the event of a critical IC failure. In such cases, the process will be coordinated by reputable IC community leaders and discussed publicly with node providers.
+
=== Security warning ===
 +
⚠️⚠️⚠️ Don’t get tricked into compromising your nodes. Only complete a manual node recovery if all of the following conditions are met:
  
and it is being actively discussed or recommended in trusted community forums.
+
* A subnet recovery is announced on the Internet Computer Status Page
 +
* The DFINITY team reached out on the dedicated Matrix channel #ic-node-providers-incident-response:matrix.org.
 +
** Only the DFINITY team is able to send messages on this channel. In case of an incident, permissions are adapted so that everyone can send messages.  
  
===1. Receive recovery version and short hash===
+
=== Prerequisites ===
A recovery coordinator will notify all subnet Node Providers of the recovery image <code>version</code> and associated short (6-character) <code>hash</code> that Node Providers will use to apply the upgrade
 
  
===2. Complete Recovery===
+
* The recovery coordinator should have communicated with you the following:
Note that the recovery can be completed from the node's remote BMC console view or from the physical console
+
** The recovery input parameters (used in the "Input recovery parameters" step):
 +
*** The <code>VERSION</code>: the 40-character commit ID of the recovery-GuestOS update image
 +
***The <code>RECOVERY-HASH-PREFIX</code>: the 6-character hash-prefix of the recovery artifacts
 +
** The recovery full-hashes (used in the "Confirm calculated full-hashes" step)
 +
*** The <code>VERSION-HASH</code>: 64-character hash of the recovery-GuestOS update image
 +
*** The <code>RECOVERY-HASH</code>: 64-character hash of the recovery artifacts
 +
** The node(s): which specific nodes managed by the NP/NO are part of the target subnet.
  
#Reboot the machine
+
*Obtain console access to all nodes you run that are part of the target subnet.
#During reboot, at the grub menu, press 'e' to edit the boot parameters (you must press 'e' before the 15-second timeout)
+
**Note that the recovery can be completed from a physical console '''or from the node's remote BMC virtual console view.'''
#:''If you miss the timeout, don't worry. Just reboot the machine and try again.''
 
#:[[File:Host grub boot menu.png|580px|screenshot]]
 
#From the GRUB edit mode screen, add the boot parameters: <code>recovery=1 version=ABC hash=XYZ</code> ''(replace ABC and XYZ with the version and hash provided by the recovery coordinator. Note that the hash should only be six-characters long.)''
 
#:[[File:Grub boot edit 1.png|480px|screenshot]]
 
#:→→→
 
#:[[File:Grub boot menu 2.png|480px|screenshot]]
 
#:🚧 Do not follow the screenshot's version and hash values! Use the <code>version</code> and <code>hash</code> values provides by the recovery coordinator! 🚧
 
#Press ctrl-X or F10 to boot
 
  
===3. Wait for recovery confirmation===
+
== Recovery Steps==
...
+
For each node to recover, you should perform the following process.
 +
 
 +
===Obtain console access===
 +
Again, note that the recovery can be completed from a physical console '''or from the node's remote BMC virtual console view.'''
 +
 
 +
[[File:Manual recovery .png|680px|screenshot]]
 +
 
 +
You should see the <code>limited-console></code> prompt. Type <code>help</code> to see the full list of limited-console commands.
 +
 
 +
===Initiate manual recovery TUI===
 +
 
 +
Type <code>manual-recovery</code> to initiate the manual recovery TUI.
 +
 
 +
[[File:Manual recovery 1.png|680px|screenshot]]
 +
 
 +
You should then be taken to the manual recovery text-user-interface:
 +
 
 +
[[File:New-manual-recovery-console-1.png|680px|screenshot]]
 +
 
 +
If you fail to enter the Manual Recovery TUI, see the [[Manual Node Recovery Guide#.E2.9A.A0.EF.B8.8F_Manual_Recovery_Fallback_.E2.9A.A0.EF.B8.8F|Manual Recovery Fallback]]
 +
 
 +
===Input recovery parameters===
 +
 
 +
[[File:Recovery-TUI-1.png|680px|screenshot]]
 +
 
 +
 
 +
Input the <code>VERSION</code> and <code>RECOVERY-HASH-PREFIX</code> provided by the recovery coordinator
 +
 
 +
Please take great care to type in the characters precisely. If a single character is wrong, the recovery will not succeed and you will have to restart.
 +
 
 +
Note: certain BMCs offer a Virtual Clipboard within the Console Controls to paste text to the console, which you may find useful.
 +
 
 +
[[File:Recovery-TUI-2.png|680px|screenshot]]
 +
 
 +
===Confirm calculated full-hashes===
 +
 
 +
[[File:New-recovery-console-3.png|680px|screenshot]]
 +
 
 +
 
 +
⚠️⚠️⚠️ The Manual Recovery TUI will then calculate and display the <code>VERSION-HASH</code> and <code>RECOVERY-HASH</code> full-hashes from the downloaded artifacts. Please verify that these calculated full-hashes hashes '''exactly''' match those provided by the recovery coordinator.
 +
 
 +
===Monitor the recovery process===
 +
 
 +
Once you have initiated the recovery process, monitor the recovery logs.
 +
 
 +
[[File:Manual recovery 6.png|680px|screenshot]]
 +
 
 +
After ~30 seconds, you should see the log:
 +
 
 +
========================================================================
 +
SUCCESS: Recovery completed successfully!
 +
========================================================================
 +
 
 +
[[File:Manual recovery 7.png|680px|screenshot]]
 +
 
 +
The system should then output standard boot logs:
 +
 
 +
[[File:Manual recovery 8.png|680px|screenshot]]
 +
 
 +
Congratulations! You have successfully completed the manual node recovery!
 +
 
 +
Note that if you reach the following recovery error page, this is almost certainly a result of incorrectly inputting the recovery parameters:
 +
 
 +
[[File:Recovery-TUI-3.png|680px|screenshot]]
 +
 
 +
If you reach the recovery error page, do not worry. Hit <code>enter</code> and return to the “Initiate manual recovery” step and try again. If errors still persist, please contact the recovery coordinator in the Matrix channel and post a screenshot of your recovery error page
 +
 
 +
===Notify of a successful recovery===
 +
 
 +
Send a message in the Matrix channel confirming that you have successfully completed recovery.
 +
 
 +
==Wait for recovery confirmation==
 +
 
 +
Once the recovery process on your node is complete and you have notified the Matrix channel, continue to monitor the Matrix until the subnet is back online and the recovery is complete.
 +
 
 +
==⚠️ Manual Recovery Fallback ⚠️==
 +
A manual recovery fallback is available if the manual recovery TUI fails to render.
 +
 
 +
=== Enter the rbash-console ===
 +
Type <code>rbash-console</code> to enter the rbash-console
 +
[[File:Rbash-console.png|680px|screenshot]]
 +
 
 +
===Run the manual recovery fallback command:===
 +
<code>sudo /opt/ic/bin/guestos-recovery-launcher.sh mode=run version=<VERSION> recovery-hash-prefix=<RECOVERY-HASH-PREFIX></code>
 +
 
 +
You may then resume the recovery instructions from the [[Manual Node Recovery Guide#Monitor_the_recovery_process|Monitor the recovery process]] step.

Latest revision as of 00:51, 19 December 2025

This runbook describes what steps node providers need to take during a manual node recovery.

Security warning

⚠️⚠️⚠️ Don’t get tricked into compromising your nodes. Only complete a manual node recovery if all of the following conditions are met:

  • A subnet recovery is announced on the Internet Computer Status Page
  • The DFINITY team reached out on the dedicated Matrix channel #ic-node-providers-incident-response:matrix.org.
    • Only the DFINITY team is able to send messages on this channel. In case of an incident, permissions are adapted so that everyone can send messages.

Prerequisites

  • The recovery coordinator should have communicated with you the following:
    • The recovery input parameters (used in the "Input recovery parameters" step):
      • The VERSION: the 40-character commit ID of the recovery-GuestOS update image
      • The RECOVERY-HASH-PREFIX: the 6-character hash-prefix of the recovery artifacts
    • The recovery full-hashes (used in the "Confirm calculated full-hashes" step)
      • The VERSION-HASH: 64-character hash of the recovery-GuestOS update image
      • The RECOVERY-HASH: 64-character hash of the recovery artifacts
    • The node(s): which specific nodes managed by the NP/NO are part of the target subnet.
  • Obtain console access to all nodes you run that are part of the target subnet.
    • Note that the recovery can be completed from a physical console or from the node's remote BMC virtual console view.

Recovery Steps

For each node to recover, you should perform the following process.

Obtain console access

Again, note that the recovery can be completed from a physical console or from the node's remote BMC virtual console view.

screenshot

You should see the limited-console> prompt. Type help to see the full list of limited-console commands.

Initiate manual recovery TUI

Type manual-recovery to initiate the manual recovery TUI.

screenshot

You should then be taken to the manual recovery text-user-interface:

screenshot

If you fail to enter the Manual Recovery TUI, see the Manual Recovery Fallback

Input recovery parameters

screenshot


Input the VERSION and RECOVERY-HASH-PREFIX provided by the recovery coordinator

Please take great care to type in the characters precisely. If a single character is wrong, the recovery will not succeed and you will have to restart.

Note: certain BMCs offer a Virtual Clipboard within the Console Controls to paste text to the console, which you may find useful.

screenshot

Confirm calculated full-hashes

screenshot


⚠️⚠️⚠️ The Manual Recovery TUI will then calculate and display the VERSION-HASH and RECOVERY-HASH full-hashes from the downloaded artifacts. Please verify that these calculated full-hashes hashes exactly match those provided by the recovery coordinator.

Monitor the recovery process

Once you have initiated the recovery process, monitor the recovery logs.

screenshot

After ~30 seconds, you should see the log:

========================================================================
SUCCESS: Recovery completed successfully!
========================================================================

screenshot

The system should then output standard boot logs:

screenshot

Congratulations! You have successfully completed the manual node recovery!

Note that if you reach the following recovery error page, this is almost certainly a result of incorrectly inputting the recovery parameters:

screenshot

If you reach the recovery error page, do not worry. Hit enter and return to the “Initiate manual recovery” step and try again. If errors still persist, please contact the recovery coordinator in the Matrix channel and post a screenshot of your recovery error page

Notify of a successful recovery

Send a message in the Matrix channel confirming that you have successfully completed recovery.

Wait for recovery confirmation

Once the recovery process on your node is complete and you have notified the Matrix channel, continue to monitor the Matrix until the subnet is back online and the recovery is complete.

⚠️ Manual Recovery Fallback ⚠️

A manual recovery fallback is available if the manual recovery TUI fails to render.

Enter the rbash-console

Type rbash-console to enter the rbash-console screenshot

Run the manual recovery fallback command:

sudo /opt/ic/bin/guestos-recovery-launcher.sh mode=run version=<VERSION> recovery-hash-prefix=<RECOVERY-HASH-PREFIX>

You may then resume the recovery instructions from the Monitor the recovery process step.