Difference between revisions of "Node Provider Networking Guide"

From Internet Computer Wiki
Jump to: navigation, search
(Created NP 'EZ' network guide.)
 
 
(28 intermediate revisions by 5 users not shown)
Line 1: Line 1:
= NP Network Requirements - EZ Guide =
+
This guide is designed to provide an overview of the networking requirements and guide Node Providers through setting up their servers into a rack with functioning networking.
'''Who is this for?''' Node Providers (NP’s) who need to set up their servers into a rack and set up a functioning network.
 
  
'''What skills are necessary?''' You should be familiar with IP networking, network equipment and network cabling.
+
Configuring networks is not trivial. You should be familiar with IP networking, network equipment and network cabling.
  
== The Bare Minimum Network Requirements ==
+
Resources to learn about networking:
 +
 
 +
* [https://learningnetwork.cisco.com/s/article/200-301-ccna-study-materials CCNA Study Materials]
 +
* Kevin Wallace [https://www.youtube.com/@kwallaceccie YouTube Training Videos]
 +
 
 +
'''DFINITY does not provide support for network configuration.'''
 +
 
 +
If you hire technical assistance, keep decentralization and security in mind. Use a local technician you personally know and carefully monitor their work.
 +
 
 +
== Requirements ==
 
To join your servers to the Internet Computer (IC) you will need:
 
To join your servers to the Internet Computer (IC) you will need:
  
Line 12: Line 20:
 
** Cabling
 
** Cabling
 
** Quantity determined by number of nodes deployed
 
** Quantity determined by number of nodes deployed
* [[Node provider hardware#Gen%202|“Gen-2” node hardware]]
 
 
* Rackspace in a data center
 
* Rackspace in a data center
 
* Internet connection
 
* Internet connection
 
** Bandwidth
 
** Bandwidth
*** ~300Mbps minimum per node  
+
*** ~300Mbps per node
*** Ingress/egress ratio is currently 1:1. We expect egress (client queries) to increase faster than ingress in the future.
+
*** Ingress/egress ratio is currently 1:1. We expect egress (serving responses to client queries) to increase faster than ingress in the future.
 
*** This should guide how many servers to deploy and the appropriate ISP connection speed
 
*** This should guide how many servers to deploy and the appropriate ISP connection speed
 
*** E.g. a 1Gbps connection will support up to 3 IC nodes.
 
*** E.g. a 1Gbps connection will support up to 3 IC nodes.
 
** One IPv6 /64 subnet - each node gets multiple IPv6 addresses
 
** One IPv6 /64 subnet - each node gets multiple IPv6 addresses
** One IPv4 address for every 4 nodes. See Appendix 1 for more details.
+
** Two IPv4 addresses per data center - All Node Providers are requested to deploy two nodes with IPv4 for every data center they operate in. Node Providers should deploy IPv4 to the first two nodes in their first rack.
** '''All IP addresses are assigned statically''' and automatically by IC-OS
+
***Additionally, one domain name for each node configured with an IPv4 address. See [[Node Provider Domain Name Guide]] for details.
*** This is configured in the [[IC OS Installation Runbook#4.%20Add%20configuration|IC-OS Installation Runbook]]
+
**'''All IP addresses are assigned statically''' and automatically by IC-OS
 +
***This is configured in the [[IC-OS Installation Runbook#6. Add configuration|IC-OS Installation Runbook]]
 +
 
 +
==Network Cabling==
 +
When racking and stacking your servers, ensure the '''at least one 10G network port''' on each server is connected to the 10G switch. SFP+ and Ethernet are supported.
 +
 
 +
[[File:Supermicro 1124US-TNRP 1U server rear photo diagram.png|480px|screenshot]]
 +
 
 +
For example, on a Supermicro 1U server, the 10G ports are in a cluster as seen above. Vendors differ. 
 +
 
 +
Connect the 10G switch to the ISP endpoint - this could be the Top Of Rack (TOR) switch or other box.
 +
 
 +
==Network Configuration==
 +
Node machines require:
 +
 
 +
*The ability to acquire a public static IPv6 address on a /64 subnet
 +
* An IPv6 gateway to communicate with other nodes on the broad internet
 +
*Unfiltered internet access
  
== Network Cabling ==
+
Two nodes per data center require:
When racking and stacking your servers, ensure the '''first two 10G network ports''' on each server are connected to the 10G switch.
 
  
[[File:Supermicro 1124US-TNRP 1U server rear photo diagram.png]]
+
*The ability to acquire a public static IPv4 address
 +
*An IPv4 gateway to communicate with other nodes on the broad internet
 +
*Unfiltered internet access
  
For example, on a Supermicro 1U server, the bottom two ports are considered ports 1 and 2 and will be enumerated by Linux in this order. Connect the bottom two ports to the switch.  
+
''Note: IPv4 should be deployed to the first two nodes in the first rack.''
  
Servers from other vendors will differ! See the server documentation for guidance.
 
  
This is subject to change - the IC-OS network configuration logic is undergoing improvements to make it more flexible.
+
There are many many ways to configure the network and some details depend on the ISP and data center. Here are some [[Example Network Configuration Scenarios]].  
  
Connect the 10G switch to the ISP endpoint - this could be the Top Of Rack (TOR) switch or other box.  
+
See the [[Node Provider Networking Troubleshooting Guide]] for help.  
  
== BMC Setup Recommendations ==
+
==BMC Setup Recommendations==
  
=== What’s a BMC? ===
+
===What’s a BMC?===
 
The [[wikipedia:Intelligent_Platform_Management_Interface#Baseboard_management_controller|Baseboard Management Controller (BMC)]] grants control of the underlying server hardware.
 
The [[wikipedia:Intelligent_Platform_Management_Interface#Baseboard_management_controller|Baseboard Management Controller (BMC)]] grants control of the underlying server hardware.
  
 
BMC’s have notoriously poor security. Vendors may name their implementation differently (Dell -> iDRAC, HPE -> iLO, etc.).
 
BMC’s have notoriously poor security. Vendors may name their implementation differently (Dell -> iDRAC, HPE -> iLO, etc.).
  
=== Recommendations ===
+
===Recommendations===
  
==== Change the password ====
+
====Change the password====
 
BMC’s usually come with a common password. Log in via crash cart, KVM or the web interface and change it to something [https://krebsonsecurity.com/password-dos-and-donts/ strong].
 
BMC’s usually come with a common password. Log in via crash cart, KVM or the web interface and change it to something [https://krebsonsecurity.com/password-dos-and-donts/ strong].
  
==== No broad internet access ====
+
==== No broad internet access====
 
It is highly recommended: '''do not expose your BMC''' to the broad internet. This is a safety precaution against attackers.
 
It is highly recommended: '''do not expose your BMC''' to the broad internet. This is a safety precaution against attackers.
  
Line 56: Line 80:
  
 
* Don’t connect the BMC to the internet.
 
* Don’t connect the BMC to the internet.
** Any BMC activities occur via SSH on the host (unreliable on many mainboard vendors) or via crash cart (requires physical interaction with the machine).
+
**Maintenance or node recovery will require physical access in this case.
* Connect the BMC to a separate dumb switch, dumb switch connects to a Rack Mounted Unit (RMU).
+
**Any BMC activities occur via SSH on the host (unreliable on many mainboard vendors) or via crash cart.
* Connect the BMC to a managed switch, separate VLAN
+
*Connect the BMC to a separate dumb switch, and the dumb switch connects to a Rack Mounted Unit (RMU).
 +
*Connect the BMC to a managed switch, and create a separate VLAN
  
 
This can get complicated. It’s outside the scope of this document to explain how to do this.
 
This can get complicated. It’s outside the scope of this document to explain how to do this.
Line 64: Line 89:
 
Resources:
 
Resources:
  
* [https://security.stackexchange.com/questions/46351/best-practice-for-accessing-management-port-of-firewall StackExchange - Best practice for accessing management port of firewall]
+
*[https://security.stackexchange.com/questions/46351/best-practice-for-accessing-management-port-of-firewall StackExchange - Best practice for accessing management port of firewall]
* [https://www.supermicro.com/products/nfo/files/IPMI/Best_Practices_BMC_Security.pdf Supermicro Guidance]
+
*[https://www.supermicro.com/products/nfo/files/IPMI/Best_Practices_BMC_Security.pdf Supermicro Guidance]
* [https://www.unicomengineering.com/blog/ipmi-best-practices/ Unicom Guidance]
+
*[https://www.unicomengineering.com/blog/ipmi-best-practices/ Unicom Guidance]
 +
 
 +
==Network monitoring==
 +
 
 +
===SNMP-based Network Monitoring:===
 +
 
 +
*Device Compatibility: Make sure your network devices support and enable SNMP agents. Choose the version of SNMP that aligns with your security needs as different versions offer varying security levels and functionality.
 +
*Secure Configuration: Implement SNMPv3 to enhance security through authentication and encryption, protecting against unauthorized access and data interception.
 +
*Monitoring Points: Select specific network parameters critical for performance. (such as bandwidth utilization, CPU usage, and memory usage). Set up SNMP polling for these parameters.
 +
*Thresholds and Alerts: Predefine alerts when monitored parameters exceed limits to identify issues proactively and take corrective actions.
 +
*Data Retention: Establish data retention policies for storing SNMP data for trend and capacity analysis.
 +
*Regular Review: It is important to regularly review SNMP monitoring configurations and thresholds to ensure that they are up-to-date and aligned with the changing network environment..
  
== What NOT to do ==
+
===GNMI/gRPC-based Network Monitoring:===
  
=== Don’t use external firewalls, packet filters, rate limiters ===
+
*Protocol Familiarity: Get familiar with GNMI data models for your network devices and understand how they use gRPC (Remote Procedure Call) for network management.
 +
*Device Support: Verify that your network devices support GNMI, which is more commonly found in modern networking equipment that supports programmability.
 +
*Authentication and Encryption: Implement TLS for gRPC security to protect communication between the monitoring system and devices.
 +
*Model Definitions: Make sure you either have access to or create GNMI data models for the devices you're monitoring. These models define the structure and hierarchy of the data that is accessible through GNMI.
 +
*Data Subscription: GNMI allows for real-time updates through subscriptions. Set up subscriptions for relevant data points to receive continuous updates without frequent polling.
 +
*Streaming Mode: Use gNMI's streaming mode for efficient real-time data transfer.
 +
 
 +
==Server monitoring==
 +
 
 +
===SNMP-based Server Hardware Monitoring:===
 +
 
 +
*Determine SNMP Compatibility: Before configuring SNMP monitoring, make sure you enable SNMP agents on your servers. Also, verify the compatibility of the SNMP version with your monitoring system.
 +
*Secure Configuration: Implement SNMPv3 to enhance security through authentication and encryption, protecting against unauthorized access and data interception.
 +
*Monitoring Parameters: It's important to monitor the CPU utilization to ensure that the performance is optimal and to identify any potential bottlenecks. Keeping track of the memory usage is crucial to prevent resource exhaustion. It's also important to check the network interface traffic to identify any bandwidth bottlenecks. Finally, monitoring the server temperatures and hardware health indicators can help detect any hardware issues.
 +
*SNMP Polling: SNMP polling should be set up regularly to collect data on critical parameters.- Thresholds and Alerts: Set Thresholds: Define appropriate thresholds for each monitored parameter. These thresholds determine when alerts should be triggered.
 +
*Data Retention and Trend Analysis: Retain historical SNMP data for trend analysis, capacity planning and performance identification.
 +
*Regular Review: It is important to regularly check the SNMP monitoring configurations and thresholds in order to ensure that they are appropriate for the environment. This helps to maintain proper alignment and accuracy in monitoring.
 +
 
 +
==What NOT to do==
 +
 
 +
===Don’t use external firewalls, packet filters, rate limiters===
 
Don’t block or interfere with any traffic to the node machines. This can disrupt node machine functionality. Occasionally ports are opened for incoming (and outgoing) connections when new versions of node software are deployed.  
 
Don’t block or interfere with any traffic to the node machines. This can disrupt node machine functionality. Occasionally ports are opened for incoming (and outgoing) connections when new versions of node software are deployed.  
  
==== What about network security? ====
+
====What about network security?====
IC-OS manages its own firewall(s) and rate limiters very strictly and is designed with security as a primary principle.  
+
IC-OS manages its own software firewalls and rate limiters strictly and is designed with security as a primary principle.  
  
== How DFINITY manages its servers ==
+
===Don't configure the switch to use LACP bonding===
 +
This feature is on the roadmap for investigation but IC nodes do not support LACP bonding at the moment. Configuring it on the switch may cause problems with nodes.
 +
 
 +
==How DFINITY manages its servers==
 
See reference DFINITY [[Gen-2 Data Center runbook|data center runbook]].
 
See reference DFINITY [[Gen-2 Data Center runbook|data center runbook]].
  
== Final Checklist ==
+
==Final Checklist==
  
* Did you deploy a 10G switch?
+
*Did you deploy a 10G switch?
* Do the '''first and second 10G ports''' on each server plug into the 10G switch?
+
*Is at least '''one 10G port''' on each server plugged into the 10G switch?
 
* Do you have '''one IPv6 /64 prefix''' allocated from your ISP?
 
* Do you have '''one IPv6 /64 prefix''' allocated from your ISP?
* Do you have at least '''one IPv4 address for every four nodes''' allocated?
+
*Do you have '''two IPv4 addresses allocated for each data center you plan to operate in'''?
* Does each node have ~300Mbps bandwidth?  
+
* Do you have '''one domain name for each node you plan to configure with an IPv4 address'''?
* Is your '''BMC inaccessible''' from the broad internet?
+
* Does each node have ~300Mbps bandwidth?
 +
*Is your '''BMC inaccessible''' from the broad internet?
 +
 
 +
== References==
  
== Appendix 1: Number of IPv4 Addresses Required ==
+
*[[Gen-2 Network Requirements|Gen2 Network Requirements]] - more detailed, possibly out of date.
{| class="wikitable"
 
|# Nodes
 
|# IPv4 Addresses
 
|-
 
|1
 
|1
 
|-
 
|2
 
|1
 
|-
 
|3
 
|1
 
|-
 
|4
 
|1
 
|-
 
|5
 
|2
 
|-
 
|6
 
|2
 
|-
 
|7
 
|2
 
|-
 
|8
 
|2
 
|-
 
|9
 
|3
 
|-
 
|10
 
|3
 
|-
 
|11
 
|3
 
|-
 
|12
 
|3
 
|-
 
|13
 
|4
 
|-
 
|14
 
|4
 
|-
 
|15
 
|4
 
|-
 
|16
 
|4
 
|-
 
|17
 
|5
 
|-
 
|18
 
|5
 
|-
 
|19
 
|5
 
|-
 
|20
 
|5
 
|-
 
|21
 
|6
 
|-
 
|22
 
|6
 
|-
 
|23
 
|6
 
|-
 
|24
 
|6
 
|-
 
|25
 
|7
 
|-
 
|26
 
|7
 
|-
 
|27
 
|7
 
|-
 
|28
 
|7
 
|}
 

Latest revision as of 15:37, 23 February 2024

This guide is designed to provide an overview of the networking requirements and guide Node Providers through setting up their servers into a rack with functioning networking.

Configuring networks is not trivial. You should be familiar with IP networking, network equipment and network cabling.

Resources to learn about networking:

DFINITY does not provide support for network configuration.

If you hire technical assistance, keep decentralization and security in mind. Use a local technician you personally know and carefully monitor their work.

Requirements

To join your servers to the Internet Computer (IC) you will need:

  • 10G Network equipment
    • SFP+ or Ethernet
    • Switch(es)
    • Cabling
    • Quantity determined by number of nodes deployed
  • Rackspace in a data center
  • Internet connection
    • Bandwidth
      • ~300Mbps per node
      • Ingress/egress ratio is currently 1:1. We expect egress (serving responses to client queries) to increase faster than ingress in the future.
      • This should guide how many servers to deploy and the appropriate ISP connection speed
      • E.g. a 1Gbps connection will support up to 3 IC nodes.
    • One IPv6 /64 subnet - each node gets multiple IPv6 addresses
    • Two IPv4 addresses per data center - All Node Providers are requested to deploy two nodes with IPv4 for every data center they operate in. Node Providers should deploy IPv4 to the first two nodes in their first rack.
    • All IP addresses are assigned statically and automatically by IC-OS

Network Cabling

When racking and stacking your servers, ensure the at least one 10G network port on each server is connected to the 10G switch. SFP+ and Ethernet are supported.

screenshot

For example, on a Supermicro 1U server, the 10G ports are in a cluster as seen above. Vendors differ.

Connect the 10G switch to the ISP endpoint - this could be the Top Of Rack (TOR) switch or other box.

Network Configuration

Node machines require:

  • The ability to acquire a public static IPv6 address on a /64 subnet
  • An IPv6 gateway to communicate with other nodes on the broad internet
  • Unfiltered internet access

Two nodes per data center require:

  • The ability to acquire a public static IPv4 address
  • An IPv4 gateway to communicate with other nodes on the broad internet
  • Unfiltered internet access

Note: IPv4 should be deployed to the first two nodes in the first rack.


There are many many ways to configure the network and some details depend on the ISP and data center. Here are some Example Network Configuration Scenarios.

See the Node Provider Networking Troubleshooting Guide for help.

BMC Setup Recommendations

What’s a BMC?

The Baseboard Management Controller (BMC) grants control of the underlying server hardware.

BMC’s have notoriously poor security. Vendors may name their implementation differently (Dell -> iDRAC, HPE -> iLO, etc.).

Recommendations

Change the password

BMC’s usually come with a common password. Log in via crash cart, KVM or the web interface and change it to something strong.

No broad internet access

It is highly recommended: do not expose your BMC to the broad internet. This is a safety precaution against attackers.

Options:

  • Don’t connect the BMC to the internet.
    • Maintenance or node recovery will require physical access in this case.
    • Any BMC activities occur via SSH on the host (unreliable on many mainboard vendors) or via crash cart.
  • Connect the BMC to a separate dumb switch, and the dumb switch connects to a Rack Mounted Unit (RMU).
  • Connect the BMC to a managed switch, and create a separate VLAN

This can get complicated. It’s outside the scope of this document to explain how to do this.

Resources:

Network monitoring

SNMP-based Network Monitoring:

  • Device Compatibility: Make sure your network devices support and enable SNMP agents. Choose the version of SNMP that aligns with your security needs as different versions offer varying security levels and functionality.
  • Secure Configuration: Implement SNMPv3 to enhance security through authentication and encryption, protecting against unauthorized access and data interception.
  • Monitoring Points: Select specific network parameters critical for performance. (such as bandwidth utilization, CPU usage, and memory usage). Set up SNMP polling for these parameters.
  • Thresholds and Alerts: Predefine alerts when monitored parameters exceed limits to identify issues proactively and take corrective actions.
  • Data Retention: Establish data retention policies for storing SNMP data for trend and capacity analysis.
  • Regular Review: It is important to regularly review SNMP monitoring configurations and thresholds to ensure that they are up-to-date and aligned with the changing network environment..

GNMI/gRPC-based Network Monitoring:

  • Protocol Familiarity: Get familiar with GNMI data models for your network devices and understand how they use gRPC (Remote Procedure Call) for network management.
  • Device Support: Verify that your network devices support GNMI, which is more commonly found in modern networking equipment that supports programmability.
  • Authentication and Encryption: Implement TLS for gRPC security to protect communication between the monitoring system and devices.
  • Model Definitions: Make sure you either have access to or create GNMI data models for the devices you're monitoring. These models define the structure and hierarchy of the data that is accessible through GNMI.
  • Data Subscription: GNMI allows for real-time updates through subscriptions. Set up subscriptions for relevant data points to receive continuous updates without frequent polling.
  • Streaming Mode: Use gNMI's streaming mode for efficient real-time data transfer.

Server monitoring

SNMP-based Server Hardware Monitoring:

  • Determine SNMP Compatibility: Before configuring SNMP monitoring, make sure you enable SNMP agents on your servers. Also, verify the compatibility of the SNMP version with your monitoring system.
  • Secure Configuration: Implement SNMPv3 to enhance security through authentication and encryption, protecting against unauthorized access and data interception.
  • Monitoring Parameters: It's important to monitor the CPU utilization to ensure that the performance is optimal and to identify any potential bottlenecks. Keeping track of the memory usage is crucial to prevent resource exhaustion. It's also important to check the network interface traffic to identify any bandwidth bottlenecks. Finally, monitoring the server temperatures and hardware health indicators can help detect any hardware issues.
  • SNMP Polling: SNMP polling should be set up regularly to collect data on critical parameters.- Thresholds and Alerts: Set Thresholds: Define appropriate thresholds for each monitored parameter. These thresholds determine when alerts should be triggered.
  • Data Retention and Trend Analysis: Retain historical SNMP data for trend analysis, capacity planning and performance identification.
  • Regular Review: It is important to regularly check the SNMP monitoring configurations and thresholds in order to ensure that they are appropriate for the environment. This helps to maintain proper alignment and accuracy in monitoring.

What NOT to do

Don’t use external firewalls, packet filters, rate limiters

Don’t block or interfere with any traffic to the node machines. This can disrupt node machine functionality. Occasionally ports are opened for incoming (and outgoing) connections when new versions of node software are deployed.

What about network security?

IC-OS manages its own software firewalls and rate limiters strictly and is designed with security as a primary principle.

Don't configure the switch to use LACP bonding

This feature is on the roadmap for investigation but IC nodes do not support LACP bonding at the moment. Configuring it on the switch may cause problems with nodes.

How DFINITY manages its servers

See reference DFINITY data center runbook.

Final Checklist

  • Did you deploy a 10G switch?
  • Is at least one 10G port on each server plugged into the 10G switch?
  • Do you have one IPv6 /64 prefix allocated from your ISP?
  • Do you have two IPv4 addresses allocated for each data center you plan to operate in?
  • Do you have one domain name for each node you plan to configure with an IPv4 address?
  • Does each node have ~300Mbps bandwidth?
  • Is your BMC inaccessible from the broad internet?

References