Troubleshooting Networking Issues

From Internet Computer Wiki
Revision as of 16:27, 26 June 2024 by Andrew.battat (talk | contribs)
Jump to: navigation, search

Troubleshooting outage of all nodes/entire DC

(This section under construction)

If a whole data center goes down at once, it is usually an internet issue.

  • Verify connectivity on that 10g circuit.

Troubleshooting Switches

Here are the first steps to try when all of your nodes are down.

  1. Verify if the ISP gateway is pingable:
    • Check the connectivity to the Internet Service Provider (ISP) gateway.
    • Ping the ISP gateway IP address to determine if it responds.
    • If there is no response, contact the ISP to inquire about any connectivity issues or outages in your area.
  2. Check if the /64 IPv6 subnet default gateway is pingable from outside/inside:
    • Ping the default gateway IP address of the IPv6 subnet from both inside and outside the network.
    • If there is no response from either side, it indicates a potential issue with the default gateway.
    • Verify the configuration of the default gateway and make sure it is properly set up.
  3. Check for recent port flaps/link failures or any other activities which might cause it:
    • Examine logs or monitoring systems for any signs of port flapping, link failures, or abnormal network activities.
    • Investigate recent changes, such as software updates, configuration modifications, or physical changes.
    • Identify any potential factors that might have caused the network disruption.
  4. Verify the cabling and port status on the switch:
    • Check the physical connections between the affected nodes and the switch.
    • Ensure that the cables are securely plugged into the correct ports on both ends.
    • Inspect the cables for any damage or loose connections.
    • Test the connectivity by using different network cables or ports.
  5. Try to perform a re-seat of cable/breakout/SFP/QSFP toward the affected machines:
    • Disconnect and reconnect the network cables at both the switch and the affected nodes.
    • If applicable, re-seat any breakout cables, SFP modules, or QSFP modules used in the connections.
    • Ensure all connections are properly seated and secured.
  6. Try to reboot the switch:
    • Save switch configuration before proceeding.
    • Reboot the switch to ensure it is functioning correctly.
    • Follow proper procedures to avoid any disruption to the network.
    • Monitor the switch during and after the reboot to check if the issue is resolved.
  7. Check with the switch vendor:
    • If the problem persists or if you are unable to identify the cause, contact the switch vendor's support team.
    • Provide them with detailed information about the issue, including any troubleshooting steps already taken.
    • Follow vendor guidance to further investigate and resolve the problem.

Troubleshooting Packet Loss Issues

If you experience intermittent packet loss toward nodes:

  1. Perform a ping toward ISP gateway:
    • Ping the IP address of the ISP gateway to check for packet loss and response times.
    • Analyze the results to determine if there is any intermittent packet loss.
    • Contact the ISP to inquire about any connectivity issues in your area.
  2. Perform a ping toward the server default gateway (/64 IPv6 subnet):
    • Ping the IP address of the server's default gateway within the IPv6 subnet.
    • Monitor the packet loss and response times to identify any irregularities.
  3. Check if another server with the same IPv6 subnet has the same issue:
    • Test the network connectivity and ping the same IPv6 subnet from another server.
    • Compare the results to determine if the intermittent packet loss is specific to a particular server or affects multiple nodes.
  4. Verify the cabling and port status on the switch:
    • Inspect the physical connections between the affected nodes and the switch.
    • Ensure that the network cables are securely plugged into the correct ports on both ends.
    • Check for any signs of damage or loose connections in the cables.
    • Test the connectivity by using different network cables or ports.
  5. Check for recent port flaps/link failures inside the switch logs:
    • Access the switch logs or monitoring systems to identify any recent port flapping or link failures.
    • Investigate the logs for any abnormalities or patterns related to the intermittent packet loss.
    • Analyze any recorded events or error messages that might provide insights into the issue.

Troubleshooting Interface Issues

If you are seeing Interface drops/errors/flaps:

  1. Verify the switch port statistics:
    • Access the switch interface statistics and examine the counters for drops, errors, or flapping.
    • Pay close attention to any significant increases in discards, CRC errors, or port flaps.
  2. Try to localize affected ports (search from high increase of discards/CRC errors or port flaps):
    • Identify the specific ports experiencing drops, errors, or flapping by comparing the statistics across different interfaces.
    • Look for ports with a significant increase in discards, CRC errors, or frequent port flapping.
  3. Try to replace the cabling toward affected ports:
    • Replace the network cables connected to the ports experiencing drops, errors, or flapping.
    • Use high-quality, properly shielded cables to ensure a stable and reliable connection.
  4. Verify the NIC adapter on the server:
    • Check the network interface card (NIC) on the affected server.
    • Ensure that the NIC is functioning correctly.
    • Consider updating the NIC firmware or replacing it if necessary.
  5. Check with the switch/server vendor:
    • If the issue persists or if you are unable to determine the cause, contact the switch or server vendor's support team.
    • Provide vendor with detailed information about the issue, including the troubleshooting steps taken and any relevant error messages or statistics.
    • Seek vendor’s assistance in diagnosing and resolving the interface-related problems.