Troubleshooting Networking Issues
From Internet Computer Wiki
Revision as of 16:27, 26 June 2024 by Andrew.battat (talk | contribs)
Troubleshooting outage of all nodes/entire DC
(This section under construction)
If a whole data center goes down at once, it is usually an internet issue.
- Verify connectivity on that 10g circuit.
Troubleshooting Switches
Here are the first steps to try when all of your nodes are down.
- Verify if the ISP gateway is pingable:
- Check the connectivity to the Internet Service Provider (ISP) gateway.
- Ping the ISP gateway IP address to determine if it responds.
- If there is no response, contact the ISP to inquire about any connectivity issues or outages in your area.
- Check if the /64 IPv6 subnet default gateway is pingable from outside/inside:
- Ping the default gateway IP address of the IPv6 subnet from both inside and outside the network.
- If there is no response from either side, it indicates a potential issue with the default gateway.
- Verify the configuration of the default gateway and make sure it is properly set up.
- Check for recent port flaps/link failures or any other activities which might cause it:
- Examine logs or monitoring systems for any signs of port flapping, link failures, or abnormal network activities.
- Investigate recent changes, such as software updates, configuration modifications, or physical changes.
- Identify any potential factors that might have caused the network disruption.
- Verify the cabling and port status on the switch:
- Check the physical connections between the affected nodes and the switch.
- Ensure that the cables are securely plugged into the correct ports on both ends.
- Inspect the cables for any damage or loose connections.
- Test the connectivity by using different network cables or ports.
- Try to perform a re-seat of cable/breakout/SFP/QSFP toward the affected machines:
- Disconnect and reconnect the network cables at both the switch and the affected nodes.
- If applicable, re-seat any breakout cables, SFP modules, or QSFP modules used in the connections.
- Ensure all connections are properly seated and secured.
- Try to reboot the switch:
- Save switch configuration before proceeding.
- Reboot the switch to ensure it is functioning correctly.
- Follow proper procedures to avoid any disruption to the network.
- Monitor the switch during and after the reboot to check if the issue is resolved.
- Check with the switch vendor:
- If the problem persists or if you are unable to identify the cause, contact the switch vendor's support team.
- Provide them with detailed information about the issue, including any troubleshooting steps already taken.
- Follow vendor guidance to further investigate and resolve the problem.
Troubleshooting Packet Loss Issues
If you experience intermittent packet loss toward nodes:
- Perform a ping toward ISP gateway:
- Ping the IP address of the ISP gateway to check for packet loss and response times.
- Analyze the results to determine if there is any intermittent packet loss.
- Contact the ISP to inquire about any connectivity issues in your area.
- Perform a ping toward the server default gateway (/64 IPv6 subnet):
- Ping the IP address of the server's default gateway within the IPv6 subnet.
- Monitor the packet loss and response times to identify any irregularities.
- Check if another server with the same IPv6 subnet has the same issue:
- Test the network connectivity and ping the same IPv6 subnet from another server.
- Compare the results to determine if the intermittent packet loss is specific to a particular server or affects multiple nodes.
- Verify the cabling and port status on the switch:
- Inspect the physical connections between the affected nodes and the switch.
- Ensure that the network cables are securely plugged into the correct ports on both ends.
- Check for any signs of damage or loose connections in the cables.
- Test the connectivity by using different network cables or ports.
- Check for recent port flaps/link failures inside the switch logs:
- Access the switch logs or monitoring systems to identify any recent port flapping or link failures.
- Investigate the logs for any abnormalities or patterns related to the intermittent packet loss.
- Analyze any recorded events or error messages that might provide insights into the issue.
Troubleshooting Interface Issues
If you are seeing Interface drops/errors/flaps:
- Verify the switch port statistics:
- Access the switch interface statistics and examine the counters for drops, errors, or flapping.
- Pay close attention to any significant increases in discards, CRC errors, or port flaps.
- Try to localize affected ports (search from high increase of discards/CRC errors or port flaps):
- Identify the specific ports experiencing drops, errors, or flapping by comparing the statistics across different interfaces.
- Look for ports with a significant increase in discards, CRC errors, or frequent port flapping.
- Try to replace the cabling toward affected ports:
- Replace the network cables connected to the ports experiencing drops, errors, or flapping.
- Use high-quality, properly shielded cables to ensure a stable and reliable connection.
- Verify the NIC adapter on the server:
- Check the network interface card (NIC) on the affected server.
- Ensure that the NIC is functioning correctly.
- Consider updating the NIC firmware or replacing it if necessary.
- Check with the switch/server vendor:
- If the issue persists or if you are unable to determine the cause, contact the switch or server vendor's support team.
- Provide vendor with detailed information about the issue, including the troubleshooting steps taken and any relevant error messages or statistics.
- Seek vendor’s assistance in diagnosing and resolving the interface-related problems.