Difference between revisions of "Troubleshooting Networking Issues"

Latest revision as of 16:58, 28 June 2024

General Networking Troubleshooting Steps

Inspect network hardware in the rack for any visible signs of malfunction (e.g. red lights) or incorrect setup
Verify the cabling and port status on the switch:
- Check the physical connection of the network cable between the server and the switch.
- Ensure that the cable is securely plugged into the correct port on both ends.
- Look for any signs of damage or loose connections.
- Test the connectivity by trying a different network cable or using the same cable on a different port.
Check for recent port flaps/link failures or any other activities which might cause it:
- Check the logs or monitoring systems for any indications of port flapping or link failures.
- Investigate any recent changes or activities that could have affected the network connection.
- Consider any software updates, configuration changes, or physical alterations made recently.
Try to perform a re-seat of cable/breakout/SFP/QSFP toward the affected machine:
- Disconnect and reconnect the network cable at both ends (server and switch).
- If applicable, re-seat any breakout cables, SFP modules, or QSFP modules used in the connection.
- Ensure a secure and proper connection is established.
Check with the switch vendor:
- If the issue persists, contact the switch vendor's support team for further assistance.
- Provide them with detailed information about the problem and any troubleshooting steps you have already taken.
- Follow vendor guidance to troubleshoot and resolve the issue.
  - If your vendor requires a TSR log, see IDRAC access and TSR logs for an example of how to retrieve one from a Dell server.
  - Updating the firmware might also resolve the issue.
Utilize an auxiliary machine within the same rack with full network access to run diagnostics tools like ping, traceroute, and nmap
Work with the ISP to troubleshoot and resolve any network routing issues identified during diagnostics
Prepare for future incidents by establishing network redundancy and failover mechanisms

Troubleshooting outage of all nodes/entire DC

(This section under construction)

If a whole data center goes down at once, it is usually an internet issue.

Verify connectivity on that 10g circuit.

Troubleshooting Switches

Here are the first steps to try when all of your nodes are down.

Verify if the ISP gateway is pingable:
- Check the connectivity to the Internet Service Provider (ISP) gateway.
- Ping the ISP gateway IP address to determine if it responds.
- If there is no response, contact the ISP to inquire about any connectivity issues or outages in your area.
Check if the /64 IPv6 subnet default gateway is pingable from outside/inside:
- Ping the default gateway IP address of the IPv6 subnet from both inside and outside the network.
- If there is no response from either side, it indicates a potential issue with the default gateway.
- Verify the configuration of the default gateway and make sure it is properly set up.
Check for recent port flaps/link failures or any other activities which might cause it:
- Examine logs or monitoring systems for any signs of port flapping, link failures, or abnormal network activities.
- Investigate recent changes, such as software updates, configuration modifications, or physical changes.
- Identify any potential factors that might have caused the network disruption.
Verify the cabling and port status on the switch:
- Check the physical connections between the affected nodes and the switch.
- Ensure that the cables are securely plugged into the correct ports on both ends.
- Inspect the cables for any damage or loose connections.
- Test the connectivity by using different network cables or ports.
Try to perform a re-seat of cable/breakout/SFP/QSFP toward the affected machines:
- Disconnect and reconnect the network cables at both the switch and the affected nodes.
- If applicable, re-seat any breakout cables, SFP modules, or QSFP modules used in the connections.
- Ensure all connections are properly seated and secured.
Try to reboot the switch:
- Save switch configuration before proceeding.
- Reboot the switch to ensure it is functioning correctly.
- Follow proper procedures to avoid any disruption to the network.
- Monitor the switch during and after the reboot to check if the issue is resolved.
Check with the switch vendor:
- If the problem persists or if you are unable to identify the cause, contact the switch vendor's support team.
- Provide them with detailed information about the issue, including any troubleshooting steps already taken.
- Follow vendor guidance to further investigate and resolve the problem.

Troubleshooting Packet Loss Issues

If you experience intermittent packet loss toward nodes:

Perform a ping toward ISP gateway:
- Ping the IP address of the ISP gateway to check for packet loss and response times.
- Analyze the results to determine if there is any intermittent packet loss.
- Contact the ISP to inquire about any connectivity issues in your area.
Perform a ping toward the server default gateway (/64 IPv6 subnet):
- Ping the IP address of the server's default gateway within the IPv6 subnet.
- Monitor the packet loss and response times to identify any irregularities.
Check if another server with the same IPv6 subnet has the same issue:
- Test the network connectivity and ping the same IPv6 subnet from another server.
- Compare the results to determine if the intermittent packet loss is specific to a particular server or affects multiple nodes.
Verify the cabling and port status on the switch:
- Inspect the physical connections between the affected nodes and the switch.
- Ensure that the network cables are securely plugged into the correct ports on both ends.
- Check for any signs of damage or loose connections in the cables.
- Test the connectivity by using different network cables or ports.
Check for recent port flaps/link failures inside the switch logs:
- Access the switch logs or monitoring systems to identify any recent port flapping or link failures.
- Investigate the logs for any abnormalities or patterns related to the intermittent packet loss.
- Analyze any recorded events or error messages that might provide insights into the issue.

Troubleshooting Interface Issues

If you are seeing Interface drops/errors/flaps:

Verify the switch port statistics:
- Access the switch interface statistics and examine the counters for drops, errors, or flapping.
- Pay close attention to any significant increases in discards, CRC errors, or port flaps.
Try to localize affected ports (search from high increase of discards/CRC errors or port flaps):
- Identify the specific ports experiencing drops, errors, or flapping by comparing the statistics across different interfaces.
- Look for ports with a significant increase in discards, CRC errors, or frequent port flapping.
Try to replace the cabling toward affected ports:
- Replace the network cables connected to the ports experiencing drops, errors, or flapping.
- Use high-quality, properly shielded cables to ensure a stable and reliable connection.
Verify the NIC adapter on the server:
- Check the network interface card (NIC) on the affected server.
- Ensure that the NIC is functioning correctly.
- Consider updating the NIC firmware or replacing it if necessary.
Check with the switch/server vendor:
- If the issue persists or if you are unable to determine the cause, contact the switch or server vendor's support team.
- Provide vendor with detailed information about the issue, including the troubleshooting steps taken and any relevant error messages or statistics.
- Seek vendor’s assistance in diagnosing and resolving the interface-related problems.

@@ Line 1: / Line 1: @@
+==General Networking Troubleshooting Steps==
+#Inspect network hardware in the rack for any visible signs of malfunction (e.g. red lights) or incorrect setup
+#Verify the cabling and port status on the switch:
+#*Check the physical connection of the network cable between the server and the switch.
+#*Ensure that the cable is securely plugged into the correct port on both ends.
+#*Look for any signs of damage or loose connections.
+#*Test the connectivity by trying a different network cable or using the same cable on a different port.
+#Check for recent port flaps/link failures or any other activities which might cause it:
+#*Check the logs or monitoring systems for any indications of port flapping or link failures.
+#*Investigate any recent changes or activities that could have affected the network connection.
+#*Consider any software updates, configuration changes, or physical alterations made recently.
+#Try to perform a re-seat of cable/breakout/SFP/QSFP toward the affected machine:
+#*Disconnect and reconnect the network cable at both ends (server and switch).
+#*If applicable, re-seat any breakout cables, SFP modules, or QSFP modules used in the connection.
+#*Ensure a secure and proper connection is established.
+#Check with the switch vendor:
+#*If the issue persists, contact the switch vendor's support team for further assistance.
+#*Provide them with detailed information about the problem and any troubleshooting steps you have already taken.
+#*Follow vendor guidance to troubleshoot and resolve the issue.
+#**If your vendor requires a TSR log, see [[IDRAC access and TSR logs]] for an example of how to retrieve one from a Dell server.
+#**[[Updating_Firmware|Updating the firmware]] might also resolve the issue.
+#Utilize an auxiliary machine within the same rack with full network access to run diagnostics tools like <code>ping</code>, <code>traceroute</code>, and <code>nmap</code>
+#Work with the ISP to troubleshoot and resolve any network routing issues identified during diagnostics
+#Prepare for future incidents by establishing network redundancy and failover mechanisms
+==Troubleshooting outage of all nodes/entire DC ==
+(This section under construction)
+If a whole data center goes down at once, it is usually an internet issue.
+*Verify connectivity on that 10g circuit.
 == Troubleshooting Switches ==
 Here are the first steps to try when all of your nodes are down.