Overview and What You Will Learn
A Swiggy backend service suddenly cannot reach the PostgreSQL database. Is it a DNS failure? A routing issue? A firewall rule? A service that stopped listening? Each of these has a different diagnostic tool and a different fix. Reaching for the right tool immediately is the difference between a 2-minute diagnosis and a 20-minute guessing session.
By the end of this lab you will:
- Configure and inspect network interfaces with
ip - Find listening services and active connections with
ss - Diagnose DNS failures with
digandnslookup - Test HTTP endpoints and measure latency with
curl - Check connectivity along the network path with
ping,traceroute, andmtr - Capture and analyse actual network packets with
tcpdump - Use
ncfor quick port testing and simple TCP debugging
Why This Matters in Production
Network issues are among the most stressful production incidents because the symptoms (service unreachable, timeouts, intermittent failures) have many possible causes at different layers. An engineer who knows which tool to use at each layer diagnoses the problem in minutes. An engineer who only knows ping spends an hour escalating unnecessarily.
Core Principles
Network diagnostic tool selection by problem type:
+------------------------------------------+| Problem: Service unreachable |+------------------------------------------+ | | Is DNS working? Is the port open? | | v v+-------------+ +--------------+| dig / host | | ss / nc || nslookup | | curl -v |+-------------+ +--------------+ | | DNS ok, still Port closed: unreachable? firewall or | service down v+------------------------------------------+| Is routing correct? Is the host alive? || ping / traceroute / mtr |+------------------------------------------+ | Packets arriving but wrong response? v+------------------------------------------+| Capture actual packets: tcpdump |+------------------------------------------+Detailed Step-by-Step Practical Lab
Milestone 1 — Configure and inspect network interfaces with ip
## Show all interfaces and their IP addressesip addr show## or shorter:ip a ## Output for eth0:## 2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP## link/ether 02:42:ac:11:00:02 brd ff:ff:ff:ff:ff:ff## inet 10.0.1.50/24 brd 10.0.1.255 scope global eth0 ## Show only a specific interfaceip addr show eth0 ## Show routing tableip route show## default via 10.0.1.1 dev eth0## 10.0.1.0/24 dev eth0 proto kernel scope link src 10.0.1.50 ## Show default gatewayip route | grep default## default via 10.0.1.1 dev eth0 metric 100 ## Add a temporary IP address (lost on reboot)sudo ip addr add 10.0.1.100/24 dev eth0 ## Remove an IP addresssudo ip addr del 10.0.1.100/24 dev eth0 ## Bring interface up or downsudo ip link set eth0 downsudo ip link set eth0 up ## Add a static routesudo ip route add 192.168.10.0/24 via 10.0.1.1 ## Show network interface statistics (errors, dropped packets)ip -s link show eth0## Look for: errors and dropped -- non-zero means hardware/driver issueMilestone 2 — Inspect connections and ports with ss
## Show all listening TCP and UDP ports with process namesss -tulpn## Netid State Recv-Q Send-Q Local Address:Port Process## tcp LISTEN 0 128 0.0.0.0:22 sshd## tcp LISTEN 0 511 0.0.0.0:443 nginx## tcp LISTEN 0 128 127.0.0.1:5432 postgres ## Show only listening TCPss -tlpn ## Show all established connectionsss -t state established ## Show connections to a specific portss -t '( dport = :5432 or sport = :5432 )' ## Count connections by statess -t | awk '{print $1}' | sort | uniq -c | sort -rn## 2450 ESTAB## 342 CLOSE-WAIT## 12 TIME-WAIT ## Show connections from a specific IPss -t src 10.0.0.0/8 ## Check if a specific service is listeningss -tlpn | grep :4000## If empty: nothing is listening on port 4000 ## Watch connections in real timewatch -n1 'ss -t state established | wc -l'Milestone 3 — Diagnose DNS with dig
## Basic DNS lookup (A record)dig api.razorpay.com## ;; ANSWER SECTION:## api.razorpay.com. 300 IN A 52.66.100.200 ## Get just the IP (short output)dig +short api.razorpay.com## 52.66.100.200 ## Query specific record typesdig MX gmail.com ## mail server recordsdig NS razorpay.com ## nameserver recordsdig CNAME www.razorpay.com ## canonical name (alias)dig TXT razorpay.com ## text records (SPF, DKIM)dig AAAA api.razorpay.com ## IPv6 address ## Query a specific DNS server (bypass local resolver)dig @8.8.8.8 api.razorpay.comdig @1.1.1.1 api.razorpay.com ## Trace the full DNS resolution pathdig +trace api.razorpay.com ## Reverse DNS lookup (IP to hostname)dig -x 52.66.100.200 ## Check /etc/resolv.conf for configured nameserverscat /etc/resolv.conf## nameserver 169.254.169.253 (AWS internal DNS)## search ap-south-1.compute.internal ## Quick DNS check (simpler alternative to dig)host api.razorpay.comnslookup api.razorpay.comREMEMBER THIS**Remember:** When a service cannot resolve a hostname, check two things: `cat /etc/resolv.conf` (is a nameserver configured?) and `dig @8.8.8.8 hostname` (does external DNS work?). If `dig @8.8.8.8` works but the service fails, the issue is the local resolver configuration, not the hostname itself.
Milestone 4 — Test HTTP endpoints with curl
## Basic GET requestcurl https://api.razorpay.com/health ## Verbose output showing headers and TLS handshakecurl -v https://api.razorpay.com/health ## Show only response headerscurl -I https://api.razorpay.com/health## HTTP/2 200## content-type: application/json## x-request-id: abc123 ## POST with JSON bodycurl -X POST https://api.internal/payment \ -H "Content-Type: application/json" \ -H "Authorization: Bearer eyJ..." \ -d '{"amount": 1000, "currency": "INR"}' ## Measure response time breakdowncurl -o /dev/null -s -w "\n\ DNS: %{time_namelookup}s\n\ Connect: %{time_connect}s\n\ TLS: %{time_appconnect}s\n\ TTFB: %{time_starttransfer}s\n\ Total: %{time_total}s\n\ HTTP Code: %{http_code}\n" \ https://api.razorpay.com/health ## Follow redirectscurl -L http://razorpay.com ## Test before DNS propagates (resolve manually)curl --resolve api.razorpay.com:443:52.66.100.200 https://api.razorpay.com/health ## Test with specific TLS versioncurl --tls-max 1.2 https://api.razorpay.com/health ## Retry on failure (useful in health check scripts)curl --retry 3 --retry-delay 2 https://api.razorpay.com/healthMilestone 5 — Check connectivity with ping, traceroute, and mtr
## Basic ping (Ctrl+C to stop)ping 10.0.2.100 ## Ping with countping -c 5 10.0.2.100 ## Ping with interval (0.2 seconds between pings)ping -i 0.2 10.0.2.100 ## Traceroute -- show each hop between you and destinationtraceroute 10.0.2.100## 1 10.0.1.1 (10.0.1.1) 0.456 ms## 2 10.0.0.1 (10.0.0.1) 1.234 ms## 3 10.0.2.100 (10.0.2.100) 2.345 ms ## mtr -- live traceroute with statistics (best tool for network path diagnosis)sudo apt install mtrmtr 10.0.2.100## Shows each hop with live packet loss % and latency## Non-interactive mode for scripting:mtr --report --report-cycles 10 10.0.2.100 ## Quick port connectivity checknc -zv 10.0.2.100 5432## Connection to 10.0.2.100 5432 port [tcp/postgresql] succeeded! ## Check multiple ports quicklyfor port in 80 443 5432 6379; do result=$(nc -zv -w2 10.0.2.100 $port 2>&1) echo "Port $port: $(echo $result | grep -o 'succeeded\|refused\|timed out')"doneMilestone 6 — Capture packets with tcpdump
## Install if neededsudo apt install tcpdump ## Capture all traffic on eth0 (Ctrl+C to stop)sudo tcpdump -i eth0 ## Capture with human-readable output and no DNS resolutionsudo tcpdump -i eth0 -n ## Capture only traffic on a specific portsudo tcpdump -i eth0 port 5432 ## Capture traffic to/from a specific hostsudo tcpdump -i eth0 host 10.0.2.100 ## Combine filterssudo tcpdump -i eth0 -n 'host 10.0.2.100 and port 5432' ## Capture and save to file for analysis in Wiresharksudo tcpdump -i eth0 -w /tmp/capture.pcap ## Read a saved capturesudo tcpdump -r /tmp/capture.pcap ## Show packet contents (ASCII)sudo tcpdump -i eth0 -A port 80 | grep -A 5 'GET\|POST' ## Capture DNS queries to see what hostname resolutions are happeningsudo tcpdump -i eth0 -n port 53 ## Count packets per second on a busy interfacesudo tcpdump -i eth0 -tnn -q 2>/dev/null | awk '{print strftime("%H:%M:%S")}' | uniq -cProduction Best Practices and Common Pitfalls
| Problem | Wrong Tool | Correct Tool |
|---|---|---|
| DNS not resolving | ping hostname | dig hostname then dig @8.8.8.8 hostname |
| Port not reachable | Restart service blindly | ss -tulpn to check if listening |
| HTTP returning errors | Check logs only | curl -v to see full request/response |
| Intermittent connectivity | ping once | mtr for sustained path monitoring |
| Unknown traffic source | Guess | tcpdump -i eth0 port X -n |
Quick Reference and Troubleshooting Commands
| Task | Command |
|---|---|
| Show interfaces | ip addr show |
| Show routing | ip route show |
| Listening ports | ss -tulpn |
| DNS lookup | dig +short hostname |
| DNS debug | dig @8.8.8.8 hostname |
| HTTP test | curl -v https://host/endpoint |
| HTTP timing | curl -o /dev/null -s -w "%{time_total}" https://host |
| Port check | nc -zv host port |
| Path diagnosis | mtr --report host |
| Packet capture | sudo tcpdump -i eth0 -n port PORT |
PLACEMENT PRO TIP**Tip:** `mtr` is the single most useful network diagnostic tool for production incidents. Unlike `traceroute` which sends three packets per hop, `mtr` continuously sends packets and shows live packet loss and latency at each hop. A hop with 30% packet loss while the next hop shows 0% loss means the problematic router is at that specific hop.
COMMON MISTAKE / WARNING**Common Mistake:** Using `ping hostname` to diagnose "service unreachable" errors. ICMP (ping) and TCP are completely independent. A service can be unreachable on TCP port 443 while ping succeeds perfectly — because the firewall blocks TCP but allows ICMP. Always use `nc -zv host port` or `curl -v` to test service connectivity, not ping.
COMMON MISTAKE / WARNING**Security:** `tcpdump` captures all packets including credentials in unencrypted protocols. Never run tcpdump on production traffic unless absolutely necessary for diagnosis, never save captures to shared locations, and delete capture files immediately after analysis. In regulated environments (PCI-DSS, SOC 2), packet captures may require explicit change management approval.