Overview and What You Will Learn
A production server is slow. Users are complaining. The on-call engineer needs to identify the bottleneck in under 5 minutes. Is it CPU saturation? Memory pressure causing swapping? Disk I/O wait? Network throughput exhausted? Each bottleneck has a different fix — but you cannot fix what you cannot identify.
By the end of this lab you will:
- Read load average correctly and know what it means per CPU core
- Use
free -hand interpret theavailablecolumn (notfree) - Diagnose I/O bottlenecks with
iostat -xandiotop - Find which process consumes disk I/O and network bandwidth
- Run a 5-minute production bottleneck diagnosis using a decision tree
Why This Matters in Production
During a Hotstar IPL live stream, millions of concurrent users create enormous I/O load. An on-call engineer who mistakes I/O wait for CPU saturation will add CPU capacity that does nothing — while the real disk bottleneck worsens. The engineers who diagnose correctly in 2 minutes are the ones who have practiced these tools until they are instinctive.
Core Principles
Bottleneck diagnosis decision tree:
+------------------------------------------+| Server is slow -- where is the problem? |+------------------------------------------+ | v+------------------------------------------+| Check load average (uptime or top) || > number of CPU cores? -> CPU issue |+------------------------------------------+ | | CPU high CPU ok | | v v+------------------+ +------------------+| Check vmstat | | Check free -h || us+sy vs wa | | available memory |+------------------+ +------------------+ | | | us+sy wa available high high low | | | v v vCPU Disk I/O Memory pressurebound bottleneck -> check swapMemory fields explained:
$ free -h total used free shared buff/cache availableMem: 15Gi 4.2Gi 1.1Gi 312Mi 9.8Gi 10.6GiSwap: 2.0Gi 128Mi 1.9Gi total = total RAM installedused = RAM in use by processesfree = completely unused RAMbuff/cache = used by kernel for disk cache (reclaimable)available = free + reclaimable cache = actual usable memory CRITICAL: Monitor "available", not "free"A server with 1GB free but 10GB buff/cache has 11GB available.Only when "available" approaches zero is memory a problem.Detailed Step-by-Step Practical Lab
Milestone 1 — Understand and check load average
## Quick load average checkuptime## 10:23:45 up 15 days, load average: 1.23, 0.87, 0.65## ^1min ^5min ^15min ## Load average meaning:## On a 4-core server:## 1.00 = 1 core fully busy, 3 idle (25% utilized)## 4.00 = all 4 cores fully busy (100% utilized)## 8.00 = 4 cores busy + 4 jobs waiting (200% -- overloaded) ## How many CPUs does this server have?nproc## 4 grep -c processor /proc/cpuinfo## 4 ## If load > nproc value, the server is overloaded ## vmstat for CPU breakdown (1-second intervals, 5 samples)vmstat 1 5## procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----## r b swpd free buff cache si so bi bo in cs us sy id wa st## 2 0 0 1.1G 82M 9.8G 0 0 0 124 892 1823 12 3 84 1 0#### r = processes waiting for CPU (runqueue)## b = processes in uninterruptible sleep (I/O wait)## us = user CPU % sy = system CPU %## id = idle CPU % wa = I/O wait %## wa > 10% = I/O bottleneck## id < 5% = CPU saturatedMilestone 2 — Monitor memory usage
## Human-readable memory summaryfree -h ## Watch memory change every 2 secondswatch -n2 free -h ## Detailed memory breakdown from kernelcat /proc/meminfo | grep -E 'MemTotal|MemFree|MemAvailable|Buffers|Cached|SwapUsed' ## Check for OOM killer activity (memory exhaustion events)dmesg | grep -i 'oom\|killed process' | tail -10## Out of memory: Kill process 2341 (node) score 450 or sacrifice child## Killed process 2341 (node) total-vm:1048576kB, anon-rss:512000kB ## If OOM killer has been active, check journaljournalctl -k | grep -i oom | tail -10 ## Find which processes use the most RAMps aux --sort=-%mem | head -10 ## Check memory per process more accurately## (accounts for shared memory correctly)for pid in $(pgrep node); do echo -n "PID $pid: " cat /proc/$pid/status | grep VmRSSdoneMilestone 3 — Diagnose disk I/O bottlenecks
## iostat -- disk I/O statistics## Install if missing: apt install sysstatiostat -x 1 5 ## Key columns explained:## Device rrqm/s wrqm/s r/s w/s rkB/s wkB/s await %util## xvda 0.00 1.23 2.34 15.67 18.7K 45.2K 4.23 12.5## xvdb 0.00 0.01 0.12 124.56 0.9K 456.2K 45.67 98.1 <- problem!#### %util = % time disk was busy (near 100% = saturated)## await = average I/O wait time in ms (high = slow disk or overloaded)## r/s w/s = reads and writes per second ## Find which PROCESS is doing the I/O## iotop shows per-process disk I/O livesudo apt install iotopsudo iotop ## Non-interactive iotop (show top I/O consumers once)sudo iotop -b -n1 | head -20 ## Check disk space (separate from I/O)df -h ## Find what is consuming disk spacedu -sh /var/* | sort -rh | head -10du -sh /var/log/* | sort -rh | head -10Milestone 4 — Monitor network usage
## Show all listening ports and which process owns themss -tulpn ## Show active connections and their statess -t state established ## Count connections by statess -t | awk '{print $1}' | sort | uniq -c | sort -rn ## Show connections to a specific portss -t state established '( dport = :5432 or sport = :5432 )' ## Network interface statisticsip -s link show eth0## Shows: bytes transmitted, packets, errors, drops ## Real-time bandwidth by interface## Install: apt install nloadnload eth0 ## Real-time bandwidth by process## Install: apt install nethogssudo nethogs eth0 ## Real-time bandwidth by connection## Install: apt install iftopsudo iftop -i eth0Milestone 5 — All-in-one monitoring with dstat
## Install dstatsudo apt install dstat ## Show CPU, disk, network, memory, system stats togetherdstat ## Output with headers every 10 rowsdstat --output /tmp/stats-$(date +%Y%m%d-%H%M%S).csv ## Custom columns for a payment service investigationdstat -cdnmgsy 1## c = CPU d = disk n = network m = memory## g = paging s = swap y = system interrupts ## Example output during an I/O spike:## ----total-cpu-usage---- -dsk/total- -net/total- ---paging-- ---system--## usr sys idl wai hiq siq| read writ| recv send| in out | int csw## 12 3 20 65 0 0| 450M 120M|8.23k 123k| 0 0 |4523 9821## ^65% wait = severe I/O bottleneckMilestone 6 — 5-minute production diagnosis runbook
## Quick production diagnosis -- run when a server is slow echo "=== $(date) ==="echo ""echo "--- Load Average and CPU ---"uptimeecho ""vmstat 1 3 | tail -1echo "" echo "--- Memory ---"free -hecho "" echo "--- Top CPU Processes ---"ps aux --sort=-%cpu | head -6echo "" echo "--- Top Memory Processes ---"ps aux --sort=-%mem | head -6echo "" echo "--- Disk I/O ---"iostat -x 1 1 | grep -v '^$' | tail -10echo "" echo "--- Disk Space ---"df -h | grep -v tmpfsecho "" echo "--- Network Connections ---"ss -t | awk '{print $1}' | sort | uniq -c | sort -rnecho "" echo "--- Recent OOM Events ---"dmesg | grep -i oom | tail -5echo "" echo "--- Failed Services ---"systemctl list-units --state=failed --no-legendProduction Best Practices and Common Pitfalls
| Symptom | First Check | What to Look For |
|---|---|---|
| Server slow, CPU high | vmstat 1 5 |
High us or sy — CPU bound |
| Server slow, CPU idle | vmstat 1 5 |
High wa — disk I/O bottleneck |
| Out of memory errors | free -h |
available near zero |
| App cannot write files | df -h |
Filesystem at 100% |
| App cannot create files | df -i |
Inodes at 100% |
| High network latency | `ss -t | wc -l` |
Quick Reference and Troubleshooting Commands
| Task | Command |
|---|---|
| Load average | uptime |
| CPU breakdown | vmstat 1 5 |
| Memory summary | free -h |
| Disk I/O stats | iostat -x 1 3 |
| Per-process I/O | sudo iotop -b -n1 |
| Disk space | df -h |
| Largest dirs | `du -sh /var/* |
| Open ports | ss -tulpn |
| OOM events | `dmesg |
| All-in-one | dstat -cdnm 1 |
PLACEMENT PRO TIP**Tip:** Save the 5-minute diagnosis script as `/usr/local/bin/diagnose` with `chmod +x`. When a server is slow, run `diagnose` and the output immediately shows you where to look next.
REMEMBER THIS**Remember:** `free -h` shows `available` memory which includes reclaimable buffer/cache. A server with only 500MB `free` but 8GB `available` is not under memory pressure. Only when `available` drops below 10-15% of total RAM should you be concerned.