At 2am an alert fires. Your payment API is timing out. Users are seeing 503 errors. You SSH into the server and open top.
Most engineers stare at it, pick the process with the highest CPU, and guess. That approach takes 20 minutes. This one takes 4.
Work through these steps in order. Every check narrows the problem. Stop at the step where you find the culprit — you rarely need to reach step 7.
Step 1: System Snapshot (30 seconds)
Step 2: CPU (check if saturated)
Step 3: Memory (check for OOM or swap)
Step 4: Disk (space, inodes, I/O wait)
Step 5: Network (connections, port state)
Step 6: Logs (find the exact error)
Step 7: Recent Changes (what changed 2 hours ago)
Run these four commands immediately. They give you the full picture in 30 seconds.
uptime
free -h
df -h
ss -tulpn | wc -l
uptime shows the load average — three numbers representing 1, 5, and 15-minute windows.
10:01:23 up 14 days, load average: 8.42, 4.11, 2.05
A load average above your CPU core count means the system is overloaded. On a 4-core server, 8.42 means twice as many processes are waiting to run as the CPU can service. Run nproc to confirm your core count.
free -h reveals memory pressure. Watch available, not free — when available drops below 10% of total RAM, the kernel begins swapping aggressively to disk. That is catastrophic for databases and APIs.
df -h catches a full disk before you waste 10 minutes elsewhere. A full disk causes processes to crash silently with no obvious log trail.
If load average is above core count, find what is consuming it.
ps aux --sort=-%cpu | head -15
top -bn1 | grep "Cpu(s)" ## iowait percentage
The %CPU column shows the culprit immediately. At Swiggy, a misconfigured background job consuming 97% CPU caused the order API to queue requests for 8 seconds during a dinner spike — ps aux found it in 15 seconds.
Reading the CPU output pattern:
| Pattern | What it means |
|---|---|
| One process at 90%+ | Code regression — infinite loop or O(n^2) algorithm |
| Many processes at 10-30% | Traffic spike — scale horizontally |
kswapd high |
Memory pressure forcing page swaps |
| iowait above 20% | Disk bottleneck, not CPU — skip to Step 4 |
The iowait percentage from top is the most commonly missed signal. If %wa is above 20, the CPU is waiting on disk. Tuning the application will not help — you need to address disk I/O first.
free -h
dmesg | grep -i "oom\|killed" | tail -20
cat /proc/meminfo | grep -E "MemAvailable|SwapUsed"
When the kernel runs out of memory it invokes the OOM killer — it terminates the process with the highest memory score to reclaim RAM. The victim is almost never the process you would expect.
A real OOM entry in dmesg looks like this:
Out of memory: Kill process 1234 (node) score 821 or sacrifice child
Killed process 1234 (node) total-vm:2048MB, anon-rss:1834MB
The score 821 indicates how aggressively the kernel targeted that process — higher scores get killed first. When you see this at Zerodha for example, it is almost always the analytics aggregation job — not the trading engine — consuming memory unnoticed until it starves everything else.
To find which processes are currently consuming swap:
for pid in $(ls /proc | grep -E '^[0-9]+$'); do
swap=$(grep VmSwap /proc/$pid/status 2>/dev/null \
| awk '{print $2}')
[ "$swap" -gt "0" ] 2>/dev/null \
&& echo "PID $pid: ${swap}kB"
done | sort -t: -k2 -rn | head -10
A full disk is the sneakiest failure mode. Processes write nothing, log nothing, and silently return errors that look like application bugs.
df -h ## Space usage per mount
df -ih ## Inode usage per mount
iostat -xz 1 3 ## I/O wait and utilisation, 3 samples
The inode trap is frequently missed. You can have 40GB of free disk space and zero inodes remaining. Every open() system call will fail with ENOSPC — the same error as a full disk. The difference only appears in df -ih.
To locate the directory generating the most files:
find / -xdev -printf '%h\n' 2>/dev/null \
| sort | uniq -c | sort -rn | head -10
For iostat, watch the %util column — above 80% means the disk is saturated. The await column shows average I/O wait time in milliseconds. Above 20ms for SSDs or 100ms for spinning disks indicates queueing that will slow every process touching that device.
Networking failures masquerade as application failures. What looks like a hanging API is often a connection that cannot be established.
ss -tulpn ## All listeners with process names
ss -ant | awk 'NR>1 {print $1}' \
| sort | uniq -c | sort -rn ## Connection counts by state
ss -ant | grep ESTABLISHED | wc -l ## Total established connections
Connection state guide:
| State | What it means |
|---|---|
TIME_WAIT high |
Normal after traffic spike — connections closing gracefully |
CLOSE_WAIT high |
Application bug — not closing connections after use |
SYN_RECV high |
TCP SYN flood or upstream connection leak |
Test connectivity to a specific dependency directly from the server:
timeout 3 bash -c \
'cat < /dev/null > /dev/tcp/db.internal.razorpay.net/5432' \
&& echo "DB port open" || echo "DB unreachable"
dig payments-db.internal.razorpay.net +short
This isolates whether the problem is application-level or infrastructure-level in one command. If the TCP connection fails, the application cannot fix it — you have a network, firewall, or service issue.
By step 6 you know which component is under pressure. Now find the specific error that triggered the incident.
journalctl -u payment-api -n 100 --no-pager
journalctl -u payment-api --since "30 minutes ago"
journalctl -p err -b --no-pager | tail -30
grep -i "error\|exception\|fatal" \
/var/log/app/payment-api.log | tail -50
dmesg | tail -30
The most important question is: when did the first error appear?
grep -n "ERROR" /var/log/app/payment-api.log | head -3
The timestamp of the first error tells you where to look for the cause. What changed in the 2-5 minutes before that line?
Eight out of ten production incidents trace back to a change in the last 2 hours. Always end your investigation by confirming what changed.
find /etc /opt /app -mmin -120 -type f 2>/dev/null
journalctl --since "2 hours ago" \
| grep -E "Started|Stopped|Failed" | head -20
grep " install \| upgrade " \
/var/log/dpkg.log | tail -10
If the server runs a cron job, check whether it fired recently:
grep "CRON" /var/log/syslog \
| grep "$(date --date='1 hour ago' '+%b %e %H')" \
| head -10
Symptom: Checkout API p99 latency spiked from 180ms to 4200ms at 01:47. Payment failure rate rose to 12%.
## Step 1
uptime
## load average: 12.45, 8.31, 4.22 on a 4-core host
## Step 2
ps aux --sort=-%cpu | head -5
## postgres 3421 97.3 12.1 autovacuum worker
## Step 6
journalctl -u postgresql --since "1 hour ago" \
| grep -i "autovacuum\|lock"
## LOG: autovacuum: processing table "payments" (48 GB)
## LOG: process 3421 acquired lock on relation "payments"
Root cause: PostgreSQL autovacuum acquired a lock on the 48GB payments table during peak checkout traffic. Every query to that table queued behind the lock for 3 minutes 40 seconds.
Total investigation time: 4 minutes. The fix — cancelling the autovacuum and scheduling it for off-peak hours — took 2 minutes.
Copy this block when an alert fires. Run it top to bottom.
uptime && free -h && df -h ## snapshot
ps aux --sort=-%cpu | head -10 ## cpu
dmesg | grep -i oom | tail -5 ## memory
df -ih && iostat -xz 1 2 ## disk
ss -tulpn | head -20 ## network
journalctl -p err -b --no-pager \
| tail -30 ## logs
find /etc /opt -mmin -120 \
-type f 2>/dev/null ## changes
Run this sequence on every P1 alert before escalating. Systematic diagnosis reduces MTTR from 20-30 minutes to under 5 minutes — the difference between one missed SLO and ten.
For teams running on Indian cloud infrastructure, iostat is especially relevant on gp2 EBS volumes in ap-south-1 — the Mumbai region's gp2 IOPS burst bucket depletes quickly under sustained write loads, causing iowait spikes that are invisible to application-level monitoring.
Configure a motd or alias for this sequence on every production bastion host so any engineer can run it without looking it up during an incident.
📚 References and Further Reading
- Linux Performance Analysis in 60 Seconds — Netflix Tech Blog
- USE Method by Brendan Gregg — Utilisation, Saturation, Errors framework
- systemd journalctl Manual — Official reference