How to Troubleshoot a Linux Production Server: A Systematic Approach

General

At 2am an alert fires. Your payment API is timing out. Users are seeing 503 errors. You SSH into the server and open top.

Most engineers stare at it, pick the process with the highest CPU, and guess. That approach takes 20 minutes. This one takes 4.

The 7-Step Diagnostic Sequence

Work through these steps in order. Every check narrows the problem. Stop at the step where you find the culprit — you rarely need to reach step 7.

TEXT

Step 1: System Snapshot   (30 seconds)
Step 2: CPU               (check if saturated)
Step 3: Memory            (check for OOM or swap)
Step 4: Disk              (space, inodes, I/O wait)
Step 5: Network           (connections, port state)
Step 6: Logs              (find the exact error)
Step 7: Recent Changes    (what changed 2 hours ago)

Step 1 — Get the System Snapshot

Run these four commands immediately. They give you the full picture in 30 seconds.

BASH

uptime
free -h
df -h
ss -tulpn | wc -l

uptime shows the load average — three numbers representing 1, 5, and 15-minute windows.

BASH

10:01:23 up 14 days, load average: 8.42, 4.11, 2.05

A load average above your CPU core count means the system is overloaded. On a 4-core server, 8.42 means twice as many processes are waiting to run as the CPU can service. Run nproc to confirm your core count.

free -h reveals memory pressure. Watch available, not free — when available drops below 10% of total RAM, the kernel begins swapping aggressively to disk. That is catastrophic for databases and APIs.

df -h catches a full disk before you waste 10 minutes elsewhere. A full disk causes processes to crash silently with no obvious log trail.

Step 2 — Diagnose CPU Saturation

If load average is above core count, find what is consuming it.

BASH

ps aux --sort=-%cpu | head -15
top -bn1 | grep "Cpu(s)"       ## iowait percentage

The %CPU column shows the culprit immediately. At Swiggy, a misconfigured background job consuming 97% CPU caused the order API to queue requests for 8 seconds during a dinner spike — ps aux found it in 15 seconds.

Reading the CPU output pattern:

Pattern	What it means
One process at 90%+	Code regression — infinite loop or O(n^2) algorithm
Many processes at 10-30%	Traffic spike — scale horizontally
`kswapd` high	Memory pressure forcing page swaps
iowait above 20%	Disk bottleneck, not CPU — skip to Step 4

The iowait percentage from top is the most commonly missed signal. If %wa is above 20, the CPU is waiting on disk. Tuning the application will not help — you need to address disk I/O first.

Step 3 — Diagnose Memory and OOM Events

BASH

free -h
dmesg | grep -i "oom\|killed" | tail -20
cat /proc/meminfo | grep -E "MemAvailable|SwapUsed"

When the kernel runs out of memory it invokes the OOM killer — it terminates the process with the highest memory score to reclaim RAM. The victim is almost never the process you would expect.

A real OOM entry in dmesg looks like this:

BASH

Out of memory: Kill process 1234 (node) score 821 or sacrifice child
Killed process 1234 (node) total-vm:2048MB, anon-rss:1834MB

The score 821 indicates how aggressively the kernel targeted that process — higher scores get killed first. When you see this at Zerodha for example, it is almost always the analytics aggregation job — not the trading engine — consuming memory unnoticed until it starves everything else.

To find which processes are currently consuming swap:

BASH

for pid in $(ls /proc | grep -E '^[0-9]+$'); do
  swap=$(grep VmSwap /proc/$pid/status 2>/dev/null \
    | awk '{print $2}')
  [ "$swap" -gt "0" ] 2>/dev/null \
    && echo "PID $pid: ${swap}kB"
done | sort -t: -k2 -rn | head -10

Step 4 — Diagnose Disk Issues

A full disk is the sneakiest failure mode. Processes write nothing, log nothing, and silently return errors that look like application bugs.

BASH

df -h          ## Space usage per mount
df -ih         ## Inode usage per mount
iostat -xz 1 3 ## I/O wait and utilisation, 3 samples

The inode trap is frequently missed. You can have 40GB of free disk space and zero inodes remaining. Every open() system call will fail with ENOSPC — the same error as a full disk. The difference only appears in df -ih.

To locate the directory generating the most files:

BASH

find / -xdev -printf '%h\n' 2>/dev/null \
  | sort | uniq -c | sort -rn | head -10

For iostat, watch the %util column — above 80% means the disk is saturated. The await column shows average I/O wait time in milliseconds. Above 20ms for SSDs or 100ms for spinning disks indicates queueing that will slow every process touching that device.

Step 5 — Diagnose Network and Connections

Networking failures masquerade as application failures. What looks like a hanging API is often a connection that cannot be established.

BASH

ss -tulpn                             ## All listeners with process names
ss -ant | awk 'NR>1 {print $1}' \
  | sort | uniq -c | sort -rn         ## Connection counts by state
ss -ant | grep ESTABLISHED | wc -l    ## Total established connections

Connection state guide:

State	What it means
`TIME_WAIT` high	Normal after traffic spike — connections closing gracefully
`CLOSE_WAIT` high	Application bug — not closing connections after use
`SYN_RECV` high	TCP SYN flood or upstream connection leak

Test connectivity to a specific dependency directly from the server:

BASH

timeout 3 bash -c \
  'cat < /dev/null > /dev/tcp/db.internal.razorpay.net/5432' \
  && echo "DB port open" || echo "DB unreachable"

dig payments-db.internal.razorpay.net +short

This isolates whether the problem is application-level or infrastructure-level in one command. If the TCP connection fails, the application cannot fix it — you have a network, firewall, or service issue.

Step 6 — Read the Logs

By step 6 you know which component is under pressure. Now find the specific error that triggered the incident.

BASH

journalctl -u payment-api -n 100 --no-pager
journalctl -u payment-api --since "30 minutes ago"
journalctl -p err -b --no-pager | tail -30
grep -i "error\|exception\|fatal" \
  /var/log/app/payment-api.log | tail -50
dmesg | tail -30

The most important question is: when did the first error appear?

BASH

grep -n "ERROR" /var/log/app/payment-api.log | head -3

The timestamp of the first error tells you where to look for the cause. What changed in the 2-5 minutes before that line?

Step 7 — Find What Changed

Eight out of ten production incidents trace back to a change in the last 2 hours. Always end your investigation by confirming what changed.

BASH

find /etc /opt /app -mmin -120 -type f 2>/dev/null
journalctl --since "2 hours ago" \
  | grep -E "Started|Stopped|Failed" | head -20
grep " install \| upgrade " \
  /var/log/dpkg.log | tail -10

If the server runs a cron job, check whether it fired recently:

BASH

grep "CRON" /var/log/syslog \
  | grep "$(date --date='1 hour ago' '+%b %e %H')" \
  | head -10

A Real Incident: Razorpay Checkout Latency

Symptom: Checkout API p99 latency spiked from 180ms to 4200ms at 01:47. Payment failure rate rose to 12%.

BASH

## Step 1
uptime
## load average: 12.45, 8.31, 4.22 on a 4-core host

## Step 2
ps aux --sort=-%cpu | head -5
## postgres  3421  97.3  12.1  autovacuum worker

## Step 6
journalctl -u postgresql --since "1 hour ago" \
  | grep -i "autovacuum\|lock"
## LOG: autovacuum: processing table "payments" (48 GB)
## LOG: process 3421 acquired lock on relation "payments"

Root cause: PostgreSQL autovacuum acquired a lock on the 48GB payments table during peak checkout traffic. Every query to that table queued behind the lock for 3 minutes 40 seconds.

Total investigation time: 4 minutes. The fix — cancelling the autovacuum and scheduling it for off-peak hours — took 2 minutes.

The Full Diagnostic One-Liner Sequence

Copy this block when an alert fires. Run it top to bottom.

BASH

uptime && free -h && df -h               ## snapshot
ps aux --sort=-%cpu | head -10           ## cpu
dmesg | grep -i oom | tail -5            ## memory
df -ih && iostat -xz 1 2                 ## disk
ss -tulpn | head -20                     ## network
journalctl -p err -b --no-pager \
  | tail -30                             ## logs
find /etc /opt -mmin -120 \
  -type f 2>/dev/null                    ## changes

Production Implementation Guidelines

Run this sequence on every P1 alert before escalating. Systematic diagnosis reduces MTTR from 20-30 minutes to under 5 minutes — the difference between one missed SLO and ten.

For teams running on Indian cloud infrastructure, iostat is especially relevant on gp2 EBS volumes in ap-south-1 — the Mumbai region's gp2 IOPS burst bucket depletes quickly under sustained write loads, causing iowait spikes that are invisible to application-level monitoring.

Configure a motd or alias for this sequence on every production bastion host so any engineer can run it without looking it up during an incident.

📚 References and Further Reading

Linux Performance Analysis in 60 Seconds — Netflix Tech Blog

USE Method by Brendan Gregg — Utilisation, Saturation, Errors framework

systemd journalctl Manual — Official reference