What is the career path for learning Monitoring Linux Resources — CPU, Memory, Disk I/O, and Network Performance?

Mastering Monitoring Linux Resources — CPU, Memory, Disk I/O, and Network Performance enables engineering opportunities in DevOps, SRE, and cloud platform automation.

Monitoring Linux Resources — CPU, Memory, Disk I/O, and Network Performance | DevOps Network

Q: How long does it take to learn Monitoring Linux Resources — CPU, Memory, Disk I/O, and Network Performance?

Most students gain core proficiency in Monitoring Linux Resources — CPU, Memory, Disk I/O, and Network Performance in 2–3 weeks of active hands-on labs.

Overview and What You Will Learn

A production server is slow. Users are complaining. The on-call engineer needs to identify the bottleneck in under 5 minutes. Is it CPU saturation? Memory pressure causing swapping? Disk I/O wait? Network throughput exhausted? Each bottleneck has a different fix — but you cannot fix what you cannot identify.

By the end of this lab you will:

Read load average correctly and know what it means per CPU core
Use free -h and interpret the available column (not free)
Diagnose I/O bottlenecks with iostat -x and iotop
Find which process consumes disk I/O and network bandwidth
Run a 5-minute production bottleneck diagnosis using a decision tree

Why This Matters in Production

During a Hotstar IPL live stream, millions of concurrent users create enormous I/O load. An on-call engineer who mistakes I/O wait for CPU saturation will add CPU capacity that does nothing — while the real disk bottleneck worsens. The engineers who diagnose correctly in 2 minutes are the ones who have practiced these tools until they are instinctive.

Core Principles

Bottleneck diagnosis decision tree:

◈ DIAGRAM

+------------------------------------------+
| Server is slow -- where is the problem?  |
+------------------------------------------+
                    |
                    v
+------------------------------------------+
| Check load average (uptime or top)       |
| > number of CPU cores? -> CPU issue      |
+------------------------------------------+
          |                     |
       CPU high              CPU ok
          |                     |
          v                     v
+------------------+  +------------------+
| Check vmstat     |  | Check free -h    |
| us+sy vs wa      |  | available memory |
+------------------+  +------------------+
  |          |              |
 us+sy       wa         available
 high       high           low
  |          |              |
  v          v              v
CPU      Disk I/O      Memory pressure
bound    bottleneck    -> check swap

Memory fields explained:

Bash

$ free -h
              total    used    free   shared  buff/cache  available
Mem:           15Gi    4.2Gi   1.1Gi   312Mi       9.8Gi      10.6Gi
Swap:          2.0Gi   128Mi   1.9Gi
 
total      = total RAM installed
used       = RAM in use by processes
free       = completely unused RAM
buff/cache = used by kernel for disk cache (reclaimable)
available  = free + reclaimable cache = actual usable memory
 
CRITICAL: Monitor "available", not "free"
A server with 1GB free but 10GB buff/cache has 11GB available.
Only when "available" approaches zero is memory a problem.

Detailed Step-by-Step Practical Lab

Milestone 1 — Understand and check load average

Bash

## Quick load average check
uptime
## 10:23:45 up 15 days, load average: 1.23, 0.87, 0.65
##                                     ^1min  ^5min  ^15min
 
## Load average meaning:
## On a 4-core server:
##   1.00 = 1 core fully busy, 3 idle       (25% utilized)
##   4.00 = all 4 cores fully busy          (100% utilized)
##   8.00 = 4 cores busy + 4 jobs waiting   (200% -- overloaded)
 
## How many CPUs does this server have?
nproc
## 4
 
grep -c processor /proc/cpuinfo
## 4
 
## If load > nproc value, the server is overloaded
 
## vmstat for CPU breakdown (1-second intervals, 5 samples)
vmstat 1 5
## procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
## r  b   swpd   free  buff  cache   si   so   bi   bo   in   cs us sy id wa st
## 2  0      0 1.1G   82M   9.8G    0    0    0   124  892 1823 12  3 84  1  0
##
## r  = processes waiting for CPU (runqueue)
## b  = processes in uninterruptible sleep (I/O wait)
## us = user CPU %     sy = system CPU %
## id = idle CPU %     wa = I/O wait %
## wa > 10% = I/O bottleneck
## id < 5%  = CPU saturated

Milestone 2 — Monitor memory usage

Bash

## Human-readable memory summary
free -h
 
## Watch memory change every 2 seconds
watch -n2 free -h
 
## Detailed memory breakdown from kernel
cat /proc/meminfo | grep -E 'MemTotal|MemFree|MemAvailable|Buffers|Cached|SwapUsed'
 
## Check for OOM killer activity (memory exhaustion events)
dmesg | grep -i 'oom\|killed process' | tail -10
## Out of memory: Kill process 2341 (node) score 450 or sacrifice child
## Killed process 2341 (node) total-vm:1048576kB, anon-rss:512000kB
 
## If OOM killer has been active, check journal
journalctl -k | grep -i oom | tail -10
 
## Find which processes use the most RAM
ps aux --sort=-%mem | head -10
 
## Check memory per process more accurately
## (accounts for shared memory correctly)
for pid in $(pgrep node); do
  echo -n "PID $pid: "
  cat /proc/$pid/status | grep VmRSS
done

Milestone 3 — Diagnose disk I/O bottlenecks

Bash

## iostat -- disk I/O statistics
## Install if missing: apt install sysstat
iostat -x 1 5
 
## Key columns explained:
## Device  rrqm/s wrqm/s  r/s    w/s    rkB/s  wkB/s  await  %util
## xvda      0.00   1.23  2.34  15.67  18.7K  45.2K   4.23   12.5
## xvdb      0.00   0.01  0.12 124.56   0.9K 456.2K  45.67   98.1 <- problem!
##
## %util   = % time disk was busy (near 100% = saturated)
## await   = average I/O wait time in ms (high = slow disk or overloaded)
## r/s w/s = reads and writes per second
 
## Find which PROCESS is doing the I/O
## iotop shows per-process disk I/O live
sudo apt install iotop
sudo iotop
 
## Non-interactive iotop (show top I/O consumers once)
sudo iotop -b -n1 | head -20
 
## Check disk space (separate from I/O)
df -h
 
## Find what is consuming disk space
du -sh /var/* | sort -rh | head -10
du -sh /var/log/* | sort -rh | head -10

Milestone 4 — Monitor network usage

Bash

## Show all listening ports and which process owns them
ss -tulpn
 
## Show active connections and their state
ss -t state established
 
## Count connections by state
ss -t | awk '{print $1}' | sort | uniq -c | sort -rn
 
## Show connections to a specific port
ss -t state established '( dport = :5432 or sport = :5432 )'
 
## Network interface statistics
ip -s link show eth0
## Shows: bytes transmitted, packets, errors, drops
 
## Real-time bandwidth by interface
## Install: apt install nload
nload eth0
 
## Real-time bandwidth by process
## Install: apt install nethogs
sudo nethogs eth0
 
## Real-time bandwidth by connection
## Install: apt install iftop
sudo iftop -i eth0

Milestone 5 — All-in-one monitoring with dstat

Bash

## Install dstat
sudo apt install dstat
 
## Show CPU, disk, network, memory, system stats together
dstat
 
## Output with headers every 10 rows
dstat --output /tmp/stats-$(date +%Y%m%d-%H%M%S).csv
 
## Custom columns for a payment service investigation
dstat -cdnmgsy 1
## c = CPU   d = disk   n = network   m = memory
## g = paging  s = swap  y = system interrupts
 
## Example output during an I/O spike:
## ----total-cpu-usage---- -dsk/total- -net/total- ---paging-- ---system--
## usr sys idl wai hiq siq| read  writ| recv  send|  in   out | int   csw
##  12   3  20  65   0   0| 450M  120M|8.23k  123k|   0     0 |4523  9821
##                    ^65% wait = severe I/O bottleneck

Milestone 6 — 5-minute production diagnosis runbook

Bash

#!/usr/bin/env bash
## Quick production diagnosis -- run when a server is slow
 
echo "=== $(date) ==="
echo ""
echo "--- Load Average and CPU ---"
uptime
echo ""
vmstat 1 3 | tail -1
echo ""
 
echo "--- Memory ---"
free -h
echo ""
 
echo "--- Top CPU Processes ---"
ps aux --sort=-%cpu | head -6
echo ""
 
echo "--- Top Memory Processes ---"
ps aux --sort=-%mem | head -6
echo ""
 
echo "--- Disk I/O ---"
iostat -x 1 1 | grep -v '^$' | tail -10
echo ""
 
echo "--- Disk Space ---"
df -h | grep -v tmpfs
echo ""
 
echo "--- Network Connections ---"
ss -t | awk '{print $1}' | sort | uniq -c | sort -rn
echo ""
 
echo "--- Recent OOM Events ---"
dmesg | grep -i oom | tail -5
echo ""
 
echo "--- Failed Services ---"
systemctl list-units --state=failed --no-legend

Production Best Practices and Common Pitfalls

Symptom	First Check	What to Look For
Server slow, CPU high	`vmstat 1 5`	High `us` or `sy` — CPU bound
Server slow, CPU idle	`vmstat 1 5`	High `wa` — disk I/O bottleneck
Out of memory errors	`free -h`	`available` near zero
App cannot write files	`df -h`	Filesystem at 100%
App cannot create files	`df -i`	Inodes at 100%
High network latency	`ss -t	wc -l`

Quick Reference and Troubleshooting Commands

Task	Command
Load average	`uptime`
CPU breakdown	`vmstat 1 5`
Memory summary	`free -h`
Disk I/O stats	`iostat -x 1 3`
Per-process I/O	`sudo iotop -b -n1`
Disk space	`df -h`
Largest dirs	`du -sh /var/*
Open ports	`ss -tulpn`
OOM events	`dmesg
All-in-one	`dstat -cdnm 1`

PLACEMENT PRO TIP
**Tip:** Save the 5-minute diagnosis script as `/usr/local/bin/diagnose` with `chmod +x`. When a server is slow, run `diagnose` and the output immediately shows you where to look next.

REMEMBER THIS
**Remember:** `free -h` shows `available` memory which includes reclaimable buffer/cache. A server with only 500MB `free` but 8GB `available` is not under memory pressure. Only when `available` drops below 10-15% of total RAM should you be concerned.

Monitoring Linux Resources — CPU, Memory, Disk I/O, and Network Performance

Overview and What You Will Learn

Why This Matters in Production

Core Principles

Detailed Step-by-Step Practical Lab

Production Best Practices and Common Pitfalls

Quick Reference and Troubleshooting Commands

Resources

Explore More in Linux Process and System Management

Managing Linux Processes - ps, top, Signals, and the Process Lifecycle

Managing Linux Services with systemd — Units, Targets, and journalctl

Scheduling Linux Tasks with Cron — crontab, Systemd Timers, and Production Patterns