Overview and What You Will Learn
Every running program on a Linux server is a process. At any moment a production server at Hotstar is running hundreds of processes simultaneously — nginx workers, application threads, log shippers, health checks, kernel threads. When something goes wrong — a runaway process consuming 100% CPU, a zombie accumulating memory, a service that will not respond — your first tools are the process inspection and control utilities covered in this topic.
By the end of this lab you will be able to:
- Read and interpret
ps auxoutput confidently - Use
topandhtopto identify resource hogs in real time - Understand all five process states and what each means for diagnosis
- Send the correct signal (SIGTERM vs SIGKILL) for each situation
- Manage background jobs with
&,fg,bg,nohup, anddisown - Read live process data from the
/procfilesystem
Why This Matters in Production
A Zerodha trading system experiences a memory spike during market open. Without process inspection skills, the on-call engineer spends 20 minutes guessing. With them, a 30-second ps aux --sort=-%mem | head -10 identifies the offending process immediately.
Process management failures cause two categories of production incidents: runaway processes that consume resources until the system becomes unresponsive, and zombie processes that accumulate until the PID table fills up and no new processes can be created. Both are diagnosable and fixable with the tools in this topic.
Core Principles
The process state machine — what each state means:
+------------------------------------------+| R — Running / Runnable || On CPU or waiting for CPU time |+------------------------------------------+ | signal or sleep() | v+------------------------------------------+| S — Sleeping (interruptible) || Waiting for event, can receive signals |+------------------------------------------+ | | disk/net I/O SIGCONT | | v v+------------------+ +------------------+| D — Uninterrupt. | | T — Stopped || Cannot be killed | | SIGSTOP received |+------------------+ +------------------+ | I/O completes | v+------------------------------------------+| Z — Zombie || Exited but parent has not called wait() |+------------------------------------------+Reading ps aux output — every column explained:
USER PID %CPU %MEM VSZ RSS STAT START TIME COMMANDroot 1 0.0 0.1 168MB 13MB Ss Jan01 0:05 /sbin/initwww-data 892 1.2 0.8 512MB 80MB S 10:00 0:12 nginx: workerpayment 1350 45.3 2.1 1.2GB 210MB R 10:05 1:23 node index.jsroot 1401 0.0 0.0 0 0 Z 10:10 0:00 [cleanup] <defunct> VSZ = Virtual memory size (total address space claimed)RSS = Resident set size (actual RAM in use — the real number)STAT = State + flags (S=sleeping, R=running, Z=zombie, s=session leader)TIME = Total CPU time used (not wall clock time)REMEMBER THIS**Remember:** RSS (Resident Set Size) is the accurate memory figure. VSZ includes memory that is mapped but not loaded into RAM. A process showing 4GB VSZ but 200MB RSS is only actually using 200MB of physical memory.
Detailed Step-by-Step Practical Lab
Milestone 1 — Snapshot all running processes
## Full process list, all users, BSD formatps aux ## Full process list, UNIX format (shows PPID)ps -ef ## Show process tree with PIDsps --forest aux ## Sort by CPU usage (descending)ps aux --sort=-%cpu | head -15 ## Sort by memory usage (descending)ps aux --sort=-%mem | head -15 ## Show specific processps aux | grep nginxps -p 1350 -o pid,ppid,user,%cpu,%mem,stat,cmd ## Show all threads of a processps -eLf | grep payment-serviceMilestone 2 — Monitor processes in real time with top
## Launch toptop ## Key bindings inside top:## P -- sort by CPU (default)## M -- sort by memory## k -- kill a process (prompts for PID then signal)## r -- renice a process (change priority)## 1 -- toggle per-CPU display## q -- quit ## Top header explained:## top - 10:23:45 up 15 days, load average: 1.23, 0.87, 0.65## ^uptime ^1min ^5min ^15min#### Tasks: 245 total, 2 running, 243 sleeping, 0 stopped, 0 zombie## ^total ^on CPU ^waiting ^SIGSTOP ^defunct#### %Cpu: 12.3 us, 2.1 sy, 0.0 ni, 84.1 id, 1.2 wa, 0.0 hi, 0.3 si## ^user ^sys ^nice ^idle ^I/O wait## wa (I/O wait) above 5% = disk bottleneck ## Non-interactive top (useful in scripts)top -bn1 | head -20## htop -- more user-friendly (install if not present)sudo apt install htophtop ## htop advantages over top:## * Mouse support## * Colour-coded CPU/memory bars## * F6 to sort by any column## * F9 to kill with signal selection menu## * Space to tag multiple processes for batch actionMilestone 3 — Understand and send process signals
## List all signal names and numberskill -l ## Send SIGTERM (15) -- polite shutdown request, can be caughtkill 1350kill -15 1350 ## samekill -TERM 1350 ## same ## Send SIGKILL (9) -- force kill, cannot be caught or ignoredkill -9 1350kill -KILL 1350 ## same ## Send SIGHUP (1) -- reload config without restartkill -HUP 1350kill -1 1350 ## same ## Send SIGSTOP (19) -- pause the processkill -STOP 1350 ## Send SIGCONT (18) -- resume a stopped processkill -CONT 1350 ## Kill by process namekillall nginx ## kills all processes named nginxpkill nginx ## samepkill -f 'node index.js' ## match against full command line ## Kill by pattern, see what would be killed firstpgrep -la 'node'pkill -f 'node worker' ## Check if a process is alive without sending a signalkill -0 1350## exit code 0 = alive, 1 = does not exist or no permissionCOMMON MISTAKE / WARNING**Common Mistake:** Reaching for SIGKILL immediately. Always send SIGTERM first and wait 10-30 seconds. SIGTERM allows the process to flush buffers, close database connections, and remove lock files. SIGKILL gives no opportunity for cleanup and can leave corrupted state.
Milestone 4 — Manage process priority with nice and renice
## nice range: -20 (highest priority) to 19 (lowest priority)## Default nice value: 0## Only root can set negative nice (higher priority) ## Start a process with lower prioritynice -n 10 /opt/backup/run-backup.sh ## Start a backup job at lowest priority to not impact the main servicenice -n 19 pg_dump -U postgres mydb > /backup/mydb.sql ## Change priority of a running processrenice -n 5 -p 1350 ## lower priority of PID 1350renice -n -5 -p 892 ## raise priority (requires root)sudo renice -n -5 -p 892 ## Check current nice valueps -o pid,ni,cmd -p 1350Milestone 5 — Manage background jobs
## Run a command in the background./long-running-script.sh &## [1] 2847 <- job number and PID ## List background jobs in current shelljobs## [1]+ Running ./long-running-script.sh & ## Bring job to foregroundfg %1 ## Send foreground job to background## (first press Ctrl+Z to stop it)## ^Zbg %1 ## Run a command immune to SIGHUP (survives terminal close)nohup ./long-running-script.sh > /tmp/script.log 2>&1 & ## Disown a running job (remove from shell's job table)## Even if shell closes, process continues./server.sh &disown %1 ## Wait for all background jobs to finishwait ## Wait for specific PIDwait 2847Milestone 6 — Read process data from /proc
## Every running process has a directory: /proc/PID/ls /proc/1350/ ## Read the command line that started the processcat /proc/1350/cmdline | tr '\0' ' '## node /opt/payment-service/index.js ## Read current environment variablescat /proc/1350/environ | tr '\0' '\n' | grep -E 'NODE_ENV|PORT|DB' ## Count open file descriptorsls /proc/1350/fd | wc -l## Compare to the limit:cat /proc/1350/limits | grep 'open files' ## Check memory mapscat /proc/1350/status | grep -E 'VmRSS|VmSize|Threads'## VmRSS: 210MB <- actual RAM used## VmSize: 1.2GB <- virtual memory claimed## Threads: 8 <- number of threads ## Watch a value change in real timewatch -n1 'cat /proc/1350/status | grep VmRSS'Production Best Practices and Common Pitfalls
| Scenario | Wrong Approach | Correct Approach |
|---|---|---|
| Service not responding | kill -9 immediately | kill -TERM, wait 30s, then kill -9 |
| High CPU process | Restart the server | ps aux --sort=-%cpu, identify, investigate root cause |
| Zombie processes | Ignore them | Find the parent, fix its wait() implementation |
| D-state process | kill -9 it | Cannot kill D-state — fix the underlying I/O issue |
| Runaway memory | Kill the process | Check for memory leak with /proc/PID/status over time |
Quick Reference and Troubleshooting Commands
| Task | Command |
|---|---|
| Show all processes | ps aux |
| Sort by CPU | `ps aux --sort=-%cpu |
| Sort by memory | `ps aux --sort=-%mem |
| Find a process | pgrep -la processname |
| Kill gracefully | kill -TERM PID |
| Force kill | kill -KILL PID |
| Reload config | kill -HUP PID |
| Check if alive | kill -0 PID |
| Count open files | `ls /proc/PID/fd |
| Memory usage | `cat /proc/PID/status |
| Background + immune | nohup command > log.txt 2>&1 & |
PLACEMENT PRO TIP**Tip:** `pgrep -la processname` is the fastest way to find a process and confirm you have the right one before sending a signal. It prints both the PID and the full command line, so you can verify you are not about to kill the wrong process.
COMMON MISTAKE / WARNING**Security:** A process can only send signals to processes owned by the same user. Root can signal any process. If you need to stop a service running as a different user (like `www-data`), you must use `sudo kill PID` or `sudo systemctl stop service`.