What is the career path for learning Managing Linux Processes - ps, top, Signals, and the Process Lifecycle?

Mastering Managing Linux Processes - ps, top, Signals, and the Process Lifecycle enables engineering opportunities in DevOps, SRE, and cloud platform automation.

Managing Linux Processes - ps, top, Signals, and the Process Lifecycle | DevOps Network

Q: How long does it take to learn Managing Linux Processes - ps, top, Signals, and the Process Lifecycle?

Most students gain core proficiency in Managing Linux Processes - ps, top, Signals, and the Process Lifecycle in 2–3 weeks of active hands-on labs.

Overview and What You Will Learn

Every running program on a Linux server is a process. At any moment a production server at Hotstar is running hundreds of processes simultaneously — nginx workers, application threads, log shippers, health checks, kernel threads. When something goes wrong — a runaway process consuming 100% CPU, a zombie accumulating memory, a service that will not respond — your first tools are the process inspection and control utilities covered in this topic.

By the end of this lab you will be able to:

Read and interpret ps aux output confidently
Use top and htop to identify resource hogs in real time
Understand all five process states and what each means for diagnosis
Send the correct signal (SIGTERM vs SIGKILL) for each situation
Manage background jobs with &, fg, bg, nohup, and disown
Read live process data from the /proc filesystem

Why This Matters in Production

A Zerodha trading system experiences a memory spike during market open. Without process inspection skills, the on-call engineer spends 20 minutes guessing. With them, a 30-second ps aux --sort=-%mem | head -10 identifies the offending process immediately.

Process management failures cause two categories of production incidents: runaway processes that consume resources until the system becomes unresponsive, and zombie processes that accumulate until the PID table fills up and no new processes can be created. Both are diagnosable and fixable with the tools in this topic.

Core Principles

The process state machine — what each state means:

◈ DIAGRAM

+------------------------------------------+
| R — Running / Runnable                   |
| On CPU or waiting for CPU time           |
+------------------------------------------+
                    |
          signal or sleep()
                    |
                    v
+------------------------------------------+
| S — Sleeping (interruptible)             |
| Waiting for event, can receive signals   |
+------------------------------------------+
          |                    |
    disk/net I/O          SIGCONT
          |                    |
          v                    v
+------------------+  +------------------+
| D — Uninterrupt. |  | T — Stopped      |
| Cannot be killed |  | SIGSTOP received |
+------------------+  +------------------+
          |
    I/O completes
          |
          v
+------------------------------------------+
| Z — Zombie                               |
| Exited but parent has not called wait()  |
+------------------------------------------+

Reading ps aux output — every column explained:

TEXT

USER     PID  %CPU %MEM    VSZ   RSS  STAT  START   TIME  COMMAND
root       1   0.0  0.1  168MB  13MB  Ss    Jan01   0:05  /sbin/init
www-data 892   1.2  0.8  512MB  80MB  S     10:00   0:12  nginx: worker
payment 1350  45.3  2.1  1.2GB 210MB  R     10:05   1:23  node index.js
root    1401   0.0  0.0      0     0  Z     10:10   0:00  [cleanup] <defunct>
 
VSZ  = Virtual memory size  (total address space claimed)
RSS  = Resident set size    (actual RAM in use — the real number)
STAT = State + flags        (S=sleeping, R=running, Z=zombie, s=session leader)
TIME = Total CPU time used  (not wall clock time)

REMEMBER THIS
**Remember:** RSS (Resident Set Size) is the accurate memory figure. VSZ includes memory that is mapped but not loaded into RAM. A process showing 4GB VSZ but 200MB RSS is only actually using 200MB of physical memory.

Detailed Step-by-Step Practical Lab

Milestone 1 — Snapshot all running processes

Bash

## Full process list, all users, BSD format
ps aux
 
## Full process list, UNIX format (shows PPID)
ps -ef
 
## Show process tree with PIDs
ps --forest aux
 
## Sort by CPU usage (descending)
ps aux --sort=-%cpu | head -15
 
## Sort by memory usage (descending)
ps aux --sort=-%mem | head -15
 
## Show specific process
ps aux | grep nginx
ps -p 1350 -o pid,ppid,user,%cpu,%mem,stat,cmd
 
## Show all threads of a process
ps -eLf | grep payment-service

Milestone 2 — Monitor processes in real time with top

Bash

## Launch top
top
 
## Key bindings inside top:
## P   -- sort by CPU (default)
## M   -- sort by memory
## k   -- kill a process (prompts for PID then signal)
## r   -- renice a process (change priority)
## 1   -- toggle per-CPU display
## q   -- quit
 
## Top header explained:
## top - 10:23:45 up 15 days, load average: 1.23, 0.87, 0.65
##         ^uptime              ^1min  ^5min  ^15min
##
## Tasks: 245 total, 2 running, 243 sleeping, 0 stopped, 0 zombie
##         ^total     ^on CPU   ^waiting      ^SIGSTOP   ^defunct
##
## %Cpu: 12.3 us, 2.1 sy, 0.0 ni, 84.1 id, 1.2 wa, 0.0 hi, 0.3 si
##        ^user  ^sys   ^nice  ^idle      ^I/O wait
## wa (I/O wait) above 5% = disk bottleneck
 
## Non-interactive top (useful in scripts)
top -bn1 | head -20

Bash

## htop -- more user-friendly (install if not present)
sudo apt install htop
htop
 
## htop advantages over top:
## * Mouse support
## * Colour-coded CPU/memory bars
## * F6 to sort by any column
## * F9 to kill with signal selection menu
## * Space to tag multiple processes for batch action

Milestone 3 — Understand and send process signals

Bash

## List all signal names and numbers
kill -l
 
## Send SIGTERM (15) -- polite shutdown request, can be caught
kill 1350
kill -15 1350       ## same
kill -TERM 1350     ## same
 
## Send SIGKILL (9) -- force kill, cannot be caught or ignored
kill -9 1350
kill -KILL 1350     ## same
 
## Send SIGHUP (1) -- reload config without restart
kill -HUP 1350
kill -1 1350        ## same
 
## Send SIGSTOP (19) -- pause the process
kill -STOP 1350
 
## Send SIGCONT (18) -- resume a stopped process
kill -CONT 1350
 
## Kill by process name
killall nginx         ## kills all processes named nginx
pkill nginx           ## same
pkill -f 'node index.js'  ## match against full command line
 
## Kill by pattern, see what would be killed first
pgrep -la 'node'
pkill -f 'node worker'
 
## Check if a process is alive without sending a signal
kill -0 1350
## exit code 0 = alive, 1 = does not exist or no permission

COMMON MISTAKE / WARNING
**Common Mistake:** Reaching for SIGKILL immediately. Always send SIGTERM first and wait 10-30 seconds. SIGTERM allows the process to flush buffers, close database connections, and remove lock files. SIGKILL gives no opportunity for cleanup and can leave corrupted state.

Milestone 4 — Manage process priority with nice and renice

Bash

## nice range: -20 (highest priority) to 19 (lowest priority)
## Default nice value: 0
## Only root can set negative nice (higher priority)
 
## Start a process with lower priority
nice -n 10 /opt/backup/run-backup.sh
 
## Start a backup job at lowest priority to not impact the main service
nice -n 19 pg_dump -U postgres mydb > /backup/mydb.sql
 
## Change priority of a running process
renice -n 5 -p 1350       ## lower priority of PID 1350
renice -n -5 -p 892       ## raise priority (requires root)
sudo renice -n -5 -p 892
 
## Check current nice value
ps -o pid,ni,cmd -p 1350

Milestone 5 — Manage background jobs

Bash

## Run a command in the background
./long-running-script.sh &
## [1] 2847  <- job number and PID
 
## List background jobs in current shell
jobs
## [1]+  Running    ./long-running-script.sh &
 
## Bring job to foreground
fg %1
 
## Send foreground job to background
## (first press Ctrl+Z to stop it)
## ^Z
bg %1
 
## Run a command immune to SIGHUP (survives terminal close)
nohup ./long-running-script.sh > /tmp/script.log 2>&1 &
 
## Disown a running job (remove from shell's job table)
## Even if shell closes, process continues
./server.sh &
disown %1
 
## Wait for all background jobs to finish
wait
 
## Wait for specific PID
wait 2847

Milestone 6 — Read process data from /proc

Bash

## Every running process has a directory: /proc/PID/
ls /proc/1350/
 
## Read the command line that started the process
cat /proc/1350/cmdline | tr '\0' ' '
## node /opt/payment-service/index.js
 
## Read current environment variables
cat /proc/1350/environ | tr '\0' '\n' | grep -E 'NODE_ENV|PORT|DB'
 
## Count open file descriptors
ls /proc/1350/fd | wc -l
## Compare to the limit:
cat /proc/1350/limits | grep 'open files'
 
## Check memory maps
cat /proc/1350/status | grep -E 'VmRSS|VmSize|Threads'
## VmRSS:   210MB  <- actual RAM used
## VmSize:  1.2GB  <- virtual memory claimed
## Threads: 8      <- number of threads
 
## Watch a value change in real time
watch -n1 'cat /proc/1350/status | grep VmRSS'

Production Best Practices and Common Pitfalls

Scenario	Wrong Approach	Correct Approach
Service not responding	kill -9 immediately	kill -TERM, wait 30s, then kill -9
High CPU process	Restart the server	ps aux --sort=-%cpu, identify, investigate root cause
Zombie processes	Ignore them	Find the parent, fix its wait() implementation
D-state process	kill -9 it	Cannot kill D-state — fix the underlying I/O issue
Runaway memory	Kill the process	Check for memory leak with /proc/PID/status over time

Quick Reference and Troubleshooting Commands

Task	Command
Show all processes	`ps aux`
Sort by CPU	`ps aux --sort=-%cpu
Sort by memory	`ps aux --sort=-%mem
Find a process	`pgrep -la processname`
Kill gracefully	`kill -TERM PID`
Force kill	`kill -KILL PID`
Reload config	`kill -HUP PID`
Check if alive	`kill -0 PID`
Count open files	`ls /proc/PID/fd
Memory usage	`cat /proc/PID/status
Background + immune	`nohup command > log.txt 2>&1 &`

PLACEMENT PRO TIP
**Tip:** `pgrep -la processname` is the fastest way to find a process and confirm you have the right one before sending a signal. It prints both the PID and the full command line, so you can verify you are not about to kill the wrong process.

COMMON MISTAKE / WARNING
**Security:** A process can only send signals to processes owned by the same user. Root can signal any process. If you need to stop a service running as a different user (like `www-data`), you must use `sudo kill PID` or `sudo systemctl stop service`.

Managing Linux Processes - ps, top, Signals, and the Process Lifecycle

Overview and What You Will Learn

Why This Matters in Production

Core Principles

Detailed Step-by-Step Practical Lab

Production Best Practices and Common Pitfalls

Quick Reference and Troubleshooting Commands

Resources

Explore More in Linux Process and System Management

Managing Linux Services with systemd — Units, Targets, and journalctl

Monitoring Linux Resources — CPU, Memory, Disk I/O, and Network Performance

Scheduling Linux Tasks with Cron — crontab, Systemd Timers, and Production Patterns