What is awk? | DevOps Dictionary

Understanding awk

What Is awk in Simple Terms

While grep finds lines, awk processes fields within those lines. If grep is a filter, awk is a transformation engine. Every line of input is automatically split into fields ($1, $2, $3...) separated by whitespace (or a custom delimiter), and you write rules that process those fields.

ps aux | awk '{print $2, $11}' — prints just the PID and command name columns from ps output, discarding everything else.

How It Works

TEXT

Input line: "rahul    1234  45.3  2.1  node index.js"
 
awk splits by whitespace:
  $1 = rahul       (user)
  $2 = 1234        (PID)
  $3 = 45.3        (CPU%)
  $4 = 2.1         (MEM%)
  $5 = node        (command)
  $6 = index.js    (argument)
  $NF = index.js   (last field)
  NR = line number
  NF = 6 (number of fields)

◈ DIAGRAM

+------------------------------------------+
| Input line:                              |
| rahul  1234  45.3  node index.js         |
+------------------------------------------+
                    |
     awk splits by whitespace (default)
                    |
                    v
+------------------------------------------+
| $1=rahul  $2=1234  $3=45.3  $NF=index.js|
| NR=line number    NF=field count         |
+------------------------------------------+
                    |
   awk '{print $1, $2}' (print fields 1,2)
                    |
                    v
+------------------------------------------+
| Output: rahul 1234                       |
+------------------------------------------+

Practical Commands

Bash

## Print specific fields
ps aux | awk '{print $1, $2, $11}'     ## user, PID, command
df -h | awk '{print $5, $6}'           ## usage%, mount point
 
## Custom field separator
awk -F: '{print $1, $3}' /etc/passwd   ## username and UID
awk -F, '{print $1, $3}' data.csv      ## first and third CSV columns
 
## Conditional processing
ps aux | awk '$3 > 10 {print $2, $11}'  ## PIDs using >10% CPU
df -h | awk '$5+0 > 80 {print $6, $5}' ## filesystems >80% full
 
## BEGIN and END blocks
awk 'BEGIN {print "=== Report ==="} {sum += $3} END {print "Total CPU:", sum}' <(ps aux)
 
## Pattern matching
awk '/ERROR/ {print NR, $0}' /var/log/app.log     ## numbered error lines
awk '!/DEBUG/ {print}' /var/log/app.log            ## skip debug lines
 
## Formatted output
df -h | awk '{printf "%-20s %5s\n", $6, $5}'
## /                    62%
## /var                 45%
 
## Multi-field operations
## Count errors by service from structured logs:
## 2024-01-15 10:23:45 payment-api ERROR Connection failed
awk '/ERROR/ {errors[$3]++} END {for (svc in errors) print svc, errors[svc]}' /var/log/app.log
 
## Sum a column
awk '{sum += $2} END {print "Total:", sum}' numbers.txt
 
## Process only specific line range
awk 'NR==10,NR==20 {print}' /var/log/app.log

Troubleshooting

Symptom	Command	What to Check
Wrong field selected	`echo 'a b c'	awk '{print $2}'`
Numeric comparison fails	`awk '$3+0 > 80'`	Force numeric with +0
Multi-character separator	`awk -F '::' '{print $1}'`	-F accepts regex

PLACEMENT PRO TIP
**Tip:** `awk 'NR==1{print; next} !seen[$0]++{print}'` is the classic awk one-liner for removing duplicate lines while preserving the first occurrence and keeping the header row. Useful for deduplicating log entries.

REMEMBER THIS
**Remember:** awk field numbering starts at $1. $0 is the entire line. $NF is always the last field regardless of how many fields there are. This makes `awk '{print $NF}'` reliable for extracting the last word/column even when the number of fields varies.