What is the career path for learning Processing Text in Linux — grep, awk, sed, cut, and jq?

Mastering Processing Text in Linux — grep, awk, sed, cut, and jq enables engineering opportunities in DevOps, SRE, and cloud platform automation.

How long does it take to learn Processing Text in Linux — grep, awk, sed, cut, and jq?

Most students gain core proficiency in Processing Text in Linux — grep, awk, sed, cut, and jq in 2–3 weeks of active hands-on labs.

Processing Text in Linux — grep, awk, sed, cut, and jq | DevOps Network

Overview and What You Will Learn

Every production incident involves reading logs. Every deployment script processes command output. Every monitoring system parses structured data. Text processing tools are the lenses through which Linux engineers see what is happening in their systems.

By the end of this lab you will:

Filter log files with grep using regex patterns and context flags
Extract and transform fields with awk one-liners
Perform find-and-replace operations with sed
Cut columns from delimited files with cut
Sort, deduplicate, and count with sort and uniq
Parse and query JSON from APIs with jq
Build multi-stage log analysis pipelines combining all tools

Why This Matters in Production

A Hotstar on-call engineer gets paged at 2 AM. The service is returning errors. The log file has 500,000 lines. Without text processing skills, finding the root cause takes 20 minutes of scrolling. With them: grep -E 'ERROR|FATAL' /var/log/app.log | awk '{print $5}' | sort | uniq -c | sort -rn | head -5 finds the top 5 error types in 3 seconds.

Core Principles

Text processing pipeline — raw log to insight:

◈ DIAGRAM

+------------------------------------------+
| Raw log: 500,000 lines                   |
| access.log (nginx access log)            |
+------------------------------------------+
                    |
             grep '500'
                    |
                    v
+------------------------------------------+
| Filtered: 1,247 error lines              |
+------------------------------------------+
                    |
         awk '{print $7}'
                    |
                    v
+------------------------------------------+
| Extracted: URL paths only                |
+------------------------------------------+
                    |
          sort | uniq -c | sort -rn
                    |
                    v
+------------------------------------------+
| Ranked: top URLs generating 500 errors   |
| 342 /api/payment/process                 |
| 198 /api/cart/checkout                   |
+------------------------------------------+

Detailed Step-by-Step Practical Lab

Milestone 1 — grep for filtering

Bash

## Basic search
grep 'ERROR' /var/log/app.log
 
## Case-insensitive
grep -i 'error' /var/log/app.log
 
## Show line numbers
grep -n 'ERROR' /var/log/app.log
 
## Count matches
grep -c 'ERROR' /var/log/app.log
 
## Invert (lines NOT matching)
grep -v 'DEBUG' /var/log/app.log
 
## Show context: 2 lines before, 3 lines after each match
grep -B 2 -A 3 'CRITICAL' /var/log/app.log
 
## Extended regex (alternation, +, ?)
grep -E 'ERROR|FATAL|CRITICAL' /var/log/app.log
 
## Multiple files -- show filename
grep -l 'ERROR' /var/log/*.log
 
## Recursive search in directories
grep -r 'DATABASE_URL' /etc/ --include='*.conf'
grep -r 'API_KEY' /opt/apps/ --include='*.py' --include='*.js'
 
## Highlight matches (useful for piping to less)
grep --color=always 'ERROR' /var/log/app.log | less -R
 
## Search for whole words only
grep -w 'error' /var/log/app.log
## Matches 'error' but not 'errors' or 'error_code'
 
## Perl regex (lookahead, lookbehind)
grep -P '(?<=user_id=)\d+' /var/log/app.log
## Extracts digits that follow 'user_id='
 
## Quiet mode for scripts
if grep -q 'FATAL' /var/log/app.log; then
  echo "Fatal errors detected -- alerting on-call"
fi

Milestone 2 — awk for field extraction

Bash

## Print specific fields (whitespace-delimited by default)
## nginx access log: IP - - [date] "METHOD /path HTTP" status bytes
cat /var/log/nginx/access.log | awk '{print $1, $7, $9}'
## 203.0.113.45 /api/payment 200
 
## Custom field separator
awk -F: '{print $1, $3}' /etc/passwd       ## user:UID
awk -F, '{print $1, $4}' /tmp/report.csv   ## CSV columns 1 and 4
 
## Filter with condition then extract
## Show PIDs of processes using more than 10% CPU
ps aux | awk '$3 > 10 {print $2, $11, $3"%"}'
 
## Count pattern occurrences
awk '/ERROR/ {count++} END {print "Errors:", count}' /var/log/app.log
 
## Sum a column
awk '{sum += $10} END {printf "Total bytes: %.2fMB\n", sum/1024/1024}' /var/log/nginx/access.log
 
## Group and count
## Count requests per status code
awk '{codes[$9]++} END {for (code in codes) print code, codes[code]}' /var/log/nginx/access.log | sort -rn -k2
 
## Formatted table output
df -h | awk 'NR==1 || $5+0 > 70 {printf "%-20s %5s %5s\n", $6, $5, $4}'
## Shows header row plus filesystems over 70% full
 
## Process specific line range
awk 'NR>=100 && NR<=200 {print NR": "$0}' /var/log/app.log
 
## Multi-field log analysis
## Format: timestamp service level message
## 2024-01-15T10:23:45 payment-api ERROR Database connection failed
awk '/ERROR/ {services[$2]++} END {for (s in services) print s, services[s]}' /var/log/app.log
## Shows error count per service

Milestone 3 — sed for transformation

Bash

## Basic substitution (first occurrence per line)
sed 's/localhost/prod-db.razorpay.internal/' config.yaml
 
## Global substitution (all occurrences)
sed 's/localhost/prod-db.razorpay.internal/g' config.yaml
 
## In-place edit (always backup first)
sed -i.bak 's/debug: true/debug: false/' /etc/app/config.yaml
## Creates config.yaml.bak with original content
 
## Delete lines
sed '/^#/d' config.yaml          ## remove comments
sed '/^$/d' config.yaml          ## remove blank lines
sed '/DEBUG/d' /var/log/app.log  ## remove debug lines
 
## Extract lines between markers
sed -n '/BEGIN CERT/,/END CERT/p' certificate.pem
 
## Substitute only on lines matching a pattern
## Only change port on lines containing 'payment-api'
sed '/payment-api/s/8080/4000/' docker-compose.yaml
 
## Multiple expressions
sed -e 's/localhost/10.0.2.100/g' -e 's/5432/5433/g' config.yaml
 
## Use different delimiter (useful when pattern contains /)
sed 's|/etc/nginx|/etc/nginx-prod|g' config.conf
 
## Add line before pattern
sed '/^\[Service\]/i User=payment-svc' myapp.service
 
## Add line after pattern
sed '/^ExecStart=/a Restart=on-failure' myapp.service
 
## Comment out a line
sed -i '/PasswordAuthentication yes/s/^/#/' /etc/ssh/sshd_config
 
## Remove trailing whitespace from all lines
sed -i 's/[[:space:]]*$//' file.txt

Milestone 4 — cut, sort, and uniq for structured data

Bash

## cut: extract columns from delimited text
## -d sets delimiter, -f selects field(s)
cut -d: -f1 /etc/passwd           ## extract usernames
cut -d: -f1,3 /etc/passwd         ## fields 1 and 3
cut -d, -f2,4 report.csv          ## CSV columns 2 and 4
cut -c1-10 /var/log/app.log       ## first 10 characters of each line
 
## sort: sort lines
sort /etc/hosts                    ## alphabetical
sort -n numbers.txt                ## numeric
sort -rn numbers.txt               ## reverse numeric
sort -k2 -t: /etc/passwd          ## sort by field 2, colon-delimited
sort -u /var/log/ips.txt           ## sort and deduplicate
 
## uniq: work with consecutive duplicate lines
## (always sort first for full deduplication)
sort /var/log/ips.txt | uniq       ## deduplicate
sort /var/log/ips.txt | uniq -c   ## count occurrences
sort /var/log/ips.txt | uniq -d   ## show only duplicates
 
## Combined: frequency analysis
## Top 10 IPs from nginx access log
awk '{print $1}' /var/log/nginx/access.log | \
  sort | uniq -c | sort -rn | head -10
 
## Top error messages
grep 'ERROR' /var/log/app.log | \
  sed 's/.*ERROR //' | \
  sort | uniq -c | sort -rn | head -10

Milestone 5 — jq for JSON processing

Bash

## Install jq if not present
sudo apt install jq
 
## Pretty-print JSON
curl -s https://api.internal/status | jq .
 
## Extract a field
curl -s https://api.internal/health | jq '.status'
## "healthy"
 
## Extract nested field
curl -s https://api.internal/metrics | jq '.services.payment.latency_ms'
 
## Extract from array
curl -s https://api.internal/servers | jq '.[0].hostname'
curl -s https://api.internal/servers | jq '.[].hostname'  ## all hostnames
 
## Filter array by condition
curl -s https://api.internal/services | jq '.[] | select(.status == "down")'
curl -s https://api.internal/services | jq '[.[] | select(.healthy == false)]'
 
## Build new JSON
curl -s https://api.internal/services | \
  jq '{name: .name, status: .status, uptime: .uptime_seconds}'
 
## Extract multiple fields as TSV
curl -s https://api.internal/services | \
  jq -r '.[] | [.name, .status, .region] | @tsv'
 
## Count elements
curl -s https://api.internal/services | jq 'length'
curl -s https://api.internal/services | jq '[.[] | select(.status == "down")] | length'
 
## Process docker inspect output
docker inspect payment-api | jq '.[0].NetworkSettings.IPAddress'
docker inspect payment-api | jq '.[0].State.Status'
 
## Process kubectl output
kubectl get pods -o json | jq '.items[] | {name: .metadata.name, status: .status.phase}'

Milestone 6 — Production log analysis pipeline

Bash

#!/usr/bin/env bash
## Production incident log analyser
## Usage: ./analyse-logs.sh /var/log/app/app.log
 
LOG_FILE="${1:-/var/log/app/app.log}"
SINCE="${2:-1 hour ago}"
 
echo "=== Log Analysis: $LOG_FILE ==="
echo "=== Period: since $SINCE ==="
echo ""
 
echo "--- Error Summary ---"
grep -c 'ERROR' "$LOG_FILE" || echo "0 errors"
grep -c 'FATAL' "$LOG_FILE" || echo "0 fatals"
echo ""
 
echo "--- Top 5 Error Types ---"
grep 'ERROR' "$LOG_FILE" | \
  awk '{$1=$2=$3=""; print $0}' | \
  sed 's/^ *//' | \
  sort | uniq -c | sort -rn | head -5
echo ""
 
echo "--- Error Rate by Minute (last 10 minutes) ---"
grep 'ERROR' "$LOG_FILE" | \
  awk '{print $1"T"substr($2,1,5)}' | \
  sort | uniq -c | tail -10
echo ""
 
echo "--- Services with Most Errors ---"
grep 'ERROR' "$LOG_FILE" | \
  awk '{print $4}' | \
  sort | uniq -c | sort -rn | head -5
echo ""
 
echo "--- Slowest API Calls (>1000ms) ---"
grep 'duration_ms' "$LOG_FILE" | \
  jq -r 'select(.duration_ms > 1000) | "\(.duration_ms)ms \(.path)"' 2>/dev/null | \
  sort -rn | head -10

Production Best Practices and Common Pitfalls

Task	Slow Approach	Fast Approach
Find error lines	`cat log	grep ERROR`
Count unique IPs	Loop and count manually	`sort ips.txt
Find large files	Browse directories	`du -sh /*
Extract JSON field	String splitting	`jq '.fieldname'`
Remove duplicate lines	Manual comparison	`sort file

Quick Reference and Troubleshooting Commands

Task	Command
Filter log	`grep -E 'ERROR
Extract field	`awk '{print $3}' logfile`
Custom delimiter	`awk -F: '{print $1}' /etc/passwd`
Replace text	`sed 's/old/new/g' file`
In-place edit	`sed -i.bak 's/old/new/g' file`
Frequency count	`sort file
JSON field	`jq '.fieldname' response.json`
Filter JSON array	`jq '.[]

PLACEMENT PRO TIP
**Tip:** Avoid the `cat file | grep pattern` antipattern (Useless Use of Cat). `grep pattern file` is faster and more direct. The extra `cat` process is unnecessary. Similarly, `grep pattern file | awk '{...}'` can often be `awk '/pattern/{...}' file` in a single process.

REMEMBER THIS
**Remember:** `uniq` only removes consecutive duplicate lines. If duplicates are scattered throughout a file, `uniq` alone will not catch them. Always sort first: `sort file | uniq -c | sort -rn` gives frequency analysis of all duplicates regardless of position.

COMMON MISTAKE / WARNING
**Security:** Avoid parsing `/etc/passwd` or security-sensitive files with regex tools in scripts that process user input. An injection via a username containing special characters could manipulate your awk or sed pattern. Use dedicated tools (`getent passwd username`) for user lookups in security-sensitive contexts.

COMMON MISTAKE / WARNING
**Common Mistake:** Using `sed -i` without first testing the expression. Always run `sed 's/old/new/g' file` without `-i` first to preview the output. One wrong regex on a production config file can break a service until the backup is restored.

Processing Text in Linux — grep, awk, sed, cut, and jq

Overview and What You Will Learn

Why This Matters in Production

Core Principles

Detailed Step-by-Step Practical Lab

Production Best Practices and Common Pitfalls

Quick Reference and Troubleshooting Commands

Resources

Explore More in Linux Shell Scripting and Automation

Writing Bash Scripts — Variables, Conditionals, Loops, and Error Handling

Shell Scripting for DevOps — Deployment, Health Checks, and Automation Patterns

Shell Scripting for DevOps - Deployment, Health Checks, and Automation Patterns

Configuring Linux Environment — Variables, PATH, Dotfiles, and Shell Startup