Overview and What You Will Learn
A container that is not working is not the same as a container that is broken. The difference is what you do next. Engineers who know Docker's debugging tools can identify the root cause of any container failure in under five minutes. Engineers who do not know these tools restart the container and hope for the best.
In this guide you will learn a systematic debugging workflow — starting with logs, moving to inspect when logs are not enough, using exec to investigate a running container from the inside, and using exit codes to understand exactly why a container died. You will also learn how to debug a container that has already stopped, which is often the hardest case.
Why This Matters in Production
At Hotstar during an IPL match, a container serving video streams starts failing. Thousands of users are seeing errors. You have minutes to diagnose and fix it. The engineer who can read docker logs with the right filters, spot a memory limit being hit in docker stats, and verify a config file inside the container with docker exec resolves it quickly. The engineer who only knows to restart the container makes it worse by losing the state needed for diagnosis.
Core Principles
The debugging process follows a consistent order. Each step gives you more information than the last.
+------------------------------------------+| Step 1: docker ps -a || Is the container running or stopped? || What is the exit code? |+------------------------------------------+ | v+------------------------------------------+| Step 2: docker logs || What did the application print? || Is there an obvious error message? |+------------------------------------------+ | If logs not enough v+------------------------------------------+| Step 3: docker inspect || What config was the container started || with? What env vars? What mounts? |+------------------------------------------+ | If container is running v+------------------------------------------+| Step 4: docker exec -it container bash || Investigate from inside the container || Check files, network, processes |+------------------------------------------+ | v+------------------------------------------+| Step 5: docker stats || Is the container hitting resource limits?|| Memory near limit? CPU throttled? |+------------------------------------------+Detailed Step-by-Step Practical Lab
Milestone 1: Reading Exit Codes
Every stopped container has an exit code. The exit code tells you exactly what happened.
# Check exit codes for stopped containersdocker ps -a --format "table {{.Names}}\t{{.Status}}"# NAMES STATUS# payment-api Exited (1) 5 minutes ago# db-migrator Exited (0) 10 minutes ago# order-worker Exited (137) 2 minutes ago# cache-warmer Exited (139) 1 minute ago # Exit code meanings:# 0 = Clean exit — process finished successfully (batch jobs, migrations)# 1 = Generic application error — check logs for details# 2 = Misuse of shell builtin or invalid argument# 126 = Command found but not executable (permission denied)# 127 = Command not found (wrong PATH or typo in CMD)# 128 = Invalid signal number# 137 = Killed by signal 9 (SIGKILL) — OOMKilled or docker kill# 139 = Segmentation fault (signal 11)# 143 = Killed by signal 15 (SIGTERM) — docker stop or clean shutdown # Confirm OOMKill specificallydocker inspect --format '{{.State.OOMKilled}}' order-worker# true — yes this container was killed because it exceeded memory limit # Get the full state objectdocker inspect --format '{{json .State}}' order-worker# {"Status":"exited","Running":false,"Paused":false,"Restarting":false,# "OOMKilled":true,"Dead":false,"Pid":0,"ExitCode":137,...}Milestone 2: Advanced Log Reading
# Basic log readingdocker logs payment-api # Real-world scenario: container crashes with restart policy# It restarts 3 times before you notice# docker logs shows logs from ALL runs concatenated together# Use --since to isolate the most recent failure docker logs --since 5m payment-api# Shows only logs from the last 5 minutes — only the most recent crash # Get the last 50 lines plus follow for new outputdocker logs --tail 50 -f payment-api # Search for errors in logs (pipe to grep)docker logs payment-api 2>&1 | grep -i "error\|fatal\|exception"# 2>&1 redirects stderr to stdout so grep sees both # Save logs to a file for sharing with your teamdocker logs payment-api > /tmp/payment-api-logs.txt 2>&1 # Count occurrences of an errordocker logs payment-api 2>&1 | grep -c "connection refused"# 47 — database connection is being refused 47 times # See the exact moment the container started having issuesdocker logs --timestamps payment-api | grep "ERROR"# 2024-01-15T14:23:01.123456789Z ERROR: DB connection failed# 2024-01-15T14:23:02.234567890Z ERROR: DB connection failed# Pattern: errors started at 14:23 — check what changed at that time # For a container that has stopped and been restarted:# --previous flag shows logs from the PREVIOUS run (before last restart)docker logs --previous payment-api# Shows why it crashed before the current run# Critical for diagnosing containers that restart immediatelyMilestone 3: Deep Inspection with docker inspect
# Full inspection — everything Docker knows about the containerdocker inspect payment-api # This returns a 200+ line JSON object. Learn to extract exactly what you need: # Check what environment variables the container was started withdocker inspect --format '{{range .Config.Env}}{{println .}}{{end}}' payment-api# NODE_ENV=production# DB_HOST=10.0.1.50# DB_PORT=5432# DB_PASSWORD=***hidden*** # Check what image the container is runningdocker inspect --format '{{.Config.Image}}' payment-api# registry.razorpay.in/payment-api:v3.1.0 # Check what ports are publisheddocker inspect --format '{{json .NetworkSettings.Ports}}' payment-api# {"8080/tcp":[{"HostIp":"0.0.0.0","HostPort":"8080"}]} # Check what volumes are mounteddocker inspect --format '{{range .Mounts}}{{.Type}} {{.Source}} -> {{.Destination}}{{println}}{{end}}' payment-api# volume /var/lib/docker/volumes/payment-data/_data -> /app/data # Check the restart policydocker inspect --format '{{.HostConfig.RestartPolicy.Name}}' payment-api# unless-stopped # Check memory and CPU limitsdocker inspect --format 'Memory: {{.HostConfig.Memory}} CPU: {{.HostConfig.NanoCpus}}' payment-api# Memory: 536870912 CPU: 1000000000# 536870912 bytes = 512MB, 1000000000 nanocpus = 1 CPU # Check the container network and IPdocker inspect --format '{{range $net, $config := .NetworkSettings.Networks}}{{$net}}: {{$config.IPAddress}}{{println}}{{end}}' payment-api# bridge: 172.17.0.3# payment-network: 10.0.1.25 # Check when the container was created and last starteddocker inspect --format 'Created: {{.Created}} Started: {{.State.StartedAt}}' payment-apiMilestone 4: Investigating from Inside with docker exec
# Get an interactive shell inside a running containerdocker exec -it payment-api bash# If bash is not available (minimal images):docker exec -it payment-api sh # Once inside the container, you can investigate:# Check what processes are runningps aux # Check network connectivitycurl http://postgres:5432 # Try to reach the database by service name# Can the container reach the database? # Check DNS resolutionnslookup postgres# Does the container resolve service names correctly? # Check which ports are listeningss -tulpn# ornetstat -tulpn # Read a config filecat /app/config/database.yml # Check disk space inside the containerdf -h # Check environment variables as seen by the processenv | sort # Check file permissions that might be causing issuesls -la /app/data/ # Exit the container shellexitWhen the container is running but behaving incorrectly, exec is your most powerful tool. You are seeing the exact environment the application sees.
Milestone 5: Debugging a Stopped Container
Exec only works on running containers. When a container crashes and stays stopped, you need a different approach.
# Method 1: Read the logs from the stopped container# Logs are preserved until docker rm is rundocker logs stopped-payment-api --tail 100 # Method 2: Create a new container from the same image with a shell override# This starts the same image but runs bash instead of the normal command# Lets you inspect the filesystem and config without the app crash happeningdocker run -it --rm \ --entrypoint bash \ registry.razorpay.in/payment-api:v3.1.0# Now you are inside the image environment# Check config files, verify binaries exist, check permissions # Method 3: Commit the stopped container to a new image and inspect itdocker commit stopped-payment-api debug-payment-apidocker run -it --rm debug-payment-api bash# This preserves any files that were written during the container's run# Useful when the crash modified files (created a lockfile, corrupted a db, etc.) # Method 4: Use docker diff to see what files changed during the container rundocker diff stopped-payment-api# A /app/logs/error.log <- A = Added# C /app/config/db.yml <- C = Changed# D /app/tmp/lock.pid <- D = Deleted# Shows every file the container modified from its imageMilestone 6: Diagnosing Resource Problems with docker stats
# Live monitoring of all running containersdocker stats # What each column means:# CONTAINER ID — short container ID# NAME — container name# CPU % — CPU usage relative to the CPU quota# MEM USAGE — current memory used vs the container's memory limit# MEM % — memory usage as a percentage of the limit# NET I/O — network bytes received / transmitted# BLOCK I/O — disk bytes read / written# PIDS — number of processes inside the container # Warning signs:# CPU % consistently above 80% — app is CPU-bound or stuck in a loop# MEM % above 85% — approaching the limit, OOMKill risk# PIDS growing over time — process leak, not cleaning up child processes # One-time snapshot for all containersdocker stats --no-stream --format "table {{.Name}}\t{{.CPUPerc}}\t{{.MemUsage}}\t{{.MemPerc}}"# NAME CPU % MEM USAGE / LIMIT MEM %# payment-api 45.2% 420MiB / 512MiB 82.0% <- getting close to limit!# order-api 2.1% 128MiB / 512MiB 25.0%# postgres 0.8% 256MiB / 1GiB 25.0% # If memory is above 85%:# Either increase the container memory limit:docker update --memory 1g payment-api# Or investigate the memory leak in the applicationCommon Mistakes
| Mistake | What Goes Wrong | Fix |
|---|---|---|
Reading logs without --since on a restarting container |
Sees thousands of lines from many restarts | Always use --since 5m or --tail 100 to scope it |
| Not checking OOMKilled before investigating logs | Misses that container was killed by the kernel | Check docker inspect --format '{{.State.OOMKilled}}' first |
Running docker exec on the wrong container |
Investigating the healthy replica instead of the crashed one | Always use docker ps -a to find the exact container ID first |
Overwriting the crashed container with docker rm before saving logs |
Evidence gone permanently | Save logs to a file before cleaning up: docker logs name > /tmp/logs.txt |
Using docker exec to make permanent fixes |
Changes are lost when container restarts | Fix the Dockerfile or environment config, rebuild the image |
Troubleshooting Reference
| Exit Code | Meaning | First Step |
|---|---|---|
| 0 | Clean exit | Check if this is a batch job that should exit, not a server |
| 1 | Application error | docker logs --tail 50 name |
| 127 | Command not found | Check CMD/ENTRYPOINT in Dockerfile — binary might not exist in image |
| 137 | OOMKilled or docker kill | docker inspect --format '{{.State.OOMKilled}}' — if true, increase memory limit |
| 139 | Segfault | Application crash — check logs and report to app developer |
| 143 | SIGTERM received | Normal stop via docker stop — not an error |
PLACEMENT PRO TIP**Tip:** When a container is crashing in a restart loop, use `docker logs --previous name` to see the logs from the run before the current one. The current run's logs may only show the startup sequence — the error that caused the crash is in the previous run's logs.
REMEMBER THIS**Remember:** `docker inspect` is the single most comprehensive source of information about a container. Before Googling a problem, try `docker inspect container-name` and read the State, HostConfig, and NetworkSettings sections. The answer to most configuration problems is in there.
COMMON MISTAKE / WARNING**Common Mistake:** Running `docker exec` to fix a problem inside a running container. Any change you make inside a container via exec is temporary — it disappears the next time the container restarts. The correct fix is always to change the Dockerfile, rebuild the image, and redeploy. Exec is for investigation, never for making production fixes.
COMMON MISTAKE / WARNING**Security:** On production hosts, all `docker exec` sessions should be logged for audit purposes. An engineer who can exec into a production container can read secrets from environment variables, read files, and exfiltrate data. Consider restricting exec access to debugging sessions only and requiring a written justification in your incident management system before granting access to production containers.