Overview and What You Will Learn
A Compose stack with depends_on alone solves only half the problem. Docker starts the containers in the right order, but "started" and "ready to accept connections" are not the same thing. PostgreSQL's container process can be running in under a second, while the database itself takes another 3-4 seconds to finish initialising and accept connections. An API container that starts the instant Postgres' container starts will crash on its first connection attempt.
This lab fixes that gap permanently using health checks combined with condition: service_healthy.
By the end of this lab you will:
- Understand why plain depends_on is not enough to guarantee readiness
- Write HEALTHCHECK instructions for HTTP, TCP, database, and Redis-based services
- Configure depends_on with condition: service_healthy in Compose
- Read and interpret container health states (starting, healthy, unhealthy)
- Debug a failing health check step by step
- Know when wait-for-it.sh style scripts are still useful
Why This Matters in Production
At PhonePe, a payments API container that starts before its Redis-backed rate limiter is ready will either crash on boot or silently skip rate limiting for its first few seconds of traffic — both are unacceptable outcomes for a payments system. The fix is not a sleep statement in the entrypoint script. It is a properly defined health check that Compose (or Kubernetes, later) can rely on as a contract: "this service does not get traffic until it reports healthy."
This same health check definition becomes the foundation for Kubernetes readiness probes later, so getting it right in Compose pays off twice.
Core Principles
Why depends_on alone is insufficient:
+------------------------+ +------------------------------+| postgres container | | api container || | | || process starts (0.3s) | -------> | starts immediately (0.3s) || still initialising | | tries to connect -> FAILS |+------------------------+ +------------------------------+With a health check gating the dependency:
+------------------------------------------+| postgres container starts |+------------------------------------------+ | v+------------------------------------------+| status: starting (health check not yet | <- pg_isready fails,| passed, retry in 5s) | container stays 'starting'+------------------------------------------+ | v+------------------------------------------+| pg_isready succeeds, status: healthy |+------------------------------------------+ | v+------------------------------------------+| api container is released to start | <- depends_on:+------------------------------------------+ condition: service_healthyHealth check states:
+------------------------+ +------------------------+ +------------------------+| starting | | healthy | | unhealthy || | | | | || within start_period | | check passed N times in a | | check failed retries || failures do not count | | row (default: 1) | | times in a row |+------------------------+ +------------------------+ +------------------------+Detailed Step-by-Step Practical Lab
Milestone 1 — Write a HEALTHCHECK in the Dockerfile
FROM node:20-alpineWORKDIR /appCOPY package*.json ./RUN npm ci --productionCOPY . . # Health check hits the app's own /health endpoint# interval: how often to check# timeout: how long to wait for a response# start_period: grace time before failures count against the container# retries: consecutive failures needed to mark unhealthyHEALTHCHECK --interval=10s --timeout=3s --start-period=15s --retries=3 \ CMD wget --no-verbose --tries=1 --spider http://localhost:4000/health || exit 1 CMD ["node", "index.js"]## Build and confirm the health check is registereddocker build -t phonepe-rate-limiter-api .docker run -d --name rate-limiter-api -p 4000:4000 phonepe-rate-limiter-api ## Watch the health status transition from starting to healthywatch -n 1 'docker inspect --format "{{.State.Health.Status}}" rate-limiter-api'Milestone 2 — Define health checks for common dependency types in Compose
services: postgres: image: postgres:16-alpine healthcheck: test: ["CMD-SHELL", "pg_isready -U phonepe_user -d ratelimits"] interval: 5s timeout: 3s retries: 5 start_period: 10s redis: image: redis:7-alpine healthcheck: test: ["CMD", "redis-cli", "ping"] interval: 5s timeout: 3s retries: 5 rabbitmq: image: rabbitmq:3-management-alpine healthcheck: test: ["CMD", "rabbitmq-diagnostics", "check_port_connectivity"] interval: 10s timeout: 5s retries: 5 start_period: 20s internal-api: image: phonepe-rate-limiter-api:latest healthcheck: test: ["CMD", "wget", "--no-verbose", "--tries=1", "--spider", "http://localhost:4000/health"] interval: 10s timeout: 3s retries: 3 start_period: 15sMilestone 3 — Gate startup with condition: service_healthy
services: api: build: . depends_on: postgres: condition: service_healthy redis: condition: service_healthy rabbitmq: condition: service_healthy environment: DATABASE_URL: postgres://phonepe_user:secret@postgres:5432/ratelimits REDIS_URL: redis://redis:6379 postgres: image: postgres:16-alpine healthcheck: test: ["CMD-SHELL", "pg_isready -U phonepe_user"] interval: 5s timeout: 3s retries: 5 redis: image: redis:7-alpine healthcheck: test: ["CMD", "redis-cli", "ping"] interval: 5s retries: 5 rabbitmq: image: rabbitmq:3-management-alpine healthcheck: test: ["CMD", "rabbitmq-diagnostics", "check_port_connectivity"] interval: 10s retries: 5## Start the stack and watch dependency ordering in actiondocker compose up## api container will NOT start until postgres, redis, and## rabbitmq all report status: healthyMilestone 4 — Read and debug health check status
## See health status for all servicesdocker compose psNAME STATUSphonepe-postgres-1 Up 12 seconds (healthy)phonepe-redis-1 Up 12 seconds (healthy)phonepe-rabbitmq-1 Up 12 seconds (health: starting)phonepe-api-1 Created## Inspect detailed health check history for one containerdocker inspect phonepe-rabbitmq-1 --format '{{json .State.Health}}' | python3 -m json.tool ## See the last few health check attempts and their outputdocker inspect phonepe-rabbitmq-1 --format '{{json .State.Health.Log}}' | python3 -m json.tool ## Manually run the exact health check command for debuggingdocker exec phonepe-rabbitmq-1 rabbitmq-diagnostics check_port_connectivityMilestone 5 — Diagnose a stuck or failing health check
## If a container stays 'starting' past start_period, check: ## 1. Is the command available inside the container at all?docker exec phonepe-api-1 which wget ## 2. Is the app actually listening on the expected port yet?docker exec phonepe-api-1 netstat -tlnp ## 3. Run the health check command manually with verbose outputdocker exec phonepe-api-1 wget --no-verbose --tries=1 --spider http://localhost:4000/health ## 4. Check application logs for startup errorsdocker compose logs api --tail=50Milestone 6 — When wait-for-it.sh is still useful
Health checks solve container-to-container ordering inside Compose. But sometimes you need to wait for a dependency from outside Compose entirely, for example in a CI script before running tests.
## Download the script once into your repocurl -o wait-for-it.sh https://raw.githubusercontent.com/vishnubob/wait-for-it/master/wait-for-it.shchmod +x wait-for-it.sh ## Use it in a CI step before running integration tests./wait-for-it.sh localhost:5432 --timeout=30 -- npm run test:integrationREMEMBER THIS**Remember:** `wait-for-it.sh` checks if a TCP port is open. A `HEALTHCHECK` checks if the application is actually functional. A database port can be open while the database itself is still recovering from a crash — prefer real health checks over plain port checks whenever possible.
Production Best Practices and Common Pitfalls
| Scenario | Wrong | Correct |
|---|---|---|
| API needs DB ready | depends_on with no condition | depends_on with condition: service_healthy |
| Slow-starting service | retries set too low, marked unhealthy too early | Use start_period to give grace time before counting failures |
| Health check tool missing | CMD curl in an alpine image without curl installed | Use wget (preinstalled) or install curl explicitly |
| Health check too expensive | Running a full DB query every 2 seconds | Use a lightweight check like pg_isready or redis-cli ping |
| Debugging a stuck check | Restarting the whole stack repeatedly | docker inspect State.Health.Log to see actual failure output |
Quick Reference and Troubleshooting Commands
| Task | Command |
|---|---|
| View health status of all services | docker compose ps |
| Inspect full health detail | docker inspect name --format '{{json .State.Health}}' |
| See health check failure history | docker inspect name --format '{{json .State.Health.Log}}' |
| Run health check command manually | docker exec name <healthcheck command> |
| View startup logs | docker compose logs -f service_name |
| Wait for a TCP port before running a command | ./wait-for-it.sh host:port -- command |
PLACEMENT PRO TIP**Tip:** Set `start_period` generously for databases (10-20 seconds) and tightly for simple HTTP services (3-5 seconds). A start_period that is too short causes false unhealthy marks during normal slow startup; one too long delays detection of a genuinely broken container.
COMMON MISTAKE / WARNING**Common Mistake:** Using `curl` in a health check inside an alpine-based image that does not include curl by default, causing the health check itself to fail with "command not found" rather than reporting the actual application status. Use `wget --spider`, which ships in alpine, or install curl explicitly in the Dockerfile.
COMMON MISTAKE / WARNING**Security:** Health check endpoints like `/health` are sometimes left unauthenticated and exposed on the same port as the main application, accidentally leaking internal status details (DB connection strings, version numbers) to anyone who can reach the container. Keep health endpoints minimal — return only `200 OK` or `503`, nothing else.