What is the career path for learning Writing Production-Grade Dockerfiles — Layers, Caching, and Best Practices?

Mastering Writing Production-Grade Dockerfiles — Layers, Caching, and Best Practices enables engineering opportunities in DevOps, SRE, and cloud platform automation.

Writing Production-Grade Dockerfiles — Layers, Caching, and Best Practices | DevOps Network

Q: How long does it take to learn Writing Production-Grade Dockerfiles — Layers, Caching, and Best Practices?

Most students gain core proficiency in Writing Production-Grade Dockerfiles — Layers, Caching, and Best Practices in 2–3 weeks of active hands-on labs.

Overview and What You Will Learn

Most engineers write their first Dockerfile by copying an example from the internet and making it work. The result is a 2GB image that takes 8 minutes to build, runs as root, and includes your entire node_modules directory in every layer. It works — but it is slow, large, and insecure.

In this guide you will learn how to write Dockerfiles that build in under 60 seconds, produce images under 200MB, and run securely as a non-root user. You will understand the exact mechanics of layer caching — so you know why moving one instruction changes your build from 45 seconds to 8 minutes — and how to structure every Dockerfile for maximum cache reuse.

Why This Matters in Production

At Hotstar, 50+ Docker images are built on every code push. A poorly cached Dockerfile that takes 8 minutes to build adds 400+ minutes of developer wait time per day across the team. An image that is 2GB instead of 150MB adds 1850MB of pull time before every deployment. These are not minor inconveniences — they are engineering productivity costs that compound daily.

Core Principles

Every Dockerfile instruction creates a layer. Layers are cached. Understanding cache invalidation is the most important skill for writing fast Dockerfiles.

Bash

+------------------------------------------+
| FROM node:20-alpine                      | <- Layer 1: base image
+------------------------------------------+
| WORKDIR /app                             | <- Layer 2: set working dir
+------------------------------------------+
| COPY package.json package-lock.json ./   | <- Layer 3: just dependency files
+------------------------------------------+
| RUN npm install                          | <- Layer 4: install deps (CACHED)
+------------------------------------------+
| COPY . .                                 | <- Layer 5: your source code
+------------------------------------------+
| RUN npm run build                        | <- Layer 6: build your app
+------------------------------------------+
 
Cache invalidation rule:
If a layer changes, ALL layers below it are invalidated.
 
If you change your source code (Layer 5 changes):
  * Layer 1: cache HIT (FROM unchanged)
  * Layer 2: cache HIT (WORKDIR unchanged)
  * Layer 3: cache HIT (package.json unchanged)
  * Layer 4: cache HIT (npm install output unchanged - this is the slow step!)
  * Layer 5: cache MISS (source changed)
  * Layer 6: cache MISS (must rebuild)
 
If you put COPY . . before npm install:
  * Every source file change invalidates npm install
  * Every build takes 3-5 minutes instead of 45 seconds

Detailed Step-by-Step Practical Lab

Milestone 1: Dockerfile Instruction Reference

Every Dockerfile instruction and when to use it:

Dockerfile

# FROM — The base image. Always pin to a specific version, never use :latest
FROM node:20-alpine
# Good: node:20-alpine, node:20-alpine3.18, node:20.10.0-alpine3.18
# Bad:  node:latest, node:alpine (alpine tag changes over time)
 
# WORKDIR — Set the working directory inside the container
WORKDIR /app
# Creates the directory if it does not exist
# All subsequent RUN, COPY, CMD use this as the base path
# Always use an absolute path
 
# COPY — Copy files from build context into the image
COPY package.json package-lock.json ./
# Copies specific files — cache-friendly
COPY src/ ./src/
# Copies a directory
COPY . .
# Copies everything not in .dockerignore — put this LAST
 
# ADD — Like COPY but also extracts tar files and allows URLs
ADD https://example.com/file.tar.gz /tmp/
# Avoid ADD unless you specifically need tar extraction or URL fetching
# COPY is more explicit and predictable
 
# RUN — Execute a command during the build
RUN npm install
# Each RUN creates a new layer
# Chain commands to avoid extra layers:
RUN apt-get update && \
    apt-get install -y curl git && \
    rm -rf /var/lib/apt/lists/*
# Critical: clean apt cache in the same RUN instruction
# If you clean in a separate RUN, the cache is still in the previous layer
 
# ENV — Set environment variables baked into the image
ENV NODE_ENV=production
ENV PORT=8080
# These are available at both build time and runtime
# Do not use ENV for secrets — they are visible in docker history
 
# ARG — Build-time variable (not available at runtime)
ARG BUILD_VERSION=unknown
# Pass at build time: docker build --build-arg BUILD_VERSION=v1.0.0 .
# Good for: version numbers, build metadata
# Not for: secrets (they appear in docker history)
 
# EXPOSE — Documents which ports the container listens on
EXPOSE 8080
# Does NOT actually publish the port
# Is purely documentation — helpful for engineers reading the Dockerfile
# You still need -p 8080:8080 in docker run to publish it
 
# CMD — The default command when the container starts
CMD ["node", "dist/server.js"]
# Use exec form (JSON array) — NOT shell form
# Shell form: CMD node dist/server.js
# Exec form: CMD ["node", "dist/server.js"]
# Why exec form: receives signals directly, SIGTERM works for graceful shutdown
 
# ENTRYPOINT — The fixed executable the container always runs
ENTRYPOINT ["node"]
CMD ["dist/server.js"]
# With both: ENTRYPOINT is the binary, CMD is the default argument
# docker run myimage -> runs: node dist/server.js
# docker run myimage dist/other.js -> runs: node dist/other.js
# Entrypoint is overridable with --entrypoint flag
 
# USER — Run the container as a non-root user
USER node
# Set this after installing dependencies (which often need root)
# Before CMD/ENTRYPOINT — the application runs as this user
 
# HEALTHCHECK — Tell Docker how to check if the container is healthy
HEALTHCHECK --interval=30s --timeout=3s --retries=3 \
  CMD curl -f http://localhost:8080/health || exit 1
 
# LABEL — Add metadata to the image
LABEL maintainer="platform@razorpay.com"
LABEL version="v3.1.0"
LABEL description="Payment API service"

Milestone 2: Layer Caching — The Most Important Concept

Dockerfile

# BAD Dockerfile — every code change rebuilds npm install
FROM node:20-alpine
WORKDIR /app
COPY . .                      # Copies everything including source code
RUN npm install               # Cache busted every time ANY file changes
RUN npm run build
EXPOSE 8080
CMD ["node", "dist/server.js"]
 
# GOOD Dockerfile — npm install is cached unless package.json changes
FROM node:20-alpine
WORKDIR /app
COPY package.json package-lock.json ./   # Copy ONLY dependency files first
RUN npm install                           # This layer is cached until package.json changes
COPY . .                                  # Copy source code AFTER installing deps
RUN npm run build
EXPOSE 8080
CMD ["node", "dist/server.js"]
 
# Time difference on a typical Node.js app:
# BAD:  every build = 3-5 minutes (npm install runs every time)
# GOOD: code-only changes = 20-30 seconds (npm install cached)
#       dependency changes = 3-5 minutes (expected — new packages)

Cache invalidation rules:

Bash

# Rule 1: If the instruction text changes, cache is busted
# "RUN npm install" -> "RUN npm install --verbose" = cache bust
 
# Rule 2: For COPY/ADD, if any copied file changes, cache is busted
# COPY package.json ./ -> if package.json changes, cache busted
# COPY . . -> if ANY file changes, cache busted
 
# Rule 3: Everything below a cache miss is also a miss
# Layer 3 busts -> Layer 4, 5, 6 all rebuild regardless of their content
 
# Check your cache hit rate during builds
docker build --progress=plain .
# step 3/8 : COPY package.json package-lock.json ./
#  ---> Using cache                                   <- cache HIT
# step 4/8 : RUN npm install
#  ---> Using cache                                   <- cache HIT (deps unchanged)
# step 5/8 : COPY . .
#  ---> a84f9c2b1d3e                                  <- cache MISS (source changed)
# step 6/8 : RUN npm run build
#  ---> Running in b72c8a9f4e1d                       <- REBUILDING (cache miss cascade)

Milestone 3: The .dockerignore File

Without .dockerignore, COPY . . sends your entire project directory to the Docker daemon — including node_modules (1GB+), .git directory (hundreds of MB), build artifacts, and secrets.

Bash

# .dockerignore — put this in the same directory as your Dockerfile
 
# Node.js
node_modules/           # Never copy node_modules — install inside container
npm-debug.log
.npm
 
# Build output (usually copied from a build stage instead)
dist/
build/
.next/
out/
 
# Version control
.git/
.gitignore
 
# Environment files — NEVER copy into images
.env
.env.*
!.env.example           # Allow the example file (it has no real secrets)
 
# Development tools
.vscode/
.idea/
*.swp
 
# Testing
coverage/
.nyc_output
__tests__/
*.test.ts
*.spec.ts
 
# Documentation
docs/
*.md
!README.md              # Allow README if your image serves docs
 
# Docker files themselves
Dockerfile
Dockerfile.*
docker-compose.yml
docker-compose.*.yml
 
# macOS
.DS_Store

Bash

# Measure the impact of .dockerignore
# Before adding .dockerignore:
docker build .
# Sending build context to Docker daemon  890.5MB  <- 890MB sent
 
# After adding .dockerignore:
docker build .
# Sending build context to Docker daemon  2.3MB    <- 2.3MB sent
 
# 390x reduction in build context = dramatically faster builds

Milestone 4: Choosing the Right Base Image

The base image is the single biggest factor in image size and security.

Bash

# Compare sizes for Node.js base images:
docker pull node:20         # ~1.1GB — full Debian with build tools
docker pull node:20-slim    # ~220MB — Debian without build tools
docker pull node:20-alpine  # ~55MB  — Alpine Linux (musl libc)
 
docker images | grep node
# node    20          sha256:...  1.1GB
# node    20-slim     sha256:...  220MB
# node    20-alpine   sha256:...  55MB
 
# Compare CVE counts (run with Trivy):
trivy image node:20        # typically 200+ CVEs
trivy image node:20-alpine # typically 10-20 CVEs

Choosing which to use:

TEXT

Use node:20-alpine when:
  * Building final production images (smallest, fewest CVEs)
  * Your application has no native module dependencies
  * You can test that Alpine's musl libc works with your dependencies
 
Use node:20-slim when:
  * Your app uses native modules that require glibc (not compatible with Alpine musl)
  * You need Debian tools but want a smaller image than full node:20
 
Use node:20 (full Debian) when:
  * You need native compilation tools (node-gyp, Python, gcc)
  * Usually only in the BUILD STAGE of a multi-stage build, not the final stage
 
Use distroless (gcr.io/distroless/nodejs20-debian12) when:
  * Maximum security: no shell, no package manager, no OS utilities
  * Only the runtime and your app binary
  * Smallest attack surface possible

Milestone 5: CMD vs ENTRYPOINT — Getting It Right

This is one of the most misunderstood parts of Dockerfiles:

Dockerfile

# ENTRYPOINT — the fixed executable, always runs
# CMD — the default arguments to ENTRYPOINT, can be overridden
 
# Pattern 1: CMD only (most common for apps)
CMD ["node", "dist/server.js"]
# docker run myimage             -> node dist/server.js
# docker run myimage bash        -> bash (CMD overridden)
 
# Pattern 2: ENTRYPOINT + CMD (good for tools)
ENTRYPOINT ["node"]
CMD ["dist/server.js"]
# docker run myimage                -> node dist/server.js
# docker run myimage dist/other.js  -> node dist/other.js (CMD overridden)
# docker run --entrypoint bash myimage -> bash (ENTRYPOINT overridden)
 
# Pattern 3: ENTRYPOINT only (strict tools)
ENTRYPOINT ["node", "dist/server.js"]
# docker run myimage  -> node dist/server.js
# Arguments passed to docker run are APPENDED (not common use case)
 
# ALWAYS use exec form (JSON array) not shell form:
# Shell form:  CMD node dist/server.js
#   -> runs as: /bin/sh -c "node dist/server.js"
#   -> node is a grandchild of sh, not PID 1
#   -> SIGTERM goes to sh, not to node — graceful shutdown breaks
 
# Exec form:   CMD ["node", "dist/server.js"]
#   -> runs as: node dist/server.js directly
#   -> node is PID 1 inside the container
#   -> SIGTERM goes directly to node — graceful shutdown works

Milestone 6: A Complete Production-Ready Dockerfile

Combining everything:

Dockerfile

# ---- Base Stage ----
FROM node:20-alpine AS base
WORKDIR /app
# Install only production OS dependencies
RUN apk add --no-cache tini
# tini is a minimal init system — handles zombie processes and signal forwarding
 
# ---- Dependencies Stage ----
FROM base AS deps
# Copy only what is needed for npm install
COPY package.json package-lock.json ./
# Install production dependencies only
RUN npm ci --omit=dev
# npm ci is faster and more deterministic than npm install in CI/production
 
# ---- Build Stage ----
FROM base AS builder
# Copy all dependencies (including devDependencies for build tools)
COPY package.json package-lock.json ./
RUN npm ci
COPY . .
RUN npm run build
 
# ---- Production Stage ----
FROM base AS production
# Set production environment
ENV NODE_ENV=production
ENV PORT=8080
 
# Copy only the production node_modules (no devDependencies)
COPY --from=deps /app/node_modules ./node_modules
 
# Copy only the build output (not source code)
COPY --from=builder /app/dist ./dist
 
# Create a non-root user and switch to it
RUN addgroup -S appgroup && \
    adduser -S appuser -G appgroup && \
    chown -R appuser:appgroup /app
USER appuser
 
# Document the port
EXPOSE 8080
 
# Health check
HEALTHCHECK --interval=30s --timeout=5s --start-period=10s --retries=3 \
  CMD wget --no-verbose --tries=1 --spider http://localhost:8080/health || exit 1
 
# Use tini as the init system, then start the app
ENTRYPOINT ["/sbin/tini", "--"]
CMD ["node", "dist/server.js"]
 
# Labels for traceability
LABEL maintainer="platform@swiggy.com"
LABEL version="production"

Common Mistakes

Mistake	Cost	Fix
`COPY . .` before `RUN npm install`	Every code change rebuilds dependencies	Copy package.json first, then run install, then copy source
No `.dockerignore` file	Sends gigabytes to daemon on every build	Always create `.dockerignore` as the first file in a new project
Using `latest` tag for base image	Builds break randomly when base image updates	Pin to specific version: `node:20.10.0-alpine3.18`
Running as root	Security vulnerability	Always add `USER` instruction before CMD
Shell form for CMD	Graceful shutdown with SIGTERM breaks	Use exec form: `CMD ["node", "server.js"]`
Installing dev dependencies in production	10-100x larger image	Use `npm ci --omit=dev` or multi-stage builds
Cleaning apt cache in separate RUN	Cache still in previous layer	Clean in same RUN: `apt-get install && rm -rf /var/lib/apt/lists/*`

Troubleshooting Reference

Problem	Symptom	Fix
Slow builds even with unchanged code	`npm install` running every build	Move `COPY package.json` before `COPY . .`
Large image size	Image over 1GB for a simple app	Check `docker history image` for large layers, use multi-stage builds
Container crashes immediately	Exit code 127	CMD binary does not exist in the image — check the path
SIGTERM not handled	Container takes 10 seconds to stop (timeout)	Use exec form for CMD, add signal handler in your application
Build context too large	`Sending build context... 890MB`	Add `.dockerignore` file to exclude node_modules, .git, build artifacts

PLACEMENT PRO TIP
**Tip:** Use `docker build --progress=plain .` to see the detailed build output including exactly which layers are cache hits and which are rebuilding. This is the fastest way to understand why your build is slower than expected.

REMEMBER THIS
**Remember:** Every `RUN apt-get install` must clean the package manager cache in the same instruction: `RUN apt-get update && apt-get install -y curl && rm -rf /var/lib/apt/lists/*`. If you clean in a separate RUN instruction, the cache bytes are still stored in the previous layer and the cleanup has no effect on image size.

COMMON MISTAKE / WARNING
**Common Mistake:** Using `ADD` instead of `COPY` for copying local files. `ADD` has two extra behaviours — extracting tar archives and fetching URLs — that make it unpredictable when you just want to copy files. Always use `COPY` for local files unless you specifically need tar extraction. Save `ADD` for the rare case where you genuinely want auto-extraction.

COMMON MISTAKE / WARNING
**Security:** Never use `ENV` to set secrets in a Dockerfile. Environment variables set with `ENV` are baked into the image and visible to anyone who runs `docker inspect` or `docker history` on the image. Use BuildKit secret mounts (`--mount=type=secret`) for build-time secrets, and pass runtime secrets through environment variables at `docker run` time from a secure secret store like AWS Secrets Manager.

Writing Production-Grade Dockerfiles — Layers, Caching, and Best Practices

Overview and What You Will Learn

Why This Matters in Production

Core Principles

Detailed Step-by-Step Practical Lab

Milestone 1: Dockerfile Instruction Reference

Milestone 2: Layer Caching — The Most Important Concept

Milestone 3: The .dockerignore File

Milestone 4: Choosing the Right Base Image

Milestone 5: CMD vs ENTRYPOINT — Getting It Right

Milestone 6: A Complete Production-Ready Dockerfile

Common Mistakes

Troubleshooting Reference

Resources

Explore More in Docker Images and Registry Management

Multi-Stage Docker Builds — Smaller Images for Production

Docker Image Tagging, Versioning, and Registry Management

Optimising Docker Image Size — Techniques and Measurement