Overview and What You Will Learn
A production Docker CI/CD pipeline does four things in sequence: builds the image, scans it for vulnerabilities, pushes it to a registry, and deploys it. Skipping the scan step means you are shipping known CVEs to production on every release. Skipping proper tagging means you cannot roll back. Getting the order wrong means you might deploy a broken image before the tests have run.
This lab walks through a complete, real GitHub Actions workflow that covers all four stages correctly.
By the end of this lab you will:
- Write a multi-stage Dockerfile that produces a minimal production image
- Build and tag images with both a git commit SHA and a semantic version tag
- Scan images for vulnerabilities using Trivy and fail the pipeline on HIGH/CRITICAL findings
- Push to AWS ECR with proper authentication
- Deploy to an EC2 instance using SSH with zero manual steps
- Use Docker layer caching in GitHub Actions to keep build times fast
Why This Matters in Production
At Razorpay, every backend service image goes through an automated Trivy scan in CI before it can be pushed to ECR. This is a hard gate — a HIGH or CRITICAL CVE in a base image or dependency fails the pipeline and blocks the deploy. The team found a critical OpenSSL vulnerability in a node:18 base image through this scan three days before it was publicly announced, giving them time to upgrade before it became an incident.
The commit SHA tag on every image is equally important. When a deploy causes a regression, the rollback is a single docker pull command with the previous SHA tag. Without immutable tags, rollbacks become guesswork.
Core Principles
Complete CI/CD pipeline stage order:
+------------------------------------------+| Push to main / PR opened |+------------------------------------------+ | v+------------------------------------------+| Run unit + integration tests | <- fail fast before building+------------------------------------------+ | v+------------------------------------------+| docker build (multi-stage, layer cache) | <- produces minimal prod image+------------------------------------------+ | v+------------------------------------------+| trivy image --exit-code 1 (scan) | <- blocks on HIGH/CRITICAL CVEs+------------------------------------------+ | v+------------------------------------------+| docker push to ECR (SHA + semver tag) | <- immutable SHA tag for rollback+------------------------------------------+ | v+------------------------------------------+| SSH deploy to EC2 / update compose file | <- only runs on main branch+------------------------------------------+Detailed Step-by-Step Practical Lab
Milestone 1 — Write a multi-stage production Dockerfile
## Dockerfile## Stage 1: build — installs all deps and compilesFROM node:20-alpine AS builderWORKDIR /app COPY package*.json ./## Install all deps including devDependencies for the build stepRUN npm ci COPY . .## Compile TypeScript to JavaScriptRUN npm run build ## Stage 2: production — minimal image, no dev tools, no sourceFROM node:20-alpine AS productionWORKDIR /app ## Run as non-root user for securityRUN addgroup -S appgroup && adduser -S appuser -G appgroup COPY package*.json ./## Install only production dependenciesRUN npm ci --only=production && npm cache clean --force ## Copy compiled output from builder stage onlyCOPY --from=builder /app/dist ./dist USER appuser HEALTHCHECK --interval=30s --timeout=10s --retries=3 \ CMD wget -qO- http://localhost:3000/health || exit 1 EXPOSE 3000CMD ["node", "dist/server.js"]Milestone 2 — Write the complete GitHub Actions workflow
## .github/workflows/ci-cd.ymlname: CI/CD Pipeline on: push: branches: [main] pull_request: branches: [main] env: ## AWS account and region — set these as GitHub repo secrets AWS_REGION: ap-south-1 ECR_REGISTRY: 123456789012.dkr.ecr.ap-south-1.amazonaws.com ECR_REPOSITORY: razorpay-api jobs: test: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - name: Set up Node.js uses: actions/setup-node@v4 with: node-version: '20' cache: 'npm' - name: Install dependencies run: npm ci - name: Run unit tests run: npm test build-scan-push: needs: test runs-on: ubuntu-latest ## Only build and push on main branch, not on PRs if: github.ref == 'refs/heads/main' outputs: ## Pass the image tag to the deploy job image-tag: ${{ steps.meta.outputs.sha-tag }} steps: - uses: actions/checkout@v4 - name: Set up Docker Buildx uses: docker/setup-buildx-action@v3 - name: Configure AWS credentials uses: aws-actions/configure-aws-credentials@v4 with: aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }} aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }} aws-region: ${{ env.AWS_REGION }} - name: Login to Amazon ECR id: login-ecr uses: aws-actions/amazon-ecr-login@v2 - name: Generate image tags id: meta run: | ## Short commit SHA for immutable rollback tag SHA_TAG=$(git rev-parse --short HEAD) echo "sha-tag=$SHA_TAG" >> $GITHUB_OUTPUT echo "IMAGE_SHA=${{ env.ECR_REGISTRY }}/${{ env.ECR_REPOSITORY }}:$SHA_TAG" >> $GITHUB_ENV echo "IMAGE_LATEST=${{ env.ECR_REGISTRY }}/${{ env.ECR_REPOSITORY }}:latest" >> $GITHUB_ENV - name: Build Docker image uses: docker/build-push-action@v5 with: context: . ## Do not push yet — scan first push: false ## Tag with both SHA and latest tags: | ${{ env.IMAGE_SHA }} ${{ env.IMAGE_LATEST }} ## Use GitHub Actions cache to speed up repeated builds cache-from: type=gha cache-to: type=gha,mode=max ## Load into local daemon for Trivy to scan load: true - name: Scan image with Trivy uses: aquasecurity/trivy-action@master with: image-ref: ${{ env.IMAGE_SHA }} format: 'table' ## Fail the pipeline if HIGH or CRITICAL CVEs are found exit-code: '1' severity: 'HIGH,CRITICAL' ## Ignore vulnerabilities with no fix available ignore-unfixed: true - name: Push image to ECR ## Only runs if Trivy scan passed (exit-code 0) run: | docker push ${{ env.IMAGE_SHA }} docker push ${{ env.IMAGE_LATEST }} deploy: needs: build-scan-push runs-on: ubuntu-latest if: github.ref == 'refs/heads/main' steps: - uses: actions/checkout@v4 - name: Deploy to EC2 uses: appleboy/ssh-action@v1.0.0 with: host: ${{ secrets.EC2_HOST }} username: ubuntu key: ${{ secrets.EC2_SSH_KEY }} script: | ## Authenticate ECR on the EC2 host aws ecr get-login-password --region ap-south-1 | \ docker login --username AWS --password-stdin \ 123456789012.dkr.ecr.ap-south-1.amazonaws.com ## Pull the exact SHA-tagged image docker pull ${{ env.ECR_REGISTRY }}/${{ env.ECR_REPOSITORY }}:${{ needs.build-scan-push.outputs.image-tag }} ## Update the image tag in the .env file sed -i 's/IMAGE_TAG=.*/IMAGE_TAG=${{ needs.build-scan-push.outputs.image-tag }}/' /opt/razorpay-api/.env ## Recreate only the api service with zero-downtime cd /opt/razorpay-api docker compose up -d --no-deps --pull always apiMilestone 3 — Trivy scan options explained
## Run Trivy scan manually on a local image before pushingdocker run --rm \ -v /var/run/docker.sock:/var/run/docker.sock \ -v $HOME/.cache/trivy:/root/.cache/trivy \ aquasec/trivy:latest \ image \ --exit-code 1 \ --severity HIGH,CRITICAL \ --ignore-unfixed \ razorpay-api:latest ## Output formats: table (human), json (for tooling), sarif (for GitHub Security tab)trivy image --format json -o trivy-results.json razorpay-api:latestMilestone 4 — Rollback using SHA tags
## On the production EC2 host — roll back to the previous known-good SHAcd /opt/razorpay-api ## Update IMAGE_TAG to the previous good commit SHAsed -i 's/IMAGE_TAG=.*/IMAGE_TAG=a3f8c12/' .env ## Pull and recreate only the api containerdocker compose pull apidocker compose up -d --no-deps api ## Verify the container is running the correct imagedocker inspect razorpay-api-api-1 --format '{{.Config.Image}}'Production Best Practices and Common Pitfalls
| Scenario | Wrong | Correct |
|---|---|---|
| Image tagging | Only latest tag |
SHA tag + latest — SHA enables rollback |
| Vulnerability scan | Skip scan to save time | Hard gate with --exit-code 1 on HIGH/CRITICAL |
| Deploy trigger | Deploy on every PR | Deploy only on push to main after tests pass |
| Secrets in workflow | Hardcoded in YAML | GitHub repo secrets via ${{ secrets.NAME }} |
| Rollback | Redeploy from source | Pull previous SHA-tagged image, recreate container |
| Layer caching | No cache in CI | cache-from: type=gha with cache-to: type=gha,mode=max |
Quick Reference and Troubleshooting Commands
| Task | Command |
|---|---|
| Build with specific tag | docker build -t image:$(git rev-parse --short HEAD) . |
| Scan local image | trivy image --severity HIGH,CRITICAL <image> |
| Login to ECR | aws ecr get-login-password --region ap-south-1 | docker login --username AWS --password-stdin <registry> |
| Push to ECR | docker push <registry>/<repo>:<tag> |
| Zero-downtime redeploy | docker compose up -d --no-deps <service> |
| Check running image tag | docker inspect <container> --format '{{.Config.Image}}' |
PLACEMENT PRO TIP**Tip:** Use `--no-deps` with `docker compose up -d` during deploy to recreate only the target service without touching its dependencies (postgres, redis). This avoids unnecessary database container restarts during API deployments.
REMEMBER THIS**Remember:** The Trivy scan must run against the locally loaded image (`load: true` in build-push-action) *before* the push step. If you scan after pushing, a vulnerable image is already in your registry and may be pulled by other jobs that started in parallel.
COMMON MISTAKE / WARNING**Common Mistake:** Using only the `latest` tag in production. If a bad deploy happens and you need to roll back, `latest` has already been overwritten. Always push a second immutable tag (git SHA or build number) so you have something concrete to roll back to.
COMMON MISTAKE / WARNING**Security:** The EC2 instance's IAM role should have the minimum permissions needed: `ecr:GetAuthorizationToken`, `ecr:BatchGetImage`, `ecr:GetDownloadUrlForLayer`. Do not attach `AdministratorAccess` or `AmazonEC2ContainerRegistryFullAccess` to a production instance role.