What is the career path for learning Designing Multi-Stage CI/CD Pipelines — Build, Test, Scan, and Deploy?

Mastering Designing Multi-Stage CI/CD Pipelines — Build, Test, Scan, and Deploy enables engineering opportunities in DevOps, SRE, and cloud platform automation.

Designing Multi-Stage CI/CD Pipelines — Build, Test, Scan, and Deploy | DevOps Network

Q: How long does it take to learn Designing Multi-Stage CI/CD Pipelines — Build, Test, Scan, and Deploy?

Most students gain core proficiency in Designing Multi-Stage CI/CD Pipelines — Build, Test, Scan, and Deploy in 2–3 weeks of active hands-on labs.

Overview and What You Will Learn

A pipeline with one stage that runs everything sequentially is a starting point. A pipeline designed for production — with parallel execution, artifact promotion, security gates, and environment-specific deployment — is an engineering asset that accelerates the entire team.

By the end of this topic you will:

Design a complete four-stage pipeline with parallel job execution
Implement artifact promotion so the same image passes through all environments
Configure security scanning gates that block deployments on HIGH vulnerabilities
Set up manual approval gates for production with notification workflows
Implement pipeline caching to reduce build times by 40-60%
Add failure notifications so the right people are alerted immediately

Why This Matters in Production

PhonePe processes millions of transactions daily. Their deployment pipeline must be fast enough to deploy fixes quickly but rigorous enough that no security vulnerability or regression reaches production. That requires deliberate pipeline design — not just adding steps as problems arise, but architecting the pipeline with a clear model for what each stage is responsible for and what gates protect each promotion step.

Core Principles

Four-stage production pipeline with parallel execution:

Bash

+------------------------------------------+
| TRIGGER: push to main or PR              |
+------------------------------------------+
                    |
                    v
+------------------------------------------+
| STAGE 1: BUILD (2-4 minutes)             |
| compile / npm build / docker build       |
| push to registry with git SHA tag        |
| Output: image digest for downstream jobs |
+------------------------------------------+
                    |
        +-----------+-----------+
        |           |           |
        v           v           v
+----------+ +----------+ +-----------+
| STAGE 2  | | STAGE 2  | | STAGE 2   |
| Unit Test| | Lint     | | Sec Scan  |
| 3 min    | | 1 min    | | 4 min     |
+----------+ +----------+ +-----------+
        |           |           |
        +-----all pass----------+
                    |
                    v
+------------------------------------------+
| STAGE 3: INTEGRATION TEST (5-8 minutes)  |
| real database, real cache                |
| API contract tests                       |
| Performance baseline check               |
+------------------------------------------+
                    |
                    v
+------------------------------------------+
| STAGE 4: DEPLOY                          |
| -> dev: automatic                        |
| -> staging: automatic (main branch only) |
| -> production: manual approval gate      |
+------------------------------------------+

Artifact promotion — the core discipline:

Bash

BAD: rebuild for each environment
  CI: docker build payment-api:test-build
  Staging: docker build payment-api:staging-build  <- different image!
  Production: docker build payment-api:prod-build  <- different image!
  Problem: staging tested a different image than production runs
 
GOOD: build once, promote the digest
  CI: docker build, push payment-api:abc1234
  Staging: deploy payment-api:abc1234  <- same image that was tested
  Production: deploy payment-api:abc1234  <- same image staging tested
  Result: what was tested is exactly what runs in production

Detailed Step-by-Step Practical Lab

Milestone 1 — Design the stage structure

Before writing any YAML, design the pipeline on paper (or a whiteboard):

Bash

Questions to answer for each stage:
1. What is the single responsibility of this stage?
2. What are its inputs (from previous stages)?
3. What are its outputs (artifacts, signals)?
4. What is the failure behaviour (block or warn)?
5. How long should it take? (set a budget)
 
Stage 1 - Build:
  Responsibility: produce a tested-ready artifact
  Input: source code from git
  Output: Docker image digest
  On failure: block all downstream stages
  Time budget: under 4 minutes
 
Stage 2 - Validate (parallel jobs):
  Responsibility: verify quality and security
  Input: source code + image digest from Stage 1
  Output: test results, scan report
  On failure: block deploy stages
  Time budget: under 5 minutes (parallel)
 
Stage 3 - Integration:
  Responsibility: verify service works end-to-end
  Input: image digest from Stage 1
  Output: integration test results
  On failure: block deploy stages
  Time budget: under 8 minutes
 
Stage 4 - Deploy:
  Responsibility: deliver to environments
  Input: image digest (same one from Stage 1)
  Output: running service per environment
  On failure: rollback, notify, alert
  Time budget: under 3 minutes per environment

Milestone 2 — Implement parallel jobs in Stage 2

YAML

jobs:
  ## Stage 1 -- runs first
  build:
    runs-on: ubuntu-latest
    outputs:
      image-digest: ${{ steps.push.outputs.digest }}
      image-tag: ${{ github.sha }}
    steps:
      - uses: actions/checkout@v4
      - uses: docker/setup-buildx-action@v3
      - uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: ${{ vars.AWS_DEPLOY_ROLE }}
          aws-region: ap-south-1
      - uses: aws-actions/amazon-ecr-login@v2
      - name: Build and push
        id: push
        uses: docker/build-push-action@v5
        with:
          push: true
          tags: ${{ env.ECR_REGISTRY }}/payment-api:${{ github.sha }}
          cache-from: type=gha
          cache-to: type=gha,mode=max
 
  ## Stage 2 -- three jobs run IN PARALLEL after build
  unit-test:
    needs: build     ## wait for build only
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with: { node-version: '20', cache: 'npm' }
      - run: npm ci
      - run: npm test -- --reporter=junit --outputFile=junit.xml
      - uses: actions/upload-artifact@v4
        if: always()
        with: { name: junit-results, path: junit.xml }
 
  lint:
    needs: build     ## parallel with unit-test
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with: { node-version: '20', cache: 'npm' }
      - run: npm ci
      - run: npm run lint && npm run format:check
 
  security-scan:
    needs: build     ## parallel with unit-test and lint
    runs-on: ubuntu-latest
    steps:
      - name: Scan image for vulnerabilities
        uses: aquasecurity/trivy-action@master
        with:
          image-ref: ${{ env.ECR_REGISTRY }}/payment-api:${{ github.sha }}
          severity: HIGH,CRITICAL
          exit-code: '1'   ## fail pipeline on HIGH/CRITICAL

Milestone 3 — Integration tests with Docker Compose

YAML

  ## Stage 3 -- runs after ALL Stage 2 jobs pass
  integration-test:
    needs: [unit-test, lint, security-scan]
    runs-on: ubuntu-latest
 
    services:
      ## Spin up real PostgreSQL for integration tests
      postgres:
        image: postgres:15
        env:
          POSTGRES_DB: payment_test
          POSTGRES_USER: testuser
          POSTGRES_PASSWORD: testpass
        options: >-
          --health-cmd pg_isready
          --health-interval 10s
          --health-timeout 5s
          --health-retries 5
      ## Spin up real Redis
      redis:
        image: redis:7-alpine
        options: >-
          --health-cmd "redis-cli ping"
          --health-interval 10s
          --health-retries 5
    env:
      DATABASE_URL: postgresql://testuser:testpass@localhost:5432/payment_test
      REDIS_URL: redis://localhost:6379
 
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with: { node-version: '20', cache: 'npm' }
      - run: npm ci
      - name: Run database migrations
        run: npm run db:migrate
      - name: Run integration tests
        run: npm run test:integration

Milestone 4 — Caching for faster pipelines

YAML

## Dependency cache: node_modules
  - uses: actions/setup-node@v4
    with:
      node-version: '20'
      cache: 'npm'  ## built-in npm caching
      ## Cache key: package-lock.json hash
      ## Cache hit: skip npm install entirely (saves 2-3 min)
 
## Docker layer cache
  - uses: docker/build-push-action@v5
    with:
      cache-from: type=gha    ## read from GitHub Actions cache
      cache-to: type=gha,mode=max  ## write all layers to cache
      ## First run: full build (3 min)
      ## Subsequent runs with same base: 40 seconds
 
## Custom cache for other tools
  - uses: actions/cache@v4
    with:
      path: ~/.cache/pip
      key: ${{ runner.os }}-pip-${{ hashFiles('requirements*.txt') }}
      restore-keys: |
        ${{ runner.os }}-pip-
## Measure cache impact:
## Check job duration before and after enabling cache
## Goal: 40-60% reduction in install/build time

Milestone 5 — Deploy with environment gates

YAML

  deploy-staging:
    needs: integration-test
    runs-on: ubuntu-latest
    if: github.ref == 'refs/heads/main'
    environment:
      name: staging
      url: https://staging.payment.internal
    steps:
      - uses: actions/checkout@v4
      - uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: ${{ vars.AWS_STAGING_ROLE }}
          aws-region: ap-south-1
      - run: aws eks update-kubeconfig --name staging-cluster --region ap-south-1
      - name: Helm deploy to staging
        run: |
          helm upgrade --install payment-api ./charts/payment-api \
            --namespace payment-api-staging \
            --values ./charts/values-staging.yaml \
            --set image.tag=${{ github.sha }} \
            --atomic --timeout 5m --wait
      - name: Post-deploy smoke test
        run: |
          sleep 15
          curl -sf https://staging.payment.internal/health
  deploy-production:
    needs: deploy-staging
    runs-on: ubuntu-latest
    ## Environment: production must have Required Reviewers set
    ## in GitHub Settings > Environments
    ## Pipeline PAUSES here until a reviewer approves
    environment:
      name: production
      url: https://api.payment.razorpay.com
    steps:
      - uses: actions/checkout@v4
      - uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: ${{ vars.AWS_PROD_ROLE }}
          aws-region: ap-south-1
      - run: aws eks update-kubeconfig --name prod-cluster --region ap-south-1
      - name: Helm deploy to production
        run: |
          helm upgrade --install payment-api ./charts/payment-api \
            --namespace payment-api-production \
            --values ./charts/values-production.yaml \
            --set image.tag=${{ github.sha }} \
            --atomic --timeout 10m --wait

Milestone 6 — Pipeline notifications and observability

YAML

  ## Run after all jobs -- notify regardless of outcome
  notify:
    runs-on: ubuntu-latest
    needs: [deploy-staging, deploy-production]
    if: always()   ## always run this job
    steps:
      - name: Determine status
        id: status
        run: |
          if [[ "${{ needs.deploy-production.result }}" == "success" ]]; then
            echo "color=#36a64f" >> $GITHUB_OUTPUT
            echo "status=SUCCESS" >> $GITHUB_OUTPUT
          elif [[ "${{ needs.deploy-staging.result }}" == "success" ]]; then
            echo "color=#ff9900" >> $GITHUB_OUTPUT
            echo "status=STAGING ONLY" >> $GITHUB_OUTPUT
          else
            echo "color=#ff0000" >> $GITHUB_OUTPUT
            echo "status=FAILED" >> $GITHUB_OUTPUT
          fi
      - name: Slack notification
        uses: slackapi/slack-github-action@v1
        with:
          payload: |
            {
              "attachments": [{
                "color": "${{ steps.status.outputs.color }}",
                "title": "Deployment ${{ steps.status.outputs.status }}",
                "fields": [
                  {"title": "Service", "value": "payment-api", "short": true},
                  {"title": "Commit", "value": "${{ github.sha }}", "short": true},
                  {"title": "Author", "value": "${{ github.actor }}", "short": true}
                ]
              }]
            }
        env:
          SLACK_WEBHOOK_URL: ${{ secrets.SLACK_WEBHOOK }}
          SLACK_WEBHOOK_TYPE: INCOMING_WEBHOOK

Production Best Practices and Common Pitfalls

Scenario	Wrong	Correct
Stage ordering	Slow integration tests before fast unit tests	Fast tests first, slow tests later
Parallel jobs	All jobs run sequentially	Independent jobs run in parallel
Artifact passing	Rebuild image per environment	Build once, pass digest downstream
Cache strategy	No caching at all	Cache node_modules and Docker layers
Failure notification	No alerts	Slack on failure with `if: always()`

Quick Reference and Troubleshooting Commands

Task	Command
View pipeline YAML	`.github/workflows/ci.yaml`
Check job dependencies	`needs:` field in each job
Test workflow syntax	`gh workflow view ci.yaml`
Measure job durations	`gh run view RUN_ID --json jobs`
Check cache hit rate	Actions tab > job > Cache step output
Debug slow pipeline	Profile each job duration, optimise slowest

PLACEMENT PRO TIP
**Tip:** Use GitHub Actions `concurrency` to cancel stale pipeline runs. When an engineer pushes three commits quickly, you only care about the last one. `concurrency: { group: "${{ github.workflow }}-${{ github.ref }}", cancel-in-progress: true }` cancels in-progress runs when a new commit arrives on the same branch.

REMEMBER THIS
**Remember:** The `needs` array in GitHub Actions creates a dependency DAG (Directed Acyclic Graph). Jobs with the same `needs` value run in parallel. Jobs with different `needs` values run sequentially after their dependencies. Drawing this graph before writing YAML makes the parallelism structure clear.

COMMON MISTAKE / WARNING
**Security:** The security scan job must block deployment — not just report findings. Set `exit-code: '1'` in Trivy so the job fails on HIGH or CRITICAL findings. A security scan that posts results but does not block the pipeline gives engineers a false sense of security while shipping vulnerable images.

COMMON MISTAKE / WARNING
**Common Mistake:** Using `needs: [build, test, lint, scan]` on the deploy job when `test`, `lint`, and `scan` already `need: build`. This creates an unnecessarily complex dependency chain. The deploy job only needs to list the immediately preceding jobs — if test already depends on build, deploy does not need to list build again.

Designing Multi-Stage CI/CD Pipelines — Build, Test, Scan, and Deploy

Overview and What You Will Learn

Why This Matters in Production

Core Principles

Detailed Step-by-Step Practical Lab

Production Best Practices and Common Pitfalls

Quick Reference and Troubleshooting Commands

Resources

Explore More in CI/CD Fundamentals and Pipeline Design

Understanding CI/CD — Pipelines, Stages, and the Delivery Lifecycle

Automated Testing in CI — Unit, Integration, and E2E Gates