Overview and What You Will Learn
A pipeline with one stage that runs everything sequentially is a starting point. A pipeline designed for production — with parallel execution, artifact promotion, security gates, and environment-specific deployment — is an engineering asset that accelerates the entire team.
By the end of this topic you will:
- Design a complete four-stage pipeline with parallel job execution
- Implement artifact promotion so the same image passes through all environments
- Configure security scanning gates that block deployments on HIGH vulnerabilities
- Set up manual approval gates for production with notification workflows
- Implement pipeline caching to reduce build times by 40-60%
- Add failure notifications so the right people are alerted immediately
Why This Matters in Production
PhonePe processes millions of transactions daily. Their deployment pipeline must be fast enough to deploy fixes quickly but rigorous enough that no security vulnerability or regression reaches production. That requires deliberate pipeline design — not just adding steps as problems arise, but architecting the pipeline with a clear model for what each stage is responsible for and what gates protect each promotion step.
Core Principles
Four-stage production pipeline with parallel execution:
+------------------------------------------+| TRIGGER: push to main or PR |+------------------------------------------+ | v+------------------------------------------+| STAGE 1: BUILD (2-4 minutes) || compile / npm build / docker build || push to registry with git SHA tag || Output: image digest for downstream jobs |+------------------------------------------+ | +-----------+-----------+ | | | v v v+----------+ +----------+ +-----------+| STAGE 2 | | STAGE 2 | | STAGE 2 || Unit Test| | Lint | | Sec Scan || 3 min | | 1 min | | 4 min |+----------+ +----------+ +-----------+ | | | +-----all pass----------+ | v+------------------------------------------+| STAGE 3: INTEGRATION TEST (5-8 minutes) || real database, real cache || API contract tests || Performance baseline check |+------------------------------------------+ | v+------------------------------------------+| STAGE 4: DEPLOY || -> dev: automatic || -> staging: automatic (main branch only) || -> production: manual approval gate |+------------------------------------------+Artifact promotion — the core discipline:
BAD: rebuild for each environment CI: docker build payment-api:test-build Staging: docker build payment-api:staging-build <- different image! Production: docker build payment-api:prod-build <- different image! Problem: staging tested a different image than production runs GOOD: build once, promote the digest CI: docker build, push payment-api:abc1234 Staging: deploy payment-api:abc1234 <- same image that was tested Production: deploy payment-api:abc1234 <- same image staging tested Result: what was tested is exactly what runs in productionDetailed Step-by-Step Practical Lab
Milestone 1 — Design the stage structure
Before writing any YAML, design the pipeline on paper (or a whiteboard):
Questions to answer for each stage:1. What is the single responsibility of this stage?2. What are its inputs (from previous stages)?3. What are its outputs (artifacts, signals)?4. What is the failure behaviour (block or warn)?5. How long should it take? (set a budget) Stage 1 - Build: Responsibility: produce a tested-ready artifact Input: source code from git Output: Docker image digest On failure: block all downstream stages Time budget: under 4 minutes Stage 2 - Validate (parallel jobs): Responsibility: verify quality and security Input: source code + image digest from Stage 1 Output: test results, scan report On failure: block deploy stages Time budget: under 5 minutes (parallel) Stage 3 - Integration: Responsibility: verify service works end-to-end Input: image digest from Stage 1 Output: integration test results On failure: block deploy stages Time budget: under 8 minutes Stage 4 - Deploy: Responsibility: deliver to environments Input: image digest (same one from Stage 1) Output: running service per environment On failure: rollback, notify, alert Time budget: under 3 minutes per environmentMilestone 2 — Implement parallel jobs in Stage 2
jobs: ## Stage 1 -- runs first build: runs-on: ubuntu-latest outputs: image-digest: ${{ steps.push.outputs.digest }} image-tag: ${{ github.sha }} steps: - uses: actions/checkout@v4 - uses: docker/setup-buildx-action@v3 - uses: aws-actions/configure-aws-credentials@v4 with: role-to-assume: ${{ vars.AWS_DEPLOY_ROLE }} aws-region: ap-south-1 - uses: aws-actions/amazon-ecr-login@v2 - name: Build and push id: push uses: docker/build-push-action@v5 with: push: true tags: ${{ env.ECR_REGISTRY }}/payment-api:${{ github.sha }} cache-from: type=gha cache-to: type=gha,mode=max ## Stage 2 -- three jobs run IN PARALLEL after build unit-test: needs: build ## wait for build only runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - uses: actions/setup-node@v4 with: { node-version: '20', cache: 'npm' } - run: npm ci - run: npm test -- --reporter=junit --outputFile=junit.xml - uses: actions/upload-artifact@v4 if: always() with: { name: junit-results, path: junit.xml } lint: needs: build ## parallel with unit-test runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - uses: actions/setup-node@v4 with: { node-version: '20', cache: 'npm' } - run: npm ci - run: npm run lint && npm run format:check security-scan: needs: build ## parallel with unit-test and lint runs-on: ubuntu-latest steps: - name: Scan image for vulnerabilities uses: aquasecurity/trivy-action@master with: image-ref: ${{ env.ECR_REGISTRY }}/payment-api:${{ github.sha }} severity: HIGH,CRITICAL exit-code: '1' ## fail pipeline on HIGH/CRITICALMilestone 3 — Integration tests with Docker Compose
## Stage 3 -- runs after ALL Stage 2 jobs pass integration-test: needs: [unit-test, lint, security-scan] runs-on: ubuntu-latest services: ## Spin up real PostgreSQL for integration tests postgres: image: postgres:15 env: POSTGRES_DB: payment_test POSTGRES_USER: testuser POSTGRES_PASSWORD: testpass options: >- --health-cmd pg_isready --health-interval 10s --health-timeout 5s --health-retries 5 ## Spin up real Redis redis: image: redis:7-alpine options: >- --health-cmd "redis-cli ping" --health-interval 10s --health-retries 5 env: DATABASE_URL: postgresql://testuser:testpass@localhost:5432/payment_test REDIS_URL: redis://localhost:6379 steps: - uses: actions/checkout@v4 - uses: actions/setup-node@v4 with: { node-version: '20', cache: 'npm' } - run: npm ci - name: Run database migrations run: npm run db:migrate - name: Run integration tests run: npm run test:integrationMilestone 4 — Caching for faster pipelines
## Dependency cache: node_modules - uses: actions/setup-node@v4 with: node-version: '20' cache: 'npm' ## built-in npm caching ## Cache key: package-lock.json hash ## Cache hit: skip npm install entirely (saves 2-3 min) ## Docker layer cache - uses: docker/build-push-action@v5 with: cache-from: type=gha ## read from GitHub Actions cache cache-to: type=gha,mode=max ## write all layers to cache ## First run: full build (3 min) ## Subsequent runs with same base: 40 seconds ## Custom cache for other tools - uses: actions/cache@v4 with: path: ~/.cache/pip key: ${{ runner.os }}-pip-${{ hashFiles('requirements*.txt') }} restore-keys: | ${{ runner.os }}-pip-## Measure cache impact:## Check job duration before and after enabling cache## Goal: 40-60% reduction in install/build timeMilestone 5 — Deploy with environment gates
deploy-staging: needs: integration-test runs-on: ubuntu-latest if: github.ref == 'refs/heads/main' environment: name: staging url: https://staging.payment.internal steps: - uses: actions/checkout@v4 - uses: aws-actions/configure-aws-credentials@v4 with: role-to-assume: ${{ vars.AWS_STAGING_ROLE }} aws-region: ap-south-1 - run: aws eks update-kubeconfig --name staging-cluster --region ap-south-1 - name: Helm deploy to staging run: | helm upgrade --install payment-api ./charts/payment-api \ --namespace payment-api-staging \ --values ./charts/values-staging.yaml \ --set image.tag=${{ github.sha }} \ --atomic --timeout 5m --wait - name: Post-deploy smoke test run: | sleep 15 curl -sf https://staging.payment.internal/health deploy-production: needs: deploy-staging runs-on: ubuntu-latest ## Environment: production must have Required Reviewers set ## in GitHub Settings > Environments ## Pipeline PAUSES here until a reviewer approves environment: name: production url: https://api.payment.razorpay.com steps: - uses: actions/checkout@v4 - uses: aws-actions/configure-aws-credentials@v4 with: role-to-assume: ${{ vars.AWS_PROD_ROLE }} aws-region: ap-south-1 - run: aws eks update-kubeconfig --name prod-cluster --region ap-south-1 - name: Helm deploy to production run: | helm upgrade --install payment-api ./charts/payment-api \ --namespace payment-api-production \ --values ./charts/values-production.yaml \ --set image.tag=${{ github.sha }} \ --atomic --timeout 10m --waitMilestone 6 — Pipeline notifications and observability
## Run after all jobs -- notify regardless of outcome notify: runs-on: ubuntu-latest needs: [deploy-staging, deploy-production] if: always() ## always run this job steps: - name: Determine status id: status run: | if [[ "${{ needs.deploy-production.result }}" == "success" ]]; then echo "color=#36a64f" >> $GITHUB_OUTPUT echo "status=SUCCESS" >> $GITHUB_OUTPUT elif [[ "${{ needs.deploy-staging.result }}" == "success" ]]; then echo "color=#ff9900" >> $GITHUB_OUTPUT echo "status=STAGING ONLY" >> $GITHUB_OUTPUT else echo "color=#ff0000" >> $GITHUB_OUTPUT echo "status=FAILED" >> $GITHUB_OUTPUT fi - name: Slack notification uses: slackapi/slack-github-action@v1 with: payload: | { "attachments": [{ "color": "${{ steps.status.outputs.color }}", "title": "Deployment ${{ steps.status.outputs.status }}", "fields": [ {"title": "Service", "value": "payment-api", "short": true}, {"title": "Commit", "value": "${{ github.sha }}", "short": true}, {"title": "Author", "value": "${{ github.actor }}", "short": true} ] }] } env: SLACK_WEBHOOK_URL: ${{ secrets.SLACK_WEBHOOK }} SLACK_WEBHOOK_TYPE: INCOMING_WEBHOOKProduction Best Practices and Common Pitfalls
| Scenario | Wrong | Correct |
|---|---|---|
| Stage ordering | Slow integration tests before fast unit tests | Fast tests first, slow tests later |
| Parallel jobs | All jobs run sequentially | Independent jobs run in parallel |
| Artifact passing | Rebuild image per environment | Build once, pass digest downstream |
| Cache strategy | No caching at all | Cache node_modules and Docker layers |
| Failure notification | No alerts | Slack on failure with if: always() |
Quick Reference and Troubleshooting Commands
| Task | Command |
|---|---|
| View pipeline YAML | .github/workflows/ci.yaml |
| Check job dependencies | needs: field in each job |
| Test workflow syntax | gh workflow view ci.yaml |
| Measure job durations | gh run view RUN_ID --json jobs |
| Check cache hit rate | Actions tab > job > Cache step output |
| Debug slow pipeline | Profile each job duration, optimise slowest |
PLACEMENT PRO TIP**Tip:** Use GitHub Actions `concurrency` to cancel stale pipeline runs. When an engineer pushes three commits quickly, you only care about the last one. `concurrency: { group: "${{ github.workflow }}-${{ github.ref }}", cancel-in-progress: true }` cancels in-progress runs when a new commit arrives on the same branch.
REMEMBER THIS**Remember:** The `needs` array in GitHub Actions creates a dependency DAG (Directed Acyclic Graph). Jobs with the same `needs` value run in parallel. Jobs with different `needs` values run sequentially after their dependencies. Drawing this graph before writing YAML makes the parallelism structure clear.
COMMON MISTAKE / WARNING**Security:** The security scan job must block deployment — not just report findings. Set `exit-code: '1'` in Trivy so the job fails on HIGH or CRITICAL findings. A security scan that posts results but does not block the pipeline gives engineers a false sense of security while shipping vulnerable images.
COMMON MISTAKE / WARNING**Common Mistake:** Using `needs: [build, test, lint, scan]` on the deploy job when `test`, `lint`, and `scan` already `need: build`. This creates an unnecessarily complex dependency chain. The deploy job only needs to list the immediately preceding jobs — if test already depends on build, deploy does not need to list build again.