Overview and What You Will Learn
Before CI/CD, shipping software was an event. Teams would accumulate changes for weeks, merge everything together, cross their fingers, and deploy on a Friday night. When something broke — and something always broke — engineers spent the weekend debugging. This model does not scale. It does not scale to the deployment frequency Razorpay needs (dozens per day), to the team size Hotstar operates at (hundreds of engineers), or to the risk tolerance of a payment company where downtime costs real money every minute.
CI/CD — Continuous Integration and Continuous Delivery — replaces the deployment event with a deployment process. Every code change is automatically built, tested, and prepared for delivery. Releases become routine rather than risky.
By the end of this topic you will:
- Explain the difference between CI, CD, and CDP precisely
- Read a pipeline and identify every stage, job, trigger, and artifact
- Apply the fail-fast principle to pipeline stage ordering
- Understand why pipeline-as-code matters for team workflows
- Recognise the cost of a slow or flaky pipeline
Why This Matters in Production
A 45-minute pipeline is a broken team workflow. When engineers wait 45 minutes to know if their code works, they context-switch away, start something new, and forget what they were debugging when the failure finally arrives. The cognitive cost of slow pipelines is enormous.
A 10-minute pipeline is a superpower. Engineers get near-instant feedback, fix issues while the context is fresh, and deploy with confidence multiple times a day. At Zerodha, fast pipelines are what make high deployment frequency possible — and high deployment frequency is what keeps each individual release small, low-risk, and easy to roll back.
Core Principles
CI vs CD vs Continuous Deployment:
+------------------------------------------+| Continuous Integration (CI) || || Every commit is automatically: || - Built (compiled / packaged) || - Tested (unit + integration tests) || - Validated (lint, format, scan) || || Goal: detect integration problems fast |+------------------------------------------+ | v+------------------------------------------+| Continuous Delivery (CD) || || Every change that passes CI is: || - Packaged as a deployable artifact || - Deployed to staging automatically || - Ready for production deployment || || Human still approves production deploy |+------------------------------------------+ | v+------------------------------------------+| Continuous Deployment (CDP) || || Every change that passes all gates is: || - Deployed to production automatically || - No manual approval gate || || Requires: excellent test coverage, || fast rollback, feature flags |+------------------------------------------+Pipeline anatomy — every component explained:
Trigger Artifact | | v v[Git push] -> [Build] -> [Test] -> [Scan] -> [Deploy] | | | Job 1 Job 2 Job 3 compile unit-test trivy-scan Job 4 lint <- parallel with Job 2The fail-fast principle in practice:
Fastest checks first -- most expensive checks last: 1. Lint and format (10 seconds) <- cheapest, catches typos2. Unit tests (2 minutes) <- fast, catches logic errors3. Build Docker image (3 minutes) <- medium, needs successful tests4. Integration tests (5 minutes) <- slower, needs running service5. Security scan (4 minutes) <- parallel with integration6. Deploy to staging (2 minutes) <- only after all above pass Total time: ~12 minutes to stagingDetailed Step-by-Step Practical Lab
Milestone 1 — Read a real pipeline definition
## .github/workflows/ci.yaml -- read this and understand every linename: Payment API CI/CD on: push: branches: [main] ## trigger: push to main pull_request: branches: [main] ## trigger: PR targeting main jobs: ## Job 1: Build Docker image build: runs-on: ubuntu-latest outputs: image-tag: ${{ steps.tag.outputs.tag }} steps: - uses: actions/checkout@v4 - id: tag run: echo "tag=${{ github.sha }}" >> $GITHUB_OUTPUT - run: docker build -t payment-api:${{ github.sha }} . - run: docker push ${{ env.ECR_REGISTRY }}/payment-api:${{ github.sha }} ## Job 2: Unit tests (runs after build) unit-test: needs: build ## dependency: build must complete first runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - run: npm ci && npm test ## Job 3: Lint (runs PARALLEL with unit-test, both need build) lint: needs: build runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - run: npm ci && npm run lint ## Job 4: Deploy (runs after BOTH unit-test AND lint) deploy: needs: [unit-test, lint] ## both must pass if: github.ref == 'refs/heads/main' runs-on: ubuntu-latest steps: - run: ./deploy.sh staging ${{ needs.build.outputs.image-tag }}Milestone 2 — Trace a commit through the pipeline
## 1. Engineer pushes a commitgit add payments/processor.jsgit commit -m "fix: handle timeout in payment processor"git push origin main ## 2. GitHub receives the push event## Webhook fires -> GitHub Actions scheduler receives event## Pipeline starts within seconds ## 3. Build job starts## ubuntu-latest VM spins up## Checks out code## docker build -t payment-api:abc1234 .## Pushes to ECR: 123456789.dkr.ecr.ap-south-1.amazonaws.com/payment-api:abc1234 ## 4. unit-test and lint start IN PARALLEL## Two separate ubuntu-latest VMs spin up simultaneously## unit-test: runs 847 tests in 2m 14s -- PASS## lint: checks code style in 23s -- PASS ## 5. deploy job starts## Waits for BOTH unit-test AND lint to complete## Connects to EKS cluster## helm upgrade payment-api --set image.tag=abc1234## Verifies pods are healthy ## 6. Slack notification## "Deployment SUCCESS: payment-api abc1234 to staging" ## Total time: 8 minutes 43 seconds from push to stagingMilestone 3 — Understand pipeline triggers
on: ## Push trigger: runs on commits to main push: branches: [main] ## Path filter: skip pipeline if only docs changed paths: - 'src/**' - 'Dockerfile' - 'package*.json' ## PR trigger: runs on pull requests pull_request: branches: [main] types: [opened, synchronize, reopened] ## Schedule trigger: nightly security scan schedule: - cron: '0 21 * * *' ## 2:30 AM IST daily ## Manual trigger with inputs workflow_dispatch: inputs: environment: type: choice options: [staging, production] required: trueMilestone 4 — Work with pipeline artifacts
jobs: build: steps: - name: Build application run: npm run build ## Artifact 1: built application files - uses: actions/upload-artifact@v4 with: name: dist-files path: dist/ retention-days: 7 test: needs: build steps: ## Download the built files (not rebuilding from source) - uses: actions/download-artifact@v4 with: name: dist-files path: dist/ - name: Run tests against built artifacts run: npm test ## Artifact 2: test results for GitHub to display - uses: actions/upload-artifact@v4 if: always() ## upload even when tests fail with: name: test-results path: test-results/junit.xmlMilestone 5 — Diagnose and fix a failing pipeline
## Step 1: See which job failedgh run list --workflow=ci.yaml --limit 5## STATUS TITLE BRANCH EVENT ID## X fix: payment timeout main push 9876543 ## Step 2: See which step failedgh run view 9876543## Jobs:## X unit-test (3m 12s)## * checkout -- pass## * npm ci -- pass## * npm test -- FAIL ## Step 3: Read the failuregh run view 9876543 --log | grep -A 10 "FAIL"## FAIL src/payments/processor.test.js## * timeout handler not called when connection fails## Expected: true## Received: false ## Step 4: Fix locally and verifynpm test src/payments/processor.test.js ## run just the failing test## Investigate and fix the test## Verify the fix ## Step 5: Push the fixgit add .git commit -m "fix: processor timeout test -- mock timer correctly"git push## New pipeline run starts automaticallyMilestone 6 — Measure pipeline health
## Pipeline speed: how long does each job take?## GitHub Actions > workflow run > each job shows duration ## Check pipeline pass rate over timegh run list --workflow=ci.yaml --limit 50 --json status,createdAt | jq '[.[] | .status] | group_by(.) | map({status: .[0], count: length})' ## Find the slowest stepsgh run view RUN_ID --json jobs | jq '.jobs[] | {name: .name, duration: (.completedAt - .startedAt)}' ## Identify flaky tests (tests that sometimes pass, sometimes fail)## Look for jobs that fail intermittently:gh run list --workflow=ci.yaml --limit 100 --json conclusion | jq '[.[].conclusion] | group_by(.) | map({result: .[0], count: length})'Production Best Practices and Common Pitfalls
| Mistake | Problem | Fix |
|---|---|---|
| One giant job with 20 steps | Cannot run steps in parallel, slow | Split into multiple jobs with dependencies |
| No path filters on triggers | Pipeline runs on README changes | Add paths: filter to push triggers |
| Rebuilding image per environment | Untested image goes to production | Build once, promote same digest everywhere |
| No artifact retention policy | Old artifacts consume storage quota | Set retention-days on all artifacts |
| Ignoring flaky tests | Team ignores pipeline failures | Quarantine flaky tests, fix or delete them |
Quick Reference and Troubleshooting Commands
| Task | Command |
|---|---|
| List recent pipeline runs | gh run list --workflow=ci.yaml |
| View run details | gh run view RUN_ID |
| View run logs | gh run view RUN_ID --log |
| Re-run failed jobs only | gh run rerun RUN_ID --failed |
| Cancel a running pipeline | gh run cancel RUN_ID |
| Trigger manually | gh workflow run ci.yaml |
| Watch run in progress | gh run watch RUN_ID |
PLACEMENT PRO TIP**Tip:** Add `paths-ignore: ['**.md', 'docs/**']` to your push trigger. Documentation changes should not trigger a full build, test, and deploy cycle. Every unnecessary pipeline run wastes runner minutes and creates noise in the deployment history.
REMEMBER THIS**Remember:** The pipeline is not a safety net for bad code — it is a verification system for good code. If your team is relying on the pipeline to catch bugs that code review should catch, the pipeline will become slow and flaky as the test suite grows to cover every edge case that should have been caught in review.
COMMON MISTAKE / WARNING**Security:** Never print environment variables or secrets in pipeline logs. Even with secret masking enabled, structured logging can sometimes expose secrets in unexpected formats. Audit your pipeline logs regularly and use `::add-mask::$SECRET_VALUE` in GitHub Actions to mask any dynamically generated sensitive values.
COMMON MISTAKE / WARNING**Common Mistake:** Treating the pipeline as passing when it is green but slow. A green pipeline that takes 45 minutes is a failing pipeline — it is just failing at a different dimension. Set time budgets for each stage (Build < 5 min, Test < 8 min, Deploy < 3 min) and treat violations as bugs to fix, not acceptable trade-offs.