Overview and What You Will Learn
The gap between a script that works and a script that is production-ready is filled with error handling, idempotency, logging, locking, retry logic, and notifications. This topic covers the patterns that experienced DevOps engineers apply to every script that touches production systems.
By the end of this lab you will:
- Parse arguments with
getoptsfor professional CLI interfaces - Write idempotent scripts that are safe to run multiple times
- Implement retry logic with exponential backoff
- Build deployment scripts with health verification and rollback
- Parse flags and implement
--dry-runmode - Send Slack or webhook notifications from scripts
- Use
shellcheckandbatsfor script quality assurance
Why This Matters in Production
A Razorpay deployment script that runs correctly the first time is good. A deployment script that also handles partial failures, validates the deployment succeeded, automatically rolls back on failure, notifies the team, and can be re-run safely without side effects — that is production engineering. The patterns in this topic are what make the difference.
Core Principles
Production deployment script flow:
+------------------------------------------+| 1. Parse and validate arguments |+------------------------------------------+ | v+------------------------------------------+| 2. Check dependencies and preconditions |+------------------------------------------+ | v+------------------------------------------+| 3. Acquire lock (prevent concurrent runs)|+------------------------------------------+ | v+------------------------------------------+| 4. Execute deployment steps |+------------------------------------------+ | health check / \ pass fail | | v v+------------------+ +------------------+| 5. Notify Slack | | 5. Rollback || success | | Notify fail |+------------------+ +------------------+Detailed Step-by-Step Practical Lab
Milestone 1 — Argument parsing with getopts
set -euo pipefail ## Usage functionusage() { cat << EOFUsage: $(basename "$0") [OPTIONS] <service> <version> Deploy a service to a target environment. Options: -e ENV Target environment (default: production) -r REGION AWS region (default: ap-south-1) -d Dry run -- show what would happen without doing it -v Verbose output -h Show this help Examples: $(basename "$0") payment-api v1.2.3 $(basename "$0") -e staging -d payment-api v1.2.3 $(basename "$0") -e production -r us-east-1 payment-api v1.2.3EOF exit 1} ## Parse options with getoptsENVIRONMENT="production"REGION="ap-south-1"DRY_RUN=falseVERBOSE=false while getopts "e:r:dvh" opt; do case "$opt" in e) ENVIRONMENT="$OPTARG" ;; r) REGION="$OPTARG" ;; d) DRY_RUN=true ;; v) VERBOSE=true ;; h) usage ;; *) usage ;; esacdoneshift $((OPTIND - 1)) ## After getopts, positional args remainif [[ $# -lt 2 ]]; then echo "Error: service and version are required" usagefi SERVICE="$1"VERSION="$2" ## Dry run moderun_cmd() { if [[ "$DRY_RUN" == true ]]; then echo "[DRY RUN] $*" else "$@" fi} $VERBOSE && echo "Environment: $ENVIRONMENT, Region: $REGION, Service: $SERVICE, Version: $VERSION"Milestone 2 — Idempotent scripts
An idempotent script produces the same result whether run once or ten times. Running it again should be safe.
## BAD: not idempotent## Running twice creates duplicate entriesecho "export PATH=/usr/local/bin:$PATH" >> ~/.bashrc ## GOOD: idempotent## Check before addingif ! grep -q 'export PATH=/usr/local/bin' ~/.bashrc; then echo "export PATH=/usr/local/bin:$PATH" >> ~/.bashrcfi ## BAD: not idempotentmkdir /opt/payment-api## Fails with "already exists" on second run ## GOOD: idempotentmkdir -p /opt/payment-api## -p makes it succeed even if directory exists ## BAD: not idempotentuseradd payment-svc## Fails on second run ## GOOD: idempotentif ! id payment-svc > /dev/null 2>&1; then useradd --system --no-create-home --shell /sbin/nologin payment-svcfi ## Idempotent symlink update## ln -s fails if symlink already exists## ln -sf is always safeln -sf "/opt/payment-api/${VERSION}" /opt/payment-api/current ## Idempotent service configuration## Only reload if config actually changedif ! diff -q /etc/nginx/sites-enabled/payment /etc/nginx/sites-available/payment > /dev/null 2>&1; then cp /etc/nginx/sites-available/payment /etc/nginx/sites-enabled/payment sudo systemctl reload nginxfiMilestone 3 — Retry logic with exponential backoff
## Simple retry functionretry() { local max_attempts="$1" local delay="$2" local description="$3" shift 3 local attempt=1 until "$@"; do if [[ $attempt -ge $max_attempts ]]; then echo "ERROR: $description failed after $max_attempts attempts" return 1 fi echo "WARN: $description failed (attempt $attempt/$max_attempts), retrying in ${delay}s..." sleep "$delay" ((attempt++)) delay=$((delay * 2)) ## exponential backoff done echo "INFO: $description succeeded on attempt $attempt"} ## Usageretry 5 2 "pull Docker image" docker pull "payment-api:${VERSION}"retry 10 3 "health check" curl -sf http://localhost:4000/health ## Retry with custom success conditionwait_for_healthy() { local url="$1" local max_wait="${2:-60}" local interval="${3:-3}" local elapsed=0 echo "Waiting for $url to become healthy..." while [[ $elapsed -lt $max_wait ]]; do if curl -sf "$url" > /dev/null 2>&1; then echo "Service healthy after ${elapsed}s" return 0 fi sleep "$interval" elapsed=$((elapsed + interval)) echo " Still waiting... (${elapsed}s / ${max_wait}s)" done echo "ERROR: Service did not become healthy after ${max_wait}s" return 1} wait_for_healthy "http://localhost:4000/health" 120 5Milestone 4 — Deployment script with rollback
PREVIOUS_VERSION="" rollback() { if [[ -n "$PREVIOUS_VERSION" ]]; then echo "ERROR: Deployment failed, rolling back to $PREVIOUS_VERSION" ln -sf "/opt/payment-api/${PREVIOUS_VERSION}" /opt/payment-api/current systemctl restart payment-api notify_slack "ROLLBACK" "payment-api rolled back to $PREVIOUS_VERSION" else echo "ERROR: No previous version to roll back to" fi} deploy() { local service="$1" local version="$2" ## Save current version for rollback PREVIOUS_VERSION=$(readlink /opt/"${service}"/current 2>/dev/null | xargs basename || echo "") echo "INFO: Previous version: $PREVIOUS_VERSION" echo "INFO: Deploying: $version" ## Step 1: Pull new artifact echo "INFO: Pulling $service:$version" docker pull "${service}:${version}" || { rollback; return 1; } ## Step 2: Stop old container gracefully echo "INFO: Stopping old container" docker stop "$service" --time 30 2>/dev/null || true docker rm "$service" 2>/dev/null || true ## Step 3: Start new container echo "INFO: Starting new container" docker run -d \ --name "$service" \ --restart unless-stopped \ -p 4000:4000 \ --env-file /etc/payment-api/env \ "${service}:${version}" || { rollback; return 1; } ## Step 4: Wait for health check if ! wait_for_healthy "http://localhost:4000/health" 60 3; then rollback return 1 fi ## Step 5: Update symlink ln -sf "/opt/${service}/${version}" "/opt/${service}/current" echo "INFO: Deployment successful: $service $version" return 0}Milestone 5 — Notifications via Slack webhook
## Send Slack notificationnotify_slack() { local status="$1" local message="$2" local webhook="${SLACK_WEBHOOK:-}" if [[ -z "$webhook" ]]; then echo "WARN: SLACK_WEBHOOK not set, skipping notification" return 0 fi local color case "$status" in SUCCESS) color="#36a64f" ;; ## green FAILURE) color="#ff0000" ;; ## red ROLLBACK) color="#ff9900" ;; ## orange *) color="#808080" ;; ## grey esac local payload payload=$(jq -n \ --arg status "$status" \ --arg message "$message" \ --arg color "$color" \ --arg service "$SERVICE" \ --arg version "$VERSION" \ --arg env "$ENVIRONMENT" \ '{ "attachments": [{ "color": $color, "title": ("Deployment " + $status), "fields": [ {"title": "Service", "value": $service, "short": true}, {"title": "Version", "value": $version, "short": true}, {"title": "Environment", "value": $env, "short": true}, {"title": "Message", "value": $message, "short": false} ] }] }') curl -s -X POST "$webhook" \ -H 'Content-type: application/json' \ -d "$payload" > /dev/null echo "INFO: Slack notification sent: $status"}Milestone 6 — Script quality with shellcheck and bats
## shellcheck: static analysis for bash scripts## Install: apt install shellcheckshellcheck deploy.sh## Finds: unquoted variables, incorrect comparisons,## deprecated syntax, undefined variables, etc. ## Fix common shellcheck warnings:## SC2086: Double quote to prevent globbing and word splitting## Bad: cp $FILE /tmp/## Good: cp "$FILE" /tmp/ ## SC2006: Use $(...) notation instead of legacy backticks## Bad: DATE=`date +%Y%m%d`## Good: DATE=$(date +%Y%m%d) ## bats: bash automated testing system## Install: apt install bats## Write tests in test/deploy.bats: cat > test/deploy_test.bats << 'EOF'#!/usr/bin/env bats ## Load the script (in source-only mode)load '../deploy.sh' @test "usage exits with code 1" { run usage [ "$status" -eq 1 ]} @test "validate_version accepts semver format" { run validate_version "v1.2.3" [ "$status" -eq 0 ]} @test "validate_version rejects invalid format" { run validate_version "not-a-version" [ "$status" -eq 1 ]} @test "rollback function works with previous version set" { PREVIOUS_VERSION="v1.1.0" run rollback [ "$status" -eq 0 ]}EOF ## Run testsbats test/deploy_test.batsProduction Best Practices and Common Pitfalls
| Pattern | Wrong | Correct |
|---|---|---|
| Argument parsing | if [ "$1" == "-e" ] manual parsing |
Use getopts for proper flag handling |
| Idempotency | Run and hope | Check-before-act for every side effect |
| Failed deployment | Exit without cleanup | Rollback + notify + cleanup via trap |
| Retry logic | Fixed sleep between attempts | Exponential backoff with max attempts |
| Script testing | Manual testing only | shellcheck + bats automated tests |
Quick Reference and Troubleshooting Commands
| Task | Command |
|---|---|
| Lint script | shellcheck deploy.sh |
| Run tests | bats test/ |
| Dry run flag | DRY_RUN=true ./deploy.sh |
| Debug trace | bash -x deploy.sh |
| Lock file check | flock -n /var/lock/deploy.lock ./deploy.sh |
| Slack test | curl -X POST $WEBHOOK -d '{"text":"test"}' |
PLACEMENT PRO TIP**Tip:** Add `--dry-run` mode to every deployment script. It should print exactly what the script would do without making any changes. This lets you validate the logic before running in production and is invaluable for training new team members on how deployments work.
REMEMBER THIS**Remember:** Always use `flock` for scripts that should not run concurrently. `flock -n /var/lock/myscript.lock -c ./myscript.sh` fails immediately if another instance is running. Without locking, two simultaneous deployments can corrupt shared state.
COMMON MISTAKE / WARNING**Security:** Never log environment variables or command output that may contain secrets. `set -x` (debug mode) prints every command before execution, including the values of all variables. If a variable contains a password or API key, it will appear in logs. Disable `set -x` before any commands that use secrets.
COMMON MISTAKE / WARNING**Common Mistake:** Writing `exit 0` inside a function. In bash, `exit` terminates the entire script, not just the function. Use `return 0` or `return 1` inside functions. Only use `exit` at the top level of the script.