Kubernetes Jobs and CronJobs run tasks that are meant to complete β not run forever like a web server. A Job runs a pod once and exits cleanly. A CronJob runs it on a schedule. Every database migration, report generation, and data pipeline at Swiggy or Razorpay that needs Kubernetes-level reliability uses one of these two.
+++
Kubernetes Jobs and CronJobs for Batch Workloads
Why Pods Alone Are Not Enough for Batch Tasks
A regular Deployment keeps pods running forever. If a pod crashes, the controller restarts it. That is the right behavior for a web server. It is the wrong behavior for a database migration β you do not want a migration to restart automatically if it partially completed.
A Job solves this. It runs a pod to completion, retries on failure up to a defined limit, and then stops. Once the task is done, the pod stays in Completed state instead of being deleted β so you can inspect its logs.
+---------------------+ +---------------------+| Deployment | | Job |+---------------------+ +---------------------+| Runs forever | | Runs to completion || Restarts on crash | | Retries on failure || Never "done" | | Stops when done |+---------------------+ +---------------------+ β Web servers β Migrations, reports API pods Data pipelinesAnatomy of a Kubernetes Job
1# db-migration-job.yaml2apiVersion: batch/v13kind: Job4metadata:5 name: db-migration-v26 namespace: production7spec:8 backoffLimit: 3 # Retry the pod up to 3 times on failure before giving up9 completions: 1 # How many times the task must succeed (default: 1)10 parallelism: 1 # How many pods run simultaneously (default: 1)11 ttlSecondsAfterFinished: 600 # Delete the Job and its pods 10 minutes after completion12 template:13 spec:14 restartPolicy: Never # REQUIRED for Jobs β must be Never or OnFailure, never Always15 containers:16 - name: migration17 image: registry.razorpay.in/payments-api:v2.5.118 command: ["python", "manage.py", "migrate"]19 env:20 - name: DATABASE_URL21 valueFrom:22 secretKeyRef:23 name: db-credentials24 key: urlπ restartPolicy: Never vs OnFailure
| Policy | Behavior | Use When |
|---|---|---|
Never |
Creates a new pod on each failure | You want a clean pod for each attempt β DB migrations, one-time scripts |
OnFailure |
Restarts the same pod on failure | The task is safe to restart in-place β simple data transforms |
Parallel Jobs β Processing a Queue
When you need to process a large number of independent tasks (sending 10,000 emails, resizing 50,000 images), you can run multiple pods in parallel.
1spec:2 completions: 50 # Total tasks to complete3 parallelism: 5 # Run 5 pods at a time4 backoffLimit: 10completions=50, parallelism=5 Batch 1: [Pod-1] [Pod-2] [Pod-3] [Pod-4] [Pod-5] β all run simultaneously β completeBatch 2: [Pod-6] [Pod-7] [Pod-8] [Pod-9] [Pod-10] β complete ... continues until 50 total completionsMonitoring a Running Job
1# Watch job status in real time2kubectl get jobs -n production -w3 4# Output:5# NAME COMPLETIONS DURATION AGE6# db-migration-v2 0/1 10s 10s7# db-migration-v2 1/1 23s 23s β Job succeeded8 9# Read logs from the job's pod10kubectl logs -l job-name=db-migration-v2 -n production11 12# Describe for events and failure details13kubectl describe job db-migration-v2 -n productionCronJob β Running Jobs on a Schedule
A CronJob is a Job with a schedule. It creates a new Job object at each scheduled time, which in turn creates the pod.
CronJob (schedule: "0 2 * * *") β ββ creates β Job (at 2:00 AM Monday) β Pod β Runs β Completes ββ creates β Job (at 2:00 AM Tuesday) β Pod β Runs β Completes ββ creates β Job (at 2:00 AM Wednesday)β Pod β Runs β Completes1# nightly-report-cronjob.yaml2apiVersion: batch/v13kind: CronJob4metadata:5 name: nightly-report6 namespace: production7spec:8 schedule: "0 2 * * *" # Every day at 2:00 AM UTC9 timeZone: "Asia/Kolkata" # Available from Kubernetes 1.27+10 concurrencyPolicy: Forbid # Do not start a new run if the previous one is still running11 successfulJobsHistoryLimit: 3 # Keep the last 3 successful job records12 failedJobsHistoryLimit: 5 # Keep the last 5 failed job records for debugging13 startingDeadlineSeconds: 300 # If the job misses its schedule by 5 minutes, skip this run14 jobTemplate:15 spec:16 backoffLimit: 217 template:18 spec:19 restartPolicy: OnFailure20 containers:21 - name: report-generator22 image: registry.hotstar.com/analytics:v1.8.223 command: ["python", "generate_report.py", "--date=yesterday"]CronJob Schedule Syntax
ββββββββββββ Minute (0β59)β ββββββββββ Hour (0β23)β β ββββββββ Day of month (1β31)β β β ββββββ Month (1β12)β β β β ββββ Day of week (0β6, Sunday=0)β β β β β* * * * * Common schedules:"0 * * * *" β Every hour at :00"*/15 * * * *" β Every 15 minutes"0 2 * * *" β Every day at 2 AM"0 2 * * 1" β Every Monday at 2 AM"0 0 1 * *" β First day of every month at midnight"0 9-18 * * 1-5" β Every hour from 9 AM to 6 PM on weekdaysπ‘ Use crontab.guru to validate cron expressions before deploying.
concurrencyPolicy β The Most Important CronJob Setting
This controls what happens when a job is still running when its next scheduled run is due.
1concurrencyPolicy: Forbid # Most common for production| Policy | Behavior | Use When |
|---|---|---|
Allow (default) |
Multiple runs overlap freely | Task is fully idempotent and overlap is safe |
Forbid |
Skips the new run if previous is still running | Database reports, reconciliation jobs β overlap would cause duplicate data |
Replace |
Kills the current run, starts the new one | Hard deadline tasks where freshness matters more than completion |
β οΈ Production mistake: Leaving concurrencyPolicy: Allow (the default) on a DB report job that sometimes runs long. It will start overlapping runs, cause duplicate rows, and corrupt your report. Always set this explicitly.
Manually Triggering a CronJob
You cannot manually trigger a CronJob directly. The standard pattern is to create a one-off Job from the CronJob's template.
1# Trigger a CronJob immediately (Kubernetes 1.21+)2kubectl create job --from=cronjob/nightly-report manual-run-$(date +%s) -n production3 4# Watch it run5kubectl get jobs -n production -w6 7# Read its output8kubectl logs -l job-name=manual-run-1716900000 -n productionHandling Failures Correctly
1spec:2 backoffLimit: 4 # Retry 4 times before marking the Job as Failed3 4 # Optional: Set a hard timeout on the entire job5 activeDeadlineSeconds: 600 # Kill the job if it has not completed in 10 minutesbackoffLimit: 3 Attempt 1 β FAILS β wait 10sAttempt 2 β FAILS β wait 20sAttempt 3 β FAILS β wait 40sAttempt 4 β FAILS β Job status = "Failed" β no more retries1# Check why a job failed2kubectl describe job db-migration-v2 -n production3 4# Events section will show:5# Warning BackoffLimitExceeded Job has reached the specified backoff limit6 7# Read logs from the last failed pod8kubectl logs -l job-name=db-migration-v2 -n production --previousReal-World Pattern β Database Migration Before Deployment
At Razorpay or Zerodha, database migrations must complete before the new application version starts taking traffic. This is handled with Kubernetes init containers in a Deployment, or more commonly, a pre-deploy Job in the CI/CD pipeline.
1# Pattern: Run migration Job, wait for it, then deploy the app2 3# Step 1 β Apply the migration Job4kubectl apply -f db-migration-job.yaml -n production5 6# Step 2 β Wait for it to complete (blocks the pipeline)7kubectl wait --for=condition=complete job/db-migration-v2 \8 --timeout=300s \9 -n production10 11# Step 3 β If the above exits 0, deploy the application12kubectl apply -f deployment.yaml -n production1# If the job fails, kubectl wait exits non-zero2# Your pipeline stops here instead of deploying broken code3# Exit code: 14# Error: timed out waiting for the conditionCleaning Up Old Jobs
Jobs and their pods are not deleted automatically unless ttlSecondsAfterFinished is set. Without it, completed jobs pile up and waste etcd storage.
1# Delete a specific completed job (also deletes its pods)2kubectl delete job db-migration-v2 -n production3 4# Delete all completed jobs in a namespace5kubectl delete jobs --field-selector status.successful=1 -n production6 7# Better: set TTL in the Job spec8spec:9 ttlSecondsAfterFinished: 3600 # Delete 1 hour after completionπ΄ Common Mistake: Not setting ttlSecondsAfterFinished on Jobs created by a CI/CD pipeline. After 100 deploys, you have 100 completed Job objects and 100 Completed pods consuming namespace resource. Set a TTL of 1β24 hours on every CI-created Job.
π‘ Tip: For CronJobs, always set successfulJobsHistoryLimit: 3 and failedJobsHistoryLimit: 5. The default keeps 3 successful and 1 failed β the 1 failed limit means you lose failure history the moment a second failure happens, making debugging harder.