What is the career path for learning Kubernetes Jobs and CronJobs for Batch Workloads?

Mastering Kubernetes Jobs and CronJobs for Batch Workloads enables engineering opportunities in DevOps, SRE, and cloud platform automation.

How long does it take to learn Kubernetes Jobs and CronJobs for Batch Workloads?

Most students gain core proficiency in Kubernetes Jobs and CronJobs for Batch Workloads in 2–3 weeks of active hands-on labs.

Kubernetes Jobs and CronJobs for Batch Workloads | DevOps Network

Kubernetes Jobs and CronJobs run tasks that are meant to complete — not run forever like a web server. A Job runs a pod once and exits cleanly. A CronJob runs it on a schedule. Every database migration, report generation, and data pipeline at Swiggy or Razorpay that needs Kubernetes-level reliability uses one of these two.

+++

Why Pods Alone Are Not Enough for Batch Tasks

A regular Deployment keeps pods running forever. If a pod crashes, the controller restarts it. That is the right behavior for a web server. It is the wrong behavior for a database migration — you do not want a migration to restart automatically if it partially completed.

A Job solves this. It runs a pod to completion, retries on failure up to a defined limit, and then stops. Once the task is done, the pod stays in Completed state instead of being deleted — so you can inspect its logs.

◈ DIAGRAM

+---------------------+          +---------------------+
|   Deployment        |          |   Job               |
+---------------------+          +---------------------+
| Runs forever        |          | Runs to completion  |
| Restarts on crash   |          | Retries on failure  |
| Never "done"        |          | Stops when done     |
+---------------------+          +---------------------+
     ↑ Web servers                     ↑ Migrations, reports
       API pods                          Data pipelines

Anatomy of a Kubernetes Job

YAML

# db-migration-job.yaml
apiVersion: batch/v1
kind: Job
metadata:
  name: db-migration-v2
  namespace: production
spec:
  backoffLimit: 3          # Retry the pod up to 3 times on failure before giving up
  completions: 1           # How many times the task must succeed (default: 1)
  parallelism: 1           # How many pods run simultaneously (default: 1)
  ttlSecondsAfterFinished: 600  # Delete the Job and its pods 10 minutes after completion
  template:
    spec:
      restartPolicy: Never # REQUIRED for Jobs — must be Never or OnFailure, never Always
      containers:
        - name: migration
          image: registry.razorpay.in/payments-api:v2.5.1
          command: ["python", "manage.py", "migrate"]
          env:
            - name: DATABASE_URL
              valueFrom:
                secretKeyRef:
                  name: db-credentials
                  key: url

📌 restartPolicy: Never vs OnFailure

Policy	Behavior	Use When
`Never`	Creates a new pod on each failure	You want a clean pod for each attempt — DB migrations, one-time scripts
`OnFailure`	Restarts the same pod on failure	The task is safe to restart in-place — simple data transforms

Parallel Jobs — Processing a Queue

When you need to process a large number of independent tasks (sending 10,000 emails, resizing 50,000 images), you can run multiple pods in parallel.

YAML

spec:
  completions: 50      # Total tasks to complete
  parallelism: 5       # Run 5 pods at a time
  backoffLimit: 10

◈ DIAGRAM

completions=50, parallelism=5
 
Batch 1:  [Pod-1] [Pod-2] [Pod-3] [Pod-4] [Pod-5]  ← all run simultaneously
          ↓ complete
Batch 2:  [Pod-6] [Pod-7] [Pod-8] [Pod-9] [Pod-10]
          ↓ complete
          ... continues until 50 total completions

Monitoring a Running Job

Bash

# Watch job status in real time
kubectl get jobs -n production -w
 
# Output:
# NAME               COMPLETIONS   DURATION   AGE
# db-migration-v2    0/1           10s        10s
# db-migration-v2    1/1           23s        23s  ← Job succeeded
 
# Read logs from the job's pod
kubectl logs -l job-name=db-migration-v2 -n production
 
# Describe for events and failure details
kubectl describe job db-migration-v2 -n production

CronJob — Running Jobs on a Schedule

A CronJob is a Job with a schedule. It creates a new Job object at each scheduled time, which in turn creates the pod.

◈ DIAGRAM

CronJob (schedule: "0 2 * * *")
    │
    ├─ creates → Job (at 2:00 AM Monday)   → Pod → Runs → Completes
    ├─ creates → Job (at 2:00 AM Tuesday)  → Pod → Runs → Completes
    └─ creates → Job (at 2:00 AM Wednesday)→ Pod → Runs → Completes

YAML

# nightly-report-cronjob.yaml
apiVersion: batch/v1
kind: CronJob
metadata:
  name: nightly-report
  namespace: production
spec:
  schedule: "0 2 * * *"          # Every day at 2:00 AM UTC
  timeZone: "Asia/Kolkata"        # Available from Kubernetes 1.27+
  concurrencyPolicy: Forbid       # Do not start a new run if the previous one is still running
  successfulJobsHistoryLimit: 3   # Keep the last 3 successful job records
  failedJobsHistoryLimit: 5       # Keep the last 5 failed job records for debugging
  startingDeadlineSeconds: 300    # If the job misses its schedule by 5 minutes, skip this run
  jobTemplate:
    spec:
      backoffLimit: 2
      template:
        spec:
          restartPolicy: OnFailure
          containers:
            - name: report-generator
              image: registry.hotstar.com/analytics:v1.8.2
              command: ["python", "generate_report.py", "--date=yesterday"]

CronJob Schedule Syntax

◈ DIAGRAM

┌─────────── Minute       (0–59)
│ ┌───────── Hour         (0–23)
│ │ ┌─────── Day of month  (1–31)
│ │ │ ┌───── Month         (1–12)
│ │ │ │ ┌─── Day of week   (0–6, Sunday=0)
│ │ │ │ │
* * * * *
 
Common schedules:
"0 * * * *"      → Every hour at :00
"*/15 * * * *"   → Every 15 minutes
"0 2 * * *"      → Every day at 2 AM
"0 2 * * 1"      → Every Monday at 2 AM
"0 0 1 * *"      → First day of every month at midnight
"0 9-18 * * 1-5" → Every hour from 9 AM to 6 PM on weekdays

PLACEMENT PRO TIP
Use [crontab.guru](https://crontab.guru) to validate cron expressions before deploying.

concurrencyPolicy — The Most Important CronJob Setting

This controls what happens when a job is still running when its next scheduled run is due.

YAML

concurrencyPolicy: Forbid # Most common for production

Policy	Behavior	Use When
`Allow` (default)	Multiple runs overlap freely	Task is fully idempotent and overlap is safe
`Forbid`	Skips the new run if previous is still running	Database reports, reconciliation jobs — overlap would cause duplicate data
`Replace`	Kills the current run, starts the new one	Hard deadline tasks where freshness matters more than completion

⚠️ Production mistake: Leaving concurrencyPolicy: Allow (the default) on a DB report job that sometimes runs long. It will start overlapping runs, cause duplicate rows, and corrupt your report. Always set this explicitly.

Manually Triggering a CronJob

You cannot manually trigger a CronJob directly. The standard pattern is to create a one-off Job from the CronJob's template.

Bash

# Trigger a CronJob immediately (Kubernetes 1.21+)
kubectl create job --from=cronjob/nightly-report manual-run-$(date +%s) -n production
 
# Watch it run
kubectl get jobs -n production -w
 
# Read its output
kubectl logs -l job-name=manual-run-1716900000 -n production

Handling Failures Correctly

YAML

spec:
  backoffLimit: 4   # Retry 4 times before marking the Job as Failed
 
  # Optional: Set a hard timeout on the entire job
  activeDeadlineSeconds: 600   # Kill the job if it has not completed in 10 minutes

◈ DIAGRAM

backoffLimit: 3
 
Attempt 1 → FAILS → wait 10s
Attempt 2 → FAILS → wait 20s
Attempt 3 → FAILS → wait 40s
Attempt 4 → FAILS → Job status = "Failed" → no more retries

Bash

# Check why a job failed
kubectl describe job db-migration-v2 -n production
 
# Events section will show:
# Warning  BackoffLimitExceeded  Job has reached the specified backoff limit
 
# Read logs from the last failed pod
kubectl logs -l job-name=db-migration-v2 -n production --previous

Real-World Pattern — Database Migration Before Deployment

At Razorpay or Zerodha, database migrations must complete before the new application version starts taking traffic. This is handled with Kubernetes init containers in a Deployment, or more commonly, a pre-deploy Job in the CI/CD pipeline.

YAML

# Pattern: Run migration Job, wait for it, then deploy the app
 
# Step 1 — Apply the migration Job
kubectl apply -f db-migration-job.yaml -n production
 
# Step 2 — Wait for it to complete (blocks the pipeline)
kubectl wait --for=condition=complete job/db-migration-v2 \
  --timeout=300s \
  -n production
 
# Step 3 — If the above exits 0, deploy the application
kubectl apply -f deployment.yaml -n production

Bash

# If the job fails, kubectl wait exits non-zero
# Your pipeline stops here instead of deploying broken code
# Exit code: 1
# Error: timed out waiting for the condition

Cleaning Up Old Jobs

Jobs and their pods are not deleted automatically unless ttlSecondsAfterFinished is set. Without it, completed jobs pile up and waste etcd storage.

Bash

# Delete a specific completed job (also deletes its pods)
kubectl delete job db-migration-v2 -n production
 
# Delete all completed jobs in a namespace
kubectl delete jobs --field-selector status.successful=1 -n production
 
# Better: set TTL in the Job spec
spec:
  ttlSecondsAfterFinished: 3600  # Delete 1 hour after completion

🔴 Common Mistake: Not setting ttlSecondsAfterFinished on Jobs created by a CI/CD pipeline. After 100 deploys, you have 100 completed Job objects and 100 Completed pods consuming namespace resource. Set a TTL of 1–24 hours on every CI-created Job.

💡 Tip: For CronJobs, always set successfulJobsHistoryLimit: 3 and failedJobsHistoryLimit: 5. The default keeps 3 successful and 1 failed — the 1 failed limit means you lose failure history the moment a second failure happens, making debugging harder.

Kubernetes Jobs and CronJobs for Batch Workloads

Why Pods Alone Are Not Enough for Batch Tasks

Anatomy of a Kubernetes Job

Parallel Jobs — Processing a Queue

Monitoring a Running Job

CronJob — Running Jobs on a Schedule

CronJob Schedule Syntax

concurrencyPolicy — The Most Important CronJob Setting

Manually Triggering a CronJob

Handling Failures Correctly

Real-World Pattern — Database Migration Before Deployment

Cleaning Up Old Jobs

Resources

Explore More in Kubernetes Workload Management

Configuring Persistent Volumes and Storage Classes in Kubernetes

Running StatefulSets for Databases on Kubernetes

Implementing Liveness and Readiness Probes for Zero-Downtime Deploys

Configuring Pod Disruption Budgets for Zero-Downtime Upgrades