Troubleshooting ImagePullBackOff and Registry Authentication Issues
Overview and What You Will Learn
ImagePullBackOff is one of the most common errors engineers encounter when deploying to Kubernetes β and one of the most frustrating, because the same image that pulls fine on your laptop refuses to pull on the cluster. The error can be caused by five completely different root causes that all produce the same status message. This lab walks through every cause systematically with concrete diagnostic commands and fixes.
By the end of this guide you will be able to:
- Distinguish between
ErrImagePull(first attempt) andImagePullBackOff(retry with backoff) and why both map to the same root causes - Diagnose authentication failures with private registries and create the correct
imagePullSecret - Fix image name, tag, and digest errors that cause pull failures
- Configure registry credentials for AWS ECR, GCP Artifact Registry, and private Harbor instances
- Resolve network-level pull failures caused by firewall rules or registry outages
Why This Matters in Production
At Razorpay, a CI/CD pipeline successfully built and pushed a new payments service image to their private ECR registry β but the Kubernetes deployment sat in ImagePullBackOff for 11 minutes before an engineer noticed. The root cause: the imagePullSecret referencing the ECR credentials had expired. The new pods could not pull the image, but the old pods were still running β so no alerts fired. The fix took 30 seconds once diagnosed, but discovery took 11 minutes of confusion.
At Hotstar, a developer accidentally deployed with image: video-encoder:latest instead of image: registry.hotstar.com/video-encoder:v3.2.1 β pulling from Docker Hub (which doesn't have the image) instead of their private registry. Same error message, completely different cause, completely different fix.
Core Principles
The five root causes of ImagePullBackOff β always check in this order: CAUSE 1 β Wrong image name or tag image: my-app:v2 β no registry prefix = pulls from Docker Hub image: registry.razorpay.in/my-app β missing tag = tries "latest" which may not exist image: registry.razorpay.in/my-app:v2.0.0 β tag doesn't exist in registry
CAUSE 2 β Missing or incorrect imagePullSecret Private registry requires credentials. No imagePullSecret = 401 Unauthorized from registry. Wrong secret = 403 Forbidden or "incorrect username or password"
CAUSE 3 β Expired credentials (ECR tokens expire every 12 hours) AWS ECR tokens are time-limited. A secret created yesterday may be expired today. Symptom: worked before, suddenly fails on new pod scheduling.
CAUSE 4 β Network cannot reach the registry Node firewall blocks outbound HTTPS to registry domain. Private registry behind VPN that cluster nodes cannot access. Symptom: "dial tcp: connection timed out" in pod events.
CAUSE 5 β Registry rate limiting (Docker Hub) Docker Hub limits unauthenticated pulls to 100/6hr per IP. Shared NAT gateway = all nodes share one IP = rate limit hit quickly. Symptom: "toomanyrequests: You have reached your pull rate limit"
Detailed Step-by-Step Practical Lab
Step 1 β Identify the Exact Error
1kubectl get pods -n production2NAME READY STATUS RESTARTS3payments-api-6d8f9b-xkp2q 0/1 ImagePullBackOff 0ALWAYS describe the pod first β the Events section contains the actual error4kubectl describe pod payments-api-6d8f9b-xkp2q -n productionLook for the Events section at the bottom:5Events:6Warning Failed kubelet Failed to pull image "registry.razorpay.in/payments-api:v2.1.0":7rpc error: code = Unknown desc = failed to pull and unpack image8"registry.razorpay.in/payments-api:v2.1.0":9unexpected status code 401 Unauthorized10 11Warning Failed kubelet Error: ErrImagePull12Normal BackOff kubelet Back-off pulling image "registry.razorpay.in/payments-api:v2.1.0"13Warning Failed kubelet Error: ImagePullBackOff14 15> π **Remember:** `ErrImagePull` is the first failed attempt. `ImagePullBackOff` is Kubernetes applying exponential backoff (10s β 20s β 40s β ... β 5min cap) before retrying. Both share the same root cause β always read the `Failed to pull image` line above them for the actual error message.16 17#### Step 2 β Diagnose and Fix: Wrong Image Name or Tag18 19```bashError signature: "manifest unknown" or "not found"20Failed to pull image "my-app:v2": manifest unknown: manifest tagged by "v2" is not foundCheck what tags actually exist in your registry21For AWS ECR:22aws ecr describe-images 23--repository-name payments-api 24--region ap-south-1 25--query 'imageDetails[*].imageTags' 26--output tableFor Harbor (private registry):27curl -u rahul:password 28https://registry.razorpay.in/v2/payments-api/tags/listFix: Update the deployment with the correct image reference29kubectl set image deployment/payments-api 30payments-api=registry.razorpay.in/payments-api:v2.1.0 31-n productionVerify the image reference in the deployment spec32kubectl get deployment payments-api -n production 33-o jsonpath='{.spec.template.spec.containers[0].image}'34registry.razorpay.in/payments-api:v2.1.035 36> β οΈ **Security:** Never use `image: myapp:latest` in production manifests. The `latest` tag is mutable β the registry can silently replace it with a different image. Always pin to an immutable tag (`v2.1.0`) or image digest (`sha256:abc123...`) for reproducible deployments.37 38#### Step 3 β Diagnose and Fix: Missing imagePullSecret for Private Registry39 40```bashError signature: "401 Unauthorized" or "403 Forbidden"41Failed to pull image: unexpected status code 401 UnauthorizedStep 1 β Create the imagePullSecret from registry credentials42Method A: Docker config file (most portable)43kubectl create secret docker-registry razorpay-registry-secret 44--docker-server=registry.razorpay.in 45--docker-username=deploy-bot 46--docker-password=sup3rs3cr3tP@ssword 47--docker-email=devops@razorpay.com 48--namespace=productionMethod B: From an existing Docker config.json (if you've already logged in locally)49kubectl create secret generic razorpay-registry-secret 50--from-file=.dockerconfigjson=$HOME/.docker/config.json 51--type=kubernetes.io/dockerconfigjson 52--namespace=productionVerify the secret was created correctly53kubectl get secret razorpay-registry-secret -n production -o yaml54 55```yamldeployment-with-pull-secret.yaml β reference the secret in your pod spec56apiVersion: apps/v157kind: Deployment58metadata:59name: payments-api60namespace: production61spec:62template:63spec:64imagePullSecrets:65- name: razorpay-registry-secret # Reference secret by name66containers:67- name: payments-api68image: registry.razorpay.in/payments-api:v2.1.069 70```bashkubectl apply -f deployment-with-pull-secret.yamlAlternative: Patch an existing deployment to add imagePullSecrets71kubectl patch deployment payments-api -n production 72--type='json' 73-p='[{"op":"add","path":"/spec/template/spec/imagePullSecrets","value":[{"name":"razorpay-registry-secret"}]}]'74 75> π‘ **Tip:** Attach the imagePullSecret to the namespace's default ServiceAccount so every pod in the namespace automatically inherits it β eliminating the need to add `imagePullSecrets` to every individual deployment:76>77> `kubectl patch serviceaccount default -n production -p '{"imagePullSecrets": [{"name": "razorpay-registry-secret"}]}'`78 79#### Step 4 β Diagnose and Fix: Expired AWS ECR Credentials80 81AWS ECR authentication tokens expire every 12 hours. A secret created during cluster setup will fail the next day:82 83```bashError signature: "no basic auth credentials" or "401" from ECR84Failed to pull image: pull access denied, repository does not exist or may require authorizationCheck when the ECR secret was last updated85kubectl get secret ecr-registry-secret -n production 86-o jsonpath='{.metadata.creationTimestamp}'872025-05-24T08:15:00Z β created >12 hours ago = expiredRefresh the ECR token and update the secret88aws ecr get-login-password 89--region ap-south-1 | 90kubectl create secret docker-registry ecr-registry-secret 91--docker-server=123456789.dkr.ecr.ap-south-1.amazonaws.com 92--docker-username=AWS 93--docker-password=$(aws ecr get-login-password --region ap-south-1) 94--namespace=production 95--dry-run=client -o yaml | kubectl apply -f -96 97```yamlecr-token-refresher-cronjob.yaml β automatically refresh ECR token every 6 hours98apiVersion: batch/v199kind: CronJob100metadata:101name: ecr-token-refresher102namespace: production103spec:104schedule: "0 */6 * * *" # Every 6 hours β well within the 12-hour expiry105jobTemplate:106spec:107template:108spec:109serviceAccountName: ecr-refresher-sa # Needs IAM role to call ECR110restartPolicy: OnFailure111containers:112- name: ecr-refresher113image: amazon/aws-cli:latest114command:115- /bin/sh116- -c117- |118ECR_TOKEN=$(aws ecr get-login-password --region ap-south-1)119kubectl create secret docker-registry ecr-registry-secret 120--docker-server=123456789.dkr.ecr.ap-south-1.amazonaws.com 121--docker-username=AWS 122--docker-password=${ECR_TOKEN} 123--namespace=production 124--dry-run=client -o yaml | kubectl apply -f -125echo "ECR token refreshed at $(date)"126 127> π **Remember:** The permanent solution for ECR on EKS is to use **IRSA (IAM Roles for Service Accounts)** instead of static credentials. With IRSA, the node's IAM role automatically authorises ECR pulls with no secrets required β no tokens to expire, no CronJob refresh needed.128 129#### Step 5 β Diagnose and Fix: Network Cannot Reach Registry130 131```bashError signature: "connection timed out" or "no such host"132Failed to pull image: dial tcp: lookup registry.razorpay.in: no such host133Failed to pull image: dial tcp 10.20.30.40:443: i/o timeoutTest DNS resolution for the registry from inside a pod on the same node134kubectl run registry-test 135--image=busybox:1.35 136--restart=Never 137-n production 138-- nslookup registry.razorpay.inTest TCP connectivity to the registry port139kubectl run registry-test-2 140--image=nicolaka/netshoot 141--restart=Never 142-n production 143-- nc -zv registry.razorpay.in 443144Connection to registry.razorpay.in 443 port [tcp/https] succeeded! β reachable145nc: connect to registry.razorpay.in port 443 (tcp) failed: Connection timed out β blockedIf connection times out β check node security group / firewall rules146For AWS EKS β verify the node security group allows outbound HTTPS (443) to registry IP147aws ec2 describe-security-groups 148--group-ids sg-node-security-group-id 149--query 'SecurityGroups[0].IpPermissionsEgress'150 151```bashFor private registries β check if the registry is accessible from the VPC152Test directly from the node via SSH153ssh ec2-user@mumbai-worker-node-ip154curl -v https://registry.razorpay.in/v2/155Should return: {"errors":[{"code":"UNAUTHORIZED",...}]} β reachable (auth error is expected)156Or: curl: (6) Could not resolve host β DNS failure157Or: curl: (28) Operation timed out β network blockedClean up test pods158kubectl delete pod registry-test registry-test-2 -n production159 160#### Step 6 β Diagnose and Fix: Docker Hub Rate Limiting161 162```bashError signature: "toomanyrequests"163Failed to pull image: toomanyrequests:164You have reached your pull rate limit. You may increase the limit by authenticating.Check current rate limit status from inside a pod165kubectl run ratelimit-test 166--image=nicolaka/netshoot 167--restart=Never 168-n production 169-- sh -c "170TOKEN=$(curl -s 'https://auth.docker.io/token?service=registry.docker.io&scope=repository:ratelimitpreview/test:pull' | jq -r .token)171curl -s --head -H "Authorization: Bearer $TOKEN" https://registry-1.docker.io/v2/ratelimitpreview/test/manifests/latest 2>&1 | grep -i ratelimit172"173ratelimit-limit: 100;w=21600174ratelimit-remaining: 0;w=21600 β exhausted175 176```yamlFix 1: Authenticate Docker Hub pulls to get higher limits (200/6hr per account)177Create a Docker Hub pull secret178kubectl create secret docker-registry dockerhub-secret 179--docker-server=https://index.docker.io/v1/ 180--docker-username=razorpay-devops 181--docker-password=dckr_pat_xxxxxxxxxxxx 182--namespace=productionFix 2 (Permanent): Mirror public images to your private registry183Never pull from Docker Hub directly in production β mirror images first184 185```bashMirror a public image to your private ECR registry186Pull locally, retag, push to private registry187docker pull postgres:15.4188docker tag postgres:15.4 123456789.dkr.ecr.ap-south-1.amazonaws.com/postgres:15.4189docker push 123456789.dkr.ecr.ap-south-1.amazonaws.com/postgres:15.4Update deployments to use the mirrored image190kubectl set image statefulset/postgres 191postgres=123456789.dkr.ecr.ap-south-1.amazonaws.com/postgres:15.4 192-n production193 194#### Step 7 β Verify the Fix and Confirm Successful Pull195 196```bashAfter applying any fix β force a new pod to attempt the pull197kubectl rollout restart deployment/payments-api -n productionWatch the new pod status198kubectl get pods -n production -w199payments-api-7f8g9h-mn3lp 0/1 ContainerCreating 0 5s200payments-api-7f8g9h-mn3lp 1/1 Running 0 18s β image pulled successfullyConfirm image was pulled by checking pod events201kubectl describe pod payments-api-7f8g9h-mn3lp -n production | grep -A5 Events202Events:203Normal Pulling kubelet Pulling image "registry.razorpay.in/payments-api:v2.1.0"204Normal Pulled kubelet Successfully pulled image in 4.821s β success205Normal Created kubelet Created container payments-api206Normal Started kubelet Started container payments-apiVerify which image digest was actually pulled207kubectl get pod payments-api-7f8g9h-mn3lp -n production 208-o jsonpath='{.status.containerStatuses[0].imageID}'209docker-pullable://registry.razorpay.in/payments-api@sha256:abc123def456...210 211### Production Best Practices & Common Pitfalls212 213* Mirror all public images (Docker Hub, quay.io, gcr.io) to your private registry as part of your base image policy. Public registries have rate limits, availability incidents, and can remove images β your production cluster should never depend on them directly.214* Use image digests (`image: registry.razorpay.in/payments-api@sha256:abc123...`) instead of mutable tags in production GitOps manifests. Tags can be overwritten; digests are immutable.215* Rotate registry credentials on a schedule and automate the Kubernetes secret update via CI/CD or a CronJob β manual rotation always gets forgotten until a deployment fails at 2am.216* For multi-namespace clusters, attach the imagePullSecret to each namespace's `default` ServiceAccount rather than adding it to every deployment manifest individually. One change, universal coverage.217* Always test image pull independently of deployment configuration by running `kubectl run test --image=<your-image> --restart=Never -n <ns>` β this isolates the pull failure from any deployment spec issues.218 219> π΄ **Common Mistake:** Deleting and recreating the pod to "force a retry" of an ImagePullBackOff. The backoff timer resets on pod recreation, but if the root cause is not fixed (wrong image name, missing secret, expired token), the new pod will fail identically. Fix the root cause first, confirmed by `kubectl describe pod`, before attempting any restart.220 221### Quick Reference & Troubleshooting Commands222 223| Command | Purpose |224|:---|:---|225| `kubectl describe pod <name> -n <ns>` | Primary diagnostic β read the Events section for the exact error |226| `kubectl get events -n <ns> --field-selector reason=Failed` | List all pull failure events in the namespace |227| `kubectl get secret <name> -n <ns> -o yaml` | Inspect imagePullSecret contents |228| `kubectl create secret docker-registry <name> --docker-server=... --docker-username=... --docker-password=...` | Create registry pull secret |229| `kubectl patch serviceaccount default -n <ns> -p '{"imagePullSecrets": [{"name": "<secret>"}]}'` | Attach pull secret to all pods in namespace |230| `kubectl set image deployment/<name> <container>=<new-image> -n <ns>` | Fix image reference directly |231| `aws ecr get-login-password --region <region>` | Generate fresh ECR auth token |232| `kubectl run test --image=<image> --restart=Never -n <ns>` | Test image pull in isolation |233| `kubectl rollout restart deployment/<name> -n <ns>` | Force new pods after fixing the root cause |234| `kubectl get pod <name> -n <ns> -o jsonpath='{.status.containerStatuses[0].imageID}'` | Confirm which image digest was pulled |