Bad Pods Labπ
Lab Overviewπ
Note
In this lab we will create various bad-pods
with intentional errors and then debug these bad-pods
Next we will cover how to take a node out of service for maintenance without disturbing our workloads.
Introπ
In this lab you will be introduced to some error scenarios and learn how to troubleshoot issues with your kubernetes cluster.
These lab activities assume that you have a namespace called ‘refresher’ in your cluster. You can create this namespace with this command:
kubectl create namespace refresher
Experience Gainedπ
- Debug pod creation failures
- Debug pod crashes
- Node maintenance scenario
Debug pod creation failuresπ
-
Create a file called
bad-pod.yaml
Paste the following into it:apiVersion: v1 kind: Pod metadata: name: bad-pod spec: containers: - name: fail image: busybox:training-fail tty: true command: - cat
-
Run the following command to create the pod:
kubectl apply -f bad-pod.yaml -n=refresher
-
Letβs watch the various pod transitions:
kubectl get pods --watch -n=refresher
Warning
You will see a set of transitions like ErrImagePull and ImagePullBackOff as show below
NAME READY STATUS RESTARTS AGE
bad-pod 0/1 ErrImagePull 0 3s
bad-pod 0/1 ImagePullBackOff 0 15s
bad-pod 0/1 ErrImagePull 0 28s
-
Letβs stop watching and describe the pod to see if we can figure out what is happening:
Ctrl-C kubectl describe pod bad-pod -n=refresher | grep -A25 Events:
In the
Events
section of the output you will see indicators of a problem downloading the image as seen below -
Letβs delete the pod and try attempt to fix the image being pulled:
kubectl delete -f bad-pod.yaml -n=refresher vi bad-pod.yaml
Replace
busybox:training-fail
tobusybox
and then save -
Letβs create the pod and watch the pod again:
kubectl apply -f bad-pod.yaml -n=refresher kubectl get pods --watch -n=refresher
The pod should be successfully created now
NAME READY STATUS RESTARTS AGE bad-pod 1/1 Running 0 11m
-
Lets clean up and delete the pod :
kubectl delete -f bad-pod.yaml -n=refresher
Debug pod crashesπ
Letβs create a pod that will continually restart because it exits with a non-zero exit code.
-
Create a file called
crash-loop.yaml
Paste the following into it:vi crash-loop.yaml
apiVersion: v1 kind: Pod metadata: name: crash-loop spec: containers: - name: fail image: busybox command: - sh - -c - sleep 10 && exit 1
-
Deploy the pod:
kubectl apply -f crash-loop.yaml -n=refresher
-
Letβs watch the various pod transitions
kubectl get pods --watch -n=refresher
Notice how after a few Running/Error transitions the pod goes into CrashLoopBackoff and the run time is delayed as shown below
NAME READY STATUS RESTARTS AGE crash-loop 0/1 ContainerCreating 0 3s crash-loop 1/1 Running 0 4s crash-loop 0/1 Error 0 14s crash-loop 1/1 Running 1 15s crash-loop 0/1 Error 1 25s crash-loop 0/1 CrashLoopBackOff 1 37s crash-loop 1/1 Running 2 39s crash-loop 0/1 Error 2 49s crash-loop 0/1 CrashLoopBackOff 2 60s
-
Lets stop watching and delete the pod:
Ctrl+C kubectl delete -f crash-loop.yaml -n=refresher
-
Lets fix the pod command definition:
vi crash-loop.yaml
Replace
- sleep 10 && exit 1
with- echo Container 1 is Running && sleep 3600
and then save -
Letβs create the pod and watch the pod again:
kubectl apply -f crash-loop.yaml -n=refresher kubectl get pods --watch -n=refresher
The pod should be successfully created now
-
Lets clean up and delete the pod :
kubectl delete -f crash-loop.yaml -n=refresher
Node maintenance scenarioπ
-
Letβs create a deployment:
kubectl create deployment maintenance-lab --image=gcr.io/google-samples/node-hello:1.0 -n=refresher
-
Find the node for this pod:
Copy the value under the Node column for later reference.kubectl get pods -o wide -n=refresher # You'll see a few columns one will have NODE and list the name of the node the pod is running on
Our node names look like this:
ip-10-10-1-17.us-gov-west-1.compute.internal
-
We need to perform some βmaintenanceβ on this node. Letβs make sure no more pods can be scheduled to it:
kubectl cordon node-name -n=refresher # EX: kubectl cordon ip-10-10-1-17.us-gov-west-1.compute.internal # node-name will be the name of the kubernetes node that the # pod that is part of the maintenance-lab deployment is running on
The output should be something like:
node/ip-10-10-58-44.us-gov-west-1.compute.internal cordoned
Describe the cordoned node:
kubectl describe node <cordoned node: e.g `ip-10-10-58-44.us-gov-west-1.compute.internal`>
Examine the output and look for the Taints towards the top of the output. You will see this:
And at the end of the output, something like:node.kubernetes.io/unschedulable:NoSchedule as shown below
Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal NodeNotSchedulable 5m27s kubelet Node ip-10-10-58-44.us-gov-west-1.compute.internal status is now: NodeNotSchedulable
-
List out the nodes:
kubectl get nodes
Notice the status of the node includes SchedulingDisabled as shown below
NAME STATUS ROLES AGE VERSION ip-10-10-3-88.us-gov-west-1.compute.internal Ready <none> 3d22h v1.18.12+rke2r2 ip-10-10-55-57.us-gov-west-1.compute.internal Ready etcd,master 3d22h v1.18.12+rke2r2 ip-10-10-58-44.us-gov-west-1.compute.internal Ready,SchedulingDisabled <none> 3d22h v1.18.12+rke2r2 ip-10-10-83-95.us-gov-west-1.compute.internal Ready <none> 3d22h v1.18.12+rke2r2
-
Letβs move the maintenance lab workload to the other node:
This step might take a few minutes. Notice that pod/maintenance-lab has been moved as shown belowkubectl drain node-name --ignore-daemonsets # Note: If you get an error about not being able to delete pods with local storage, you may also need to add the --delete-emptydir-data option
node/ip-10-10-58-44.us-gov-west-1.compute.internal already cordoned WARNING: ignoring DaemonSet-managed Pods: kube-system/kube-proxy-t75c7, kube-system/rke2-canal-78ctm evicting pod default/maintenance-lab-54545b884b-8j5rj pod/maintenance-lab-54545b884b-8j5rj evicted node/ip-10-10-58-44.us-gov-west-1.compute.internal evicted
-
Look at the node for our maintenance-lab pod:
Notice that it is now running on the other node.kubectl get pods -o wide -n refresher
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES maintenance-lab-54545b884b-n26db 1/1 Running 0 10m 10.42.2.4 ip-10-10-83-95.us-gov-west-1.compute.internal <none> <none>
-
Uncordon the node and look at the nodes again
kubectl uncordon node-name
node/ip-10-10-58-44.us-gov-west-1.compute.internal uncordoned NAME STATUS ROLES AGE VERSION ip-10-10-3-88.us-gov-west-1.compute.internal Ready <none> 3d22h v1.18.12+rke2r2 ip-10-10-55-57.us-gov-west-1.compute.internal Ready etcd,master 3d22h v1.18.12+rke2r2 ip-10-10-58-44.us-gov-west-1.compute.internal Ready <none> 3d22h v1.18.12+rke2r2 ip-10-10-83-95.us-gov-west-1.compute.internal Ready <none> 3d22h v1.18.12+rke2r2
-
Clean up and delete the deployment and refresher namespace:
kubectl delete deployment maintenance-lab -n=refresher # deployment.apps "maintenance-lab" deleted kubectl delete namespace refresher
Lab Summaryπ
You have been able to successfully troubleshoot issues with pod creation and inspect the lifecyle to infer underlying issues that were preventing a pod for getting created and/or stuck in a crash loop.
You also ran a typical maintenance scenario where you cordoned off a node and then drained that node off any workloads it was running. After performing some maintenance such as upgrades or repairs you have uncordoned the node so that it back available to run workloads.