Troubleshooting Flashcards
What logging solutions can be configured in a kubernetes Cluster? Describe them
Cluster and node level.
At cluster level 3 options exist.
- Configure a node-level logging agent that runs as a daemon set and reads the log files, sending them to an external logging backend. Benefits include not requiring to change app code. and not require change pod config
- Using a sidecar container where each pod runs with a sidecar container that sends logs to the logfile which is then collected by a similar logging agent to send to the logging backend. Benefits from easily seperate different streams (stdout and stderr)
- Pushing directly to logging backend from the app container
Node logging means the logs will stay at the logfile on each node. Benefits from being less complex but requires app changes on logging backend change
What resource do you need to monitor cluster and application metrics?
A metrics server
What are the steps to take when troubleshooting a Pod and what commands would you use for each one?
1st Retrieve high level information
- Run kubectl get pods look at columns READY STATUS RESTARTS
2nd Inspect events
- kubectl describe events
3rd Inspect logs
- kubectl get logs (use –previous to get the previous instance)
4th Open interactive shell
- …
What are 3 common error STATUS that can be found on pods what are their meanings and what are potential fixes?
ImagePullBackOff/ErrImagePull
- Image could not be pulled from the registry
- Check correct image name
- Check that image name exists in the registry
- Check network access from the node
- Ensure proper authentication
CrashLoopBackoff
- Application or command run in container crashes
- Check command executed
- Ensure the image can properly execute (use a docker container to test)
CreateContainerConfigError
- ConfigMap or Secret referenced cannot be found
- Check correct name of the configuration object
- Verify the existance of the configuration object in the namespace
How would you troubleshoot a service?
- Check that the selector labels match the ones on the pods
kubectl describe service and kuebctl get pods –show-labels see if they match - Check endpoints to see if the number of pods is the expected
kubectl get endpoints (servicename) - Check if the service type is the one you want
- Check if th port mapping is properly configured
How would you troubleshooot a cluster failure?
Run kubectl get nodes, check if the nodes are all Ready
- Does the version of the nodes devieate from the version on others
How would you troubleshoot a control plane node?
- Run kubectl get pods -n kube-system and check if all pods are healthy, run pod diagnosis if not
- Run kubectl cluster-info (add dump to get more detail)
How would you troubleshoot worker nodes?
A node can be NotReady in the following cases:
- Insufficient resources - run kubectl describe node worker1, run top and df commands
- Issues with kubelet process - systemctl status kubelet, if not active or running run journalctl -u kubelet.service
- Certificate issues.- run openssl -x509 -in /var/lib/kubelet/pki/kubelet.crt -text (verify this location is accuratae)
- Check kube-proxy pod with pod diagnosis
How would you troubleshoot worker nodes?
A node can be NotReady in the following cases:
- Insufficient resources - run kubectl describe node worker1, run top and df commands
- Issues with kubelet process - systemctl status kubelet, if not active or running run journalctl -u kubelet.service
- Certificate issues.- run openssl -x509 -in /var/lib/kubelet/pki/kubelet.crt -text (verify this location is accuratae)
- Check kube-proxy pod with pod diagnosis
How would you back up and restore etcd?
Run kubectl describe pod etcd-controlplane -n kube-system
Get
–cert-file /etc/kubernetes/pki/etcd/server.crt
–key-file /etc/kubernetes/pki/etcd/server.key
–trusted-ca-file /etc/kubernetes/pki/etcd/ca.crt
Run
sudo ETCDCTL_API=3 etcdctl –cacert=/etc/kubernetes/pki/etcd/ca.crt –cert=/etc/kubernetes/pki/etcd/server.crt –key=/etc/kubernetes/pki/etcd/server.key snapshot save /opt/etcd.bak
Run to restore:
sudo ETCDCTL_API=3 etcdctl –data-dir=/var/bak snapshot restore /opt/etcd.bak
Edit the etcd yaml to update volume path:
vim /etc/kubernetes/manifests/etcd.yaml set hostPath to /var/bak
Restart kubelet
If the etcd pod does not transition to a running state delete pod