问题描述

今天一台kubernetes计算节点状态显示异常(NotReady)。首先登陆到计算节点查看KubeletDocker进行状态,显示都没有问题。

然后去查看系统日志(/var/log/message),发现如下的报错信息:

1
2
Dec 31 12:44:16 docker18 kubelet: E1231 12:44:16.634146  707301 kubelet_volumes.go:128] Orphaned pod "356a8df1-0b4e-11e9-8afe-fa163e75de2b" found, but volume paths are still present on disk : There were a total of 1 errors similar to this. Turn up verbosity to see them.
Dec 31 12:44:18 docker18 kubelet: E1231 12:44:18.629745 707301 kubelet_volumes.go:128] Orphaned pod "356a8df1-0b4e-11e9-8afe-fa163e75de2b" found, but volume paths are still present on disk : There were a total of 1 errors similar to this. Turn up verbosity to see them.

问题定位

从错误信息可以推测到,这台计算节点存在一个孤儿Pod,并且该Pod挂载了数据卷(volume),阻碍了Kubelet对孤儿Pod正常的回收清理。

注意: 孤儿Pod: 就是裸露的Pod,没有相关的控制器领养的Pod

通过google搜索相关的信息也证实了这一点:
https://github.com/kubernetes/kubernetes/issues/60987
https://github.com/kubernetes/kubernetes/pull/68616

1
2
While meet Orphan Pod, kubelet will clean up it and its directorys (cleanupOrphanedPodDirs);
But if there are mount path in the directorys, the clean action will be skipped.

解决问题

1.首先通过Pod ID获取Pod的挂载数据卷的mount信息:

1
2
# mount -l | grep 356a8df1-0b4e-11e9-8afe-fa163e75de2b
ceph01,ceph02,ceph03:/kube/volumes/kubernetes-dynamic-pvc-fad0f75d-f3ab-11e8-ad67-1e1c4625dec0 on /data/kubelet/pods/356a8df1-0b4e-11e9-8afe-fa163e75de2b/volumes/kubernetes.io~cephfs/pvc-fac32543-f3ab-11e8-acec-fa163e75de2b type ceph (rw,relatime,name=kubernetes-dynamic-user-fad0f7ae-f3ab-11e8-ad67-1e1c4625dec0,secret=<hidden>,acl)

2.为了防止数据丢失,umount该挂载点

1
umount /data/kubelet/pods/356a8df1-0b4e-11e9-8afe-fa163e75de2b/volumes/kubernetes.io~cephfs/pvc-fac32543-f3ab-11e8-acec-fa163e75de2b

3.删除该计算节点Pod的元数据

1
rm -r /data/kubelet/pods/356a8df1-0b4e-11e9-8afe-fa163e75de2b

4.检查kubernetes计算节点是否正常

1
kubectl get nodes

ok,计算节点恢复正常:)