在一次异常断电后, 开发环境的一个小kubernetes cluster中不幸遭遇了PLEG is not healthy问题, 表现是k8s中的pod状态变成Unknown或ContainerCreating, k8s节点状态变成NotReady:
# kubectl get nodes NAME STATUS ROLES AGE VERSION EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME k8s-dev-master Ready master 1y v1.10.0 <none> CentOS Linux 7 (Core) 3.10.0-957.21.3.el7.x86_64 docker://17.3.0 k8s-dev-node1 NotReady node 1y v1.10.0 <none> CentOS Linux 7 (Core) 3.10.0-957.21.3.el7.x86_64 docker://Unknown k8s-dev-node2 NotReady node 1y v1.10.0 <none> CentOS Linux 7 (Core) 3.10.0-957.21.3.el7.x86_64 docker://Unknown k8s-dev-node3 NotReady node 289d v1.10.0 <none> CentOS Linux 7 (Core) 3.10.0-957.21.3.el7.x86_64 docker://Unknown k8s-dev-node4 Ready node 289d v1.10.0 <none> CentOS Linux 7 (Core) 3.10.0-957.21.3.el7.x86_64 docker://17.3.0
Kubelet日志中提示: skipping pod synchronization, container runtime is down PLEG is not healthy:
9月 25 11:05:06 k8s-dev-node1 kubelet[546]: I0925 11:05:06.003645 546 kubelet.go:1794] skipping pod synchronization - [container runtime is down PLEG is not healthy: pleg was last seen active 21m18.877402888s ago; threshold is 3m0s] 9月 25 11:05:11 k8s-dev-node1 kubelet[546]: I0925 11:05:11.004116 546 kubelet.go:1794] skipping pod synchronization - [container runtime is down PLEG is not healthy: pleg was last seen active 21m23.877803484s ago; threshold is 3m0s] 9月 25 11:05:16 k8s-dev-node1 kubelet[546]: I0925 11:05:16.004382 546 kubelet.go:1794] skipping pod synchronization - [container runtime is down PLEG is not healthy: pleg was last seen active 21m28.878169681s ago; threshold is 3m0s]
重启节点docker和kubelet后恢复,过不了多久又出错变成NotReady, google了一把,在stackoverflow和github/kubernetes上有相关的issue:
- https://stackoverflow.com/questions/53872739/how-to-fix-container-runtime-is-down-pleg-is-not-healthy
- https://github.com/kubernetes/kubernetes/issues/45419
- https://github.com/kubernetes/kubernetes/issues/61117
- https://github.com/kubernetes/kubernetes/issues/72533
- https://github.com/Azure/AKS/issues/102
但#45419在v1.16中才被fix, 从1.10升级到1.16太繁琐, 看到 #61117中的一个评论说通过请求节点上的/var/lib/kubelet/pods目录可以解决, 第一次试了下由于mount卷的占用问题没有删除掉该目录, 问题没有解决, 后面索性级升级了docker, 从17.3.0升级到了19.3.2, 并请除了每个节点中/var/lib/kubelet/pods/, /var/lib/docker两个目录下的所有数据后,问题解决了。
大致过程:
# 先禁用docker和kubelet自动启动, 重启后清除文件: systemctl disable docker && systemctl disable kubelet reboot rm -rf /var/lib/kubelet/pods/ rm -rf /var/lib/docker # 中间顺便把docker-ce从17.3.0升级到了19.3.2 # 升级完docker后修改docker.service还指定17.3.0中默认的storage-driver为overlay, 中间试过overlay2, devicemapper, vfs, kubelet中都有报错, 不知是kubernetes v1.10的支持问题还是数据没有清除干净 vi /etc/systemd/system/docker.service ExecStart=/usr/bin/dockerd ... --storage-driver=overlay # 重新加载配置后启动docker systemctl daemon-reload systemctl start docker && systemctl enable docker systemctl status docker # 由于/var/lib/docker目录被整体删除, 如果节点不能直接访问k8s镜像库,需要手动导入节点需要的基础镜像: docker load -i kubernetes-v10.0-node.tar # 启动Kubelet systemctl start kubelet && systemctl enable kubelet systemctl status kubelet
问题解决:
# kubectl get nodes -o wide NAME STATUS ROLES AGE VERSION EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME k8s-dev-master Ready master 1y v1.10.0 <none> CentOS Linux 7 (Core) 3.10.0-957.21.3.el7.x86_64 docker://17.3.0 k8s-dev-node1 Ready node 1y v1.10.0 <none> CentOS Linux 7 (Core) 3.10.0-957.21.3.el7.x86_64 docker://19.3.2 k8s-dev-node2 Ready node 1y v1.10.0 <none> CentOS Linux 7 (Core) 3.10.0-957.21.3.el7.x86_64 docker://19.3.2 k8s-dev-node3 Ready node 289d v1.10.0 <none> CentOS Linux 7 (Core) 3.10.0-957.21.3.el7.x86_64 docker://19.3.2 k8s-dev-node4 Ready node 289d v1.10.0 <none> CentOS Linux 7 (Core) 3.10.0-957.21.3.el7.x86_64 docker://19.3.2
本次断电不幸造成了kong网关上3个月的配置数据丢失:(, 备份! 备份! 备份!