kubernetes – IT漫步

How to install specific hotfix on Windows Server

2020年7月31日2020年8月20日 by ylhyh

Windows容器环境有个特点, Host与Container的OS Builder Number必须匹配, 有点场景甚至要求Revision Number匹配, 所以经常要为K8s Node安装指定Revision 的hotfix, 用powershell在线安装时下载过程缓慢而不可控, 体验最好的路径还是直接找到相应Revision Number的msu安装包，直接安装：

1. 从Windows Update History网站找到版本对应的KB. 如: Windows Server 1809 OS Build 10.0.17763.1158
https://support.microsoft.com/en-us/help/4549949

2. 在Windows Update Catelog按KB搜索: https://www.catalog.update.microsoft.com/
找到相应的下载包. 如17763.1158对应的KB4549949: https://www.catalog.update.microsoft.com/Search.aspx?q=KB4549949

3. 下载msu安装包后使用wusa指令安装即可:

wusa windows10.0-kb4549949-x64_90e8805e69944530b8d4d4877c7609b9a9e68d81.msu

附:

为了防止Windows Node版本变更, 还要关闭Windows Auto Update, 防止Node OS自己变更版本：

a). 查看Auto Update 状态:

%systemroot%\system32\Cscript %systemroot%\system32\scregedit.wsf /AU /v

b). 禁用 Windows Auto Update:

Net stop wuauserv 
%systemroot%\system32\Cscript %systemroot%\system32\scregedit.wsf /AU 1 
Net start wuauserv

PS: 可使用wmic qfe list查看已安装的hostfix

Reference:
https://docs.microsoft.com/en-us/windows-server/administration/server-core/server-core-servicing

For Windows Container, you need to set –image-pull-progress-deadline for kubelet

2020年6月18日2020年6月18日 by ylhyh

Windows镜像动则几个G, 基于Windows Server Core的镜像5~10G, Windows节点上的kubelet在下载镜像的时候经常会cancel掉:

Failed to pull image "XXX": rpc error: code = Unknown desc = context canceled

造成这个问题的原因是因为默认的image pulling progress deadline是1分钟, 如果1分钟内镜像下载没有任何进度更新, 下载动作就会取消, 比较大的镜像就无法成功下载. 见官方文档:

If no pulling progress is made before this deadline, the image pulling will be cancelled. This docker-specific flag only works when container-runtime is set to docker. (default 1m0s)

解决方法是为kubelet配置–image-pull-progress-deadline参数, 比如指定为30分钟:

"c:/k/kubelet.exe ... --image-pull-progress-deadline=30m"

对于Windows服务, 使用sc指令修改kubelet的binPath:

sc config kubelet binPath= " --image-pull-progress-deadline=30m

然后重启kubelet及依赖服务:

sc stop kubeproxy && sc stop kubelet && sc start kubelet && sc start kubeproxy && sc query kubelet && sc query kubeproxy

Refer to: https://kubernetes.io/docs/reference/command-line-tools-reference/kubelet/

Describe Kubelet Service Parameters on Azure Windows node

2020年4月14日2020年4月14日 by ylhyh

Query Kubelet service

Managed by nssm

C:\k>sc qc kubelet
[SC] QueryServiceConfig SUCCESS

SERVICE_NAME: kubelet
        TYPE               : 10  WIN32_OWN_PROCESS
        START_TYPE         : 2   AUTO_START
        ERROR_CONTROL      : 1   NORMAL
        BINARY_PATH_NAME   : C:\k\nssm.exe
        LOAD_ORDER_GROUP   :
        TAG                : 0
        DISPLAY_NAME       : Kubelet
        DEPENDENCIES       : docker
        SERVICE_START_NAME : LocalSystem

Query kubelet AppParameters by nssm

C:\k>nssm get kubelet Application
C:\Windows\System32\WindowsPowerShell\v1.0\powershell.exe

C:\k>nssm get kubelet AppParameters
c:\k\kubeletstart.ps1

Powershell scripts to start kubelet

$global:MasterIP = "q1game-q1game-6adca6-e3314a8c.hcp.westus2.azmk8s.io"
$global:KubeDnsSearchPath = "svc.cluster.local"
$global:KubeDnsServiceIp = "10.0.0.10"
$global:MasterSubnet = "10.240.0.0/16"
$global:KubeClusterCIDR = "10.240.0.0/16"
$global:KubeServiceCIDR = "10.0.0.0/16"
$global:KubeBinariesVersion = "1.17.3"
$global:CNIPath = "c:\k\cni"
$global:NetworkMode = "L2Bridge"
$global:ExternalNetwork = "ext"
$global:CNIConfig = "c:\k\cni\config\$global:NetworkMode.conf"
$global:HNSModule = "c:\k\hns.psm1"
$global:VolumePluginDir = "c:\k\volumeplugins"
$global:NetworkPlugin="azure"
$global:KubeletNodeLabels="kubernetes.azure.com/role=agent,agentpool=q1win,storageprofile=managed,storagetier=Premium_LRS,kubernetes.azure.com/cluster=MC_q1game_q1game_westus2"
Write-Host "NetworkPlugin azure, starting kubelet."

# Turn off Firewall to enable pods to talk to service endpoints. (Kubelet should eventually do this)
netsh advfirewall set allprofiles state off
# startup the service

# Find if network created by CNI exists, if yes, remove it
# This is required to keep the network non-persistent behavior
# Going forward, this would be done by HNS automatically during restart of the node

$hnsNetwork = Get-HnsNetwork | ? Name -EQ azure
if ($hnsNetwork)
{
    # Cleanup all containers
    docker ps -q | foreach {docker rm $_ -f}

    Write-Host "Cleaning up old HNS network found"
    Remove-HnsNetwork $hnsNetwork
    # Kill all cni instances & stale data left by cni
    # Cleanup all files related to cni
    taskkill /IM azure-vnet.exe /f
    taskkill /IM azure-vnet-ipam.exe /f
    $cnijson = [io.path]::Combine("c:\k", "azure-vnet-ipam.json")
    if ((Test-Path $cnijson))
    {
        Remove-Item $cnijson
    }
    $cnilock = [io.path]::Combine("c:\k", "azure-vnet-ipam.json.lock")
    if ((Test-Path $cnilock))
    {
        Remove-Item $cnilock
    }

    $cnijson = [io.path]::Combine("c:\k", "azure-vnet.json")
    if ((Test-Path $cnijson))
    {
        Remove-Item $cnijson
    }
    $cnilock = [io.path]::Combine("c:\k", "azure-vnet.json.lock")
    if ((Test-Path $cnilock))
    {
        Remove-Item $cnilock
    }
}

# Restart Kubeproxy, which would wait, until the network is created
# This was fixed in 1.15, workaround still needed for 1.14 https://github.com/kubernetes/kubernetes/pull/78612
Restart-Service Kubeproxy

$env:AZURE_ENVIRONMENT_FILEPATH="c:\k\azurestackcloud.json"

c:\k\kubelet.exe --address=0.0.0.0 --anonymous-auth=false --authentication-token-webhook=true --authorization-mode=Webhook --azure-container-registry-config=c:\k\azure.json --cgroups-per-qos=false --client-ca-file=c:\k\ca.crt --cloud-config=c:\k\azure.json --cloud-provider=azure --cluster-dns=10.0.0.10 --cluster-domain=cluster.local --dynamic-config-dir=/var/lib/kubelet --enforce-node-allocatable="" --event-qps=0 --eviction-hard="" --feature-gates=RotateKubeletServerCertificate=true --hairpin-mode=promiscuous-bridge --image-gc-high-threshold=85 --image-gc-low-threshold=80 --image-pull-progress-deadline=20m --keep-terminated-pod-volumes=false --kube-reserved=cpu=100m,memory=1843Mi --kubeconfig=c:\k\config --max-pods=30 --network-plugin=cni --node-status-update-frequency=10s --non-masquerade-cidr=0.0.0.0/0 --pod-infra-container-image=kubletwin/pause --pod-max-pids=-1 --protect-kernel-defaults=true --read-only-port=0 --resolv-conf="" --rotate-certificates=false --streaming-connection-idle-timeout=4h --system-reserved=memory=2Gi --tls-cipher-suites=TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,TLS_RSA_WITH_AES_256_GCM_SHA384,TLS_RSA_WITH_AES_128_GCM_SHA256 --node-labels=$global:KubeletNodeLabels --volume-plugin-dir=$global:VolumePluginDir --cni-bin-dir=c:\k\azurecni\bin --cni-conf-dir=c:\k\azurecni\netconf

Enable Hyper-V Isolation by modify kubelet parameters

1. Modify c:\k\kubeletstart.ps1 to add parameter to kubelet

--feature-gates="XXX=true,HyperVContainer=true"

2. Restart kubelet

C:\k>nssm restart kubelet
Kubelet: STOP: A stop control has been sent to a service that other running services are dependent on.

C:\k>sc queryex kubelet

SERVICE_NAME: kubelet
        TYPE               : 10  WIN32_OWN_PROCESS
        STATE              : 4  RUNNING
                                (STOPPABLE, PAUSABLE, ACCEPTS_SHUTDOWN)
        WIN32_EXIT_CODE    : 0  (0x0)
        SERVICE_EXIT_CODE  : 0  (0x0)
        CHECKPOINT         : 0x0
        WAIT_HINT          : 0x0
        PID                : 4044
        FLAGS              :

C:\k>taskkill /PID 4044 /F

C:\k>sc start kubelet

Restart the Windows node if necessary

Run Windows container with Hyper-V isolation mode in Kubernetes

2020年2月20日2020年4月14日 by ylhyh

Windows Container有两种隔离运行模式Hyper-V和Process, 参见：Isolation Modes

两种模式下的host的OS版本与containter的OS版本存在兼容性又不相同，参见：Windows container version compatibility

很明显Hyper-V模式的兼容性要比Process模式要好，向下兼容，也就是高版本的host OS可以运行低版本的container OS, 反之不行；

而Process模式下Windows Server中则要求host OS与container OS的版本完全相同, Windows 10中则不支持Process模式.

某一天，我想在Kubernetes Windows 节点中以Hyper-V模式运行Container, 于是乎发现1.17的文档中写道：

Note: In this document, when we talk about Windows containers we mean Windows containers with process isolation. Windows containers with Hyper-V isolation is planned for a future release.

不甘心又google了一下，发现：

1. 有人提了bug, 已经被修复了: https://github.com/kubernetes/kubernetes/issues/58750
2. 代码也merge了: https://github.com/kubernetes/kubernetes/pull/58751
3. 有人在测试过程中遇到问题，也解决了: https://github.com/kubernetes/kubernetes/issues/62812

但我测试的过程中却提示：

Error response from daemon: CreateComputeSystem test: The container operating system does not match the host operating system.

我的环境：

Kubernetes Ver: 1.14.8

Kubernetes Node OS Ver: Windows Server Datacenter 10.0.17763.504, 属于1809的版本

Container Base Image: windowsservercore-1709

Deployment yaml:

apiVersion: apps/v1beta2
kind: Deployment
metadata:
  labels:
    app: test
  name: test
  namespace: default
spec:
  replicas: 1
  selector:
    matchLabels:
      app: test
  template:
    metadata:
      annotations:
        experimental.windows.kubernetes.io/isolation-type: hyperv
      labels:
        app: test
...

然后对比了下github别人试成功的deployment yaml, 发现人家用的是apps/v1

apiVersion: apps/v1
kind: Deployment
metadata:
  name: whoami
  labels:
    app: whoami
spec:
  ...

目前在k8s中启用hyperv isolation的三个条件：

kubelet 启用参数: –feature-gates=HyperVContainer=true
Pod/Deployment apiVersion: apps/v1
spec.template.metadata.annotations[].experimental.windows.kubernetes.io/isolation-type:hyperv

参见: https://kubernetes.io/docs/setup/production-environment/windows/intro-windows-in-kubernetes/#hyper-v-isolation

目前我的云提供商给的kubernetes 1.14.8又不支持apps/v1 …

于是乎，我要么等提供商升级kubernetes，要么自己升级container OS跟kubernetes node OS一样…

ernel:unregister_netdevice: waiting for eth0 to become free. Usage count = 1

2019年12月3日2019年12月3日 by ylhyh

Issue

We are trying to prototype kubernetes on top of RHEL and encounter the situation that the device seems to be frozen. There are repeated messages similar to:

[1331228.795391] unregister_netdevice: waiting for eth0 to become free. Usage count = 1
[1331238.871337] unregister_netdevice: waiting for eth0 to become free. Usage count = 1
[1331248.919329] unregister_netdevice: waiting for eth0 to become free. Usage count = 1
[1331258.964322] unregister_netdevice: waiting for eth0 to become free. Usage count = 1

This problem occurs when scaling down pods in kubernetes. A reboot of the node is required to rectify.
This has been seen after customers upgraded to a kernel with the fix for https://access.redhat.com/solutions/3105941 But after that the messages appear on ethX instead of lo.

Environment

Red Hat Enterprise Linux 7
- kernel-3.10.0-862.el7.x86_64
- kernel-3.10.0-957.el7.x86_64
upstream Kubernetes
upstream docker

refer:

https://github.com/kubernetes/kubernetes/issues/70427

[Kubernetes] Create deployment, service by Python client

2019年10月29日2019年10月29日 by ylhyh

Install Kubernetes Python Client and PyYaml:

# pip install kubernetes pyyaml

1. Get Namespaces or Pods by CoreV1Api:

# -*- coding: utf-8 -*-
from kubernetes import client, config, utils

config.kube_config.load_kube_config(config_file="../kubecfg.yaml")
coreV1Api = client.CoreV1Api()

print("\nListing all namespaces")
for ns in coreV1Api.list_namespace().items:
    print(ns.metadata.name)

print("\nListing pods with their IP, namespace, names:")
for pod in coreV1Api.list_pod_for_all_namespaces(watch=False).items:
    print("%s\t\t%s\t%s" % (pod.status.pod_ip, pod.metadata.namespace, pod.metadata.name))

2. Create Deployment and Service by AppsV1Api:

# -*- coding: utf-8 -*-
from kubernetes import client, config, utils
import yaml

config.kube_config.load_kube_config(config_file="../kubecfg.yaml")
yamlDeploy = open( r'deploy.yaml')
jsonDeploy = yaml.load(yamlDeploy, Loader = yaml.FullLoader)

yamlService = open(r'service.yaml')
jsonService = yaml.load(yamlService, Loader = yaml.FullLoader)

appsV1Api = client.AppsV1Api()

if jsonDeploy['kind'] == 'Deployment':
    appsV1Api.create_namespaced_deployment(
        namespace="default", body = jsonDeploy
    )

if jsonService['kind'] == 'Service':
    coreV1Api.create_namespaced_service(
        namespace="default",
        body=jsonService
    )

3. Create ANY type of objects from a yaml file by utils.create_from_yaml, you can put multiple resources in one yaml file:

# -*- coding: utf-8 -*-
from kubernetes import client, config, utils

config.kube_config.load_kube_config(config_file="../kubecfg.yaml")

k8sClient = client.ApiClient()
utils.create_from_yaml(k8sClient, "deploy-service.yaml")

Reference:
https://github.com/kubernetes-client/python/blob/6709b753b4ad2e09aa472b6452bbad9f96e264e3/examples/create_deployment_from_yaml.py
https://stackoverflow.com/questions/56673919/kubernetes-python-api-client-execute-full-yaml-file

Customize hosts record on docker and kubernetes

2019年9月26日2019年9月26日 by ylhyh

Docker:

docker run -it --rm --add-host=host1:172.17.0.2 --add-host=host2:192.168.1.3 busybox

use “–add-host” to add entries to /etc/hosts

Kubernetes:

apiVersion: v1
kind: Pod
metadata:
  name: hostaliases-pod
spec:
  hostAliases:
  - ip: "127.0.0.1"
    hostnames:
    - "foo.local"
    - "bar.local"
  - ip: "10.1.2.3"
    hostnames:
    - "foo.remote"
    - "bar.remote"
  containers:
  - name: cat-hosts
    image: busybox
    command:
    - cat
    args:
    - "/etc/hosts"

use “spec.hostAliases” to configure hosts entry for pod/deployment

https://kubernetes.io/docs/concepts/services-networking/add-entries-to-pod-etc-hosts-with-host-aliases/

遇到了传说中的container runtime is down PLEG is not healthy

2019年9月25日2019年9月25日 by ylhyh

在一次异常断电后, 开发环境的一个小kubernetes cluster中不幸遭遇了PLEG is not healthy问题, 表现是k8s中的pod状态变成Unknown或ContainerCreating, k8s节点状态变成NotReady:

# kubectl get nodes
NAME             STATUS     ROLES     AGE	VERSION   EXTERNAL-IP   OS-IMAGE                KERNEL-VERSION               CONTAINER-RUNTIME
k8s-dev-master   Ready      master    1y        v1.10.0   <none>        CentOS Linux 7 (Core)   3.10.0-957.21.3.el7.x86_64   docker://17.3.0
k8s-dev-node1    NotReady   node      1y        v1.10.0   <none>        CentOS Linux 7 (Core)   3.10.0-957.21.3.el7.x86_64   docker://Unknown
k8s-dev-node2    NotReady   node      1y        v1.10.0   <none>        CentOS Linux 7 (Core)   3.10.0-957.21.3.el7.x86_64   docker://Unknown
k8s-dev-node3    NotReady   node      289d	v1.10.0   <none>        CentOS Linux 7 (Core)   3.10.0-957.21.3.el7.x86_64   docker://Unknown
k8s-dev-node4    Ready      node      289d	v1.10.0   <none>        CentOS Linux 7 (Core)   3.10.0-957.21.3.el7.x86_64   docker://17.3.0

Kubelet日志中提示: skipping pod synchronization, container runtime is down PLEG is not healthy:

9月 25 11:05:06 k8s-dev-node1 kubelet[546]: I0925 11:05:06.003645     546 kubelet.go:1794] skipping pod synchronization - [container runtime is down PLEG is not healthy: pleg was last seen active 21m18.877402888s ago; threshold is 3m0s]
9月 25 11:05:11 k8s-dev-node1 kubelet[546]: I0925 11:05:11.004116     546 kubelet.go:1794] skipping pod synchronization - [container runtime is down PLEG is not healthy: pleg was last seen active 21m23.877803484s ago; threshold is 3m0s]
9月 25 11:05:16 k8s-dev-node1 kubelet[546]: I0925 11:05:16.004382     546 kubelet.go:1794] skipping pod synchronization - [container runtime is down PLEG is not healthy: pleg was last seen active 21m28.878169681s ago; threshold is 3m0s]

重启节点docker和kubelet后恢复，过不了多久又出错变成NotReady, google了一把，在stackoverflow和github/kubernetes上有相关的issue:

但#45419在v1.16中才被fix, 从1.10升级到1.16太繁琐, 看到 #61117中的一个评论说通过请求节点上的/var/lib/kubelet/pods目录可以解决, 第一次试了下由于mount卷的占用问题没有删除掉该目录, 问题没有解决, 后面索性级升级了docker, 从17.3.0升级到了19.3.2, 并请除了每个节点中/var/lib/kubelet/pods/, /var/lib/docker两个目录下的所有数据后，问题解决了。

大致过程:

# 先禁用docker和kubelet自动启动, 重启后清除文件:
systemctl disable docker && systemctl disable kubelet
reboot
rm -rf /var/lib/kubelet/pods/
rm -rf /var/lib/docker

# 中间顺便把docker-ce从17.3.0升级到了19.3.2

# 升级完docker后修改docker.service还指定17.3.0中默认的storage-driver为overlay, 中间试过overlay2, devicemapper, vfs, kubelet中都有报错, 不知是kubernetes v1.10的支持问题还是数据没有清除干净
vi /etc/systemd/system/docker.service

ExecStart=/usr/bin/dockerd ... --storage-driver=overlay

# 重新加载配置后启动docker
systemctl daemon-reload
systemctl start docker && systemctl enable docker
systemctl status docker

# 由于/var/lib/docker目录被整体删除, 如果节点不能直接访问k8s镜像库，需要手动导入节点需要的基础镜像:
docker load -i kubernetes-v10.0-node.tar

# 启动Kubelet
systemctl start kubelet && systemctl enable kubelet
systemctl status kubelet

问题解决：

# kubectl get nodes -o wide
NAME             STATUS    ROLES     AGE       VERSION   EXTERNAL-IP   OS-IMAGE                KERNEL-VERSION               CONTAINER-RUNTIME
k8s-dev-master   Ready     master    1y        v1.10.0   <none>        CentOS Linux 7 (Core)   3.10.0-957.21.3.el7.x86_64   docker://17.3.0
k8s-dev-node1    Ready     node      1y        v1.10.0   <none>        CentOS Linux 7 (Core)   3.10.0-957.21.3.el7.x86_64   docker://19.3.2
k8s-dev-node2    Ready     node      1y        v1.10.0   <none>        CentOS Linux 7 (Core)   3.10.0-957.21.3.el7.x86_64   docker://19.3.2
k8s-dev-node3    Ready     node      289d      v1.10.0   <none>        CentOS Linux 7 (Core)   3.10.0-957.21.3.el7.x86_64   docker://19.3.2
k8s-dev-node4    Ready     node      289d      v1.10.0   <none>        CentOS Linux 7 (Core)   3.10.0-957.21.3.el7.x86_64   docker://19.3.2

本次断电不幸造成了kong网关上3个月的配置数据丢失:(, 备份! 备份! 备份!

Kubernetes CronJob failed to schedule: Cannot determine if job needs to be started: Too many missed start time (> 100). Set or decrease .spec.startingDeadlineSeconds or check clock skew

2019年9月9日 by ylhyh

Kubernetes v1.13.3, schedule了一个cronjob, 每5分钟运行一次, 但发现已经有3天没有新的pod被创建了:

# kubectl get cronjob/dingtalk-atndsyncer
NAME                  SCHEDULE      SUSPEND   ACTIVE   LAST SCHEDULE   AGE
dingtalk-atndsyncer   */5 * * * *   False     0        3d1h            4d21h

cronjob中的.spec.concurrencyPolicy为Forbid, 不允许多任务并行, describe该cronjob提示：FailedNeedsStart, 具体message是”Cannot determine if job needs to be started: Too many missed start time (> 100). Set or decrease .spec.startingDeadlineSeconds or check clock skew.”

# kubectl describe cronjob/dingtalk-atndsyncer
Name:                       dingtalk-atndsyncer
Namespace:                  default
Labels:                     app=dingtalk-atndsyncer
Annotations:                <none>
Schedule:                   */5 * * * *
Concurrency Policy:         Forbid
Suspend:                    False
Starting Deadline Seconds:  <unset>
Selector:                   <unset>
Parallelism:                <unset>
Completions:                <unset>
Pod Template:
  Labels:  <none>
  Containers:
   dingtalk-atndsyncer:
    Image:      dingtalk-atndsyncer:v1.0
    Port:       <none>
    Host Port:  <none>
    Environment:
      ASPNETCORE_ENVIRONMENT:               Production
    Mounts:                                 <none>
  Volumes:                                  <none>
Last Schedule Time:                         Fri, 06 Sep 2019 08:15:00 +0800
Active Jobs:                                <none>
Events:
  Type     Reason            Age                   From                Message
  ----     ------            ----                  ----                -------
  Warning  FailedNeedsStart  43m (x790 over 178m)  cronjob-controller  Cannot determine if job needs to be started: Too many missed start time (> 100). Set or decrease .spec.startingDeadlineSeconds or check clock skew.
  Warning  FailedNeedsStart  25m (x89 over 40m)    cronjob-controller  Cannot determine if job needs to be started: Too many missed start time (> 100). Set or decrease .spec.startingDeadlineSeconds or check clock skew.
  Warning  FailedNeedsStart  119s (x117 over 22m)  cronjob-controller  Cannot determine if job needs to be started: Too many missed start time (> 100). Set or decrease .spec.startingDeadlineSeconds or check clock skew.

Google后仔细阅读官方文档(https://kubernetes.io/docs/tasks/job/automated-tasks-with-cron-jobs/#starting-deadline), 说是如果没有配置.spec.startingDeadlineSeconds, 则从最后一次的schedule时间统计错过的schedule次数，如果超过100次就不再schedule
尝试把.spec.startingDeadlineSeconds配置为300秒, 意味着如果5分钟内错过schedule超过100次，才不会schedule (因为schedule周期是5分钟, 所以是一个不太可能达到的条件), 配置后任务schedule正常了

Kubernetes 1.13.3 external etcd clean up | Kubernetes外部etcd数据清除

2019年2月19日2019年2月19日 by ylhyh

Kubernetes配置过程中如果出了问题, 可以用kubeadm reset重置Kubernetes cluster状态, 但如果使用了外部etcd cluster, 则执行kubeadm reset不会清除外部etcd cluster中的数据，也就意味着如果再次执行kubeadm init, 则会看到上一个kubenetes cluster中的数据。

查询和手动清除外部etcd cluster的方式如下(以Kubernetes 1.13.3为例):

1. 查询所有数据：

docker run --rm -it --net host -v /etc/kubernetes:/etc/kubernetes -e ETCDCTL_API=3 k8s.gcr.io/etcd:3.2.24 etcdctl --cert="/etc/kubernetes/pki/etcd/healthcheck-client.crt" --key="/etc/kubernetes/pki/etcd/healthcheck-client.key" --cacert="/etc/kubernetes/pki/etcd/ca.crt" --endpoints https://etcd1.cloud.k8s:2379 get "" --prefix

2. 删除所有数据:

docker run --rm -it --net host -v /etc/kubernetes:/etc/kubernetes -e ETCDCTL_API=3 k8s.gcr.io/etcd:3.2.24 etcdctl --cert="/etc/kubernetes/pki/etcd/healthcheck-client.crt" --key="/etc/kubernetes/pki/etcd/healthcheck-client.key" --cacert="/etc/kubernetes/pki/etcd/ca.crt" --endpoints https://etcd1.cloud.k8s:2379 del "" --prefix

指令中的几个关键点：

运行使用docker镜像k8s.gcr.io/etcd:3.2.24中的etcdctl指令, 也可以使用外部的
通过docker -e参数设置环境变量ETCDCTL_API=3设置API Version为3
挂载外部的etcd ca和客户端证书连接etcd cluster

参考：

External etcd clean up: https://kubernetes.io/docs/reference/setup-tools/kubeadm/kubeadm-reset/#external-etcd-clean-up