声明:这是我在大学毕业后进入第二家互联网公司学习的内容


背景

一周的最后一个工作日,早晨整个园区突然停电,公司机房的服务器没有UPS,所以闪退只能等来电了,过了不到10分钟来电了,去机房给服务器开机,然后发现K8S集群启动不了

现状

机房搭建的是一Master多Node的K8S集群,机器重启后我执行kubectl get nodes ,一直提示apiserver连接不上,我直接进入到master节点了

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
$ kubectl get nodes 
The connection to the server 127.0.0.1:8443 was refused - did you specify the right host or port?

看来是apiserver启动失败,查看apiserver的日志

$ docker ps -a|grep apiserver
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
eb44e1f500f0 registry.aliyuncs.com/kubeadm-ha/pause:3.2 "/pause" 4 seconds ago Exited (0) 2 seconds ago k8s_POD_kube-apiserver-xxx_kube-system_613abe9b942b6bc621a9d098614e6fe8_2

$ tail -f /var/log/message

Node NotReady: kubelet suddenly giving error Failed to list ****: an error on the server ("") has prevented the request from succeeding

E0305 11:14:38.953687 1000 reflector.go:138] k8s.io/client-go/informers/factory.go:134: Failed to watch *v1.Service: failed to list *v1.Service: Get "https://127.0.0.1:8443/api/v1/services?limit=500&resourceVersion=0": dial tcp 127.0.0.1:8443: connect: connection refused

E0305 12:56:46.903299 1031 pod_workers.go:191] Error syncing pod 1916a059cb13894ce9385e67335904e2 ("etcd-xxx_kube-system(1916a059cb13894ce9385e67335904e2)"), skipping: failed to "StartContainer" for "etcd" with CrashLoopBackOff: "back-off 10s restarting failed container=etcd pod=etcd-xxx_kube-system(1916a059cb13894ce9385e67335904e2)"

先重启试下,重启完仍然报错

1
2
3
Unable to register node with API server: Post "https://127.0.0.1:8443/api/v1/nodes": EOF

kubelet won't restart after reboot - Unable to register node with API server: connection refused

发现apiserver连不上是因为etcd没有启动成功

查看etcd报错

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
[WARNING] Deprecated '--logger=capnslog' flag is set; use '--logger=zap' flag instead
2022-03-05 11:39:10.063604 I | etcdmain: etcd Version: 3.4.13
2022-03-05 11:39:10.063656 I | etcdmain: Git SHA: ae9734ed2
2022-03-05 11:39:10.063660 I | etcdmain: Go Version: go1.12.17
2022-03-05 11:39:10.063664 I | etcdmain: Go OS/Arch: linux/amd64
2022-03-05 11:39:10.063668 I | etcdmain: setting maximum number of CPUs to 6, total number of available CPUs is 6
2022-03-05 11:39:10.063716 N | etcdmain: the server is already initialized as member before, starting as etcd member...
[WARNING] Deprecated '--logger=capnslog' flag is set; use '--logger=zap' flag instead
2022-03-05 11:39:10.063752 I | embed: peerTLS: cert = /etc/kubernetes/pki/etcd/peer.crt, key = /etc/kubernetes/pki/etcd/peer.key, trusted-ca = /etc/kubernetes/pki/etcd/ca.crt, client-cert-auth = true, crl-file =
2022-03-05 11:39:10.064357 I | embed: name = etcd-xxx
2022-03-05 11:39:10.064368 I | embed: data dir = /var/lib/etcd
2022-03-05 11:39:10.064373 I | embed: member dir = /var/lib/etcd/member
2022-03-05 11:39:10.064383 I | embed: heartbeat = 100ms
2022-03-05 11:39:10.064386 I | embed: election = 1000ms
2022-03-05 11:39:10.064390 I | embed: snapshot count = 10000
2022-03-05 11:39:10.064399 I | embed: advertise client URLs = https://xxx:2379
2022-03-05 11:39:10.064404 I | embed: initial advertise peer URLs = https://xxx:2380
2022-03-05 11:39:10.064409 I | embed: initial cluster =
2022-03-05 11:39:10.250331 C | etcdmain: wal: crc mismatch

etcdmain: wal: crc mismatch

查了下原因,etcd持久化数据受到了损坏,启动失败,看的这个案例也是因为断电导致数据丢失!!!

解决办法

  • 重置etcd集群
  • 重装k8s集群
  • 恢复etcd快照

重置etcd集群

先备份

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
进入master节点
$ cd /var/lib/
$ cp -r etcd etcd_backup
$ rm -rf etcd/*

等待重启

[WARNING] Deprecated '--logger=capnslog' flag is set; use '--logger=zap' flag instead
2022-03-05 11:58:46.691947 I | etcdmain: etcd Version: 3.4.13
2022-03-05 11:58:46.692039 I | etcdmain: Git SHA: ae9734ed2
2022-03-05 11:58:46.692051 I | etcdmain: Go Version: go1.12.17
2022-03-05 11:58:46.692063 I | etcdmain: Go OS/Arch: linux/amd64
2022-03-05 11:58:46.692074 I | etcdmain: setting maximum number of CPUs to 6, total number of available CPUs is 6
2022-03-05 11:58:46.727766 N | etcdmain: the server is already initialized as member before, starting as etcd member...
[WARNING] Deprecated '--logger=capnslog' flag is set; use '--logger=zap' flag instead
2022-03-05 11:58:46.727900 I | embed: peerTLS: cert = /etc/kubernetes/pki/etcd/peer.crt, key = /etc/kubernetes/pki/etcd/peer.key, trusted-ca = /etc/kubernetes/pki/etcd/ca.crt, client-cert-auth = true, crl-file =
2022-03-05 11:58:46.729577 I | embed: name = etcd-xxx
2022-03-05 11:58:46.729606 I | embed: data dir = /var/lib/etcd
2022-03-05 11:58:46.729619 I | embed: member dir = /var/lib/etcd/member
2022-03-05 11:58:46.729629 I | embed: heartbeat = 100ms
2022-03-05 11:58:46.729637 I | embed: election = 1000ms
2022-03-05 11:58:46.729647 I | embed: snapshot count = 10000
2022-03-05 11:58:46.729669 I | embed: advertise client URLs = https://xxx:2379
2022-03-05 11:58:46.765737 C | etcdmain: cannot fetch cluster info from peer urls: could not retrieve cluster information from the given URLs

网上查了下,这个情况是初始化或者新增节点时不能获取集群信息导致的,如果是多节点的话,重制一个节点是可以的,但是目前我们搭建的是单master,所以没有另外一份etcd存储的配置,删除就会有问题。

其实到这里我心里是比较绝望的,难道要重装K8S集群吗,这个问题已经影响1个小时,大家都不能工作了,请示下领导后决定如果到下午1点还解决不了这个问题,就重装K8S集群

恢复etcd快照

查阅官网资料ETCD Disaster recovery可以发现etcd快照可以恢复。

1
2
3
4
5
crontab -l
#Ansible: Backup etcd databases
0 3 * * * /etc/kubernetes/backup/etcd/etcdtools backup
#Ansible: Clean etcd databases backup file
30 3 * * * /etc/kubernetes/backup/etcd/etcdtools cleanup

恢复命令

  • etcdctl snapshot restore snapshot.db –$ETCD_OPTIONS

由于本地没有etcdctl,所以直接用docker启动

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
创建恢复快照的目录etcd1
$ mkdir /var/lib/etcd1
$ docker run --net host -e ETCDCTL_API=3 \
-v /etc/kubernetes/backup/etcd:/etc/kubernetes/backup/etcd \
-v /etc/kubernetes/pki/etcd/:/etc/kubernetes/pki/etcd/ \
-v /etc/kubernetes/backup/etcd/etcd-snapshot-20220305T030001.db:/etc/kubernetes/backup/etcd/etcd-snapshot-20220305T030001.db \
-v /var/lib/etcd1/:/var/lib/ \
registry.aliyuncs.com/kubeadm-ha/etcd:3.4.13-0 \
etcdctl snapshot restore \
/etc/kubernetes/backup/etcd/etcd-snapshot-20220305T030001.db \
--name etcd-xxx\
--endpoints=https://xxx:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--key=/etc/kubernetes/pki/etcd/healthcheck-client.key \
--cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \
--initial-advertise-peer-urls=https://xxx:2380 \
--initial-cluster=etcd-xxx=https://xxx:2380 \
--initial-cluster-token=etcd-cluster-token \
--data-dir=/var/lib/etcd1

这个时候etcd1里面已经有当天0点备份的快照了

1
2
替换现有的etcd目录
mv /var/lib/etcd1 /var/lib/etcd

再次启动etcd

成功!

等待etcd启动完成,再启动apiserver

整个集群终于好了!

总结

这次事故说明,单机确实是不可靠的,恢复完成后我又增加了2个etcd节点和apiserver节点。

另外对于园区不可靠的供电系统我真是想吐槽几句,一年至少有不下10次的检修,经常需要重启服务器,对于嫌麻烦的还是直接放IDC托管算了,这样能省事不少。

参考资料

Troubleshoot kubectl connection refused

3.3.21 crashing on every node with “etcdmain: walpb: crc mismatch” #11918

etcd panics with corrupted raft log errors

kubelet won’t restart after reboot

etcd initialization - peer cluster information request failure results in unhealthy cluster #8337

New installation fails to connect peers

ETCD Disaster recovery

etcd数据备份与恢复验证