声明:这是我在大学毕业后进入第二家互联网公司学习的内容


背景

我今天工作的时候创建了一个namespace,并且附加了一个node用作这个namespace,下班后将这个node和namespace都删除以节省资源

1
2
3
4
$ kubectl delete ns test --force
warning: Immediate deletion does not wait for confirmation that the running resource has been terminated. The resource may continue to run on the cluster indefinitely.
namespace "test" force deleted

执行删除namespace的时候卡住

排查问题

初步排查

先看namespace情况

1
2
3
$ kubectl  get ns test
NAME STATUS AGE
test Terminating 18h

处于一个正在删除状态

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
$ kubectl get ns test -o json
{
"apiVersion": "v1",
"kind": "Namespace",
"metadata": {
"annotations": {
"kubectl.kubernetes.io/last-applied-configuration": "{\"apiVersion\":\"v1\",\"kind\":\"Namespace\",\"metadata\":{\"annotations\":{},\"labels\":{\"name\":\"test\"},\"name\":\"test\"}}\n"
},
"creationTimestamp": "2022-03-10T07:40:36Z",
"deletionTimestamp": "2022-03-10T12:02:47Z",
"labels": {
"kubernetes.io/metadata.name": "test",
"name": "test"
},
"name": "test",
"resourceVersion": "67782733",
"uid": "46eff73e-3ce2-439a-91cf-6464ea5eeebb"
},
"spec": {
"finalizers": [
"kubernetes"
]
},
"status": {
"conditions": [
{
"lastTransitionTime": "2022-03-10T12:02:54Z",
"message": "All resources successfully discovered",
"reason": "ResourcesDiscovered",
"status": "False",
"type": "NamespaceDeletionDiscoveryFailure"
},
{
"lastTransitionTime": "2022-03-10T12:02:54Z",
"message": "All legacy kube types successfully parsed",
"reason": "ParsedGroupVersions",
"status": "False",
"type": "NamespaceDeletionGroupVersionParsingFailure"
},
{
"lastTransitionTime": "2022-03-10T12:02:54Z",
"message": "All content successfully deleted, may be waiting on finalization",
"reason": "ContentDeleted",
"status": "False",
"type": "NamespaceDeletionContentFailure"
},
{
"lastTransitionTime": "2022-03-10T12:02:54Z",
"message": "Some resources are remaining: ingresses.extensions has 1 resource instances, ingresses.networking.k8s.io has 1 resource instances",
"reason": "SomeResourcesRemain",
"status": "True",
"type": "NamespaceContentRemaining"
},
{
"lastTransitionTime": "2022-03-10T12:02:54Z",
"message": "Some content in the namespace has finalizers remaining: ingress.k8s.aws/resources in 2 resource instances",
"reason": "SomeFinalizersRemain",
"status": "True",
"type": "NamespaceFinalizersRemaining"
}
],
"phase": "Terminating"
}
}

网上查询资料得知

要删除一个命名空间,Kubernetes 必须删除该命名空间中的所有资源,然后检查注册的 API 服务的状态。如果该命名空间包含 Kubernetes 无法删除的资源,或者 API 服务处于 False 状态,则该命名空间将卡在 Terminating(正在终止)状态。

删除spec及status部分的内容还有metadata字段后的”,”号
剩下内容大致如下

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
# 编辑tempfile.json
$ vim tempfile.json

{
"apiVersion": "v1",
"kind": "Namespace",
"metadata": {
"annotations": {
"kubectl.kubernetes.io/last-applied-configuration": "{\"apiVersion\":\"v1\",\"kind\":\"Namespace\",\"metadata\":{\"annotations\":{},\"labels\":{\"name\":\"test\"},\"name\":\"test\"}}\n"
},
"creationTimestamp": "2022-03-10T07:40:36Z",
"deletionTimestamp": "2022-03-10T12:02:47Z",
"labels": {
"kubernetes.io/metadata.name": "test",
"name": "test"
},
"name": "test",
"resourceVersion": "67782733",
"uid": "46eff73e-3ce2-439a-91cf-6464ea5eeebb"
},
"spec": {
},
"status": {
"phase": "Terminating"
}
}

# 执行删除命令
$ kubectl replace --raw "/api/v1/namespaces/test/finalize" -f ./tempfile.json

# 返回结果
{
"kind": "Namespace",
"apiVersion": "v1",
"metadata": {
"name": "test",
"uid": "46eff73e-3ce2-439a-91cf-6464ea5eeebb",
"resourceVersion": "67782733",
"creationTimestamp": "2022-03-10T07:40:36Z",
"deletionTimestamp": "2022-03-10T12:02:47Z",
"labels": {
"kubernetes.io/metadata.name": "test",
"name": "test"
},
"annotations": {
"kubectl.kubernetes.io/last-applied-configuration": "{\"apiVersion\":\"v1\",\"kind\":\"Namespace\",\"metadata\":{\"annotations\":{},\"labels\":{\"name\":\"test\"},\"name\":\"test\"}}\n"
},
"managedFields": [
{
"manager": "kubectl-client-side-apply",
"operation": "Update",
"apiVersion": "v1",
"time": "2022-03-10T07:40:36Z",
"fieldsType": "FieldsV1",
"fieldsV1": {
"f:metadata": {
"f:annotations": {
".": {},
"f:kubectl.kubernetes.io/last-applied-configuration": {}
},
"f:labels": {
".": {},
"f:kubernetes.io/metadata.name": {},
"f:name": {}
}
}
}
},
{
"manager": "kube-controller-manager",
"operation": "Update",
"apiVersion": "v1",
"time": "2022-03-10T12:02:54Z",
"fieldsType": "FieldsV1",
"fieldsV1": {
"f:status": {
"f:conditions": {
".": {},
"k:{\"type\":\"NamespaceContentRemaining\"}": {
".": {},
"f:lastTransitionTime": {},
"f:message": {},
"f:reason": {},
"f:status": {},
"f:type": {}
},
"k:{\"type\":\"NamespaceDeletionContentFailure\"}": {
".": {},
"f:lastTransitionTime": {},
"f:message": {},
"f:reason": {},
"f:status": {},
"f:type": {}
},
"k:{\"type\":\"NamespaceDeletionDiscoveryFailure\"}": {
".": {},
"f:lastTransitionTime": {},
"f:message": {},
"f:reason": {},
"f:status": {},
"f:type": {}
},
"k:{\"type\":\"NamespaceDeletionGroupVersionParsingFailure\"}": {
".": {},
"f:lastTransitionTime": {},
"f:message": {},
"f:reason": {},
"f:status": {},
"f:type": {}
},
"k:{\"type\":\"NamespaceFinalizersRemaining\"}": {
".": {},
"f:lastTransitionTime": {},
"f:message": {},
"f:reason": {},
"f:status": {},
"f:type": {}
}
}
}
}
}
]
},
"spec": {},
"status": {
"phase": "Terminating",
"conditions": [
{
"type": "NamespaceDeletionDiscoveryFailure",
"status": "False",
"lastTransitionTime": "2022-03-10T12:02:54Z",
"reason": "ResourcesDiscovered",
"message": "All resources successfully discovered"
},
{
"type": "NamespaceDeletionGroupVersionParsingFailure",
"status": "False",
"lastTransitionTime": "2022-03-10T12:02:54Z",
"reason": "ParsedGroupVersions",
"message": "All legacy kube types successfully parsed"
},
{
"type": "NamespaceDeletionContentFailure",
"status": "False",
"lastTransitionTime": "2022-03-10T12:02:54Z",
"reason": "ContentDeleted",
"message": "All content successfully deleted, may be waiting on finalization"
},
{
"type": "NamespaceContentRemaining",
"status": "True",
"lastTransitionTime": "2022-03-10T12:02:54Z",
"reason": "SomeResourcesRemain",
"message": "Some resources are remaining: ingresses.extensions has 1 resource instances, ingresses.networking.k8s.io has 1 resource instances"
},
{
"type": "NamespaceFinalizersRemaining",
"status": "True",
"lastTransitionTime": "2022-03-10T12:02:54Z",
"reason": "SomeFinalizersRemain",
"message": "Some content in the namespace has finalizers remaining: ingress.k8s.aws/resources in 2 resource instances"
}
]
}
}


再次验证namespace

1
2
$ kubectl get ns test                                                        
Error from server (NotFound): namespaces "test" not found

namespace删除了

意识到情况不对

第二天我再创建namespace后重新apply昨天测试编排文件

1
2
3
4
5
$ kubectl apply -f ingress.yaml
namespace/test created
Warning: networking.k8s.io/v1beta1 Ingress is deprecated in v1.19+, unavailable in v1.22+; use networking.k8s.io/v1 Ingress
Warning: Detected changes to resource test which is currently being deleted.
ingress.networking.k8s.io/test configured

我突然发现我创建的ingress报错
Warning: Detected changes to resource test which is currently being deleted.

1
2
3
4
$ kubectl get ingress -A
NAMESPACE NAME CLASS HOSTS ADDRESS PORTS AGE
test test <none> * k8s-test-test-xxxxx.us-east-2.elb.amazonaws.com 80 30h

???

这个ingress为什么没有删除?!

而且我apply了这个ingress.yaml也没有创建一个新的ingress

然后我发现这个namespace删除不了的原因很可能就是这个ingress卡住的原因

我先尝试手动删除yaml文件

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
$ kubectl delete -f ingress.yaml
ingress.networking.k8s.io "test" deleted


^C

卡住,强制删除

$ kubectl delete ingress test -n test --force --grace-period=0
warning: Immediate deletion does not wait for confirmation that the running resource has been terminated. The resource may continue to run on the cluster indefinitely.
ingress.networking.k8s.io "test" force deleted

^C

我尝试改ingress的api版本,看看是不是因为版本原因
Warning: extensions/v1beta1 Ingress is deprecated in v1.14+, unavailable in v1.22+; use networking.k8s.io/v1 Ingress

把apiversion改成networking.k8s.io/v1发现还是不行

查看日志

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
# 再查下这个ingress的状态
$ kubectl describe ingress test -n test
Name: test
Namespace: test
Address: k8s-test-test-xxx.us-east-2.elb.amazonaws.com
Default backend: default-http-backend:80 (<error: endpoints "default-http-backend" not found>)
Rules:
Host Path Backends
---- ---- --------
* / test
Annotations: alb.ingress.kubernetes.io/certificate-arn: arn:aws:acm:xxxx
alb.ingress.kubernetes.io/listen-ports: [{"HTTP": 80}, {"HTTPS": 443}]
alb.ingress.kubernetes.io/scheme: internet-facing
alb.ingress.kubernetes.io/ssl-redirect: 443
alb.ingress.kubernetes.io/subnets: xxxx
alb.ingress.kubernetes.io/target-type: ip
kubernetes.io/ingress.class: alb
Events: <none>

这个ingress还存在。。。

我想到了查询eks的apiserver的日志,发现apiserver没有日志,先把它日志给开启
开启Amazon EKS 控制层面日志记录

然后再执行一次delete操作,查看apiserver的日志

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
{
"kind": "Event",
"apiVersion": "audit.k8s.io/v1",
"level": "RequestResponse",
"auditID": "xxxx",
"stage": "ResponseComplete",
"requestURI": "/apis/networking.k8s.io/v1/namespaces/test/ingresses/test",
"verb": "delete",
"user": {
"username": "xxx",
"uid": "xxx",
"groups": [
"system:masters",
"system:authenticated"
],
},
"sourceIPs": [
"xxxx"
],
"userAgent": "kubectl/v1.20.4 (linux/amd64) kubernetes/6b74644",
"objectRef": {
"resource": "ingresses",
"namespace": "test",
"name": "test",
"apiGroup": "networking.k8s.io",
"apiVersion": "v1"
},
"responseStatus": {
"metadata": {},
"code": 200
},
"requestObject": {
"kind": "DeleteOptions",
"apiVersion": "networking.k8s.io/v1",
"propagationPolicy": "Background"
},
"responseObject": {
"kind": "Ingress",
"apiVersion": "networking.k8s.io/v1",
"metadata": {
"name": "test",
"namespace": "test",
"uid": "xxxx",
"resourceVersion": "68583106",
"generation": 3,
"creationTimestamp": "2022-03-11T03:19:00Z",
"deletionTimestamp": "2022-03-12T08:29:46Z",
"deletionGracePeriodSeconds": 0,
},
"finalizers": [
"ingress.k8s.aws/resources"
],
"managedFields": [
{
"manager": "kubectl-client-side-apply",
"operation": "Update",
"apiVersion": "networking.k8s.io/v1beta1",
"time": "2022-03-11T03:19:00Z",
"fieldsType": "FieldsV1",
"fieldsV1": {
"f:metadata": {
"f:annotations": {
".": {},
"f:alb.ingress.kubernetes.io/certificate-arn": {},
"f:alb.ingress.kubernetes.io/listen-ports": {},
"f:alb.ingress.kubernetes.io/scheme": {},
"f:alb.ingress.kubernetes.io/ssl-redirect": {},
"f:alb.ingress.kubernetes.io/subnets": {},
"f:alb.ingress.kubernetes.io/target-type": {},
"f:kubernetes.io/ingress.class": {}
}
},
"f:spec": {
"f:rules": {}
}
}
},
{
"manager": "controller",
"operation": "Update",
"apiVersion": "networking.k8s.io/v1beta1",
"time": "2022-03-11T03:19:03Z",
"fieldsType": "FieldsV1",
"fieldsV1": {
"f:metadata": {
"f:finalizers": {
".": {},
"v:\"ingress.k8s.aws/resources\"": {}
}
},
"f:status": {
"f:loadBalancer": {
"f:ingress": {}
}
}
}
},
{
"manager": "kubectl-client-side-apply",
"operation": "Update",
"apiVersion": "extensions/v1beta1",
"time": "2022-03-12T08:37:49Z",
"fieldsType": "FieldsV1",
"fieldsV1": {
"f:metadata": {
"f:annotations": {
"f:kubectl.kubernetes.io/last-applied-configuration": {}
}
}
}
}
]
},
"status": {
"loadBalancer": {
"ingress": [
{
"hostname": "k8s-test-test-xxx.us-east-2.elb.amazonaws.com"
}
]
}
}
},
"requestReceivedTimestamp": "2022-03-12T08:43:06.236374Z",
"stageTimestamp": "2022-03-12T08:43:06.256705Z",
"annotations": {
"authorization.k8s.io/decision": "allow",
"authorization.k8s.io/reason": ""
}
}

从以下日志可以发现,apiserver是收到了该delete的请求,但是集群并没有删除

1
2
3
4
5
6
7
8
9
"responseStatus": {
"metadata": {},
"code": 200
}
"requestObject": {
"kind": "DeleteOptions",
"apiVersion": "networking.k8s.io/v1",
"propagationPolicy": "Background"
}

了解apiserver的处理逻辑就知道,它请求的下一步就是ingress-controller

然后找下ingress-controller的日志,发现问题了

1
2
3
E0127 15:57:50.103408       1 leaderelection.go:320] error retrieving resource lock kube-system/aws-load-balancer-controller-leader: etcdserver: leader changed
E0127 16:39:03.421513 1 leaderelection.go:320] error retrieving resource lock kube-system/aws-load-balancer-controller-leader: Unauthorized
{"level":"error","ts":1647076608.8784113,"logger":"controller","msg":"Reconciler error","controller":"ingress","name":"test","namespace":"test","error":"failed to delete securityGroup: timed out waiting for the condition"}

原来是安全组删除不了,等待这个超时所以ingress也超时

然后我再aws控制台发现创建这个ingress的安全组被关联到cluster安全组了,先得解除cluster引用这个ingress,才能删除安全组

图片链接xxxx

解决办法

解除clueter对ingress安全组引用,然后再删除ingress安全组,最后再删除ingress

解决!

后续

因为之前ingress删除是可以的,不知道这次为什么突然失败,继续追查发现是alb-ingress-controller有概率出现删除安全组的失败的情况,需要重启容器

后续也是把alb-ingress-controller升级成一个最新稳定版本了

参考资料

如何排查 Amazon EKS 集群中的命名空间处于已终止状态的问题?

Warning: Detected changes to resource object-store which is currently being deleted

Detect “kubectl apply” to resources currently being deleted #93363

Amazon EKS 控制层面日志记录

Can not get ingress controller running

Ingres resources are not getting deleted even though the alb ingress controller deployment is deleted