声明:这是我在大学毕业后进入第二家互联网公司学习的内容


背景

K8S prometheus 异常告警

周末服务器突然发了一条告警信息到群里,所有服务器全部CPU使用率100%,吓得我直接看了下服务器状态,结果发现都是正常,但是alertmanager却告警了,一起来看看吧。

报错日志

1
2
3
4
level=error ts=2022-03-13T14:29:37.826Z caller=scrape.go:1098 component="scrape manager" scrape_pool=kubernetes-service-endpoints target=http://172.31.29.72:9100/metrics msg="Scrape commit failed" err="write to WAL: log samples: write /data/wal/00000288: no space left on device"
level=error ts=2022-03-13T14:29:37.903Z caller=scrape.go:1098 component="scrape manager" scrape_pool=kubernetes-service-endpoints target=http://172.31.42.63:9100/metrics msg="Scrape commit failed" err="write to WAL: log samples: write /data/wal/00000288: no space left on device"
level=error ts=2022-03-13T14:29:37.907Z caller=scrape.go:1098 component="scrape manager" scrape_pool=kubernetes-service-endpoints target=http://172.31.33.131:8080/metrics msg="Scrape commit failed" err="write to WAL: log samples: write /data/wal/00000288: no space left on device"
level=error ts=2022-03-13T14:29:38.134Z caller=scrape.go:1098 component="scrape manager" scrape_pool=prometheus target=http://localhost:9090/metrics msg="Scrape commit failed" err="write to WAL: log samples: write /data/wal/00000288: no space left on device"

应该是硬盘满了,查看prometheus的api会发现所有CPU告警全部异常,不知道是不是BUG,反正现在先得增加硬盘

增加硬盘

查看存储类

1
2
3
kubectl  get StorageClass
NAME PROVISIONER RECLAIMPOLICY VOLUMEBINDINGMODE ALLOWVOLUMEEXPANSION AGE
gp2 (default) kubernetes.io/aws-ebs Delete WaitForFirstConsumer false 244d

发现ALLOWVOLUMEEXPANSION值false

只有更改它才能开启扩容PV的功能

更新ALLOWVOLUMEEXPANSION的值

1
kubectl patch sc gp2 -p '{"allowVolumeExpansion": true}'

添加硬盘

将默认8G的硬盘大小调整为50G

values.yaml添加

1
2
3
4
server:
persistentVolume:
size: 50Gi
storageClass: "gp2"

检查

1
2
3
4
5
$ helm diff upgrade  prometheus prometheus-community/prometheus     --namespace prometheus -f values.yaml
prometheus, prometheus-server, PersistentVolumeClaim (v1) has changed:
# Source: prometheus/templates/server/pvc.yaml
- storage: "8Gi"
+ storage: "50Gi"

更新

1
helm upgrade  prometheus prometheus-community/prometheus     --namespace prometheus -f values.yaml

查看启动没有问题,解决

总结

对于为什么硬盘满了,prometheus的指标会发生异常我向prometheus-github提了一个issue

报错总结

存储类更新配置文件

如果不更新StorageClass的allowVolumeExpansion,直接扩容会收到一个因为存储类不支持在线扩容的报错

硬盘添加格式错误

一开始我用50G去代替,没有想太多情况,结果发现50G并没有覆盖8Gi,后来才发现是格式不对

1
2
3
4
5
$ helm diff upgrade  prometheus prometheus-community/prometheus     --namespace prometheus -f values.yaml
prometheus, prometheus-server, PersistentVolumeClaim (v1) has changed:
# Source: prometheus/templates/server/pvc.yaml
- storage: "8Gi"
+ storage: "50G"

这种写法是错误的

不过很奇怪helm diff为什么允许这种写法存在,不检查格式的么

参考资料

Resizing Persistent Volumes using Kubernetes

Expanding K8s PVs in EKS on AWS

AWS EKS: How to resize Persistent Volumes

Error While Resizing The Persistent Volume In Kubernetes