声明:这是我在大学毕业后进入第二家互联网公司学习的内容


背景

今天查看grafana发现这2天的eks上数据源为空,感觉是prometheus出了问题,看了下prometheus日志发现硬盘满了,准备添加磁盘

更改yaml

diff一下看看

1
2
3
4
5
helm diff upgrade  prometheus prometheus-community/prometheus   -f values.yaml --namespace prometheus 
- size: 50Gi
+ size: 100Gi
- retention: "365d"
+ retention: "90d"

确认没有问题

1
helm upgread prometheus prometheus-community/prometheus   -f values.yaml --namespace prometheus 

结果发现启动失败

报错日志

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
level=info ts=2022-05-01T02:45:33.563Z caller=main.go:418 msg="Starting Prometheus" version="(version=2.26.0, branch=HEAD, revision=3cafc58827d1ebd1a67749f88be4218f0bab3d8d)"
level=info ts=2022-05-01T02:45:33.563Z caller=main.go:423 build_context="(go=go1.16.2, user=root@a67cafebe6d0, date=20210331-11:56:23)"
level=info ts=2022-05-01T02:45:33.563Z caller=main.go:424 host_details="(Linux 5.4.172-90.336.amzn2.x86_64 #1 SMP Wed Jan 19 23:08:01 UTC 2022 x86_64 prometheus-server-6886dcbd59-d4n59 (none))"
level=info ts=2022-05-01T02:45:33.563Z caller=main.go:425 fd_limits="(soft=1048576, hard=1048576)"
level=info ts=2022-05-01T02:45:33.563Z caller=main.go:426 vm_limits="(soft=unlimited, hard=unlimited)"
level=info ts=2022-05-01T02:45:33.567Z caller=web.go:540 component=web msg="Start listening for connections" address=0.0.0.0:9090
level=info ts=2022-05-01T02:45:33.567Z caller=main.go:795 msg="Starting TSDB ..."
level=info ts=2022-05-01T02:45:33.568Z caller=tls_config.go:191 component=web msg="TLS is disabled." http2=false
level=info ts=2022-05-01T02:45:33.569Z caller=repair.go:57 component=tsdb msg="Found healthy block" mint=1646476579146 maxt=1646568000000 ulid=01FXG2SMAZTTAB0RQRGC640N8X
level=info ts=2022-05-01T02:45:33.569Z caller=repair.go:57 component=tsdb msg="Found healthy block" mint=1646568000035 maxt=1646762400000 ulid=01FXNW69MWSR1BNW7JAPPMDVNP
level=info ts=2022-05-01T02:45:33.569Z caller=repair.go:57 component=tsdb msg="Found healthy block" mint=1646762400008 maxt=1646956800000 ulid=01FXVNK0SQNW1E3YEG0P02T6D9
level=info ts=2022-05-01T02:45:33.569Z caller=repair.go:57 component=tsdb msg="Found healthy block" mint=1646956800019 maxt=1647540000000 ulid=01FYD1SYWPZ1JFMZYNJ0XPY05K
level=info ts=2022-05-01T02:45:33.569Z caller=repair.go:57 component=tsdb msg="Found healthy block" mint=1647540000022 maxt=1648123200000 ulid=01FYYDZSVF8Y78N17S7957HKKA
level=info ts=2022-05-01T02:45:33.570Z caller=repair.go:57 component=tsdb msg="Found healthy block" mint=1648123200002 maxt=1648706400000 ulid=01FZFT5S68NTH3KDN10DZGEV48
level=info ts=2022-05-01T02:45:33.570Z caller=repair.go:57 component=tsdb msg="Found healthy block" mint=1648706400033 maxt=1649289600000 ulid=01G016BNVY5V7GN748QZJXKKCS
level=info ts=2022-05-01T02:45:33.570Z caller=repair.go:57 component=tsdb msg="Found healthy block" mint=1649289600145 maxt=1649872800000 ulid=01G0JJHM1F0YJK7RYE201ADTTD
level=info ts=2022-05-01T02:45:33.570Z caller=repair.go:57 component=tsdb msg="Found healthy block" mint=1649872800016 maxt=1650456000000 ulid=01G13YR2QSXKJGRK6CDVNTSR2H
level=info ts=2022-05-01T02:45:33.570Z caller=repair.go:57 component=tsdb msg="Found healthy block" mint=1650456000016 maxt=1650650400000 ulid=01G19R2M70BD03JKNNDCK5XD9G
level=info ts=2022-05-01T02:45:33.570Z caller=repair.go:57 component=tsdb msg="Found healthy block" mint=1650650400016 maxt=1650844800000 ulid=01G1FHF98ZFG21K3Y357Q165N2
level=info ts=2022-05-01T02:45:33.571Z caller=repair.go:57 component=tsdb msg="Found healthy block" mint=1650844800004 maxt=1651039200000 ulid=01G1NAW5A7KP79KXF3YMHNFZ8J
level=info ts=2022-05-01T02:45:33.571Z caller=repair.go:57 component=tsdb msg="Found healthy block" mint=1651039200011 maxt=1651104000000 ulid=01G1Q8N4QSKTTG222X2EZSP821
level=info ts=2022-05-01T02:45:33.571Z caller=repair.go:57 component=tsdb msg="Found healthy block" mint=1651104000006 maxt=1651125600000 ulid=01G1QX82G9SM7FQZNPAEA792WE
level=info ts=2022-05-01T02:45:33.571Z caller=repair.go:57 component=tsdb msg="Found healthy block" mint=1651125600043 maxt=1651147200000 ulid=01G1RHV934VYCFQSKMF0NRS1H9
level=info ts=2022-05-01T02:45:33.571Z caller=repair.go:57 component=tsdb msg="Found healthy block" mint=1651168800084 maxt=1651176000000 ulid=01G1RZJE99AD5T5VMYXQAS4KVJ
level=info ts=2022-05-01T02:45:33.571Z caller=repair.go:57 component=tsdb msg="Found healthy block" mint=1651176000000 maxt=1651183200000 ulid=01G1YQR52163FPAHMV40YA5TN0
level=info ts=2022-05-01T02:45:33.571Z caller=repair.go:57 component=tsdb msg="Found healthy block" mint=1651183200036 maxt=1651190400000 ulid=01G1YQWJVAM3QTD8N7AX1242EQ
level=info ts=2022-05-01T02:45:33.571Z caller=repair.go:57 component=tsdb msg="Found healthy block" mint=1651190400089 maxt=1651197600000 ulid=01G1YQZ0M4TSWCFE099EYA7791
level=info ts=2022-05-01T02:45:33.571Z caller=repair.go:57 component=tsdb msg="Found healthy block" mint=1651147200059 maxt=1651168800000 ulid=01G1YQZ5THSEQAZMPDMN25HGAW
level=info ts=2022-05-01T02:45:33.572Z caller=main.go:668 msg="Stopping scrape discovery manager..."
level=info ts=2022-05-01T02:45:33.572Z caller=main.go:682 msg="Stopping notify discovery manager..."
level=info ts=2022-05-01T02:45:33.572Z caller=main.go:704 msg="Stopping scrape manager..."
level=info ts=2022-05-01T02:45:33.572Z caller=main.go:678 msg="Notify discovery manager stopped"
level=info ts=2022-05-01T02:45:33.572Z caller=main.go:664 msg="Scrape discovery manager stopped"
level=info ts=2022-05-01T02:45:33.572Z caller=manager.go:934 component="rule manager" msg="Stopping rule manager..."
level=info ts=2022-05-01T02:45:33.572Z caller=manager.go:944 component="rule manager" msg="Rule manager stopped"
level=info ts=2022-05-01T02:45:33.572Z caller=notifier.go:601 component=notifier msg="Stopping notification manager..."
level=info ts=2022-05-01T02:45:33.572Z caller=main.go:872 msg="Notifier manager stopped"
level=info ts=2022-05-01T02:45:33.572Z caller=main.go:698 msg="Scrape manager stopped"
level=error ts=2022-05-01T02:45:33.572Z caller=main.go:881 err="opening storage failed: lock DB directory: resource temporarily unavailable"

网上查了下资料,发现这是prometheus锁的问题,因为旧的prometheus-server容器没有删除,锁不会释放,需要更改滚动更新的策略

添加values.yaml参数

1
2
3
4
helm diff upgrade  prometheus prometheus-community/prometheus   -f values.yaml --namespace prometheus 
+ strategy:
+ type: Recreate
+ rollingUpdate: null

启动成功

后续

prometheus-server用的pv是eks的gp2,目前还没有增加对pv容量的监控告警,目前先临时用脚本每日执行一遍,并通过webhook发送到告警群里

1
2
查看挂载的gp2路径,我这里$PATH为/data
kubectl exec -it $(kubectl get pods -n prometheus |grep server |awk '{print $1}') -n prometheus -c prometheus-server -- df -hP |grep $PATH

参考资料

Use Recreate instead of RollingUpdate deployment strategy for Prometheus #8

Promethues升级死锁