声明:这是我在大学毕业后进入第二家互联网公司学习的内容
背景
今天查看grafana发现这2天的eks上数据源为空,感觉是prometheus出了问题,看了下prometheus日志发现硬盘满了,准备添加磁盘
更改yaml
diff一下看看
1 2 3 4 5
| helm diff upgrade prometheus prometheus-community/prometheus -f values.yaml --namespace prometheus - size: 50Gi + size: 100Gi - retention: "365d" + retention: "90d"
|
确认没有问题
1
| helm upgread prometheus prometheus-community/prometheus -f values.yaml --namespace prometheus
|
结果发现启动失败
报错日志
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39
| level=info ts=2022-05-01T02:45:33.563Z caller=main.go:418 msg="Starting Prometheus" version="(version=2.26.0, branch=HEAD, revision=3cafc58827d1ebd1a67749f88be4218f0bab3d8d)" level=info ts=2022-05-01T02:45:33.563Z caller=main.go:423 build_context="(go=go1.16.2, user=root@a67cafebe6d0, date=20210331-11:56:23)" level=info ts=2022-05-01T02:45:33.563Z caller=main.go:424 host_details="(Linux 5.4.172-90.336.amzn2.x86_64 #1 SMP Wed Jan 19 23:08:01 UTC 2022 x86_64 prometheus-server-6886dcbd59-d4n59 (none))" level=info ts=2022-05-01T02:45:33.563Z caller=main.go:425 fd_limits="(soft=1048576, hard=1048576)" level=info ts=2022-05-01T02:45:33.563Z caller=main.go:426 vm_limits="(soft=unlimited, hard=unlimited)" level=info ts=2022-05-01T02:45:33.567Z caller=web.go:540 component=web msg="Start listening for connections" address=0.0.0.0:9090 level=info ts=2022-05-01T02:45:33.567Z caller=main.go:795 msg="Starting TSDB ..." level=info ts=2022-05-01T02:45:33.568Z caller=tls_config.go:191 component=web msg="TLS is disabled." http2=false level=info ts=2022-05-01T02:45:33.569Z caller=repair.go:57 component=tsdb msg="Found healthy block" mint=1646476579146 maxt=1646568000000 ulid=01FXG2SMAZTTAB0RQRGC640N8X level=info ts=2022-05-01T02:45:33.569Z caller=repair.go:57 component=tsdb msg="Found healthy block" mint=1646568000035 maxt=1646762400000 ulid=01FXNW69MWSR1BNW7JAPPMDVNP level=info ts=2022-05-01T02:45:33.569Z caller=repair.go:57 component=tsdb msg="Found healthy block" mint=1646762400008 maxt=1646956800000 ulid=01FXVNK0SQNW1E3YEG0P02T6D9 level=info ts=2022-05-01T02:45:33.569Z caller=repair.go:57 component=tsdb msg="Found healthy block" mint=1646956800019 maxt=1647540000000 ulid=01FYD1SYWPZ1JFMZYNJ0XPY05K level=info ts=2022-05-01T02:45:33.569Z caller=repair.go:57 component=tsdb msg="Found healthy block" mint=1647540000022 maxt=1648123200000 ulid=01FYYDZSVF8Y78N17S7957HKKA level=info ts=2022-05-01T02:45:33.570Z caller=repair.go:57 component=tsdb msg="Found healthy block" mint=1648123200002 maxt=1648706400000 ulid=01FZFT5S68NTH3KDN10DZGEV48 level=info ts=2022-05-01T02:45:33.570Z caller=repair.go:57 component=tsdb msg="Found healthy block" mint=1648706400033 maxt=1649289600000 ulid=01G016BNVY5V7GN748QZJXKKCS level=info ts=2022-05-01T02:45:33.570Z caller=repair.go:57 component=tsdb msg="Found healthy block" mint=1649289600145 maxt=1649872800000 ulid=01G0JJHM1F0YJK7RYE201ADTTD level=info ts=2022-05-01T02:45:33.570Z caller=repair.go:57 component=tsdb msg="Found healthy block" mint=1649872800016 maxt=1650456000000 ulid=01G13YR2QSXKJGRK6CDVNTSR2H level=info ts=2022-05-01T02:45:33.570Z caller=repair.go:57 component=tsdb msg="Found healthy block" mint=1650456000016 maxt=1650650400000 ulid=01G19R2M70BD03JKNNDCK5XD9G level=info ts=2022-05-01T02:45:33.570Z caller=repair.go:57 component=tsdb msg="Found healthy block" mint=1650650400016 maxt=1650844800000 ulid=01G1FHF98ZFG21K3Y357Q165N2 level=info ts=2022-05-01T02:45:33.571Z caller=repair.go:57 component=tsdb msg="Found healthy block" mint=1650844800004 maxt=1651039200000 ulid=01G1NAW5A7KP79KXF3YMHNFZ8J level=info ts=2022-05-01T02:45:33.571Z caller=repair.go:57 component=tsdb msg="Found healthy block" mint=1651039200011 maxt=1651104000000 ulid=01G1Q8N4QSKTTG222X2EZSP821 level=info ts=2022-05-01T02:45:33.571Z caller=repair.go:57 component=tsdb msg="Found healthy block" mint=1651104000006 maxt=1651125600000 ulid=01G1QX82G9SM7FQZNPAEA792WE level=info ts=2022-05-01T02:45:33.571Z caller=repair.go:57 component=tsdb msg="Found healthy block" mint=1651125600043 maxt=1651147200000 ulid=01G1RHV934VYCFQSKMF0NRS1H9 level=info ts=2022-05-01T02:45:33.571Z caller=repair.go:57 component=tsdb msg="Found healthy block" mint=1651168800084 maxt=1651176000000 ulid=01G1RZJE99AD5T5VMYXQAS4KVJ level=info ts=2022-05-01T02:45:33.571Z caller=repair.go:57 component=tsdb msg="Found healthy block" mint=1651176000000 maxt=1651183200000 ulid=01G1YQR52163FPAHMV40YA5TN0 level=info ts=2022-05-01T02:45:33.571Z caller=repair.go:57 component=tsdb msg="Found healthy block" mint=1651183200036 maxt=1651190400000 ulid=01G1YQWJVAM3QTD8N7AX1242EQ level=info ts=2022-05-01T02:45:33.571Z caller=repair.go:57 component=tsdb msg="Found healthy block" mint=1651190400089 maxt=1651197600000 ulid=01G1YQZ0M4TSWCFE099EYA7791 level=info ts=2022-05-01T02:45:33.571Z caller=repair.go:57 component=tsdb msg="Found healthy block" mint=1651147200059 maxt=1651168800000 ulid=01G1YQZ5THSEQAZMPDMN25HGAW level=info ts=2022-05-01T02:45:33.572Z caller=main.go:668 msg="Stopping scrape discovery manager..." level=info ts=2022-05-01T02:45:33.572Z caller=main.go:682 msg="Stopping notify discovery manager..." level=info ts=2022-05-01T02:45:33.572Z caller=main.go:704 msg="Stopping scrape manager..." level=info ts=2022-05-01T02:45:33.572Z caller=main.go:678 msg="Notify discovery manager stopped" level=info ts=2022-05-01T02:45:33.572Z caller=main.go:664 msg="Scrape discovery manager stopped" level=info ts=2022-05-01T02:45:33.572Z caller=manager.go:934 component="rule manager" msg="Stopping rule manager..." level=info ts=2022-05-01T02:45:33.572Z caller=manager.go:944 component="rule manager" msg="Rule manager stopped" level=info ts=2022-05-01T02:45:33.572Z caller=notifier.go:601 component=notifier msg="Stopping notification manager..." level=info ts=2022-05-01T02:45:33.572Z caller=main.go:872 msg="Notifier manager stopped" level=info ts=2022-05-01T02:45:33.572Z caller=main.go:698 msg="Scrape manager stopped" level=error ts=2022-05-01T02:45:33.572Z caller=main.go:881 err="opening storage failed: lock DB directory: resource temporarily unavailable"
|
网上查了下资料,发现这是prometheus锁的问题,因为旧的prometheus-server容器没有删除,锁不会释放,需要更改滚动更新的策略
添加values.yaml参数
1 2 3 4
| helm diff upgrade prometheus prometheus-community/prometheus -f values.yaml --namespace prometheus + strategy: + type: Recreate + rollingUpdate: null
|
启动成功
后续
prometheus-server用的pv是eks的gp2,目前还没有增加对pv容量的监控告警,目前先临时用脚本每日执行一遍,并通过webhook发送到告警群里
1 2
| 查看挂载的gp2路径,我这里$PATH为/data kubectl exec -it $(kubectl get pods -n prometheus |grep server |awk '{print $1}') -n prometheus -c prometheus-server -- df -hP |grep $PATH
|
参考资料
Use Recreate instead of RollingUpdate deployment strategy for Prometheus #8
Promethues升级死锁