声明:这是我在大学毕业后进入第二家互联网公司学习的内容


准备工作

Kubernetes 1.16+
Helm 3+
nfs 

安装nfs

准备持久化redis的磁盘

挑选一台服务器做nfs

yum install -y nfs-utils
mkdir /data/prom
vim /etc/exports
/data/prom *(rw,no_root_squash)

systemctl enable nfs
systemctl start nfs

查看是否启动成功
showmount -e

自定义配置文件

    .
├── prometheus.yaml
└── pv.yaml

prometheus.yaml

暴露nodeport,因为我是自建了grafana,所以需要把k8s-prom接入,如果grafana是直接跟prom一起装到k8s里的可以不用暴露nodeport

retention默认15天,我觉得不够长,调成1年

persistentVolume是挂载的pv信息

这个配置文件是权限问题,后面会讲到
securityContext:
    runAsUser: 0
    runAsNonRoot: false
    runAsGroup: 0
    fsGroup: 0

server:
  service:
    nodePort: 30003 
    type: NodePort
  retention: "365d"
  persistentVolume:
    storageClass: "prometheus-server"
  securityContext:
    runAsUser: 0
    runAsNonRoot: false
    runAsGroup: 0
    fsGroup: 0
alertmanager:
  persistentVolume:
    storageClass: "prometheus-alertmanager"
    
    
pv.yaml

apiVersion: v1
kind: PersistentVolume
metadata:
  name: prometheus-alertmanager
spec:
  capacity:
    storage: 2Gi
  accessModes:
  - ReadWriteOnce
  persistentVolumeReclaimPolicy: Retain
  storageClassName: prometheus-alertmanager
  nfs:
    path: /data/prom/alertmanager
    server: ${ip}

---
apiVersion: v1
kind: PersistentVolume
metadata:
  name: prometheus-server
spec:
  capacity:
    storage: 100Gi
  accessModes:
  - ReadWriteOnce
  persistentVolumeReclaimPolicy: Retain
  storageClassName: prometheus-server
  nfs:
    path: /data/prom/prometheus-server
    server: ${ip}

安装

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

kubectl apply -f pv.yaml
helm install prometheus -f prometheus.yaml prometheus-community/prometheus

更新

helm upgrade -i prometheus prometheus-community/prometheus   -f prometheus.yaml  --namespace prometheus

报错

node-exporter启动失败

原因是9100端口被原服务器的docker的node-exporter占用了

解决办法关闭原docker的node-exporter或者改yaml文件node-exporter的端口

我是直接关闭docker的node-exporter

prom-server启动失败

默认prom-server的persistentVolume是true,所以如果不提前准备好磁盘是启动不了的,除非关闭持久化

解决办法,提前配置好persistentVolume并先部署

权限不够

挂载磁盘后pod仍然启动失败

容器

prometheus-server CrashLoopBackOff

查看prometheus-server日志发现就一句话

prometheus-server Watching directory: "/etc/config" failed

参考helm文档解决的

更改
 securityContext:
    fsGroup: 65534
    runAsGroup: 65534
    runAsNonRoot: true
    runAsUser: 65534
为
securityContext:
    runAsUser: 0
    runAsNonRoot: false
    runAsGroup: 0
    fsGroup: 0

参考文档

prom-helm

pod has unbound immediate PersistentVolumeClaims #21820

prometheus-server CrashLoopBackOff #15742