Prometheus社区提供了在k8s部署的三种方式:

  1. 部署Prometheus-Operator,自定义部署(yaml方式,本文)
  2. 通过kube-Prometheus部署通用监控环境
  3. 通过kube-prometheus-stack部署(helm方式,简化第二种方式)

上文使用了第二种方式快速部署了Prometheus集成监控方案

优点是快速和全面,但对于定制化需求不能得到满足,内置的规则和模板也比较杂

本文和后续都使用Prometheus-Operator方式集成,精简和高度定制一套属于自己的监控方案

开始部署

兼容性

Prometheus-Operator不再像kube-Prometheus需要关注版本信息

由于使用了 apiextensions.k8s.io/v1 CustomResourceDefinitions,prometheus-operator 需要 Kubernetes >= v1.16.0。

部署Prometheus-Operator

官方文档:https://prometheus-operator.dev/docs/getting-started/installation

获取最新版

1
LATEST=$(curl -s https://api.github.com/repos/prometheus-operator/prometheus-operator/releases/latest | jq -cr .tag_name)

下载

1
2
curl -sL https://github.com/prometheus-operator/prometheus-operator/releases/download/${LATEST}/bundle.yaml 
# 这将下载一个bundle.yaml
1
2
3
4
5
6
7
8
9
# 创建命名空间
kubectl create ns monitoring

# 修改命名空间
sed -i 's/namespace: default/namespace: monitoring/g' bundle.yaml
# 查看 grep -r 'namespace: monitoring' bundle.yaml

# 安装operator
kubectl create -f bundle.yaml

部署Prometheus

rbac.yaml

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
apiVersion: v1
kind: ServiceAccount
metadata:
name: prometheus
namespace: monitoring
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: prometheus
rules:
- apiGroups: [""]
resources:
- nodes
- nodes/metrics
- services
- endpoints
- pods
verbs: ["get", "list", "watch"]
- apiGroups: [""]
resources:
- configmaps
verbs: ["get"]
- apiGroups:
- discovery.k8s.io
resources:
- endpointslices
verbs: ["get", "list", "watch"]
- apiGroups:
- networking.k8s.io
resources:
- ingresses
verbs: ["get", "list", "watch"]
- nonResourceURLs: ["/metrics"]
verbs: ["get"]

---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: prometheus
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: prometheus
subjects:
- kind: ServiceAccount
name: prometheus
namespace: monitoring

prometheus.yaml

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
name: prometheus
namespace: monitoring
spec:
replicas: 3 # Prometheus 实例副本数,设置为 3 表示高可用部署
image: quay.io/prometheus/prometheus:v3.4.2 # 使用的 Prometheus 镜像版本
version: v3.4.2 # 指定 Prometheus 版本,供 Operator 使用
serviceAccountName: prometheus # 指定使用的 ServiceAccount 名称
ruleSelector:
matchLabels:
role: alert-rules # 选择带有 role=alert-rules 标签的 PrometheusRule 资源
scrapeInterval: 5s # 抓取指标的时间间隔
alerting:
alertmanagers:
- apiVersion: v2
name: alertmanager # 关联的 Alertmanager 名称
namespace: monitoring # Alertmanager 所在命名空间
port: web # 使用的端口名称,与service的ports下name对应
enableFeatures: [] # 启用的 Prometheus 特性列表,默认不启用额外特性
externalLabels: {} # 添加在每条时间序列上的外部标签
nodeSelector:
kubernetes.io/os: linux # 指定调度到运行 Linux 系统的节点
podMonitorNamespaceSelector: {} # 选择哪些命名空间下的 PodMonitor 生效(空表示所有)
podMonitorSelector: {} # 选择哪些 PodMonitor 生效(空表示所有)
probeNamespaceSelector: {} # 选择哪些命名空间下的 Probe 生效(空表示所有)
probeSelector: {} # 选择哪些 Probe 生效(空表示所有)
ruleNamespaceSelector: {} # 选择哪些命名空间下的 PrometheusRule 生效(空表示所有)
ruleSelector: {} # 选择哪些 PrometheusRule 生效(空表示所有)
scrapeConfigNamespaceSelector: {} # 选择哪些命名空间下的 ScrapeConfig 生效(空表示所有)
scrapeConfigSelector: {} # 选择哪些 ScrapeConfig 生效(空表示所有)
securityContext:
fsGroup: 2000 # 设置 Pod 的文件系统组 ID
runAsNonRoot: true # 禁止以 root 用户运行
runAsUser: 1000 # 指定运行用户 ID
serviceMonitorNamespaceSelector: {} # 选择哪些命名空间下的 ServiceMonitor 生效(空表示所有)
serviceMonitorSelector: {} # 选择哪些 ServiceMonitor 生效(空表示所有)
resources:
requests:
memory: "400Mi" # 请求的内存资源
cpu: "500m" # 请求的 CPU 资源(500m = 0.5 核)
storage:
volumeClaimTemplate:
spec:
storageClassName: "nfs-storage" # 使用的 StorageClass 名称
accessModes: [ "ReadWriteOnce" ] # 访问模式,ReadWriteOnce 表示单节点读写
resources:
requests:
storage: 5Gi # 请求的存储大小

---
apiVersion: v1
kind: Service
metadata:
labels:
app: prometheus
name: prometheus
namespace: monitoring
spec:
type: NodePort
ports:
- name: web
nodePort: 30900
port: 9090
protocol: TCP
targetPort: web
- name: reloader-web
port: 8080
targetPort: reloader-web
selector:
prometheus: prometheus

---
# Prometheus自身指标
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: prometheus
namespace: monitoring
spec:
endpoints:
- interval: 30s
port: web
- interval: 30s
port: reloader-web
selector:
matchLabels:
app: prometheus

部署AlertManager

alertmanager.yaml

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
apiVersion: monitoring.coreos.com/v1
kind: Alertmanager
metadata:
name: alertmanager
namespace: monitoring
spec:
replicas: 1
version: 0.28.1
image: quay.io/prometheus/alertmanager:v0.28.1
nodeSelector:
kubernetes.io/os: linux
podMetadata:
labels:
app: alertmanager
resources:
limits:
cpu: 100m
memory: 100Mi
requests:
cpu: 4m
memory: 100Mi
#alertmanagerConfiguration: # alertmanager全局配置
# name: alertroute
alertmanagerConfigSelector: # alertmanager配置
matchLabels:
alert: alert-config
secrets: []
securityContext:
fsGroup: 2000
runAsNonRoot: true
runAsUser: 1000
storage:
volumeClaimTemplate:
spec:
storageClassName: nfs-storage
resources:
requests:
storage: 5Gi

---
apiVersion: v1
kind: Service
metadata:
labels:
app: alertmanager
name: alertmanager
namespace: monitoring
spec:
ports:
- name: web
port: 9093
targetPort: web
nodePort: 30903
- name: reloader-web
port: 8080
targetPort: reloader-web
nodePort: 30980
selector:
app: alertmanager
sessionAffinity: ClientIP
type: NodePort

---
# alertmanager暴露指标给alertmanager
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: alertmanager
namespace: monitoring
spec:
endpoints:
- interval: 30s
port: web
- interval: 30s
port: reloader-web
selector:
matchLabels:
app: alertmanager

---
# alertmanager配置
apiVersion: monitoring.coreos.com/v1alpha1
kind: AlertmanagerConfig
metadata:
name: config-example
namespace: monitoring
labels:
alert: alert-config
spec:
route:
groupBy: ['job']
groupWait: 30s
groupInterval: 5m
repeatInterval: 12h
receiver: 'webhook'
receivers:
- name: 'webhook'
webhookConfigs:
- url: 'http://example.com/'

查看pod

1
2
3
4
5
6
7
8
kubectl get pod -n monitoring

NAME READY STATUS RESTARTS AGE
alertmanager-alertmanager-0 2/2 Running 0 3m6s
prometheus-operator-7587858ff6-pcncq 1/1 Running 0 31m
prometheus-prometheus-0 2/2 Running 0 27m
prometheus-prometheus-1 2/2 Running 0 27m
prometheus-prometheus-2 2/2 Running 0 27m

访问http://nodeip:30900/targets

验证服务可以访问,并且已经通过serviceMonitor获取到alertmanager指标(下一篇详解)