背景:
k8s集群部署后,急需可靠穩(wěn)定低延時(shí)的集群監(jiān)控報(bào)警系統(tǒng),報(bào)警k8s集群正常有序運(yùn)行,經(jīng)過不斷調(diào)研和測(cè)試,最終選擇Prometheus+AlertManager+Grafana+prometheusAlert的部署方案,故障信息報(bào)警至釘釘群和郵件,如需要額外監(jiān)控可部署pushgateway主動(dòng)推送數(shù)據(jù)到Prometheus進(jìn)行數(shù)據(jù)采集
部署方案:
Prometheus+AlertManager+Grafana+prometheusAlert+Dingding(可自行部署pushgateway)
前提:
k8s集群已經(jīng)部署完畢,詳細(xì)見使用 kubeadm 搭建生產(chǎn)環(huán)境的單 master 節(jié)點(diǎn) k8s 集群
部署
一、 Prometheus部署
Prometheus 由多個(gè)組件組成,但是其中有些組件是可選的:
Prometheus Server:用于抓取指標(biāo)、存儲(chǔ)時(shí)間序列數(shù)據(jù)
exporter:暴露指標(biāo)讓任務(wù)來抓
pushgateway:push 的方式將指標(biāo)數(shù)據(jù)推送到該網(wǎng)關(guān)
alertmanager:處理報(bào)警的報(bào)警組件 adhoc:用于數(shù)據(jù)查詢
Prometheus架構(gòu)圖:
Prometheus 直接接收或者通過中間的 Pushgateway 網(wǎng)關(guān)被動(dòng)獲取指標(biāo)數(shù)據(jù),在本地存儲(chǔ)所有的獲取的指標(biāo)數(shù)據(jù),并對(duì)這些數(shù)據(jù)進(jìn)行一些規(guī)則整理,用來生成一些聚合數(shù)據(jù)或者報(bào)警信息,Grafana 或者其他工具用來可視化這些數(shù)據(jù)。
1.1創(chuàng)建命名空間
kubectl create ns monitor
1.2 創(chuàng)建Prometheus配置文件# prometheus-cm.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-config
namespace: monitor
data:
prometheus.yml: |
global:
scrape_interval: 15s
evaluation_interval: 15s
alerting:
alertmanagers:
- static_configs:
- targets: ["alertmanager:9093"]
-
rule_files:
# - "first.rules"
# - "second.rules"
scrape_configs:
- job_name: prometheus
static_configs:
- targets: ["localhost:9090"]
- job_name: "kubernetes-apiservers"
kubernetes_sd_configs:
- role: endpoints
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
relabel_configs:
- source_labels:
[
__meta_kubernetes_namespace,
__meta_kubernetes_service_name,
__meta_kubernetes_endpoint_port_name,
]
action: keep
regex: default;kubernetes;https
- job_name: "kubernetes-nodes"
kubernetes_sd_configs:
- role: node
relabel_configs:
- source_labels: [__address__]
regex: "(.*):10250"
replacement: "${1}:9100"
target_label: __address__
action: replace
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
- job_name: 'controller-manager'
kubernetes_sd_configs:
- role: endpoints
scheme: https
tls_config:
insecure_skip_verify: true
authorization:
credentials_file: /var/run/secrets/kubernetes.io/serviceaccount/token
relabel_configs:
- source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
action: keep
regex: kube-system;kube-controller-manager;https
- job_name: 'kube-scheduler'
kubernetes_sd_configs:
- role: endpoints
scheme: https
tls_config:
insecure_skip_verify: true
authorization:
credentials_file: /var/run/secrets/kubernetes.io/serviceaccount/token
relabel_configs:
- source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
action: keep
regex: kube-system;kube-scheduler;https
- job_name: 'etcd'
kubernetes_sd_configs:
- role: endpoints
scheme: http
relabel_configs:
- source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
action: keep
regex: kube-system;etcd;http
- job_name: "etcd-https"
metrics_path: "/metrics"
kubernetes_sd_configs:
- role: endpoints
scheme: https
tls_config:
ca_file: /opt/categraf/pki/etcd/ca.crt
cert_file: /opt/categraf/pki/etcd/client.crt
key_file: /opt/categraf/pki/etcd/client.key
insecure_skip_verify: true
relabel_configs:
- source_labels:
[
__meta_kubernetes_namespace,
__meta_kubernetes_service_name,
__meta_kubernetes_endpoint_port_name,
]
action: keep
regex: kube-system;etcd;https
#]kubectl apply -f prometheus-cm.yaml
configmap "prometheus-config" created
global 模塊控制 Prometheus Server 的全局配置:
scrape_interval表示 prometheus 抓取指標(biāo)數(shù)據(jù)的頻率,默認(rèn)是 15s,我們可以覆蓋這個(gè)值
evaluation_interval:用來控制評(píng)估規(guī)則的頻率,prometheus 使用規(guī)則產(chǎn)生新的時(shí)間序列數(shù)據(jù)或者產(chǎn)生警報(bào)
rule_files:指定了報(bào)警規(guī)則所在的位置,prometheus 可以根據(jù)這個(gè)配置加載規(guī)則,用于生成新的時(shí)間序列數(shù)據(jù)或者報(bào)警信息,當(dāng)前我們沒有配置任何報(bào)警規(guī)則。
scrape_configs 用于控制 prometheus 監(jiān)控哪些資源。
1.3創(chuàng)建 prometheus 的 Pod 資源
# prometheus-deploy.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: prometheus
namespace: monitor
labels:
app: prometheus
spec:
selector:
matchLabels:
app: prometheus
template:
metadata:
labels:
app: prometheus
spec:
securityContext:
runAsUser: 0 #使用指定root用戶運(yùn)行容器
serviceAccountName: prometheus
containers:
# - image: prom/prometheus:v2.34.0
- image: prom/prometheus:v2.44.0
name: prometheus
args:
- "--config.file=/etc/prometheus/prometheus.yml"
- "--storage.tsdb.path=/prometheus" # 指定tsdb數(shù)據(jù)路徑
- "--storage.tsdb.retention.time=24h"
- "--web.enable-admin-api" # 控制對(duì)admin HTTP API的訪問,其中包括刪除時(shí)間序列等功能
- "--web.enable-lifecycle" # 支持熱更新,直接執(zhí)行l(wèi)ocalhost:9090/-/reload立即生效
ports:
- containerPort: 9090
name: http
volumeMounts:
- mountPath: "/etc/prometheus"
name: config-volume
- mountPath: "/prometheus"
name: data
resources:
requests:
cpu: 100m
memory: 512Mi
limits:
cpu: 100m
memory: 512Mi
volumes:
- name: data
persistentVolumeClaim:
claimName: prometheus-data
- configMap:
name: prometheus-config
name: config-volume
–storage.tsdb.path=/prometheus 指定數(shù)據(jù)目錄
創(chuàng)建如下所示的一個(gè) PVC 資源對(duì)象,注意是一個(gè) LocalPV,和 node1 節(jié)點(diǎn)具有親和性:
mkdir /data/k8s/localpv/prometheus
---
apiVersion: v1
kind: PersistentVolume
metadata:
name: pv-local
labels:
app: prometheus
spec:
capacity:
storage: 2Gi
volumeMode: Filesystem
accessModes:
- ReadWriteOnce
persistentVolumeReclaimPolicy: Retain
storageClassName: local-storage-prometheus
local:
path: /data/k8s/localpv/prometheus # 節(jié)點(diǎn)上的目錄
nodeAffinity:
required:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/hostname
operator: In
values:
- master
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: prometheus-data
namespace: monitor
spec:
selector:
matchLabels:
app: prometheus
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 1Gi
storageClassName: local-storage-prometheus
---
# local-storageclass.yaml
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: local-storage-prometheus # StorageClass 的名字,叫作 local-storage-prometheus,也就是我們?cè)?PV 中聲明的
provisioner: kubernetes.io/no-provisioner # 因?yàn)槲覀冞@里是手動(dòng)創(chuàng)建的 PV,所以不需要?jiǎng)討B(tài)來生成 PV
volumeBindingMode: WaitForFirstConsumer # 延遲綁定
prometheus 需要訪問 Kubernetes 的一些資源對(duì)象,所以需要配置 rbac 相關(guān)認(rèn)證,這里我們使用了一個(gè)名為 prometheus 的 serviceAccount 對(duì)象:
# prometheus-rbac.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
name: prometheus
namespace: monitor
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: prometheus
rules:
- apiGroups:
- ""
resources:
- nodes
- services
- endpoints
- pods
- nodes/proxy
verbs:
- get
- list
- watch
- apiGroups:
- "extensions"
resources:
- ingresses
verbs:
- get
- list
- watch
- apiGroups:
- ""
resources:
- configmaps
- nodes/metrics
verbs:
- get
- nonResourceURLs:
- /metrics
verbs:
- get
---
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRoleBinding
metadata:
name: prometheus
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: prometheus
subjects:
- kind: ServiceAccount
name: prometheus
namespace: monitor
#]kubectl apply -f prometheus-rbac.yaml
serviceaccount "prometheus" created
clusterrole.rbac.authorization.k8s.io "prometheus" created
clusterrolebinding.rbac.authorization.k8s.io "prometheus" created
1.4創(chuàng)建Prometheus
kubectl apply -f prometheus-deploy.yaml
deployment.apps/prometheus created
? kubectl get pods -n monitor
NAME READY STATUS RESTARTS AGE
prometheus-df4f47d95-vksmc 1/1 running 3 98s
1.5創(chuàng)建service
Pod 創(chuàng)建成功后,為了能夠在外部訪問到 prometheus 的 webui 服務(wù),我們還需要?jiǎng)?chuàng)建一個(gè) Service 對(duì)象:
# prometheus-svc.yaml
apiVersion: v1
kind: Service
metadata:
name: prometheus
namespace: monitor
labels:
app: prometheus
spec:
selector:
app: prometheus
type: NodePort
ports:
- name: web
port: 9090
targetPort: http
#] kubectl apply -f prometheus-svc.yaml
service "prometheus" created
#] kubectl get svc -n monitor
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
prometheus NodePort 10.96.194.29 <none> 9090:30980/TCP 13h
現(xiàn)在我們就可以通過 http://任意節(jié)點(diǎn)IP:30980 訪問 prometheus 的 webui 服務(wù)了:
二、 AlertManager部署
2.1 安裝AlertManager
alert manager配置文件
# alertmanager-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: alert-config
namespace: monitor
data:
config.yml: |-
global:
# 當(dāng)alertmanager持續(xù)多長時(shí)間未接收到告警后標(biāo)記告警狀態(tài)為 resolved
resolve_timeout: 5m
# 配置郵件發(fā)送信息
smtp_smarthost: 'smtp.qq.com:25'
smtp_from: '257*******@qq.com'
smtp_auth_username: '257*******@qq.com'
smtp_auth_password: '<郵箱密碼>'
smtp_hello: 'qq.com'
smtp_require_tls: false
# 所有報(bào)警信息進(jìn)入后的根路由,用來設(shè)置報(bào)警的分發(fā)策略
route:
# 這里的標(biāo)簽列表是接收到報(bào)警信息后的重新分組標(biāo)簽,例如,接收到的報(bào)警信息里面有許多具有 cluster=A 和 alertname=LatncyHigh 這樣的標(biāo)簽的報(bào)警信息將會(huì)批量被聚合到一個(gè)分組里面
group_by: ['alertname', 'cluster']
# 當(dāng)一個(gè)新的報(bào)警分組被創(chuàng)建后,需要等待至少 group_wait 時(shí)間來初始化通知,這種方式可以確保您能有足夠的時(shí)間為同一分組來獲取多個(gè)警報(bào),然后一起觸發(fā)這個(gè)報(bào)警信息。
group_wait: 30s
# 相同的group之間發(fā)送告警通知的時(shí)間間隔
group_interval: 30s
# 如果一個(gè)報(bào)警信息已經(jīng)發(fā)送成功了,等待 repeat_interval 時(shí)間來重新發(fā)送他們,不同類型告警發(fā)送頻率需要具體配置
repeat_interval: 1h
# 默認(rèn)的receiver:如果一個(gè)報(bào)警沒有被一個(gè)route匹配,則發(fā)送給默認(rèn)的接收器
receiver: default
# 上面所有的屬性都由所有子路由繼承,并且可以在每個(gè)子路由上進(jìn)行覆蓋。
routes:
- receiver: email
group_wait: 10s
group_by: ['instance'] # 根據(jù)instance做分組
match:
team: node
receivers:
- name: 'default'
email_configs:
- to: '257*******@qq.com'
send_resolved: true # 接受告警恢復(fù)的通知
- name: 'email'
email_configs:
- to: '257*******@qq.com'
send_resolved: true
kubectl apply -f alertmanager-config.yaml
配置 AlertManager 的容器,直接使用一個(gè) Deployment 來進(jìn)行管理即可,對(duì)應(yīng)的 YAML 資源聲明如下:
apiVersion: apps/v1
kind: Deployment
metadata:
name: alertmanager
namespace: monitor
labels:
app: alertmanager
spec:
selector:
matchLabels:
app: alertmanager
template:
metadata:
labels:
app: alertmanager
spec:
volumes:
- name: alertcfg
configMap:
name: alert-config
containers:
- name: alertmanager
image: prom/alertmanager:v0.24.0
imagePullPolicy: IfNotPresent
args:
- "--config.file=/etc/alertmanager/config.yml"
ports:
- containerPort: 9093
name: http
volumeMounts:
- mountPath: "/etc/alertmanager"
name: alertcfg
resources:
requests:
cpu: 100m
memory: 256Mi
limits:
cpu: 100m
memory: 256Mi
---
# alertmanager-svc.yaml
apiVersion: v1
kind: Service
metadata:
name: alertmanager
namespace: monitor
labels:
app: alertmanager
spec:
selector:
app: alertmanager
type: NodePort
ports:
- name: web
port: 9093
targetPort: http
AlertManager 的容器啟動(dòng)起來后,我們還需要在 Prometheus 中配置下 AlertManager 的地址,讓 Prometheus 能夠訪問到 AlertManager,在 Prometheus 的 ConfigMap 資源清單中添加如下配置:
alerting:
alertmanagers:
- static_configs:
- targets: ["alertmanager:9093"]
執(zhí)行 reload 操作即可。
在 Prometheus 的配置文件中添加如下報(bào)警規(guī)則配置:
rule_files:
- /etc/prometheus/rules.yml
rule_files 就是用來指定報(bào)警規(guī)則的,這里我們同樣將 rules.yml 文件用 ConfigMap 的形式掛載到 /etc/prometheus 目錄下面即可,比如下面的規(guī)則:
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-config
namespace: monitor
data:
prometheus.yml: |
global:
scrape_interval: 15s
scrape_timeout: 15s
evaluation_interval: 30s # 默認(rèn)情況下每分鐘對(duì)告警規(guī)則進(jìn)行計(jì)算
alerting:
alertmanagers:
- static_configs:
- targets: ["alertmanager:9093"]
rule_files:
- /etc/prometheus/rules.yml
...... # 省略prometheus其他部分
rules.yml: |
groups:
- name: test-node-mem
rules:
- alert: NodeMemoryUsage
expr: (node_memory_MemTotal_bytes - (node_memory_MemFree_bytes + node_memory_Buffers_bytes + node_memory_Cached_bytes)) / node_memory_MemTotal_bytes * 100 > 20
for: 2m
labels:
team: node
annotations:
summary: "{{$labels.instance}}: High Memory usage detected"
description: "{{$labels.instance}}: Memory usage is above 20% (current value is: {{ $value }}"
定義了一個(gè)名為 NodeMemoryUsage 的報(bào)警規(guī)則,一條報(bào)警規(guī)則主要由以下幾部分組成:
alert:告警規(guī)則的名稱
expr:是用于進(jìn)行報(bào)警規(guī)則 PromQL 查詢語句
for:評(píng)估等待時(shí)間(Pending Duration),用于表示只有當(dāng)觸發(fā)條件持續(xù)一段時(shí)間后才發(fā)送告警,在等待期間新產(chǎn)生的告警狀態(tài)為 pending
labels:自定義標(biāo)簽,允許用戶指定額外的標(biāo)簽列表,把它們附加在告警上
annotations:指定了另一組標(biāo)簽,它們不被當(dāng)做告警實(shí)例的身份標(biāo)識(shí),它們經(jīng)常用于存儲(chǔ)一些額外的信息,用于報(bào)警信息的展示之類的
三、 Grafana部署
Grafana 是一個(gè)可視化面板,有著非常漂亮的圖表和布局展示,功能齊全的度量儀表盤和圖形編輯器,支持 Graphite、zabbix、InfluxDB、Prometheus、OpenTSDB、Elasticsearch 等作為數(shù)據(jù)源,比 Prometheus 自帶的圖表展示功能強(qiáng)大太多,更加靈活,有豐富的插件,功能更加強(qiáng)大。
3.1安裝Grafana
這里我指定storageClassName: managed-nfs-storage
需要提前部署好storageclass,然后聲明下就可以自動(dòng)創(chuàng)建pv
本文采取local storageclass ,提前創(chuàng)建好路徑
mkdir -p /data/k8s/localpv
---
#grafana.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: grafana
namespace: monitor
spec:
selector:
matchLabels:
app: grafana
template:
metadata:
labels:
app: grafana
spec:
volumes:
- name: storage
persistentVolumeClaim:
claimName: grafana-pvc
securityContext:
runAsUser: 0
containers:
- name: grafana
# image: grafana/grafana:8.4.6
image: grafana/grafana:10.0.1
imagePullPolicy: IfNotPresent
ports:
- containerPort: 3000
name: grafana
env:
- name: GF_SECURITY_ADMIN_USER
value: admin
- name: GF_SECURITY_ADMIN_PASSWORD
value: admin321
readinessProbe:
failureThreshold: 10
httpGet:
path: /api/health
port: 3000
scheme: HTTP
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 30
livenessProbe:
failureThreshold: 3
httpGet:
path: /api/health
port: 3000
scheme: HTTP
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 1
resources:
limits:
cpu: 150m
memory: 512Mi
requests:
cpu: 150m
memory: 512Mi
volumeMounts:
- mountPath: /var/lib/grafana
name: storage
---
#grafana-svc.yaml
apiVersion: v1
kind: Service
metadata:
name: grafana
namespace: monitor
spec:
type: NodePort
ports:
- port: 3000
selector:
app: grafana
---
#grafana-pv.yaml
apiVersion: v1
kind: PersistentVolume
metadata:
name: grafana-pv
labels:
app: grafana
spec:
capacity:
storage: 2Gi
volumeMode: Filesystem
accessModes:
- ReadWriteOnce
persistentVolumeReclaimPolicy: Retain
storageClassName: local-storage
local:
path: /data/k8s/localpv # 節(jié)點(diǎn)上的目錄
nodeAffinity:
required:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/hostname
operator: In
values:
- master
---
#grafana-pvc.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: grafana-pvc
namespace: monitor
labels:
app: grafana
spec:
# storageClassName: managed-nfs-storage
storageClassName: local-storage
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 2Gi
---
# local-storageclass.yaml
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: local-storage # StorageClass 的名字,叫作 local-storage,也就是我們?cè)?PV 中聲明的
provisioner: kubernetes.io/no-provisioner # 因?yàn)槲覀冞@里是手動(dòng)創(chuàng)建的 PV,所以不需要?jiǎng)討B(tài)來生成 PV
volumeBindingMode: WaitForFirstConsumer # 延遲綁定
環(huán)境變量GF_SECURITY_ADMIN_USER 和 GF_SECURITY_ADMIN_PASSWORD,用來配置 grafana 的管理員用戶和密碼
grafana 將 dashboard、插件這些數(shù)據(jù)保存在 /var/lib/grafana 這個(gè)目錄下面的,所以我們這里如果需要做數(shù)據(jù)持久化的話,就需要針對(duì)這個(gè)目錄進(jìn)行 volume 掛載聲明
查看 grafana 對(duì)應(yīng)的 Pod 是否正常:
[root@master grafana]# kubectl get pods -n monitor -l app=grafana
NAME READY STATUS RESTARTS AGE
grafana-85794dc4d9-mhcj7 1/1 Running 0 7m12s
[root@master grafana]# kubectl logs -f grafana-85794dc4d9-mhcj7 -n monitor
...
logger=settings var="GF_SECURITY_ADMIN_USER=admin"
t=2019-12-13T06:35:08+0000 lvl=info msg="Config overridden from Environment variable"
......
t=2019-12-13T06:35:08+0000 lvl=info msg="Initializing Stream Manager"
t=2019-12-13T06:35:08+0000 lvl=info msg="HTTP Server Listen" logger=http.server address=[::]:3000 protocol=http subUrl= socket=
[root@master grafana]# kubectl get svc -n monitor
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
grafana NodePort 10.98.74.79 <none> 3000:31197/TCP 8m26s
在瀏覽器中使用 http://<任意節(jié)點(diǎn)IP:31197> 來訪問 grafana 這個(gè)服務(wù)配置數(shù)據(jù)源:
Prometheus 和 Grafana 都處于 kube-mon 這同一個(gè) namespace 下面,所以我們這里的數(shù)據(jù)源地址:http://prometheus:9090(因?yàn)樵谕粋€(gè) namespace 下面所以直接用 Service 名也可以)
四、 1.prometheusAlert部署
github 地址:https://github.com/feiyu563/PrometheusAlert
PrometheusAlert是開源的運(yùn)維告警中心消息轉(zhuǎn)發(fā)系統(tǒng),支持主流的監(jiān)控系統(tǒng)Prometheus、Zabbix,日志系統(tǒng)Graylog2,Graylog3、數(shù)據(jù)可視化系統(tǒng)Grafana、SonarQube。阿里云-云監(jiān)控,以及所有支持WebHook接口的系統(tǒng)發(fā)出的預(yù)警消息,支持將收到的這些消息發(fā)送到釘釘,微信,email,飛書,騰訊短信,騰訊電話,阿里云短信,阿里云電話,華為短信,百度云短信,容聯(lián)云電話,七陌短信,七陌語音,TeleGram,百度Hi(如流)等。
PrometheusAlert可以部署在本地和云平臺(tái)上,支持windows、linux、公有云、私有云、混合云、容器和kubernetes。你可以根據(jù)實(shí)際場(chǎng)景或需求,選擇相應(yīng)的方式來部署PrometheusAlert:
https://github.com/feiyu563/PrometheusAlert/blob/master/doc/readme/base-install.md
本文采用在kubernetes中運(yùn)行:
提前下載鏡像:
docker pull feiyu563/prometheus-alert
[root@master ~]# docker images | grep prometheus-alert
feiyu563/prometheus-alert latest d68864d68c3e 19 months ago 38.9MB
#Kubernetes中運(yùn)行可以直接執(zhí)行以下命令行即可(注意默認(rèn)的部署模版中未掛載模版數(shù)據(jù)庫文件 db/PrometheusAlertDB.db,為防止模版數(shù)據(jù)丟失,請(qǐng)自行增加掛載配置 )
kubectl apply -n monitoring -f https://raw.githubusercontent.com/feiyu563/PrometheusAlert/master/example/kubernetes/PrometheusAlert-Deployment.yaml
#啟動(dòng)后可使用瀏覽器打開以下地址查看:http://[YOUR-PrometheusAlert-URL]:8080
#默認(rèn)登錄帳號(hào)和密碼在app.conf中有配置
[root@master prometheusalert]# kubectl logs prometheus-alert-center-7f76d88c98-fnjzz
pass!
table `prometheus_alert_d_b` already exists, skip
table `alert_record` already exists, skip
2023/08/14 10:07:46.483 [I] [proc.go:225] [main] 構(gòu)建的Go版本: go1.16.5
2023/08/14 10:07:46.483 [I] [proc.go:225] [main] 應(yīng)用當(dāng)前版本: v4.6.1
2023/08/14 10:07:46.483 [I] [proc.go:225] [main] 應(yīng)用當(dāng)前提交: 1bc0791a637b633257ce69de05d57b79ddd76f7c
2023/08/14 10:07:46.483 [I] [proc.go:225] [main] 應(yīng)用構(gòu)建時(shí)間: 2021-12-23T12:37:35+0000
2023/08/14 10:07:46.483 [I] [proc.go:225] [main] 應(yīng)用構(gòu)建用戶: root@c14786b5a1cd
2023/08/14 10:07:46.491 [I] [asm_amd64.s:1371] http server Running on http://0.0.0.0:8080
把svc的type修改為nodeport
[root@master prometheusalert]# kubectl edit svc prometheus-alert-center
service/prometheus-alert-center edited
[root@master prometheusalert]# kubectl get svc -A
NAMESPACE NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
default service/kubernetes ClusterIP 10.96.0.1 <none> 443/TCP 11d
default service/prometheus-alert-center NodePort 10.105.133.163 <none> 8080:32021/TCP 2m19s
kube-system service/kube-dns ClusterIP 10.96.0.10 <none> 53/UDP,53/TCP,9153/TCP 11d
瀏覽器訪問任意節(jié)點(diǎn)http://<任意節(jié)點(diǎn)ip>:32021
因github地址所提供的容器化鏡像老舊,可以自行編寫dockerfile下載二進(jìn)制包封裝成鏡像部署到k8s
2.prometheusAlert部署(二進(jìn)制)
#打開PrometheusAlert releases頁面,根據(jù)需要選擇需要的版本下載到本地解壓并進(jìn)入解壓后的目錄
如linux版本(https://github.com/feiyu563/PrometheusAlert/releases/download/v4.9/linux.zip)
# wget https://github.com/feiyu563/PrometheusAlert/releases/download/v4.9/linux.zip && unzip linux.zip && cd linux/
#下載好后解壓并進(jìn)入解壓后的文件夾
#運(yùn)行PrometheusAlert
./PrometheusAlert (#后臺(tái)運(yùn)行請(qǐng)執(zhí)行 nohup ./PrometheusAlert &)
#啟動(dòng)后可使用瀏覽器打開以下地址查看:http://127.0.0.1:8080
#默認(rèn)登錄帳號(hào)和密碼在app.conf中有配置
注:
1.配置告警路由
https://github.com/feiyu563/PrometheusAlert/blob/master/doc/readme/web-router.md
2.開啟告警記錄
#是否開啟告警記錄 0為關(guān)閉,1為開啟
AlertRecord=1
五、釘釘配置
開啟釘釘機(jī)器人
打開釘釘,進(jìn)入釘釘群中,選擇群設(shè)置–>智能群助手–>添加機(jī)器人–>自定義,可參下圖:
新版本的釘釘加了安全設(shè)置,只需選擇安全設(shè)置中的 自定義關(guān)鍵詞 即可,并將關(guān)鍵詞設(shè)置為 Prometheus或者app.conf中設(shè)置的title值均可,參考下圖
復(fù)制圖中的Webhook地址,并填入PrometheusAlert配置文件app.conf中對(duì)應(yīng)配置項(xiàng)即可。
PS: 釘釘機(jī)器人目前已經(jīng)支持 @某人 ,使用該功能需要取得對(duì)應(yīng)用戶的釘釘關(guān)聯(lián)手機(jī)號(hào)碼,如下圖:
釘釘目前支持只支持markdown語法的子集,具體支持的元素如下:
標(biāo)題
# 一級(jí)標(biāo)題
## 二級(jí)標(biāo)題
### 三級(jí)標(biāo)題
#### 四級(jí)標(biāo)題
##### 五級(jí)標(biāo)題
###### 六級(jí)標(biāo)題
引用
> A man who stands for nothing will fall for anything.
文字加粗、斜體
**bold**
*italic*
鏈接
[this is a link](http://name.com)
圖片

無序列表
- item1
- item2
有序列表
1. item1
2. item2
釘釘相關(guān)配置:
#---------------------↓全局配置-----------------------
#告警消息標(biāo)題
title=PrometheusAlert
#釘釘告警 告警logo圖標(biāo)地址
logourl=https://raw.githubusercontent.com/feiyu563/PrometheusAlert/master/doc/alert-center.png
#釘釘告警 恢復(fù)logo圖標(biāo)地址
rlogourl=https://raw.githubusercontent.com/feiyu563/PrometheusAlert/master/doc/alert-center.png
#---------------------↓webhook-----------------------
#是否開啟釘釘告警通道,可同時(shí)開始多個(gè)通道0為關(guān)閉,1為開啟
open-dingding=1
#默認(rèn)釘釘機(jī)器人地址
ddurl=https://oapi.dingtalk.com/robot/send?access_token=xxxxx
#是否開啟 @所有人(0為關(guān)閉,1為開啟)
dd_isatall=1
以Prometheus配合自定義模板為例:
Prometheus配置參考:文章來源:http://www.zghlxwxcb.cn/news/detail-718127.html
global:
resolve_timeout: 5m
route:
group_by: ['instance']
group_wait: 10m
group_interval: 10s
repeat_interval: 10m
receiver: 'web.hook.prometheusalert'
receivers:
- name: 'web.hook.prometheusalert'
webhook_configs:
- url: 'http://[prometheusalert_url]:8080/prometheusalert?type=dd&tpl=prometheus-dd&ddurl=釘釘機(jī)器人地址,釘釘機(jī)器人地址2&at=18888888888,18888888889'
若想指定報(bào)警發(fā)送多個(gè)方式比如郵箱+釘釘,可以在routes下加continue: true 實(shí)現(xiàn)文章來源地址http://www.zghlxwxcb.cn/news/detail-718127.html
到了這里,關(guān)于k8s集群監(jiān)控及報(bào)警(Prometheus+AlertManager+Grafana+prometheusAlert+Dingding)的文章就介紹完了。如果您還想了解更多內(nèi)容,請(qǐng)?jiān)谟疑辖撬阉鱐OY模板網(wǎng)以前的文章或繼續(xù)瀏覽下面的相關(guān)文章,希望大家以后多多支持TOY模板網(wǎng)!