基本架構(gòu)
Prometheus由SoundCloud發(fā)布,是一套由go語言開發(fā)的開源的監(jiān)控&報(bào)警&時(shí)間序列數(shù)據(jù)庫的組合。
Prometheus的基本原理是通過HTTP協(xié)議周期性抓取被監(jiān)控組件的狀態(tài),任意組件只要提供對應(yīng)的HTTP接口就可以接入監(jiān)控。不需要任何SDK或者其他的集成過程。這樣做非常適合做虛擬化環(huán)境監(jiān)控系統(tǒng),比如VM、Docker、Kubernetes等。
Prometheus 主要的組件功能如下:
- Prometheus Server:server的作用主要是定期從靜態(tài)配置的targets或者服務(wù)發(fā)現(xiàn)(主要是DNS、consul、k8s、mesos等)的 targets 拉取數(shù)據(jù)。
- Exporter: 主要負(fù)責(zé)向prometheus server做數(shù)據(jù)匯報(bào)。而不同的數(shù)據(jù)匯報(bào)由不同的exporters實(shí)現(xiàn),比如監(jiān)控主機(jī)有node-exporters,mysql有MySQL server exporter。
- Pushgateway:Prometheus獲得數(shù)據(jù)的方式除了到對應(yīng)exporter去Pull,還可以由服務(wù)先Push到pushgateway,server再去pushgateway 拉取。
- Alertmanager:實(shí)現(xiàn)prometheus的告警功能。
- webui:主要通過grafana來實(shí)現(xiàn)webui展示。
我們在實(shí)際使用的時(shí)候的基本流程就是:
各個(gè)服務(wù)push監(jiān)控?cái)?shù)據(jù)到其對應(yīng)的指標(biāo)(比如下面提到的Exporter) --> Prometheus Server定時(shí)采集數(shù)據(jù)并存儲(chǔ) --> 配置Grafana展示數(shù)據(jù) & 配置告警規(guī)則進(jìn)行告警
Helm部署Prometheus平臺(tái)
使用helm部署kube-prometheus-stack
helm地址:傳送門
github地址:傳送門
首先需要在服務(wù)器上安裝helm工具,怎么安裝不再贅述,網(wǎng)上很多教程。使用helm安裝prometheus的具體操作為:
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm install [RELEASE_NAME] prometheus-community/kube-prometheus-stack
Exporter
要采集目標(biāo)的監(jiān)控?cái)?shù)據(jù),首先就要在被采集目標(biāo)地方安裝采集組件,這種采集組件被稱為Exporter。prometheus.io官網(wǎng)上有很多這種exporter,官方exporter列表。
采集完了怎么傳輸?shù)絇rometheus?
Exporter 會(huì)暴露一個(gè)HTTP接口,prometheus通過Pull模式的方式來拉取數(shù)據(jù),會(huì)通過HTTP協(xié)議周期性抓取被監(jiān)控的組件數(shù)據(jù)。
不過prometheus也提供了一種方式來支持Push模式,你可以將數(shù)據(jù)推送到Push Gateway,prometheus通過pull的方式從Push Gateway獲取數(shù)據(jù)。
golang應(yīng)用中接入采集組件
kratos框架
在微服務(wù)框架kratos中接入Prometheus采集組件的示例,kratos官方教程:
package main
import (
"context"
"fmt"
"log"
prom "github.com/go-kratos/kratos/contrib/metrics/prometheus/v2"
"github.com/go-kratos/kratos/v2/middleware/metrics"
"github.com/prometheus/client_golang/prometheus/promhttp"
"github.com/go-kratos/examples/helloworld/helloworld"
"github.com/go-kratos/kratos/v2"
"github.com/go-kratos/kratos/v2/transport/grpc"
"github.com/go-kratos/kratos/v2/transport/http"
"github.com/prometheus/client_golang/prometheus"
)
// go build -ldflags "-X main.Version=x.y.z"
var (
// Name is the name of the compiled software.
Name = "metrics"
// Version is the version of the compiled software.
// Version = "v1.0.0"
_metricSeconds = prometheus.NewHistogramVec(prometheus.HistogramOpts{
Namespace: "server",
Subsystem: "requests",
Name: "duration_sec",
Help: "server requests duration(sec).",
Buckets: []float64{0.005, 0.01, 0.025, 0.05, 0.1, 0.250, 0.5, 1},
}, []string{"kind", "operation"})
_metricRequests = prometheus.NewCounterVec(prometheus.CounterOpts{
Namespace: "client",
Subsystem: "requests",
Name: "code_total",
Help: "The total number of processed requests",
}, []string{"kind", "operation", "code", "reason"})
)
// server is used to implement helloworld.GreeterServer.
type server struct {
helloworld.UnimplementedGreeterServer
}
// SayHello implements helloworld.GreeterServer
func (s *server) SayHello(ctx context.Context, in *helloworld.HelloRequest) (*helloworld.HelloReply, error) {
return &helloworld.HelloReply{Message: fmt.Sprintf("Hello %+v", in.Name)}, nil
}
func init() {
prometheus.MustRegister(_metricSeconds, _metricRequests)
}
func main() {
grpcSrv := grpc.NewServer(
grpc.Address(":9000"),
grpc.Middleware(
metrics.Server(
metrics.WithSeconds(prom.NewHistogram(_metricSeconds)),
metrics.WithRequests(prom.NewCounter(_metricRequests)),
),
),
)
httpSrv := http.NewServer(
http.Address(":8000"),
http.Middleware(
metrics.Server(
metrics.WithSeconds(prom.NewHistogram(_metricSeconds)),
metrics.WithRequests(prom.NewCounter(_metricRequests)),
),
),
)
httpSrv.Handle("/metrics", promhttp.Handler())
s := &server{}
helloworld.RegisterGreeterServer(grpcSrv, s)
helloworld.RegisterGreeterHTTPServer(httpSrv, s)
app := kratos.New(
kratos.Name(Name),
kratos.Server(
httpSrv,
grpcSrv,
),
)
if err := app.Run(); err != nil {
log.Fatal(err)
}
}
最終暴露出一個(gè)http://127.0.0.1:8000/metrics
HTTP接口出來,Prometheus可以通過這個(gè)接口拉取監(jiān)控?cái)?shù)據(jù)。
Gin框架
在輕量級HTTP框架Gin中接入Prometheus采集組件的示例:
package main
import (
"strconv"
"time"
"github.com/gin-gonic/gin"
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/promhttp"
)
var (
handler = promhttp.Handler()
_metricSeconds = prometheus.NewHistogramVec(prometheus.HistogramOpts{
Namespace: "server",
Subsystem: "requests",
Name: "duration_sec",
Help: "server requests duration(sec).",
Buckets: []float64{0.005, 0.01, 0.025, 0.05, 0.1, 0.250, 0.5, 1},
}, []string{"method", "path"})
_metricRequests = prometheus.NewCounterVec(prometheus.CounterOpts{
Namespace: "client",
Subsystem: "requests",
Name: "code_total",
Help: "The total number of processed requests",
}, []string{"method", "path", "code"})
)
func init() {
prometheus.MustRegister(_metricSeconds, _metricRequests)
}
func HandlerMetrics() func(c *gin.Context) {
return func(c *gin.Context) {
handler.ServeHTTP(c.Writer, c.Request)
}
}
func WithProm() gin.HandlerFunc {
return func(c *gin.Context) {
var (
method string
path string
code int
)
startTime := time.Now()
method = c.Request.Method
path = c.Request.URL.Path
c.Next()
code = c.Writer.Status()
_metricSeconds.WithLabelValues(method, path).Observe(time.Since(startTime).Seconds())
_metricRequests.WithLabelValues(method, path, strconv.Itoa(code)).Inc()
}
}
func main() {
r := gin.Default()
r.Use(WithProm())
r.GET("/ping", func(c *gin.Context) {
c.JSON(200, gin.H{
"message": "pong",
})
})
r.GET("/metrics", HandlerMetrics())
r.Run() // 監(jiān)聽并在 0.0.0.0:8080 上啟動(dòng)服務(wù)
}
最終暴露出一個(gè)http://127.0.0.1:8080/metrics
HTTP接口出來,Prometheus可以通過這個(gè)接口拉取監(jiān)控?cái)?shù)據(jù)。
抓取集群外部數(shù)據(jù)源
背景:在已有的K8s集群中通過
helm
部署了一個(gè)kube-prometheus-stack
,用于監(jiān)控服務(wù)器和服務(wù)。現(xiàn)在已經(jīng)將k8s集群中的node、pod等組件接入到prometheus了。還需要將部署在k8s集群外部的其他應(yīng)用服務(wù)接入到prometheus。
prometheus抓取k8s集群外部的數(shù)據(jù)時(shí),有以下途徑:
- ServiceMonitor
- Additional Scrape Configuration
ServiceMonitor抓取HTTP數(shù)據(jù)源
ServiceMonitor 是一個(gè)CRD,它定義了 Prometheus 應(yīng)該抓取的服務(wù)端點(diǎn)以及抓取的時(shí)間間隔。
通過ServiceMonitor監(jiān)控集群外部的服務(wù),需要配置Service、Endpoints和ServiceMonitor。
現(xiàn)在有一個(gè)已經(jīng)部署到192.168.1.100:8000
的后端服務(wù),已經(jīng)通過/metrics
將監(jiān)控指標(biāo)暴露出來了。嘗試將其接入到prometheus,具體操作如下:
在命令行中輸入
$ touch external-application.yaml
$ vim external-application.yaml
然后將下面的yaml文件內(nèi)容拷貝進(jìn)去
---
apiVersion: v1
kind: Service
metadata:
name: external-application-exporter
namespace: monitoring
labels:
app: external-application-exporter
app.kubernetes.io/name: application-exporter
spec:
type: ClusterIP
ports:
- name: metrics
port: 9101
protocol: TCP
targetPort: 9101
---
apiVersion: v1
kind: Endpoints
metadata:
name: external-application-exporter
namespace: monitoring
labels:
app: external-application-exporter
app.kubernetes.io/name: application-exporter
subsets:
- addresses:
- ip: 192.168.1.100 # 這里是外部的資源列表
ports:
- name: metrics
port: 8000
- addresses:
- ip: 192.168.1.100 # 這里是外部的資源列表2
ports:
- name: metrics
port: 8080
---
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: external-application-exporter
namespace: monitoring
labels:
app: external-application-exporter
release: prometheus
spec:
selector:
matchLabels: # Service選擇器
app: external-application-exporter
namespaceSelector: # Namespace選擇器
matchNames:
- monitoring
endpoints:
- port: metrics # 采集節(jié)點(diǎn)端口(svc定義)
interval: 10s # 采集頻率根據(jù)實(shí)際需求配置,prometheus默認(rèn)10s
path: /metrics # 默認(rèn)地址/metrics
保存好文件之后運(yùn)行命令:
kubectl apply -f external-application.yaml
之后打開prometheus控制臺(tái),進(jìn)入Targets目錄??梢钥吹叫略龅膃xternal-application-exporter顯示出來了:
Additional Scrape Configuration抓取HTTPS數(shù)據(jù)源
除了ip加端口提供的HTTP服務(wù)以外,我還在其他服務(wù)器上部署了可以通過域名訪問的HTTPS服務(wù)?,F(xiàn)在想用同樣的方法將其接入進(jìn)來。
首先嘗試修改Endpoints
,找到k8s的官方文檔,發(fā)現(xiàn)Endpoints
僅支持ip
,也沒有配置HTTPS
協(xié)議的地方。
那么我們嘗試換一種方式。
第一種方法
首先查閱官方文檔,找到關(guān)于關(guān)于prometheus抓取配置的地方,可以看到,prometheus的抓取配置的關(guān)鍵字是scrape_config
我們的prometheus是通過helm部署kube-prometheus-stack得到的,所以我們查看一下該charts的value.yaml文件,看看有無配置。
輸入命令:
$ cat values.yaml | grep -C 20 scrape_config
得到如下結(jié)果:
從注釋中知道,kube-prometheus是通過additionalScrapeConfigs配置抓取策略的。
于是寫一個(gè)配置文件去更新helm已經(jīng)部署好的prometheus的release。
$ touch prometheus.yml
$ vim prometheus.yml
將一下內(nèi)容寫入:
prometheus:
prometheusSpec:
additionalScrapeConfigs:
- job_name: external-application-exporter-https
scrape_interval: 10s
scrape_timeout: 10s
metrics_path: /metrics
scheme: https
tls_config:
insecure_skip_verify: true
static_configs:
- targets: ["www.baidu.com:443"]
最后更新release:
$ helm upgrade -nmonitoring -f prometheus.yaml prometheus kube-prometheus-stack-40.0.0.tgz
使用prometheus.yaml
更新release,其中kube-prometheus-stack-40.0.0.tgz
是我在部署prometheus時(shí)已經(jīng)helm pull到本地的chart文件。
我們在prometheus的控制臺(tái)的Targets目錄下可以看到我們新添加的數(shù)據(jù)源。
到這里其實(shí)就可以結(jié)束了,但是有一個(gè)不好的地方是,每次添加新的域名監(jiān)控,都需要重新更新helm的release,不是特別方便。
第二種方法
翻一翻prometheus-operator的源碼,發(fā)現(xiàn)在說明中,有關(guān)于抓取配置熱更新的教程。簡單的概括就是,通過配置secret,來控制prometheus的抓取數(shù)據(jù)源。secret的內(nèi)容修改時(shí),可以熱更新prometheus的抓取配置。截個(gè)圖看一下:
第一步,生成prometheus-additional.yaml
文件
$ touch prometheus-additional.yaml
$ vim prometheus-additional.yaml
prometheus-additional.yaml
內(nèi)容:
- job_name: external-application-exporter-https
scrape_interval: 10s
scrape_timeout: 10s
metrics_path: /metrics
scheme: https
tls_config:
insecure_skip_verify: true
static_configs:
- targets: ["www.baidu.com:443"]
第二步,生成secret
生成用于創(chuàng)建secret的配置文件:
$ kubectl create secret generic additional-scrape-configs --from-file=prometheus-additional.yaml --dry-run=client -oyaml > additional-scrape-configs.yaml
$ cat additional-scrape-configs.yaml
可以看到生成的additional-scrape-configs.yaml
內(nèi)容如下:
apiVersion: v1
data:
prometheus-additional.yaml: LSBqb2JfbmFtZTogZXh0ZXJuYWwtYXBwbGljYXRpb24tZXhwb3J0ZXItaHR0cHMKICBzY3JhcGVfaW50ZXJ2YWw6IDEwcwogIHNjcmFwZV90aW1lb3V0OiAxMHMKICBtZXRyaWNzX3BhdGg6IC9tZXRyaWNzCiAgc2NoZW1lOiBodHRwcwogIHRsc19jb25maWc6CiAgICBpbnNlY3VyZV9za2lwX3ZlcmlmeTogdHJ1ZQogIHN0YXRpY19jb25maWdzOgogICAgLSB0YXJnZXRzOiBbImNpYW10ZXN0LnNtb2EuY2M6NDQzIl0K
kind: Secret
metadata:
creationTimestamp: null
name: additional-scrape-configs
將這段編碼解碼看一下內(nèi)容:
$ echo "LSBqb2JfbmFtZTogZXh0ZXJuYWwtYXBwbGljYXRpb24tZXhwb3J0ZXItaHR0cHMKICBzY3JhcGVfaW50ZXJ2YWw6IDEwcwogIHNjcmFwZV90aW1lb3V0OiAxMHMKICBtZXRyaWNzX3BhdGg6IC9tZXRyaWNzCiAgc2NoZW1lOiBodHRwcwogIHRsc19jb25maWc6CiAgICBpbnNlY3VyZV9za2lwX3ZlcmlmeTogdHJ1ZQogIHN0YXRpY19jb25maWdzOgogICAgLSB0YXJnZXRzOiBbImNpYW10ZXN0LnNtb2EuY2M6NDQzIl0K" | base64 -d
得到:
- job_name: external-application-exporter-https
scrape_interval: 10s
scrape_timeout: 10s
metrics_path: /metrics
scheme: https
tls_config:
insecure_skip_verify: true
static_configs:
- targets: ["www.baidu.com:443"]
可以確認(rèn)配置文件生成無誤,接著生成secret:
$ kubectl apply -f additional-scrape-configs.yaml -n monitoring
monitoring是prometheus部署所在的命名空間,把它們放到同一個(gè)命名空間。
確認(rèn)secret生成了:
$ kubectl get secret -n monitoring
輸出:
最后,修改CRD
Finally, reference this additional configuration in your prometheus.yaml CRD.
官方文檔讓我們修改prometheus的配置
先找到prometheus這個(gè)CRD:
$ kubectl get prometheus -n monitoring
NAME VERSION REPLICAS AGE
prometheus-kube-prometheus-prometheus v2.38.0 1 2d18h
然后修改它
$ kubectl edit prometheus prometheus-kube-prometheus-prometheus -n monitoring
apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
name: prometheus
labels:
prometheus: prometheus
spec:
...
additionalScrapeConfigs:
name: additional-scrape-configs
key: prometheus-additional.yaml
...
最后,在prometheus控制臺(tái)看一下效果:
域名服務(wù)已經(jīng)監(jiān)控上了,以后想添加其他域名監(jiān)控,只需要修改secret就行,great?。?!
告警
關(guān)于告警,我們采用prometheus+alertmanager這一套方案。從監(jiān)控告警信息到處置告警事件的主要流程如下:
我們的業(yè)務(wù)需求是,在服務(wù)掛了的時(shí)候能夠收到通知,及時(shí)處置。所以我們這里需要配置的告警規(guī)則為,收集應(yīng)用的存活信息,當(dāng)檢測到不存活狀態(tài),告警消息狀態(tài)設(shè)為peding
。當(dāng)peding時(shí)長到達(dá)一定時(shí)間閾值,就將其設(shè)為firing
,此時(shí)觸發(fā)告警,告警信息提交到alertmanager
,然后在alertmanager中按照規(guī)則,發(fā)送告警消息給消息接收者
,如企微、釘釘、郵件等。
具體的做法如下:
步驟一 prometheus告警觸發(fā)器
參考:kube-prometheus-stack 告警配置
由于我是用helm部署的kube-prometheus-stack
,為了保持版本一致性,將charts:kube-prometheus-stack-40.0.0.tgz
提前下載(helm pull prometheus-community/kube-prometheus-stack --version=40.0.0
)到本地了。解壓之后,可以在kube-prometheus-stack
的 values.yaml
中找到如下 PrometheusRules
相關(guān)入口:
## Deprecated way to provide custom recording or alerting rules to be deployed into the cluster.
##
# additionalPrometheusRules: []
# - name: my-rule-file
# groups:
# - name: my_group
# rules:
# - record: my_record
# expr: 100 * my_record
## Provide custom recording or alerting rules to be deployed into the cluster.
##
#additionalPrometheusRulesMap: {}
# rule-name:
# groups:
# - name: my_group
# rules:
# - record: my_record
# expr: 100 * my_record
修改values.yaml
:
## Deprecated way to provide custom recording or alerting rules to be deployed into the cluster.
##
# additionalPrometheusRules: []
# - name: my-rule-file
# groups:
# - name: my_group
# rules:
# - record: my_record
# expr: 100 * my_record
## Provide custom recording or alerting rules to be deployed into the cluster.
##
additionalPrometheusRulesMap:
rule-name:
groups:
- name: Instance
rules:
# Alert for any instance that is unreachable for >5 minutes.
- alert: InstanceDown
expr: up == 0
for: 5m
labels:
severity: page
annotations:
summary: "Instance {{ $labels.instance }} down"
description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 5 minutes."
然后更新helm release
helm upgrade -nmonitoring prometheus --values=values.yaml ../kube-prometheus-stack-40.0.0.tgz
更新完成后在prometheus控制臺(tái)查看結(jié)果:
可以看到alert rules
已經(jīng)配置成功,根據(jù)告警規(guī)則,只要任意instance實(shí)例的狀態(tài)不為up == 0
,則會(huì)按照規(guī)則將alert狀態(tài)改成peding,5分鐘后仍未恢復(fù),狀態(tài)會(huì)變更為firing,觸發(fā)告警消息。
步驟二 alertmanager 告警通知
參考:kube-prometheus-stack 配置AlertManager
prometheus觸發(fā)器收集到了告警消息之后,會(huì)發(fā)送到alertmanager進(jìn)行統(tǒng)一管理。alertmanager配置一定的規(guī)則,將告警消息分發(fā)給不同的接收者。
在kube-prometheus-stack
的 values.yaml
中找到如下 alertmanager.config
相關(guān)入口。alertmanager.config
提供了指定 altermanager
的配置,這樣就能夠自己定制一些特定的 receivers
。原始的配置如下:
## Configuration for alertmanager
## ref: https://prometheus.io/docs/alerting/alertmanager/
##
alertmanager:
...
## Alertmanager configuration directives
## ref: https://prometheus.io/docs/alerting/configuration/#configuration-file
## https://prometheus.io/webtools/alerting/routing-tree-editor/
##
config:
global:
resolve_timeout: 5m
inhibit_rules:
- source_matchers:
- 'severity = critical'
target_matchers:
- 'severity =~ warning|info'
equal:
- 'namespace'
- 'alertname'
- source_matchers:
- 'severity = warning'
target_matchers:
- 'severity = info'
equal:
- 'namespace'
- 'alertname'
- source_matchers:
- 'alertname = InfoInhibitor'
target_matchers:
- 'severity = info'
equal:
- 'namespace'
route:
group_by: ['namespace']
group_wait: 30s
group_interval: 5m
repeat_interval: 12h
receiver: 'null'
routes:
- receiver: 'null'
matchers:
- alertname =~ "InfoInhibitor|Watchdog"
receivers:
- name: 'null'
templates:
- '/etc/alertmanager/config/*.tmpl'
我們將其修改為:
## Configuration for alertmanager
## ref: https://prometheus.io/docs/alerting/alertmanager/
##
alertmanager:
...
## Alertmanager configuration directives
## ref: https://prometheus.io/docs/alerting/configuration/#configuration-file
## https://prometheus.io/webtools/alerting/routing-tree-editor/
##
config:
global:
resolve_timeout: 5m
inhibit_rules:
- source_matchers:
- 'severity = critical'
target_matchers:
- 'severity =~ warning|info'
equal:
- 'namespace'
- 'alertname'
- source_matchers:
- 'severity = warning'
target_matchers:
- 'severity = info'
equal:
- 'namespace'
- 'alertname'
- source_matchers:
- 'alertname = InfoInhibitor'
target_matchers:
- 'severity = info'
equal:
- 'namespace'
route:
group_by: ['instance']
group_wait: 30s
group_interval: 5m
repeat_interval: 12h
receiver: 'wx-webhook'
routes:
receivers:
- name: 'wx-webhook'
webhook_configs:
- url: "http://wx-webhook:80/adapter/wx"
send_resolved: true
templates:
- '/etc/alertmanager/config/*.tmpl'
其中webhook_configs[0].url: "http://wx-webhook:80/adapter/wx"
中的地址為接受告警消息的企業(yè)微信群機(jī)器人webhook,企業(yè)微信群機(jī)器人webhook的搭建接下來會(huì)詳細(xì)講解。
然后更新helm release
helm upgrade -nmonitoring prometheus --values=values.yaml ../kube-prometheus-stack-40.0.0.tgz
配置完成后,關(guān)掉一個(gè)服務(wù),在企業(yè)微信群查看結(jié)果:
步驟三 搭建企業(yè)微信群機(jī)器人webhook
參考:prometheus通過企業(yè)微信機(jī)器人報(bào)警
生成一個(gè)企微機(jī)器人
在群設(shè)置中,進(jìn)入群機(jī)器人功能:
然后添加群機(jī)器人,復(fù)制添加的群機(jī)器人的Webhook
地址
編寫deployment
配置文件wx-webhook-deployment.yaml
:
apiVersion: apps/v1
kind: Deployment
metadata:
name: wx-webhook
labels:
app: wx-webhook
spec:
replicas: 1
selector:
matchLabels:
app: wx-webhook
template:
metadata:
labels:
app: wx-webhook
spec:
containers:
- name: wx-webhook
image: guyongquan/webhook-adapter:latest
imagePullPolicy: IfNotPresent
args: ["--adapter=/app/prometheusalert/wx.js=/wx=https://qyapi.weixin.qq.com/cgi-bin/webhook/send?key=xxxxxxxxxxxxxxxxxxxxxx"]
ports:
- containerPort: 80
---
apiVersion: v1
kind: Service
metadata:
name: wx-webhook
labels:
app: wx-webhook
spec:
selector:
app: wx-webhook
ports:
- name: wx-webhook
port: 80
protocol: TCP
targetPort: 80
nodePort: 30904
type: NodePort
其中args: ["--adapter=/app/prometheusalert/wx.js=/wx=https://qyapi.weixin.qq.com/cgi-bin/webhook/send?key=xxxxxxxxxxxxxxxxxxxxxx"]
的內(nèi)容為上一步創(chuàng)建的企微機(jī)器人Webhook
地址
緊接著運(yùn)行命令:
$ kubectl apply -f wx-webhook-deployment.yaml -nmonitoring
$ kubectl get pod -n monitoring | grep wx-webhook
wx-webhook-78d4dc95fc-9nsjn 1/1 Running 0 26d
$ kubectl get service -n monitoring | grep wx-webhook
wx-webhook NodePort 10.106.111.183 <none> 80:30904/TCP 27d
這樣就完成了企業(yè)微信群機(jī)器人webhook的搭建。
這里我使用的是企業(yè)微信作為告警消息的接收者,alertmanager也支持其他消息接收者。可以參考這篇文章:kube-promethues監(jiān)控告警詳解(郵件、釘釘、微信、企微機(jī)器人、自研平臺(tái))
遇到的問題
- 更新抓取配置的secret后prometheus的控制臺(tái)看不到效果
嘗試重啟pod:prometheus-prometheus-kube-prometheus-prometheus-0
,報(bào)錯(cuò):
ts=2023-07-29T09:30:54.188Z caller=main.go:454 level=error msg=“Error loading config (–config.file=/etc/prometheus/config_out/prometheus.env.yaml)” file=/etc/prometheus/config_out/prometheus.env.yaml err=“parsing YAML file /etc/prometheus/config_out/prometheus.env.yaml: scrape timeout greater than scrape interval for scrape config with job name “external-application-exporter-https””
原因是,自定義指標(biāo)的配置出錯(cuò)導(dǎo)致prometheus啟動(dòng)失敗,scrape_interval和scrape_timeout存在問題文章來源:http://www.zghlxwxcb.cn/news/detail-621595.html
- job_name: external-application-exporter-https
scrape_interval: 10s
scrape_timeout: 30s
metrics_path: /metrics
scheme: https
tls_config:
insecure_skip_verify: true
static_configs:
- targets: ["www.baidu.com:443"]
需要改成文章來源地址http://www.zghlxwxcb.cn/news/detail-621595.html
- job_name: external-application-exporter-https
scrape_interval: 10s
scrape_timeout: 10s
metrics_path: /metrics
scheme: https
tls_config:
insecure_skip_verify: true
static_configs:
- targets: ["www.baidu.com:443"]
引用
- Grafana & prometheus 入門
- Prometheus監(jiān)控+Grafana+Alertmanager告警安裝使用 (圖文詳解)
- Prometheus官方教程
- Helm倉庫
- kube-prometheus項(xiàng)目的Github地址
- kratos官方教程
- K8s官方文檔
- prometheus-operator的源碼
- kube-prometheus-stack 告警配置
- kube-prometheus-stack 配置AlertManager
- prometheus通過企業(yè)微信機(jī)器人報(bào)警
- kube-promethues監(jiān)控告警詳解(郵件、釘釘、微信、企微機(jī)器人、自研平臺(tái))
到了這里,關(guān)于在k8s集群內(nèi)搭建Prometheus監(jiān)控平臺(tái)的文章就介紹完了。如果您還想了解更多內(nèi)容,請?jiān)谟疑辖撬阉鱐OY模板網(wǎng)以前的文章或繼續(xù)瀏覽下面的相關(guān)文章,希望大家以后多多支持TOY模板網(wǎng)!