夜鶯(Flashcat)V6監(jiān)控(五)：夜鶯監(jiān)控k8s組件(上)

這篇具有很好參考價值的文章主要介紹了夜鶯(Flashcat)V6監(jiān)控(五)：夜鶯監(jiān)控k8s組件(上)。希望對大家有所幫助。如果存在錯誤或未考慮完全的地方，請大家不吝賜教，您也可以點擊"舉報違法"按鈕提交疑問。

? ?

(一)Kubernetest監(jiān)控體系

1.Kubernetes監(jiān)控策略

(二)K8s-ApiServer組件監(jiān)控

(1)我們先創(chuàng)建一個namespace來專門做夜鶯監(jiān)控采集指標(biāo)

(2)創(chuàng)建認(rèn)證授權(quán)信息rbac? ??

(3)使用prometheus-agent進(jìn)行指標(biāo)采集

① 創(chuàng)建Prometheus的配置文件

② 部署Prometehus Agent

(三)K8s-ControllerManager組件監(jiān)控

(1)創(chuàng)建prometheus的配置文件

?(2)重新創(chuàng)建controller 的endpoints

(3)更改controller 的bind-address

(4)指標(biāo)測試

(四)K8s-Scheduler組件監(jiān)控

(1)創(chuàng)建prometheus的配置文件

?(2)配置Scheduler的service

?(3)重啟prometheus-agent

??(4) 測試指標(biāo)導(dǎo)入儀表盤

(五)K8s-Etcd組件監(jiān)控

(1)更改etcd配置文件監(jiān)聽地址為0.0.0.0

(2)數(shù)據(jù)采集

(3)指標(biāo)測試?

(六)K8s-kubelet組件監(jiān)控

(1)配置Prometheus-agent configmap配置文件

(2)配置kubelet的service和endpoints

(3)測試指標(biāo)

(七)K8s-KubeProxy組件監(jiān)控

(1)配置Prometheus-agent configmap配置文件

(2)配置kube-proxy的endpoints

(3)更改kube-proxy的metricsbindAddress

(4)指標(biāo)測試

最后的最后

? ? 這一期我們講一下夜鶯來監(jiān)控k8s組件的監(jiān)控，因為k8s的組件復(fù)雜，內(nèi)容多，所以我們分成上下兩部分來學(xué)習(xí)，這一期我們先學(xué)習(xí)監(jiān)控k8s的幾大組件。首先我們先來了解認(rèn)識一下k8s的架構(gòu)和監(jiān)控概述

(一)Kubernetest監(jiān)控體系

? ? 當(dāng)我們談及 Kubernetes 監(jiān)控的時候，我們在談?wù)撌裁矗匡@然是 Kubernetes 架構(gòu)下的各個內(nèi)容的監(jiān)控，Kubernetes 所跑的環(huán)境、Kubernetes 本身、跑在 Kubernetes 上面的應(yīng)用等等。Kubernetes 所跑的環(huán)境，可能是物理機(jī)、虛擬機(jī)，并且依賴底層的基礎(chǔ)網(wǎng)絡(luò)，Kubernetes 上面的應(yīng)用，可能是業(yè)務(wù)應(yīng)用程序，也可能是各類中間件、數(shù)據(jù)庫，Kubernetes 本身，則包含很多組件，我們通過一張 Kubernetes 架構(gòu)圖來說明? 。

夜鶯(Flashcat)V6監(jiān)控(五)：夜鶯監(jiān)控k8s組件(上)

? ? ?最左側(cè)是 UI 層，包括頁面 UI 以及命令行工具 kubectl，中間部分是 Kubernetes 控制面組件，右側(cè)部分是工作負(fù)載節(jié)點，包含兩個工作覆蓋節(jié)點。

? ? k8s的這個架構(gòu)我們可以大致分為兩個模塊來理解：

1.Master組件

? ? apiserver: 是Kubernetes集群中所有組件之間通信的中心組件，也是集群的前端接口。kube-apiserver負(fù)責(zé)驗證和處理API請求，并將它們轉(zhuǎn)發(fā)給其他組件。

? ? scheduler:?Kubernetes Scheduler負(fù)責(zé)在Kubernetes集群中選擇最合適的Node來運行新創(chuàng)建的Pod，考慮到節(jié)點的資源利用率、Pod的調(diào)度限制、網(wǎng)絡(luò)位置等因素。

? ??controller-manager:?Kubernetes Controller Manager包含多個控制器，負(fù)責(zé)監(jiān)視并確保集群狀態(tài)符合預(yù)期。例如，ReplicationController、NamespaceController、ServiceAccountController等等。

? ?? etcd：etcd是Kubernetes的后端數(shù)據(jù)庫，用于存儲和管理Kubernetes集群狀態(tài)信息，例如Pod、Service、ConfigMap等對象的配置和狀態(tài)信息。

2.Slave-node組件

? ? kubelet：Kubelet是在每個Node上運行的代理服務(wù)，負(fù)責(zé)管理和監(jiān)視該Node上的容器，并與kube-apiserver進(jìn)行通信以保持節(jié)點狀態(tài)最新。

? ??kube-proxy：Kubernetes Proxy負(fù)責(zé)為容器提供網(wǎng)絡(luò)代理和負(fù)載均衡功能，使得容器可以訪問其他Pod、Service等網(wǎng)絡(luò)資源。

? ??Container Runtime：如Docker,rkt,runc等提供容器運行時環(huán)境

1.Kubernetes監(jiān)控策略

? ? ?Kubernetes作為開源的容器編排工具，為用戶提供了一個可以統(tǒng)一調(diào)度，統(tǒng)一管理的云操作系統(tǒng)。其解決如用戶應(yīng)用程序如何運行的問題。而一旦在生產(chǎn)環(huán)境中大量基于Kubernetes部署和管理應(yīng)用程序后，作為系統(tǒng)管理員，還需要充分了解應(yīng)用程序以及Kubernetes集群服務(wù)運行質(zhì)量如何，通過對應(yīng)用以及集群運行狀態(tài)數(shù)據(jù)的收集和分析，持續(xù)優(yōu)化和改進(jìn)，從而提供一個安全可靠的生產(chǎn)運行環(huán)境。這一小節(jié)中我們將討論當(dāng)使用Kubernetes時的監(jiān)控策略該如何設(shè)計。

? ? ?從物理結(jié)構(gòu)上講Kubernetes主要用于整合和管理底層的基礎(chǔ)設(shè)施資源，對外提供應(yīng)用容器的自動化部署和管理能力，這些基礎(chǔ)設(shè)施可能是物理機(jī)、虛擬機(jī)、云主機(jī)等等。因此，基礎(chǔ)資源的使用直接影響當(dāng)前集群的容量和應(yīng)用的狀態(tài)。在這部分，我們需要關(guān)注集群中各個節(jié)點的主機(jī)負(fù)載，CPU使用率、內(nèi)存使用率、存儲空間以及網(wǎng)絡(luò)吞吐等監(jiān)控指標(biāo)。

? ? ?從自身架構(gòu)上講，kube-apiserver是Kubernetes提供所有服務(wù)的入口，無論是外部的客戶端還是集群內(nèi)部的組件都直接與kube-apiserver進(jìn)行通訊。因此，kube-apiserver的并發(fā)和吞吐量直接決定了集群性能的好壞。其次，對于外部用戶而言，Kubernetes是否能夠快速的完成pod的調(diào)度以及啟動，是影響其使用體驗的關(guān)鍵因素。而這個過程主要由kube-scheduler負(fù)責(zé)完成調(diào)度工作，而kubelet完成pod的創(chuàng)建和啟動工作。因此在Kubernetes集群本身我們需要評價其自身的服務(wù)質(zhì)量，主要關(guān)注在Kubernetes的API響應(yīng)時間，以及Pod的啟動時間等指標(biāo)上。??

? ? ?Kubernetes的最終目標(biāo)還是需要為業(yè)務(wù)服務(wù)，因此我們還需要能夠監(jiān)控應(yīng)用容器的資源使用情況。對于內(nèi)置了對Prometheus支持的應(yīng)用程序，也要支持從這些應(yīng)用程序中采集內(nèi)部的監(jiān)控指標(biāo)。最后，結(jié)合黑盒監(jiān)控模式，對集群中部署的服務(wù)進(jìn)行探測，從而當(dāng)應(yīng)用發(fā)生故障后，能夠快速處理和恢復(fù)。

? ? 綜上所述，我們需要綜合使用白盒監(jiān)控和黑盒監(jiān)控模式，建立從基礎(chǔ)設(shè)施，Kubernetes核心組件，應(yīng)用容器等全面的監(jiān)控體系。

在白盒監(jiān)控層面我們需要關(guān)注：

基礎(chǔ)設(shè)施層（Node）：為整個集群和應(yīng)用提供運行時資源，需要通過各節(jié)點的kubelet獲取節(jié)點的基本狀態(tài)，同時通過在節(jié)點上部署Node Exporter獲取節(jié)點的資源使用情況；
容器基礎(chǔ)設(shè)施（Container）：為應(yīng)用提供運行時環(huán)境，Kubelet內(nèi)置了對cAdvisor的支持，用戶可以直接通過Kubelet組件獲取給節(jié)點上容器相關(guān)監(jiān)控指標(biāo)；
用戶應(yīng)用（Pod）：Pod中會包含一組容器，它們一起工作，并且對外提供一個（或者一組）功能。如果用戶部署的應(yīng)用程序內(nèi)置了對Prometheus的支持，那么我們還應(yīng)該采集這些Pod暴露的監(jiān)控指標(biāo)；
Kubernetes組件：獲取并監(jiān)控Kubernetes核心組件的運行狀態(tài)，確保平臺自身的穩(wěn)定運行。

而在黑盒監(jiān)控層面，則主要需要關(guān)注以下：

內(nèi)部服務(wù)負(fù)載均衡（Service）：在集群內(nèi)，通過Service在集群暴露應(yīng)用功能，集群內(nèi)應(yīng)用和應(yīng)用之間訪問時提供內(nèi)部的負(fù)載均衡。通過Blackbox Exporter探測Service的可用性，確保當(dāng)Service不可用時能夠快速得到告警通知；
外部訪問入口（Ingress）：通過Ingress提供集群外的訪問入口，從而可以使外部客戶端能夠訪問到部署在Kubernetes集群內(nèi)的服務(wù)。因此也需要通過Blackbox Exporter對Ingress的可用性進(jìn)行探測，確保外部用戶能夠正常訪問集群內(nèi)的功能；

? ? 說這么大家肯定有了一點初步的了解k8s的監(jiān)控，那我們接下來趁熱打鐵，直接上實踐，我們用夜鶯來監(jiān)控k8s的六大組件。

(二)K8s-ApiServer組件監(jiān)控

? ? ApiServer 是 Kubernetes 架構(gòu)中的核心，是所有 API 是入口，它串聯(lián)所有的系統(tǒng)組件。

? ? 為了方便監(jiān)控管理 ApiServer，設(shè)計者們?yōu)樗┞读艘幌盗械闹笜?biāo)數(shù)據(jù)。當(dāng)你部署完集群，默認(rèn)會在default名稱空間下創(chuàng)建一個名叫kubernetes的 service，它就是 ApiServer 的地址，當(dāng)然也可以查看本機(jī)暴露的apiserver的端口ss -tlnp

[root@k8s-master ~]# kubectl get service -A | grep kubernetes
default        kubernetes                ClusterIP   10.96.0.1       <none>        443/TCP                  52d


[root@k8s-master ~]# ss -tlpn | grep  apiserver
LISTEN     0      128       [::]:6443                  [::]:*                   users:(("kube-apiserver",pid=2287,fd=7))

? ? 但是當(dāng)我們想要去獲取抓緊metrics數(shù)據(jù)的時候，會發(fā)現(xiàn)我們抓緊不了，沒有權(quán)限證書

[root@k8s-master ~]# curl -s -k https://localhost:6443/metrics
{
  "kind": "Status",
  "apiVersion": "v1",
  "metadata": {},
  "status": "Failure",
  "message": "forbidden: User \"system:anonymous\" cannot get path \"/metrics\"",
  "reason": "Forbidden",
  "details": {},
  "code": 403
}[root@k8s-master ~]#

? ? 所以，要監(jiān)控 ApiServer，采集到對應(yīng)的指標(biāo)，就需要先授權(quán)。為此，我們先準(zhǔn)備認(rèn)證信息。

(1)我們先創(chuàng)建一個namespace來專門做夜鶯監(jiān)控采集指標(biāo)

[root@k8s-master ~]# kubectl create namespace flashcat

(2)創(chuàng)建認(rèn)證授權(quán)信息rbac? ??

? ? 這個yaml文件的意思就是我們創(chuàng)建一個賬號sa名為categraf，然后給他綁定resources的verbs權(quán)限，讓categraf這個賬號有足夠的權(quán)限來獲取k8s的各個組件的指標(biāo)采集

vim apiserver-auth.yaml

---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: categraf
rules:
  - apiGroups: [""]
    resources:
      - nodes
      - nodes/metrics
      - nodes/stats
      - nodes/proxy
      - services
      - endpoints
      - pods
    verbs: ["get", "list", "watch"]
  - apiGroups:
      - extensions
      - networking.k8s.io
    resources:
      - ingresses
    verbs: ["get", "list", "watch"]
  - nonResourceURLs: ["/metrics", "/metrics/cadvisor"]
    verbs: ["get"]
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: categraf
  namespace: flashcat
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: categraf
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: categraf
subjects:
  - kind: ServiceAccount
    name: categraf
    namespace: flashcat

(3)使用prometheus-agent進(jìn)行指標(biāo)采集

? ? 支持 Kubernetes 服務(wù)發(fā)現(xiàn)的 agent 有不少，但是要說最原汁原味的還是 Prometheus 自身，Prometheus 新版本(v2.32.0)支持了 agent mode 模式，即把 Prometheus 進(jìn)程當(dāng)做采集器 agent，采集了數(shù)據(jù)之后通過 remote write 方式傳給中心(這里使用早就準(zhǔn)備好的 Nightingale 作為數(shù)據(jù)接收服務(wù)端)。那這里我就使用 Prometheus 的 agent mode 方式來采集 APIServer。

① 創(chuàng)建Prometheus的配置文件

? ? 這里給大家解釋一下這個配置文件的一些內(nèi)容，給一些對普羅米修斯還不是很了解的小伙伴參考：

global：第一部分定義的一個名為global的模塊

scrape_interval: 采集間隔

evaluation_interval: 評估間隔，用于控制數(shù)據(jù)的收集和處理頻率

scrape_config: 第二部分定義的模塊，用來配置Prometheus要監(jiān)控的目標(biāo)

job_name: 表示該配置是用于監(jiān)控Kubernetets APIserver的

kubernetes_sd_configs: 表示指定了從kubernetest Service Discovery中獲取目錄對象的方式
此處使用了 role: endpoints 獲取endpoint對象，也就是API server的ip地址和端口信息。

scheme：指定了網(wǎng)絡(luò)通信協(xié)議是HTTPS

tls_config：參數(shù)指定了TLS證書的相關(guān)配置，包括是否驗證服務(wù)器端證書等。

insecure_skip_verify：是一個bool類型的參數(shù)，如果為true，表示跳過對服務(wù)器端證書的驗證。在生產(chǎn)環(huán)境中，不應(yīng)該使用，因為會導(dǎo)致通信的不安全。正常情況下。我們需要在客戶端上配置ca證書來驗證服務(wù)器端證書的合法性。

authorization: 指定了認(rèn)證信息的來源，這里使用了默認(rèn)的kubernetest服務(wù)賬號的Token。

relabel_configs：用于將原始數(shù)據(jù)標(biāo)簽進(jìn)行變換，篩選出需要的目標(biāo)數(shù)據(jù)

source_labels：定義了三個規(guī)則用來匹配標(biāo)簽。其中__meta_kubernetes_namespace表示Kubernetes命名空間，__meta_kubernetes_service_name表示服務(wù)名稱，__meta_kubernetes_endpoint_port_name表示端口名稱

action：指定該操作是保留keep，也就是保留符合指定正則表達(dá)式的標(biāo)簽

regex：使用正則表達(dá)式來對標(biāo)簽進(jìn)行過濾，這里的正則表達(dá)式為default;kubernetes;http，表示要保留的目標(biāo)是default命名空間下的kubernetes服務(wù)，并且端口是http。

通過這個relabel_configs塊，Prometheus將采集到的來自default命名空間下的kubernetes服務(wù)，并且端口是http的數(shù)據(jù)進(jìn)行保留，并將這些數(shù)據(jù)推送給后續(xù)的n9e夜鶯

remote_write: 用于將普羅米修斯采集的數(shù)據(jù)寫入外部存儲。這里我們定義的是夜鶯的地址。prometheus/v1/write是外部存儲的接口路徑。

vim prometheus-cm.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-agent-conf
  labels:
    name: prometheus-agent-conf
  namespace: flashcat
data:
  prometheus.yml: |-
    global:
      scrape_interval: 15s
      evaluation_interval: 15s
    scrape_configs:
      - job_name: 'apiserver'
        kubernetes_sd_configs:
        - role: endpoints
        scheme: https
        tls_config:
          insecure_skip_verify: true
        authorization:
          credentials_file: /var/run/secrets/kubernetes.io/serviceaccount/token
        relabel_configs:
        - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
          action: keep
          regex: default;kubernetes;https
    remote_write:
    - url: 'http://192.168.120.17:17000/prometheus/v1/write'

② 部署Prometehus Agent

? ? 這里我們使用deployment的方式部署

? ? 其中--enable-feature=agent表示啟動的是 agent 模式。

apiVersion: apps/v1
kind: Deployment
metadata:
  name: prometheus-agent
  namespace: flashcat
  labels:
    app: prometheus-agent
spec:
  replicas: 1
  selector:
    matchLabels:
      app: prometheus-agent
  template:
    metadata:
      labels:
        app: prometheus-agent
    spec:
      serviceAccountName: categraf
      containers:
        - name: prometheus
          image: prom/prometheus
          args:
            - "--config.file=/etc/prometheus/prometheus.yml"
            - "--web.enable-lifecycle"
            - "--enable-feature=agent"
          ports:
            - containerPort: 9090
          resources:
            requests:
              cpu: 500m
              memory: 500M
            limits:
              cpu: 1
              memory: 1Gi
          volumeMounts:
            - name: prometheus-config-volume
              mountPath: /etc/prometheus/
            - name: prometheus-storage-volume
              mountPath: /prometheus/
      volumes:
        - name: prometheus-config-volume
          configMap:
            defaultMode: 420
            name: prometheus-agent-conf
        - name: prometheus-storage-volume
          emptyDir: {}

? ? 查看是否部署成功

[root@k8s-master ~]# kubectl get pod -n flashcat
NAME                                READY   STATUS    RESTARTS   AGE
prometheus-agent-7c8d7bc7bb-42djw   1/1     Running   0          115m

? ? 然后可以到夜鶯web頁面查看指標(biāo) 測試apiserver_request_total

? ? 獲取到了指標(biāo)數(shù)據(jù)，后面就是合理利用指標(biāo)做其他動作，比如構(gòu)建面板、告警處理等。

夜鶯(Flashcat)V6監(jiān)控(五)：夜鶯監(jiān)控k8s組件(上)

?導(dǎo)入Apiserver的監(jiān)控大盤，監(jiān)控的json文件在categraf/apiserver-dash.json · GitHub

?直接復(fù)制導(dǎo)入json文件的內(nèi)容即可

夜鶯(Flashcat)V6監(jiān)控(五)：夜鶯監(jiān)控k8s組件(上)

另外，Apiserver 的關(guān)鍵指標(biāo)的含義也貼出來

# HELP apiserver_request_duration_seconds [STABLE] Response latency distribution in seconds for each verb, dry run value, group, version, resource, subresource, scope and component.
# TYPE apiserver_request_duration_seconds histogram
apiserver響應(yīng)的時間分布，按照url 和 verb 分類
一般按照instance和verb+時間 匯聚

# HELP apiserver_request_total [STABLE] Counter of apiserver requests broken out for each verb, dry run value, group, version, resource, scope, component, and HTTP response code.
# TYPE apiserver_request_total counter
apiserver的請求總數(shù)，按照verb、 version、 group、resource、scope、component、 http返回碼分類統(tǒng)計

# HELP apiserver_current_inflight_requests [STABLE] Maximal number of currently used inflight request limit of this apiserver per request kind in last second.
# TYPE apiserver_current_inflight_requests gauge
最大并發(fā)請求數(shù), 按mutating(非get list watch的請求)和readOnly(get list watch)分別限制
超過max-requests-inflight(默認(rèn)值400)和max-mutating-requests-inflight(默認(rèn)200)的請求會被限流
apiserver變更時要注意觀察，也是反饋集群容量的一個重要指標(biāo)

# HELP apiserver_response_sizes [STABLE] Response size distribution in bytes for each group, version, verb, resource, subresource, scope and component.
# TYPE apiserver_response_sizes histogram
apiserver 響應(yīng)大小，單位byte, 按照verb、 version、 group、resource、scope、component分類統(tǒng)計

# HELP watch_cache_capacity [ALPHA] Total capacity of watch cache broken by resource type.
# TYPE watch_cache_capacity gauge
按照資源類型統(tǒng)計的watch緩存大小

# HELP process_cpu_seconds_total Total user and system CPU time spent in seconds.
# TYPE process_cpu_seconds_total counter
每秒鐘用戶態(tài)和系統(tǒng)態(tài)cpu消耗時間, 計算apiserver進(jìn)程的cpu的使用率

# HELP process_resident_memory_bytes Resident memory size in bytes.
# TYPE process_resident_memory_bytes gauge
apiserver的內(nèi)存使用量(單位:Byte)

# HELP workqueue_adds_total [ALPHA] Total number of adds handled by workqueue
# TYPE workqueue_adds_total counter
apiserver中包含的controller的工作隊列，已處理的任務(wù)總數(shù)

# HELP workqueue_depth [ALPHA] Current depth of workqueue
# TYPE workqueue_depth gauge
apiserver中包含的controller的工作隊列深度，表示當(dāng)前隊列中要處理的任務(wù)的數(shù)量，數(shù)值越小越好 
例如APIServiceRegistrationController admission_quota_controller

(三)K8s-ControllerManager組件監(jiān)控

? ? ?controller-manager 是 Kubernetes 控制面的組件，通常不太可能出問題，一般監(jiān)控一下通用的進(jìn)程指標(biāo)就問題不大了，不過 controller-manager 確實也暴露了很多?/metrics?白盒指標(biāo)，我們也一并梳理一下相關(guān)內(nèi)容。

? ? 監(jiān)控思路跟上面一樣，也是用Prometheus-Agent的方式進(jìn)行采集指標(biāo)

(1)創(chuàng)建prometheus的配置文件

? ? 因為我們上面做apiserver的時候已經(jīng)做了權(quán)限綁定和一些基礎(chǔ)配置，所以這里我們直接添加Prometheus的配置文件添加job模塊內(nèi)容即可。

? ?這里我們可以直接打開之前創(chuàng)建的prometheus-cm的configmap配置文件添加一個job關(guān)于controller-manager的即可

apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-agent-conf
  labels:
    name: prometheus-agent-conf
  namespace: flashcat
data:
  prometheus.yml: |-
    global:
      scrape_interval: 15s
      evaluation_interval: 15s

    scrape_configs:
      - job_name: 'apiserver'
        kubernetes_sd_configs:
        - role: endpoints
        scheme: https
        tls_config:
          insecure_skip_verify: true
        authorization:
          credentials_file: /var/run/secrets/kubernetes.io/serviceaccount/token
        relabel_configs:
        - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
          action: keep
          regex: default;kubernetes;https
    ## 這里添加即可以下內(nèi)容即可
      - job_name: 'controller-manager'
        kubernetes_sd_configs:
        - role: endpoints
        scheme: https
        tls_config:
          insecure_skip_verify: true
        authorization:
          credentials_file: /var/run/secrets/kubernetes.io/serviceaccount/token
        relabel_configs:
        - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
          action: keep
          regex: kube-system;kube-controller-manager;https-metrics

    remote_write:
    - url: 'http://192.168.120.17:17000/prometheus/v1/write'

?(2)重新創(chuàng)建controller 的endpoints

? ? 先查看一下自己有沒有對應(yīng)的controller-manager的endpoint 如果沒有創(chuàng)建一個servece即可

? ? 為什么要endpoint呢因為我們上面Prometheus的采集規(guī)則role就是endpoint

[root@k8s-master ~]# kubectl get endpoints -A | grep controller

這里如果沒有查詢到就創(chuàng)建一個serveice文件
vim controller-manager-service.yaml
apiVersion: v1
kind: Service
metadata:
  annotations:
  labels:
    k8s-app: kube-controller-manager
  name: kube-controller-manager
  namespace: kube-system
spec:
  clusterIP: None
  ports:
    - name: https-metrics
      port: 10257
      protocol: TCP
      targetPort: 10257
  selector:
    component: kube-controller-manager
  sessionAffinity: None
  type: ClusterIP


運行yaml:  kubectl apply -f controller-manager-service.yaml

(3)更改controller 的bind-address

注意：如果你使用的kubeadm安裝的k8s集群，需要把controller-manager的bind-address改為0.0.0.0

[root@k8s-master ~]# vim /etc/kubernetes/manifests/kube-controller-manager.yaml
....
....
- --bind-address=0.0.0.0 ##找到bind-address 把127.0.0.1 改為 0.0.0.0

(4)指標(biāo)測試

然后重啟Prometheus-agent 的pod 重新加載Prometheus的配置文件的yaml

重啟后先在夜鶯的web頁面查詢指標(biāo)，測試指標(biāo)daemon_controller_rate_limiter_use

夜鶯(Flashcat)V6監(jiān)控(五)：夜鶯監(jiān)控k8s組件(上)

導(dǎo)入監(jiān)控大盤，大盤鏈接:categraf/cm-dash.json at main · flashcatcloud/categraf · GitHub

?查看儀表盤 (怎么導(dǎo)入儀表盤的操作跟上面導(dǎo)入apiserver的儀表盤一樣的，把json文件克隆進(jìn)行即可)

夜鶯(Flashcat)V6監(jiān)控(五)：夜鶯監(jiān)控k8s組件(上)

controller-manager關(guān)鍵指標(biāo)意思也貼出來

# HELP rest_client_request_duration_seconds [ALPHA] Request latency in seconds. Broken down by verb and URL.
# TYPE rest_client_request_duration_seconds histogram
請求apiserver的耗時分布，按照url+verb統(tǒng)計

# HELP cronjob_controller_cronjob_job_creation_skew_duration_seconds [ALPHA] Time between when a cronjob is scheduled to be run, and when the corresponding job is created
# TYPE cronjob_controller_cronjob_job_creation_skew_duration_seconds histogram
cronjob 創(chuàng)建到運行的時間分布

# HELP leader_election_master_status [ALPHA] Gauge of if the reporting system is master of the relevant lease, 0 indicates backup, 1 indicates master. 'name' is the string used to identify the lease. Please make sure to group by name.
# TYPE leader_election_master_status gauge
控制器的選舉狀態(tài)，0表示backup， 1表示master 

# HELP node_collector_zone_health [ALPHA] Gauge measuring percentage of healthy nodes per zone.
# TYPE node_collector_zone_health gauge
每個zone的健康node占比

# HELP node_collector_zone_size [ALPHA] Gauge measuring number of registered Nodes per zones.
# TYPE node_collector_zone_size gauge
每個zone的node數(shù)

# HELP process_cpu_seconds_total Total user and system CPU time spent in seconds.
# TYPE process_cpu_seconds_total counter
cpu使用量（也可以理解為cpu使用率）

# HELP process_open_fds Number of open file descriptors.
# TYPE process_open_fds gauge
控制器打開的fd數(shù)

# HELP pv_collector_bound_pv_count [ALPHA] Gauge measuring number of persistent volume currently bound
# TYPE pv_collector_bound_pv_count gauge
當(dāng)前綁定的pv數(shù)量

# HELP pv_collector_unbound_pvc_count [ALPHA] Gauge measuring number of persistent volume claim currently unbound
# TYPE pv_collector_unbound_pvc_count gauge
當(dāng)前沒有綁定的pvc數(shù)量 


# HELP pv_collector_bound_pvc_count [ALPHA] Gauge measuring number of persistent volume claim currently bound
# TYPE pv_collector_bound_pvc_count gauge
當(dāng)前綁定的pvc數(shù)量

# HELP pv_collector_total_pv_count [ALPHA] Gauge measuring total number of persistent volumes
# TYPE pv_collector_total_pv_count gauge
pv總數(shù)量


# HELP workqueue_adds_total [ALPHA] Total number of adds handled by workqueue
# TYPE workqueue_adds_total counter
各個controller已接受的任務(wù)總數(shù)
與apiserver的workqueue_adds_total指標(biāo)類似

# HELP workqueue_depth [ALPHA] Current depth of workqueue
# TYPE workqueue_depth gauge
各個controller隊列深度，表示一個controller中的任務(wù)的數(shù)量
與apiserver的workqueue_depth類似，這個是指各個controller中隊列的深度，數(shù)值越小越好

# HELP workqueue_queue_duration_seconds [ALPHA] How long in seconds an item stays in workqueue before being requested.
# TYPE workqueue_queue_duration_seconds histogram
任務(wù)在隊列中的等待耗時,按照控制器分別統(tǒng)計

# HELP workqueue_work_duration_seconds [ALPHA] How long in seconds processing an item from workqueue takes.
# TYPE workqueue_work_duration_seconds histogram
任務(wù)出隊到被處理完成的時間，按照控制分別統(tǒng)計

# HELP workqueue_retries_total [ALPHA] Total number of retries handled by workqueue
# TYPE workqueue_retries_total counter
任務(wù)進(jìn)入隊列重試的次數(shù)

# HELP workqueue_longest_running_processor_seconds [ALPHA] How many seconds has the longest running processor for workqueue been running.
# TYPE workqueue_longest_running_processor_seconds gauge
正在處理的任務(wù)中，最長耗時任務(wù)的處理時間

# HELP endpoint_slice_controller_syncs [ALPHA] Number of EndpointSlice syncs
# TYPE endpoint_slice_controller_syncs counter
endpoint_slice 同步的數(shù)量(1.20以上)

# HELP get_token_fail_count [ALPHA] Counter of failed Token() requests to the alternate token source
# TYPE get_token_fail_count counter
獲取token失敗的次數(shù)

# HELP go_memstats_gc_cpu_fraction The fraction of this program's available CPU time used by the GC since the program started.
# TYPE go_memstats_gc_cpu_fraction gauge
controller gc的cpu使用率

(四)K8s-Scheduler組件監(jiān)控

? ? scheduler 是 Kubernetes 的控制面組件，負(fù)責(zé)調(diào)度對象到合適的 node 上，會有一系列的規(guī)則計算和篩選，重點關(guān)注調(diào)度相關(guān)的指標(biāo)。相關(guān)監(jiān)控數(shù)據(jù)也是通過?/metrics?接口暴露，scheduler的暴露的端口是10259

? ? 接下來就是采集數(shù)據(jù)了，我們還是使用 prometheus agent 來拉取數(shù)據(jù)，原汁原味的，只要在上一篇文章提供的 configmap 中增加 scheduler 相關(guān)的配置job即可

(1)創(chuàng)建prometheus的配置文件

apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-agent-conf
  labels:
    name: prometheus-agent-conf
  namespace: flashcat
data:
  prometheus.yml: |-
    global:
      scrape_interval: 15s
      evaluation_interval: 15s

    scrape_configs:
      - job_name: 'apiserver'
        kubernetes_sd_configs:
        - role: endpoints
        scheme: https
        tls_config:
          insecure_skip_verify: true
        authorization:
          credentials_file: /var/run/secrets/kubernetes.io/serviceaccount/token
        relabel_configs:
        - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
          action: keep
          regex: default;kubernetes;https

      - job_name: 'controller-manager'
        kubernetes_sd_configs:
        - role: endpoints
        scheme: https
        tls_config:
          insecure_skip_verify: true
        authorization:
          credentials_file: /var/run/secrets/kubernetes.io/serviceaccount/token
        relabel_configs:
        - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
          action: keep
          regex: kube-system;kube-controller-manager;https-metrics
  ##添加以下scheduler的job即可
      - job_name: 'scheduler'
        kubernetes_sd_configs:
        - role: endpoints
        scheme: https
        tls_config:
          insecure_skip_verify: true
        authorization:
          credentials_file: /var/run/secrets/kubernetes.io/serviceaccount/token
        relabel_configs:
        - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
          action: keep
          regex: kube-system;kube-scheduler;https

    remote_write:
    - url: 'http://192.168.120.17:17000/prometheus/v1/write'

?(2)配置Scheduler的service

? ? 跟上面一樣，首先我們要查看有沒有相關(guān)的sheduler的endpoint，如果沒有我們就要創(chuàng)建一個service來暴露

[root@k8s-master ~]# kubectl get endpoints -A | grep schedu
## 如果沒有我們就創(chuàng)建一個service的yaml

vim scheduler-service.yaml
apiVersion: v1
kind: Service
metadata:
  labels:
    k8s-app: kube-scheduler
  name: kube-scheduler
  namespace: kube-system
spec:
  clusterIP: None
  ports:
    - name: https
      port: 10259
      protocol: TCP
      targetPort: 10259
  selector:
    component: kube-scheduler
  sessionAffinity: None
  type: ClusterIP

?(3)重啟prometheus-agent

? ? 配置更新完configmap后要重新去apply一下configmap或者你edit更改，更改完成后如果還是無法獲取指標(biāo)就重啟一下Prometheus-agent的pod 重新apply一下就行 ,或者curl -X POST "http://<PROMETHEUS_IP>:9090/-/reload"重載 Prometheus，這里的prometheus的ip是pod的IP，這個ip你要查看Prometheus pod的IP 可以使用kubectl get pod -o wide -n flashcat 即可。

(注意這里如果你的k8s是kubeadm安裝的，也要去scheduler的manifests文件把bind-address更改為0.0.0.0)

[root@k8s-master manifests]# vim /etc/kubernetes/manifests/kube-scheduler.yaml
......
......
......
- --bind-address=0.0.0.0 ##找到這行更改為0.0.0.0即可

??(4) 測試指標(biāo)導(dǎo)入儀表盤

? ? 測試指標(biāo)scheduler_scheduler_cache_size

夜鶯(Flashcat)V6監(jiān)控(五)：夜鶯監(jiān)控k8s組件(上)

? ? 導(dǎo)入監(jiān)控大盤，大盤json鏈接:?categraf/scheduler-dash.json at main · · GitHub

夜鶯(Flashcat)V6監(jiān)控(五)：夜鶯監(jiān)控k8s組件(上)

這里也貼出常用scheduler關(guān)鍵指標(biāo)意思:

# HELP rest_client_request_duration_seconds [ALPHA] Request latency in seconds. Broken down by verb and URL.
# TYPE rest_client_request_duration_seconds histogram
請求apiserver的延遲分布

# HELP rest_client_requests_total [ALPHA] Number of HTTP requests, partitioned by status code, method, and host.
# TYPE rest_client_requests_total counter
請求apiserver的總數(shù) ,按照host code method 統(tǒng)計

# HELP leader_election_master_status [ALPHA] Gauge of if the reporting system is master of the relevant lease, 0 indicates backup, 1 indicates master. 'name' is the string used to identify the lease. Please make sure to group by name.
# TYPE leader_election_master_status gauge
調(diào)度器的選舉狀態(tài)，0表示backup， 1表示master

# HELP scheduler_queue_incoming_pods_total [STABLE] Number of pods added to scheduling queues by event and queue type.
# TYPE scheduler_queue_incoming_pods_total counter
進(jìn)入調(diào)度隊列的pod數(shù)

# HELP scheduler_preemption_attempts_total [STABLE] Total preemption attempts in the cluster till now
# TYPE scheduler_preemption_attempts_total counter
調(diào)度器驅(qū)逐容器的次數(shù)

# HELP scheduler_scheduler_cache_size [ALPHA] Number of nodes, pods, and assumed (bound) pods in the scheduler cache.
# TYPE scheduler_scheduler_cache_size gauge
調(diào)度器cache中node pod和綁定pod的數(shù)目

# HELP scheduler_pending_pods [STABLE] Number of pending pods, by the queue type. 'active' means number of pods in activeQ; 'backoff' means number of pods in backoffQ; 'unschedulable' means number of pods in unschedulableQ.
# TYPE scheduler_pending_pods gauge
調(diào)度pending的pod數(shù)量，按照queue type分別統(tǒng)計

# HELP scheduler_plugin_execution_duration_seconds [ALPHA] Duration for running a plugin at a specific extension point.
# TYPE scheduler_plugin_execution_duration_seconds histogram
調(diào)度插件在每個擴(kuò)展點的執(zhí)行時間，按照extension_point+plugin+status 分別統(tǒng)計

# HELP scheduler_e2e_scheduling_duration_seconds [ALPHA] (Deprecated since 1.23.0) E2e scheduling latency in seconds (scheduling algorithm + binding). This metric is replaced by scheduling_attempt_duration_seconds.
# TYPE scheduler_e2e_scheduling_duration_seconds histogram
調(diào)度延遲分布，1.23.0 以后會被scheduling_attempt_duration_seconds替代

# HELP scheduler_framework_extension_point_duration_seconds [STABLE] Latency for running all plugins of a specific extension point.
# TYPE scheduler_framework_extension_point_duration_seconds histogram
調(diào)度框架的擴(kuò)展點延遲分布,按extension_point(擴(kuò)展點Bind Filter Permit PreBind/PostBind PreFilter/PostFilter Reseve)
+profile（調(diào)度器）+ status(調(diào)度成功) 統(tǒng)計

# HELP scheduler_pod_scheduling_attempts [STABLE] Number of attempts to successfully schedule a pod.
# TYPE scheduler_pod_scheduling_attempts histogram
pod調(diào)度成功前，調(diào)度重試的次數(shù)分布

# HELP scheduler_schedule_attempts_total [STABLE] Number of attempts to schedule pods, by the result. 'unschedulable' means a pod could not be scheduled, while 'error' means an internal scheduler problem.
# TYPE scheduler_schedule_attempts_total counter
按照調(diào)度結(jié)果統(tǒng)計的調(diào)度重試次數(shù)。 "unschedulable" 表示無法調(diào)度，"error"表示調(diào)度器內(nèi)部錯誤

# HELP scheduler_scheduler_goroutines [ALPHA] Number of running goroutines split by the work they do such as binding.
# TYPE scheduler_scheduler_goroutines gauge
按照功能(binding filter之類)統(tǒng)計的goroutine數(shù)量

# HELP scheduler_scheduling_algorithm_duration_seconds [ALPHA] Scheduling algorithm latency in seconds
# TYPE scheduler_scheduling_algorithm_duration_seconds histogram
調(diào)度算法的耗時分布

# HELP scheduler_scheduling_attempt_duration_seconds [STABLE] Scheduling attempt latency in seconds (scheduling algorithm + binding)
# TYPE scheduler_scheduling_attempt_duration_seconds histogram
調(diào)度算法+binding的耗時分布

# HELP scheduler_scheduler_goroutines [ALPHA] Number of running goroutines split by the work they do such as binding.
# TYPE scheduler_scheduler_goroutines gauge
調(diào)度器的goroutines數(shù)目

(五)K8s-Etcd組件監(jiān)控

? ? ETCD 是 Kubernetes 控制面的重要組件和依賴，Kubernetes 的各類信息都存儲在 ETCD 中，所以監(jiān)控 ETCD 就顯得尤為重要。ETCD 在 Kubernetes 中的架構(gòu)角色如下（只與 APIServer 交互）：

夜鶯(Flashcat)V6監(jiān)控(五)：夜鶯監(jiān)控k8s組件(上)

ETCD 是一個類似 Zookeeper 的產(chǎn)品，通常由多個節(jié)點組成集群，節(jié)點之間使用 raft 協(xié)議保證一致性。ETCD 具有以下特點：

每個節(jié)點都有一個角色狀態(tài)，F(xiàn)ollower、Candidate、Leader
如果 Follower 找不到當(dāng)前 Leader 節(jié)點的時候，就會變成 Candidate
選舉系統(tǒng)會從 Candidate 中選出 Leader
所有的寫操作都通過 Leader 進(jìn)行
一旦 Leader 從大多數(shù) Follower 拿到 ack，該寫操作就被認(rèn)為是“已提交”狀態(tài)
只要大多數(shù)節(jié)點存活，整個 ETCD 就是存活的，個別節(jié)點掛掉不影響整個集群的可用性
ETCD 使用 restful 風(fēng)格的 HTTP API 來操作，這使得 ETCD 的使用非常方便，這也是 ETCD 流行的一個關(guān)鍵因素

ETCD 這么云原生的組件，顯然是內(nèi)置支持了?/metrics?接口的，不過 ETCD 很講求安全性，默認(rèn)的 2379 端口的訪問是要用證書的，我來測試一下先：

[root@tt-fc-dev01.nj ~]# curl -k https://localhost:2379/metrics
curl: (35) error:14094412:SSL routines:ssl3_read_bytes:sslv3 alert bad certificate

[root@tt-fc-dev01.nj ~]# ls /etc/kubernetes/pki/etcd
ca.crt  ca.key  healthcheck-client.crt  healthcheck-client.key  peer.crt  peer.key  server.crt  server.key

[root@tt-fc-dev01.nj ~]# curl -s --cacert /etc/kubernetes/pki/etcd/ca.crt --cert /etc/kubernetes/pki/etcd/server.crt --key /etc/kubernetes/pki/etcd/server.key https://localhost:2379/metrics | head -n 6
# HELP etcd_cluster_version Which version is running. 1 for 'cluster_version' label with current cluster version
# TYPE etcd_cluster_version gauge
etcd_cluster_version{cluster_version="3.5"} 1
# HELP etcd_debugging_auth_revision The current revision of auth store.
# TYPE etcd_debugging_auth_revision gauge
etcd_debugging_auth_revision 1

? ? 使用 kubeadm 安裝的 Kubernetes 集群，相關(guān)證書是在?/etc/kubernetes/pki/etcd?目錄下，為 curl 命令指定相關(guān)證書，是可以訪問的通的。后面使用 Categraf 的 prometheus 插件直接采集相關(guān)數(shù)據(jù)即可。

? ? 不過指標(biāo)數(shù)據(jù)實在沒必要做這么強的安全管控，整的挺麻煩，實際上，ETCD 也確實提供了另一個端口來獲取指標(biāo)數(shù)據(jù)，無需走這套證書認(rèn)證機(jī)制。

(1)更改etcd配置文件監(jiān)聽地址為0.0.0.0

? ? 這里我們首先去etcd的manifests文件更改監(jiān)聽metrics地址

[root@k8s-master manifests]# vim /etc/kubernetes/manifests/etcd.yaml

......
......
- --listen-metrics-urls=http://0.0.0.0:2381  ##找到lisetn這行，把地址改為0.0.0.0

? ? 這樣改完以后我們就能直接通過2381端口來抓取metrics數(shù)據(jù)

[root@tt-fc-dev01.nj ~]# curl -s localhost:2381/metrics | head -n 6
# HELP etcd_cluster_version Which version is running. 1 for 'cluster_version' label with current cluster version
# TYPE etcd_cluster_version gauge
etcd_cluster_version{cluster_version="3.5"} 1
# HELP etcd_debugging_auth_revision The current revision of auth store.
# TYPE etcd_debugging_auth_revision gauge
etcd_debugging_auth_revision 1

(2)數(shù)據(jù)采集

ETCD 的數(shù)據(jù)采集通常使用 3 種方式：

使用 ETCD 所在宿主的 agent 直接來采集，因為 ETCD 是個靜態(tài) Pod，采用的 hostNetwork，所以 agent 直接連上去采集即可
把采集器和 ETCD 做成 sidecar 的模式，ETCD 的使用其實已經(jīng)越來越廣泛，不只是給 Kubernetes 使用，很多業(yè)務(wù)也在使用，在 Kubernetes 里創(chuàng)建和管理 ETCD 也是很常見的做法，sidecar 這種模式非常干凈，隨著 ETCD 創(chuàng)建而創(chuàng)建，隨著其銷毀而銷毀，省事
使用服務(wù)發(fā)現(xiàn)機(jī)制，在中心端部署采集器，就像之前的文章中介紹的 APIServer、Controller-manager、Scheduler 等的做法，使用 Prometheus agent mode 采集監(jiān)控數(shù)據(jù)，當(dāng)然，這種方式的話需要有對應(yīng)的 etcd endpoint，你可以自行檢查一下?kubectl get endpoints -n kube-system?，如果沒有，創(chuàng)建一下即可


[root@k8s-master manifests]# kubectl get endpoints -A | grep etcd
##如果沒有對應(yīng)的endpoint 就創(chuàng)建一個service

vim etcd-service.yaml
apiVersion: v1
kind: Service
metadata:
  namespace: kube-system
  name: etcd
  labels:
    k8s-app: etcd
spec:
  selector:
    component: etcd
  type: ClusterIP
  clusterIP: None
  ports:
    - name: http
      port: 2381
      targetPort: 2381
      protocol: TCP

? ? 更改我們直接寫的configmap的Prometheus的配置文件，添加etcd job字段模塊即可

apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-agent-conf
  labels:
    name: prometheus-agent-conf
  namespace: flashcat
data:
  prometheus.yml: |-
    global:
      scrape_interval: 15s
      evaluation_interval: 15s

    scrape_configs:
      - job_name: 'apiserver'
        kubernetes_sd_configs:
        - role: endpoints
        scheme: https
        tls_config:
          insecure_skip_verify: true
        authorization:
          credentials_file: /var/run/secrets/kubernetes.io/serviceaccount/token
        relabel_configs:
        - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
          action: keep
          regex: default;kubernetes;https

      - job_name: 'controller-manager'
        kubernetes_sd_configs:
        - role: endpoints
        scheme: https
        tls_config:
          insecure_skip_verify: true
        authorization:
          credentials_file: /var/run/secrets/kubernetes.io/serviceaccount/token
        relabel_configs:
        - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
          action: keep
          regex: kube-system;kube-controller-manager;https-metrics
      - job_name: 'scheduler'
        kubernetes_sd_configs:
        - role: endpoints
        scheme: https
        tls_config:
          insecure_skip_verify: true
        authorization:
          credentials_file: /var/run/secrets/kubernetes.io/serviceaccount/token
        relabel_configs:
        - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
          action: keep
          regex: kube-system;kube-scheduler;https
 ##   添加以下etcd字段
      - job_name: 'etcd'
        kubernetes_sd_configs:
        - role: endpoints
        scheme: http
        relabel_configs:
        - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
          action: keep
          regex: kube-system;etcd;http
    remote_write:
    - url: 'http://192.168.120.17:17000/prometheus/v1/write'

(3)指標(biāo)測試?

? ? 更改完成后重新加載yaml文件和Prometheus-agent，然后打開夜鶯的web頁面指標(biāo)查詢，測試指標(biāo)是否查詢得到：etcd_cluster_version

夜鶯(Flashcat)V6監(jiān)控(五)：夜鶯監(jiān)控k8s組件(上)

? 查詢到指標(biāo)后，導(dǎo)入監(jiān)控儀表盤，儀表盤json地址：categraf/etcd-dash. · fl/categraf · GitHub

? 復(fù)制json文件克隆到儀表盤

夜鶯(Flashcat)V6監(jiān)控(五)：夜鶯監(jiān)控k8s組件(上)

?ETCD關(guān)鍵指標(biāo)意思含義:

# HELP etcd_server_is_leader Whether or not this member is a leader. 1 if is, 0 otherwise.
# TYPE etcd_server_is_leader gauge
etcd leader 表示 ，1 leader 0 learner

# HELP etcd_server_health_success The total number of successful health checks
# TYPE etcd_server_health_success counter
etcd server 健康檢查成功次數(shù)

# HELP etcd_server_health_failures The total number of failed health checks
# TYPE etcd_server_health_failures counter
etcd server 健康檢查失敗次數(shù)

# HELP etcd_disk_defrag_inflight Whether or not defrag is active on the member. 1 means active, 0 means not.
# TYPE etcd_disk_defrag_inflight gauge
是否啟動數(shù)據(jù)壓縮，1表示壓縮，0表示沒有啟動壓縮

# HELP etcd_server_snapshot_apply_in_progress_total 1 if the server is applying the incoming snapshot. 0 if none.
# TYPE etcd_server_snapshot_apply_in_progress_total gauge
是否再快照中，1 快照中，0 沒有

# HELP etcd_server_leader_changes_seen_total The number of leader changes seen.
# TYPE etcd_server_leader_changes_seen_total counter
集群leader切換的次數(shù)

# HELP grpc_server_handled_total Total number of RPCs completed on the server, regardless of success or failure.
# TYPE grpc_server_handled_total counter
grpc 調(diào)用總數(shù)

# HELP etcd_disk_wal_fsync_duration_seconds The latency distributions of fsync called by WAL.
# TYPE etcd_disk_wal_fsync_duration_seconds histogram
etcd wal同步耗時

# HELP etcd_server_proposals_failed_total The total number of failed proposals seen.
# TYPE etcd_server_proposals_failed_total counter
etcd proposal（提議）失敗總次數(shù)(proposal就是完成raft協(xié)議的一次請求）

# HELP etcd_server_proposals_pending The current number of pending proposals to commit.
# TYPE etcd_server_proposals_pending gauge
etcd proposal（提議）pending總次數(shù)（proposal就是完成raft協(xié)議的一次請求）

# HELP etcd_server_read_indexes_failed_total The total number of failed read indexes seen.
# TYPE etcd_server_read_indexes_failed_total counter
讀取索引失敗的次數(shù)統(tǒng)計(v3索引為所有key都建了索引，索引是為了加快range操作)

# HELP etcd_server_slow_read_indexes_total The total number of pending read indexes not in sync with leader's or timed out read index requests.
# TYPE etcd_server_slow_read_indexes_total counter
讀取到過期索引或者讀取超時的次數(shù)

# HELP etcd_server_quota_backend_bytes Current backend storage quota size in bytes.
# TYPE etcd_server_quota_backend_bytes gauge
當(dāng)前后端的存儲quota（db大小的上限）
通過參數(shù)quota-backend-bytes調(diào)整大小，默認(rèn)2G，官方建議不超過8G

# HELP etcd_mvcc_db_total_size_in_bytes Total size of the underlying database physically allocated in bytes.
# TYPE etcd_mvcc_db_total_size_in_bytes gauge
etcd 分配的db大小（使用量大小+空閑大小）

# HELP etcd_mvcc_db_total_size_in_use_in_bytes Total size of the underlying database logically in use in bytes.
# TYPE etcd_mvcc_db_total_size_in_use_in_bytes gauge
etcd db的使用量大小

# HELP etcd_mvcc_range_total Total number of ranges seen by this member.
# TYPE etcd_mvcc_range_total counter
etcd執(zhí)行range的數(shù)量

# HELP etcd_mvcc_put_total Total number of puts seen by this member.
# TYPE etcd_mvcc_put_total counter
etcd執(zhí)行put的數(shù)量

# HELP etcd_mvcc_txn_total Total number of txns seen by this member.
# TYPE etcd_mvcc_txn_total counter
etcd實例執(zhí)行事務(wù)的數(shù)量

# HELP etcd_mvcc_delete_total Total number of deletes seen by this member.
# TYPE etcd_mvcc_delete_total counter
etcd實例執(zhí)行delete操作的數(shù)量

# HELP process_cpu_seconds_total Total user and system CPU time spent in seconds.
# TYPE process_cpu_seconds_total counter
etcd cpu使用量

# HELP process_resident_memory_bytes Resident memory size in bytes.
# TYPE process_resident_memory_bytes gauge
etcd 內(nèi)存使用量

# HELP process_open_fds Number of open file descriptors.
# TYPE process_open_fds gauge
etcd 打開的fd數(shù)目

(六)K8s-kubelet組件監(jiān)控

? ? 接下來我們監(jiān)控的就是k8s第二個模塊salve-node的組件，kubelet監(jiān)聽有兩個固定的端口，一個是10248，一個是10250，可以用ss -ntlp | grep kubelet命令查看。

? ? 10248是健康檢測的端口，檢測節(jié)點狀態(tài)，可以使用curl localhost:10248/healthz查看

[root@k8s-master ~]# curl localhost:10248/healthz
ok

? ??10250是kubelet默認(rèn)的端口，/metrics接口就是在這個端口下，但是你不能直接通過這個端口獲取metrics的數(shù)據(jù)，因為他有認(rèn)證機(jī)制。這一期我們還是講使用Prometheus-agent的方式來采集metrics數(shù)據(jù)，下一期我們來通過認(rèn)證使用daemonset的方式部署categraf來采集。

(1)配置Prometheus-agent configmap配置文件

? ? 跟上面的操作一樣，在configmap下面添加名為kubelet的job字段即可，然后重新加載configmap的yaml文件

apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-agent-conf
  labels:
    name: prometheus-agent-conf
  namespace: flashcat
data:
  prometheus.yml: |-
    global:
      scrape_interval: 15s
      evaluation_interval: 15s

    scrape_configs:
      - job_name: 'apiserver'
        kubernetes_sd_configs:
        - role: endpoints
        scheme: https
        tls_config:
          insecure_skip_verify: true
        authorization:
          credentials_file: /var/run/secrets/kubernetes.io/serviceaccount/token
        relabel_configs:
        - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
          action: keep
          regex: default;kubernetes;https

      - job_name: 'controller-manager'
        kubernetes_sd_configs:
        - role: endpoints
        scheme: https
        tls_config:
          insecure_skip_verify: true
        authorization:
          credentials_file: /var/run/secrets/kubernetes.io/serviceaccount/token
        relabel_configs:
        - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
          action: keep
          regex: kube-system;kube-controller-manager;https-metrics
      - job_name: 'scheduler'
        kubernetes_sd_configs:
        - role: endpoints
        scheme: https
        tls_config:
          insecure_skip_verify: true
        authorization:
          credentials_file: /var/run/secrets/kubernetes.io/serviceaccount/token
        relabel_configs:
        - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
          action: keep
          regex: kube-system;kube-scheduler;https
      - job_name: 'etcd'
        kubernetes_sd_configs:
        - role: endpoints
        scheme: http
        relabel_configs:
        - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
          action: keep
          regex: kube-system;etcd;http
 ## 以下為添加的kubelete內(nèi)容
      - job_name: 'kubelet'
        kubernetes_sd_configs:
        - role: endpoints
        scheme: https
        tls_config:
          insecure_skip_verify: true
        authorization:
          credentials_file: /var/run/secrets/kubernetes.io/serviceaccount/token
        relabel_configs:
        - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
          action: keep
          regex: kube-system;kube-kubelet;https

    remote_write:
    - url: 'http://192.168.120.17:17000/prometheus/v1/write'

(2)配置kubelet的service和endpoints

? ? 跟之前一樣我們要先查看本地有沒有kubelet的endpoints如果沒有就要添加。

[root@k8s-master ~]# kubectl get endpoints -A | grep kubelet
  ##如果沒有就添加

vim kubelet-service.yaml
apiVersion: v1
kind: Service
metadata:
  labels:
    k8s-app: kubelet
  name: kube-kubelet
  namespace: kube-system
spec:
  clusterIP: None
  ports:
    - name: https
      port: 10250
      protocol: TCP
      targetPort: 10250
  sessionAffinity: None
  type: ClusterIP
---
apiVersion: v1
kind: Endpoints
metadata:
  labels:
    k8s-app: kubelet
  name: kube-kubelet
  namespace: kube-system
subsets:
  - addresses:
      - ip: 192.168.120.101
      - ip: 192.168.120.102  ##這里我們自定義的endpoint，這里添加的是需要監(jiān)控的k8s節(jié)點，這里我寫的是master的ip地址和node的IP地址
    ports:
      - name: https
        port: 10250
        protocol: TCP

(3)測試指標(biāo)

? ? 然后打開夜鶯的web頁面，查看指標(biāo)是否采集上。測試指標(biāo)：kubelet_running_pods

夜鶯(Flashcat)V6監(jiān)控(五)：夜鶯監(jiān)控k8s組件(上)

? ? ?導(dǎo)入儀表盤，儀表盤地址:categraf/dashboard-by-ident.json at main ·? · GitHub

夜鶯(Flashcat)V6監(jiān)控(五)：夜鶯監(jiān)控k8s組件(上)

? ? ?kubelet相關(guān)指標(biāo)意思：

# HELP go_gc_duration_seconds A summary of the pause duration of garbage collection cycles.
# TYPE go_gc_duration_seconds summary
gc的時間統(tǒng)計(summary指標(biāo))

# HELP go_goroutines Number of goroutines that currently exist.
# TYPE go_goroutines gauge
goroutine 數(shù)量

# HELP go_threads Number of OS threads created.
# TYPE go_threads gauge
線程數(shù)量

# HELP kubelet_cgroup_manager_duration_seconds [ALPHA] Duration in seconds for cgroup manager operations. Broken down by method.
# TYPE kubelet_cgroup_manager_duration_seconds histogram
操作cgroup的時長分布，按照操作類型統(tǒng)計

# HELP kubelet_containers_per_pod_count [ALPHA] The number of containers per pod.
# TYPE kubelet_containers_per_pod_count histogram
pod中container數(shù)量的統(tǒng)計(spec.containers的數(shù)量)

# HELP kubelet_docker_operations_duration_seconds [ALPHA] Latency in seconds of Docker operations. Broken down by operation type.
# TYPE kubelet_docker_operations_duration_seconds histogram
操作docker的時長分布，按照操作類型統(tǒng)計

# HELP kubelet_docker_operations_errors_total [ALPHA] Cumulative number of Docker operation errors by operation type.
# TYPE kubelet_docker_operations_errors_total counter
操作docker的錯誤累計次數(shù)，按照操作類型統(tǒng)計

# HELP kubelet_docker_operations_timeout_total [ALPHA] Cumulative number of Docker operation timeout by operation type.
# TYPE kubelet_docker_operations_timeout_total counter
操作docker的超時統(tǒng)計，按照操作類型統(tǒng)計

# HELP kubelet_docker_operations_total [ALPHA] Cumulative number of Docker operations by operation type.
# TYPE kubelet_docker_operations_total counter
操作docker的累計次數(shù)，按照操作類型統(tǒng)計

# HELP kubelet_eviction_stats_age_seconds [ALPHA] Time between when stats are collected, and when pod is evicted based on those stats by eviction signal
# TYPE kubelet_eviction_stats_age_seconds histogram
驅(qū)逐操作的時間分布，按照驅(qū)逐信號(原因)分類統(tǒng)計

# HELP kubelet_evictions [ALPHA] Cumulative number of pod evictions by eviction signal
# TYPE kubelet_evictions counter
驅(qū)逐次數(shù)統(tǒng)計，按照驅(qū)逐信號(原因)統(tǒng)計

# HELP kubelet_http_inflight_requests [ALPHA] Number of the inflight http requests
# TYPE kubelet_http_inflight_requests gauge
請求kubelet的inflight請求數(shù),按照method path server_type統(tǒng)計, 注意與每秒的request數(shù)區(qū)別開

# HELP kubelet_http_requests_duration_seconds [ALPHA] Duration in seconds to serve http requests
# TYPE kubelet_http_requests_duration_seconds histogram
請求kubelet的請求時間統(tǒng)計, 按照method path server_type統(tǒng)計

# HELP kubelet_http_requests_total [ALPHA] Number of the http requests received since the server started
# TYPE kubelet_http_requests_total counter
請求kubelet的請求數(shù)統(tǒng)計,按照method path server_type統(tǒng)計

# HELP kubelet_managed_ephemeral_containers [ALPHA] Current number of ephemeral containers in pods managed by this kubelet. Ephemeral containers will be ignored if disabled by the EphemeralContainers feature gate, and this number will be 0.
# TYPE kubelet_managed_ephemeral_containers gauge
當(dāng)前kubelet管理的臨時容器數(shù)量

# HELP kubelet_network_plugin_operations_duration_seconds [ALPHA] Latency in seconds of network plugin operations. Broken down by operation type.
# TYPE kubelet_network_plugin_operations_duration_seconds histogram
網(wǎng)絡(luò)插件的操作耗時分布 ，按照操作類型(operation_type)統(tǒng)計, 如果 --feature-gates=EphemeralContainers=false, 否則一直為0 

# HELP kubelet_network_plugin_operations_errors_total [ALPHA] Cumulative number of network plugin operation errors by operation type.
# TYPE kubelet_network_plugin_operations_errors_total counter
網(wǎng)絡(luò)插件累計操作錯誤數(shù)統(tǒng)計，按照操作類型(operation_type)統(tǒng)計

# HELP kubelet_network_plugin_operations_total [ALPHA] Cumulative number of network plugin operations by operation type.
# TYPE kubelet_network_plugin_operations_total counter
網(wǎng)絡(luò)插件累計操作數(shù)統(tǒng)計，按照操作類型(operation_type)統(tǒng)計

# HELP kubelet_node_name [ALPHA] The node's name. The count is always 1.
# TYPE kubelet_node_name gauge
node name

# HELP kubelet_pleg_discard_events [ALPHA] The number of discard events in PLEG.
# TYPE kubelet_pleg_discard_events counter
PLEG(pod lifecycle event generator) 丟棄的event數(shù)統(tǒng)計

# HELP kubelet_pleg_last_seen_seconds [ALPHA] Timestamp in seconds when PLEG was last seen active.
# TYPE kubelet_pleg_last_seen_seconds gauge
PLEG上次活躍的時間戳

# HELP kubelet_pleg_relist_duration_seconds [ALPHA] Duration in seconds for relisting pods in PLEG.
# TYPE kubelet_pleg_relist_duration_seconds histogram
PLEG relist pod時間分布 

# HELP kubelet_pleg_relist_interval_seconds [ALPHA] Interval in seconds between relisting in PLEG.
# TYPE kubelet_pleg_relist_interval_seconds histogram
PLEG relist 間隔時間分布

# HELP kubelet_pod_start_duration_seconds [ALPHA] Duration in seconds for a single pod to go from pending to running.
# TYPE kubelet_pod_start_duration_seconds histogram
pod啟動時間(從pending到running)分布, kubelet watch到pod時到pod中contianer都running后, watch各種source channel的pod變更

# HELP kubelet_pod_worker_duration_seconds [ALPHA] Duration in seconds to sync a single pod. Broken down by operation type: create, update, or sync
# TYPE kubelet_pod_worker_duration_seconds histogram
pod狀態(tài)變化的時間分布， 按照操作類型(create update sync)統(tǒng)計, worker就是kubelet中處理一個pod的邏輯工作單位

# HELP kubelet_pod_worker_start_duration_seconds [ALPHA] Duration in seconds from seeing a pod to starting a worker.
# TYPE kubelet_pod_worker_start_duration_seconds histogram
kubelet watch到pod到worker啟動的時間分布

# HELP kubelet_run_podsandbox_duration_seconds [ALPHA] Duration in seconds of the run_podsandbox operations. Broken down by RuntimeClass.Handler.
# TYPE kubelet_run_podsandbox_duration_seconds histogram
啟動sandbox的時間分布

# HELP kubelet_run_podsandbox_errors_total [ALPHA] Cumulative number of the run_podsandbox operation errors by RuntimeClass.Handler.
# TYPE kubelet_run_podsandbox_errors_total counter
啟動sanbox出現(xiàn)error的總數(shù) 

# HELP kubelet_running_containers [ALPHA] Number of containers currently running
# TYPE kubelet_running_containers gauge
當(dāng)前containers運行狀態(tài)的統(tǒng)計, 按照container狀態(tài)統(tǒng)計，created running exited

# HELP kubelet_running_pods [ALPHA] Number of pods that have a running pod sandbox
# TYPE kubelet_running_pods gauge
當(dāng)前處于running狀態(tài)pod數(shù)量

# HELP kubelet_runtime_operations_duration_seconds [ALPHA] Duration in seconds of runtime operations. Broken down by operation type.
# TYPE kubelet_runtime_operations_duration_seconds histogram
容器運行時的操作耗時(container在create list exec remove stop等的耗時)

# HELP kubelet_runtime_operations_errors_total [ALPHA] Cumulative number of runtime operation errors by operation type.
# TYPE kubelet_runtime_operations_errors_total counter
容器運行時的操作錯誤數(shù)統(tǒng)計(按操作類型統(tǒng)計)

# HELP kubelet_runtime_operations_total [ALPHA] Cumulative number of runtime operations by operation type.
# TYPE kubelet_runtime_operations_total counter
容器運行時的操作總數(shù)統(tǒng)計(按操作類型統(tǒng)計)

# HELP kubelet_started_containers_errors_total [ALPHA] Cumulative number of errors when starting containers
# TYPE kubelet_started_containers_errors_total counter
kubelet啟動容器錯誤總數(shù)統(tǒng)計(按code和container_type統(tǒng)計)
code包括ErrImagePull ErrImageInspect ErrImagePull ErrRegistryUnavailable ErrInvalidImageName等
container_type一般為"container" "podsandbox"

# HELP kubelet_started_containers_total [ALPHA] Cumulative number of containers started
# TYPE kubelet_started_containers_total counter
kubelet啟動容器總數(shù)

# HELP kubelet_started_pods_errors_total [ALPHA] Cumulative number of errors when starting pods
# TYPE kubelet_started_pods_errors_total counter
kubelet啟動pod遇到的錯誤總數(shù)(只有創(chuàng)建sandbox遇到錯誤才會統(tǒng)計)

# HELP kubelet_started_pods_total [ALPHA] Cumulative number of pods started
# TYPE kubelet_started_pods_total counter
kubelet啟動的pod總數(shù) 

# HELP process_cpu_seconds_total Total user and system CPU time spent in seconds.
# TYPE process_cpu_seconds_total counter
統(tǒng)計cpu使用率

# HELP process_max_fds Maximum number of open file descriptors.
# TYPE process_max_fds gauge
允許進(jìn)程打開的最大fd數(shù)

# HELP process_open_fds Number of open file descriptors.
# TYPE process_open_fds gauge
當(dāng)前打開的fd數(shù)量

# HELP process_resident_memory_bytes Resident memory size in bytes.
# TYPE process_resident_memory_bytes gauge
進(jìn)程駐留內(nèi)存大小

# HELP process_start_time_seconds Start time of the process since unix epoch in seconds.
# TYPE process_start_time_seconds gauge
進(jìn)程啟動時間

# HELP rest_client_request_duration_seconds [ALPHA] Request latency in seconds. Broken down by verb and URL.
# TYPE rest_client_request_duration_seconds histogram
請求apiserver的耗時統(tǒng)計(按照url和請求類型統(tǒng)計verb)

# HELP rest_client_requests_total [ALPHA] Number of HTTP requests, partitioned by status code, method, and host.
# TYPE rest_client_requests_total counter
請求apiserver的總次數(shù)(按照返回碼code和請求類型method統(tǒng)計)

# HELP storage_operation_duration_seconds [ALPHA] Storage operation duration
# TYPE storage_operation_duration_seconds histogram
存儲操作耗時(按照存儲plugin(configmap emptydir hostpath 等 )和operation_name分類統(tǒng)計)
# HELP volume_manager_total_volumes [ALPHA] Number of volumes in Volume Manager
# TYPE volume_manager_total_volumes gauge
本機(jī)掛載的volume數(shù)量統(tǒng)計(按照plugin_name和state統(tǒng)計
plugin_name包括"host-path" "empty-dir" "configmap" "projected")
state(desired_state_of_world期狀態(tài)/actual_state_of_world實際狀態(tài))

? ? cadivisor指標(biāo)梳理

# HELP container_cpu_cfs_periods_total Number of elapsed enforcement period intervals.
# TYPE container_cpu_cfs_periods_total counter
cfs時間片總數(shù), 完全公平調(diào)度的時間片總數(shù)(分配到cpu的時間片數(shù))

# HELP container_cpu_cfs_throttled_periods_total Number of throttled period intervals.
# TYPE container_cpu_cfs_throttled_periods_total counter
容器被throttle的時間片總數(shù)

# HELP container_cpu_cfs_throttled_seconds_total Total time duration the container has been throttled.
# TYPE container_cpu_cfs_throttled_seconds_total counter
容器被throttle的時間

# HELP container_file_descriptors Number of open file descriptors for the container.
# TYPE container_file_descriptors gauge
容器打開的fd數(shù)

# HELP container_memory_usage_bytes Current memory usage in bytes, including all memory regardless of when it was accessed
# TYPE container_memory_usage_bytes gauge
容器內(nèi)存使用量，單位byte 

# HELP container_network_receive_bytes_total Cumulative count of bytes received
# TYPE container_network_receive_bytes_total counter
容器入方向的流量

# HELP container_network_transmit_bytes_total Cumulative count of bytes transmitted
# TYPE container_network_transmit_bytes_total counter
容器出方向的流量

# HELP container_spec_cpu_period CPU period of the container.
# TYPE container_spec_cpu_period gauge
容器的cpu調(diào)度單位時間

# HELP container_spec_cpu_quota CPU quota of the container.
# TYPE container_spec_cpu_quota gauge
容器的cpu規(guī)格 ，除以單位調(diào)度時間可以計算核數(shù)

# HELP container_spec_memory_limit_bytes Memory limit for the container.
# TYPE container_spec_memory_limit_bytes gauge
容器的內(nèi)存規(guī)格，單位byte

# HELP container_threads Number of threads running inside the container
# TYPE container_threads gauge
容器當(dāng)前的線程數(shù)

# HELP container_threads_max Maximum number of threads allowed inside the container, infinity if value is zero
# TYPE container_threads_max gauge
允許容器啟動的最大線程數(shù)

(七)K8s-KubeProxy組件監(jiān)控

? ? KubeProxy 主要負(fù)責(zé)節(jié)點的網(wǎng)絡(luò)管理，它在每個節(jié)點都會存在，是通過10249端口暴露監(jiān)控指標(biāo)。

? ? 這里指標(biāo)采集我們也用上面的方法，使用Prometheus-agent的方式

(1)配置Prometheus-agent configmap配置文件

? ? 在之前的configmap的yaml文件中添加名為kube-proxy的job模塊字段，添加完記得重新加載yaml文件和Prometheus-agent的pod

apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-agent-conf
  labels:
    name: prometheus-agent-conf
  namespace: flashcat
data:
  prometheus.yml: |-
    global:
      scrape_interval: 15s
      evaluation_interval: 15s

    scrape_configs:
      - job_name: 'apiserver'
        kubernetes_sd_configs:
        - role: endpoints
        scheme: https
        tls_config:
          insecure_skip_verify: true
        authorization:
          credentials_file: /var/run/secrets/kubernetes.io/serviceaccount/token
        relabel_configs:
        - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
          action: keep
          regex: default;kubernetes;https

      - job_name: 'controller-manager'
        kubernetes_sd_configs:
        - role: endpoints
        scheme: https
        tls_config:
          insecure_skip_verify: true
        authorization:
          credentials_file: /var/run/secrets/kubernetes.io/serviceaccount/token
        relabel_configs:
        - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
          action: keep
          regex: kube-system;kube-controller-manager;https-metrics
      - job_name: 'scheduler'
        kubernetes_sd_configs:
        - role: endpoints
        scheme: https
        tls_config:
          insecure_skip_verify: true
        authorization:
          credentials_file: /var/run/secrets/kubernetes.io/serviceaccount/token
        relabel_configs:
        - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
          action: keep
          regex: kube-system;kube-scheduler;https
      - job_name: 'etcd'
        kubernetes_sd_configs:
        - role: endpoints
        scheme: http
        relabel_configs:
        - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
          action: keep
          regex: kube-system;etcd;http
      - job_name: 'kubelet'
        kubernetes_sd_configs:
        - role: endpoints
        scheme: https
        tls_config:
          insecure_skip_verify: true
        authorization:
          credentials_file: /var/run/secrets/kubernetes.io/serviceaccount/token
        relabel_configs:
        - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
          action: keep
          regex: kube-system;kube-kubelet;https
   ##這里是添加的模塊
      - job_name: 'kube-proxy'
        kubernetes_sd_configs:
        - role: endpoints
        scheme: http
        relabel_configs:
        - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
          action: keep
          regex: kube-system;kube-proxy;http

    remote_write:
    - url: 'http://192.168.120.17:17000/prometheus/v1/write'

(2)配置kube-proxy的endpoints

? ? 跟之前一樣，先查看有沒有這個kube-proxy的endpoints如果沒有添加。

[root@k8s-master ~]#  kubectl get endpoints -A | grep kube-pro
## 如果沒有 添加service

vim kube-proxy-service.yaml
apiVersion: v1
kind: Service
metadata:
  labels:
    k8s-app: proxy
  name: kube-proxy
  namespace: kube-system
spec:
  clusterIP: None
  selector:
    k8s-app: kube-proxy
  ports:
    - name: http
      port: 10249
      protocol: TCP
      targetPort: 10249
  sessionAffinity: None
  type: ClusterIP

(3)更改kube-proxy的metricsbindAddress

? ??查看 kube-proxy 的10249端口是否綁定到127.0.0.1了，如果是就修改成0.0.0.0，通過kubectl edit cm -n kube-system kube-proxy修改metricsBindAddress即可

[root@k8s-master ~]# kubectl edit cm -n kube-system kube-proxy

......
......
......

kind: KubeProxyConfiguration
    metricsBindAddress: "0.0.0.0"  ## 這里修改為0.0.0.0 即可
    mode: ""
    nodePortAddresses: null
    oomScoreAdj: nul

(4)指標(biāo)測試

在夜鶯的web頁面輸入指標(biāo)測試:
kubeproxy_network_programming_duration_seconds_bucket

夜鶯(Flashcat)V6監(jiān)控(五)：夜鶯監(jiān)控k8s組件(上)

?導(dǎo)入監(jiān)控大盤, 儀表盤json文件:https://github.com/flin/inputs/kube_proxy/dashboard-by-ident.json

夜鶯(Flashcat)V6監(jiān)控(五)：夜鶯監(jiān)控k8s組件(上)

kube-proxy關(guān)鍵指標(biāo)含義:

# HELP go_gc_duration_seconds A summary of the pause duration of garbage collection cycles.
# TYPE go_gc_duration_seconds summary
gc時間

# HELP go_goroutines Number of goroutines that currently exist.
# TYPE go_goroutines gauge
goroutine數(shù)量

# HELP go_threads Number of OS threads created.
# TYPE go_threads gauge
線程數(shù)量

# HELP kubeproxy_network_programming_duration_seconds [ALPHA] In Cluster Network Programming Latency in seconds
# TYPE kubeproxy_network_programming_duration_seconds histogram
service或者pod發(fā)生變化到kube-proxy規(guī)則同步完成時間指標(biāo)含義較復(fù)雜，參照https://github.com/kubernetes/community/blob/master/sig-scalability/slos/network_programming_latency.md

# HELP kubeproxy_sync_proxy_rules_duration_seconds [ALPHA] SyncProxyRules latency in seconds
# TYPE kubeproxy_sync_proxy_rules_duration_seconds histogram
規(guī)則同步耗時

# HELP kubeproxy_sync_proxy_rules_endpoint_changes_pending [ALPHA] Pending proxy rules Endpoint changes
# TYPE kubeproxy_sync_proxy_rules_endpoint_changes_pending gauge
endpoint 發(fā)生變化后規(guī)則同步pending的次數(shù)

# HELP kubeproxy_sync_proxy_rules_endpoint_changes_total [ALPHA] Cumulative proxy rules Endpoint changes
# TYPE kubeproxy_sync_proxy_rules_endpoint_changes_total counter
endpoint 發(fā)生變化后規(guī)則同步的總次數(shù)

# HELP kubeproxy_sync_proxy_rules_iptables_restore_failures_total [ALPHA] Cumulative proxy iptables restore failures
# TYPE kubeproxy_sync_proxy_rules_iptables_restore_failures_total counter
本機(jī)上 iptables restore 失敗的總次數(shù)

# HELP kubeproxy_sync_proxy_rules_last_queued_timestamp_seconds [ALPHA] The last time a sync of proxy rules was queued
# TYPE kubeproxy_sync_proxy_rules_last_queued_timestamp_seconds gauge
最近一次規(guī)則同步的請求時間戳，如果比下一個指標(biāo) kubeproxy_sync_proxy_rules_last_timestamp_seconds 大很多，那說明同步 hung 住了

# HELP kubeproxy_sync_proxy_rules_last_timestamp_seconds [ALPHA] The last time proxy rules were successfully synced
# TYPE kubeproxy_sync_proxy_rules_last_timestamp_seconds gauge
最近一次規(guī)則同步的完成時間戳

# HELP kubeproxy_sync_proxy_rules_service_changes_pending [ALPHA] Pending proxy rules Service changes
# TYPE kubeproxy_sync_proxy_rules_service_changes_pending gauge
service變化引起的規(guī)則同步pending數(shù)量

# HELP kubeproxy_sync_proxy_rules_service_changes_total [ALPHA] Cumulative proxy rules Service changes
# TYPE kubeproxy_sync_proxy_rules_service_changes_total counter
service變化引起的規(guī)則同步總數(shù)

# HELP process_cpu_seconds_total Total user and system CPU time spent in seconds.
# TYPE process_cpu_seconds_total counter
利用這個指標(biāo)統(tǒng)計cpu使用率

# HELP process_max_fds Maximum number of open file descriptors.
# TYPE process_max_fds gauge
進(jìn)程可以打開的最大fd數(shù)

# HELP process_open_fds Number of open file descriptors.
# TYPE process_open_fds gauge
進(jìn)程當(dāng)前打開的fd數(shù)

# HELP process_resident_memory_bytes Resident memory size in bytes.
# TYPE process_resident_memory_bytes gauge
統(tǒng)計內(nèi)存使用大小

# HELP process_start_time_seconds Start time of the process since unix epoch in seconds.
# TYPE process_start_time_seconds gauge
進(jìn)程啟動時間戳

# HELP rest_client_request_duration_seconds [ALPHA] Request latency in seconds. Broken down by verb and URL.
# TYPE rest_client_request_duration_seconds histogram
請求 apiserver 的耗時(按照url和verb統(tǒng)計)

# HELP rest_client_requests_total [ALPHA] Number of HTTP requests, partitioned by status code, method, and host.
# TYPE rest_client_requests_total counter
請求 apiserver 的總數(shù)(按照code method host統(tǒng)計)

最后的最后

? ? ?夜鶯監(jiān)控k8s的方法夜鶯的官網(wǎng)也做了合計專欄，有興趣的伙伴可以去看看Kubernetes監(jiān)控專欄，無論是指標(biāo)還是原理，都做了解釋初識Kubernetes -(flashcat.cloud)。如果在部署中遇到問題歡迎在本文章留言，24小時內(nèi)必回復(fù)。

? ? ?看完這一期肯定會有小伙伴會有疑問，我業(yè)務(wù)都跑pod上面，光監(jiān)控這些組件沒啥大用啊，我想知道總共有幾個 Namespace，有幾個 Service、Deployment、Statefulset，某個 Deployment 期望有幾個 Pod 要運行，實際有幾個 Pod 在運行，這些既有的指標(biāo)就無法回答了。當(dāng)然這一點肯定重中之重，這個問題我們下一期詳細(xì)講解使用使用 kube-state-metrics (KSM)監(jiān)控 Kubernetes 對象,俗稱KSM來監(jiān)聽各個Kubernetes對象的狀態(tài)，生產(chǎn)指標(biāo)暴露出來讓我們查看。還有下一期還會講用daemonset的最佳實踐方案來采集監(jiān)控。文章來源地址http://www.zghlxwxcb.cn/news/detail-457763.html

到了這里，關(guān)于夜鶯(Flashcat)V6監(jiān)控(五)：夜鶯監(jiān)控k8s組件(上)的文章就介紹完了。如果您還想了解更多內(nèi)容，請在右上角搜索TOY模板網(wǎng)以前的文章或繼續(xù)瀏覽下面的相關(guān)文章，希望大家以后多多支持TOY模板網(wǎng)！