? ?
目錄
(一)Kubernetest監(jiān)控體系
1.Kubernetes監(jiān)控策略
(二)K8s-ApiServer組件監(jiān)控
(1)我們先創(chuàng)建一個namespace來專門做夜鶯監(jiān)控采集指標(biāo)
(2)創(chuàng)建認(rèn)證授權(quán)信息rbac? ??
(3)使用prometheus-agent進(jìn)行指標(biāo)采集
① 創(chuàng)建Prometheus的配置文件
② 部署Prometehus Agent
(三)K8s-ControllerManager組件監(jiān)控
(1)創(chuàng)建prometheus的配置文件
?(2)重新創(chuàng)建controller 的endpoints
(3)更改controller 的bind-address
(4)指標(biāo)測試
(四)K8s-Scheduler組件監(jiān)控
(1)創(chuàng)建prometheus的配置文件
?(2)配置Scheduler的service
?(3)重啟prometheus-agent
??(4) 測試指標(biāo)導(dǎo)入儀表盤
(五)K8s-Etcd組件監(jiān)控
(1)更改etcd配置文件監(jiān)聽地址為0.0.0.0
(2)數(shù)據(jù)采集
(3)指標(biāo)測試?
(六)K8s-kubelet組件監(jiān)控
(1)配置Prometheus-agent configmap配置文件
(2)配置kubelet的service和endpoints
(3)測試指標(biāo)
(七)K8s-KubeProxy組件監(jiān)控
(1)配置Prometheus-agent configmap配置文件
(2)配置kube-proxy的endpoints
(3)更改kube-proxy的metricsbindAddress
(4)指標(biāo)測試
最后的最后
? ? 這一期我們講一下夜鶯來監(jiān)控k8s組件的監(jiān)控,因為k8s的組件復(fù)雜,內(nèi)容多,所以我們分成上下兩部分來學(xué)習(xí),這一期我們先學(xué)習(xí)監(jiān)控k8s的幾大組件。首先我們先來了解認(rèn)識一下k8s的架構(gòu)和監(jiān)控概述
(一)Kubernetest監(jiān)控體系
? ? 當(dāng)我們談及 Kubernetes 監(jiān)控的時候,我們在談?wù)撌裁矗匡@然是 Kubernetes 架構(gòu)下的各個內(nèi)容的監(jiān)控,Kubernetes 所跑的環(huán)境、Kubernetes 本身、跑在 Kubernetes 上面的應(yīng)用等等。Kubernetes 所跑的環(huán)境,可能是物理機(jī)、虛擬機(jī),并且依賴底層的基礎(chǔ)網(wǎng)絡(luò),Kubernetes 上面的應(yīng)用,可能是業(yè)務(wù)應(yīng)用程序,也可能是各類中間件、數(shù)據(jù)庫,Kubernetes 本身,則包含很多組件,我們通過一張 Kubernetes 架構(gòu)圖來說明? 。
? ? ?最左側(cè)是 UI 層,包括頁面 UI 以及命令行工具 kubectl,中間部分是 Kubernetes 控制面組件,右側(cè)部分是工作負(fù)載節(jié)點,包含兩個工作覆蓋節(jié)點。
? ? k8s的這個架構(gòu)我們可以大致分為兩個模塊來理解:
1.Master組件
? ? apiserver: 是Kubernetes集群中所有組件之間通信的中心組件,也是集群的前端接口。kube-apiserver負(fù)責(zé)驗證和處理API請求,并將它們轉(zhuǎn)發(fā)給其他組件。
? ? scheduler:?Kubernetes Scheduler負(fù)責(zé)在Kubernetes集群中選擇最合適的Node來運行新創(chuàng)建的Pod,考慮到節(jié)點的資源利用率、Pod的調(diào)度限制、網(wǎng)絡(luò)位置等因素。
? ??controller-manager:?Kubernetes Controller Manager包含多個控制器,負(fù)責(zé)監(jiān)視并確保集群狀態(tài)符合預(yù)期。例如,ReplicationController、NamespaceController、ServiceAccountController等等。
? ?? etcd:etcd是Kubernetes的后端數(shù)據(jù)庫,用于存儲和管理Kubernetes集群狀態(tài)信息,例如Pod、Service、ConfigMap等對象的配置和狀態(tài)信息。
2.Slave-node組件
? ? kubelet:Kubelet是在每個Node上運行的代理服務(wù),負(fù)責(zé)管理和監(jiān)視該Node上的容器,并與kube-apiserver進(jìn)行通信以保持節(jié)點狀態(tài)最新。
? ??kube-proxy:Kubernetes Proxy負(fù)責(zé)為容器提供網(wǎng)絡(luò)代理和負(fù)載均衡功能,使得容器可以訪問其他Pod、Service等網(wǎng)絡(luò)資源。
? ??Container Runtime:如Docker,rkt,runc等提供容器運行時環(huán)境
1.Kubernetes監(jiān)控策略
? ? ?Kubernetes作為開源的容器編排工具,為用戶提供了一個可以統(tǒng)一調(diào)度,統(tǒng)一管理的云操作系統(tǒng)。其解決如用戶應(yīng)用程序如何運行的問題。而一旦在生產(chǎn)環(huán)境中大量基于Kubernetes部署和管理應(yīng)用程序后,作為系統(tǒng)管理員,還需要充分了解應(yīng)用程序以及Kubernetes集群服務(wù)運行質(zhì)量如何,通過對應(yīng)用以及集群運行狀態(tài)數(shù)據(jù)的收集和分析,持續(xù)優(yōu)化和改進(jìn),從而提供一個安全可靠的生產(chǎn)運行環(huán)境。 這一小節(jié)中我們將討論當(dāng)使用Kubernetes時的監(jiān)控策略該如何設(shè)計。
? ? ?從物理結(jié)構(gòu)上講Kubernetes主要用于整合和管理底層的基礎(chǔ)設(shè)施資源,對外提供應(yīng)用容器的自動化部署和管理能力,這些基礎(chǔ)設(shè)施可能是物理機(jī)、虛擬機(jī)、云主機(jī)等等。因此,基礎(chǔ)資源的使用直接影響當(dāng)前集群的容量和應(yīng)用的狀態(tài)。在這部分,我們需要關(guān)注集群中各個節(jié)點的主機(jī)負(fù)載,CPU使用率、內(nèi)存使用率、存儲空間以及網(wǎng)絡(luò)吞吐等監(jiān)控指標(biāo)。
? ? ?從自身架構(gòu)上講,kube-apiserver是Kubernetes提供所有服務(wù)的入口,無論是外部的客戶端還是集群內(nèi)部的組件都直接與kube-apiserver進(jìn)行通訊。因此,kube-apiserver的并發(fā)和吞吐量直接決定了集群性能的好壞。其次,對于外部用戶而言,Kubernetes是否能夠快速的完成pod的調(diào)度以及啟動,是影響其使用體驗的關(guān)鍵因素。而這個過程主要由kube-scheduler負(fù)責(zé)完成調(diào)度工作,而kubelet完成pod的創(chuàng)建和啟動工作。因此在Kubernetes集群本身我們需要評價其自身的服務(wù)質(zhì)量,主要關(guān)注在Kubernetes的API響應(yīng)時間,以及Pod的啟動時間等指標(biāo)上。??
? ? ?Kubernetes的最終目標(biāo)還是需要為業(yè)務(wù)服務(wù),因此我們還需要能夠監(jiān)控應(yīng)用容器的資源使用情況。對于內(nèi)置了對Prometheus支持的應(yīng)用程序,也要支持從這些應(yīng)用程序中采集內(nèi)部的監(jiān)控指標(biāo)。最后,結(jié)合黑盒監(jiān)控模式,對集群中部署的服務(wù)進(jìn)行探測,從而當(dāng)應(yīng)用發(fā)生故障后,能夠快速處理和恢復(fù)。
? ? 綜上所述,我們需要綜合使用白盒監(jiān)控和黑盒監(jiān)控模式,建立從基礎(chǔ)設(shè)施,Kubernetes核心組件,應(yīng)用容器等全面的監(jiān)控體系。
在白盒監(jiān)控層面我們需要關(guān)注:
- 基礎(chǔ)設(shè)施層(Node):為整個集群和應(yīng)用提供運行時資源,需要通過各節(jié)點的kubelet獲取節(jié)點的基本狀態(tài),同時通過在節(jié)點上部署Node Exporter獲取節(jié)點的資源使用情況;
- 容器基礎(chǔ)設(shè)施(Container):為應(yīng)用提供運行時環(huán)境,Kubelet內(nèi)置了對cAdvisor的支持,用戶可以直接通過Kubelet組件獲取給節(jié)點上容器相關(guān)監(jiān)控指標(biāo);
- 用戶應(yīng)用(Pod):Pod中會包含一組容器,它們一起工作,并且對外提供一個(或者一組)功能。如果用戶部署的應(yīng)用程序內(nèi)置了對Prometheus的支持,那么我們還應(yīng)該采集這些Pod暴露的監(jiān)控指標(biāo);
- Kubernetes組件:獲取并監(jiān)控Kubernetes核心組件的運行狀態(tài),確保平臺自身的穩(wěn)定運行。
而在黑盒監(jiān)控層面,則主要需要關(guān)注以下:
- 內(nèi)部服務(wù)負(fù)載均衡(Service):在集群內(nèi),通過Service在集群暴露應(yīng)用功能,集群內(nèi)應(yīng)用和應(yīng)用之間訪問時提供內(nèi)部的負(fù)載均衡。通過Blackbox Exporter探測Service的可用性,確保當(dāng)Service不可用時能夠快速得到告警通知;
- 外部訪問入口(Ingress):通過Ingress提供集群外的訪問入口,從而可以使外部客戶端能夠訪問到部署在Kubernetes集群內(nèi)的服務(wù)。因此也需要通過Blackbox Exporter對Ingress的可用性進(jìn)行探測,確保外部用戶能夠正常訪問集群內(nèi)的功能;
? ? 說這么大家肯定有了一點初步的了解k8s的監(jiān)控,那我們接下來趁熱打鐵,直接上實踐,我們用夜鶯來監(jiān)控k8s的六大組件。
(二)K8s-ApiServer組件監(jiān)控
? ? ApiServer 是 Kubernetes 架構(gòu)中的核心,是所有 API 是入口,它串聯(lián)所有的系統(tǒng)組件。
? ? 為了方便監(jiān)控管理 ApiServer,設(shè)計者們?yōu)樗┞读艘幌盗械闹笜?biāo)數(shù)據(jù)。當(dāng)你部署完集群,默認(rèn)會在default
名稱空間下創(chuàng)建一個名叫kubernetes
的 service,它就是 ApiServer 的地址,當(dāng)然也可以查看本機(jī)暴露的apiserver的端口ss -tlnp
[root@k8s-master ~]# kubectl get service -A | grep kubernetes
default kubernetes ClusterIP 10.96.0.1 <none> 443/TCP 52d
[root@k8s-master ~]# ss -tlpn | grep apiserver
LISTEN 0 128 [::]:6443 [::]:* users:(("kube-apiserver",pid=2287,fd=7))
? ? 但是當(dāng)我們想要去獲取抓緊metrics數(shù)據(jù)的時候,會發(fā)現(xiàn)我們抓緊不了,沒有權(quán)限證書
[root@k8s-master ~]# curl -s -k https://localhost:6443/metrics
{
"kind": "Status",
"apiVersion": "v1",
"metadata": {},
"status": "Failure",
"message": "forbidden: User \"system:anonymous\" cannot get path \"/metrics\"",
"reason": "Forbidden",
"details": {},
"code": 403
}[root@k8s-master ~]#
? ? 所以,要監(jiān)控 ApiServer,采集到對應(yīng)的指標(biāo),就需要先授權(quán)。為此,我們先準(zhǔn)備認(rèn)證信息。
(1)我們先創(chuàng)建一個namespace來專門做夜鶯監(jiān)控采集指標(biāo)
[root@k8s-master ~]# kubectl create namespace flashcat
(2)創(chuàng)建認(rèn)證授權(quán)信息rbac? ??
? ? 這個yaml文件的意思就是我們創(chuàng)建一個賬號sa名為categraf,然后給他綁定resources的verbs權(quán)限,讓categraf這個賬號有足夠的權(quán)限來獲取k8s的各個組件的指標(biāo)采集
vim apiserver-auth.yaml
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: categraf
rules:
- apiGroups: [""]
resources:
- nodes
- nodes/metrics
- nodes/stats
- nodes/proxy
- services
- endpoints
- pods
verbs: ["get", "list", "watch"]
- apiGroups:
- extensions
- networking.k8s.io
resources:
- ingresses
verbs: ["get", "list", "watch"]
- nonResourceURLs: ["/metrics", "/metrics/cadvisor"]
verbs: ["get"]
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: categraf
namespace: flashcat
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: categraf
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: categraf
subjects:
- kind: ServiceAccount
name: categraf
namespace: flashcat
(3)使用prometheus-agent進(jìn)行指標(biāo)采集
? ? 支持 Kubernetes 服務(wù)發(fā)現(xiàn)的 agent 有不少,但是要說最原汁原味的還是 Prometheus 自身,Prometheus 新版本(v2.32.0)支持了 agent mode 模式,即把 Prometheus 進(jìn)程當(dāng)做采集器 agent,采集了數(shù)據(jù)之后通過 remote write 方式傳給中心(這里使用早就準(zhǔn)備好的 Nightingale 作為數(shù)據(jù)接收服務(wù)端)。那這里我就使用 Prometheus 的 agent mode 方式來采集 APIServer。
① 創(chuàng)建Prometheus的配置文件
? ? 這里給大家解釋一下這個配置文件的一些內(nèi)容,給一些對普羅米修斯還不是很了解的小伙伴參考:
global:第一部分定義的一個名為global的模塊
scrape_interval: 采集間隔
evaluation_interval: 評估間隔,用于控制數(shù)據(jù)的收集和處理頻率
scrape_config: 第二部分定義的模塊,用來配置Prometheus要監(jiān)控的目標(biāo)
job_name: 表示該配置是用于監(jiān)控Kubernetets APIserver的
kubernetes_sd_configs: 表示指定了從kubernetest Service Discovery中獲取目錄對象的方式
此處使用了 role: endpoints 獲取endpoint對象,也就是API server的ip地址和端口信息。
scheme:指定了網(wǎng)絡(luò)通信協(xié)議是HTTPS
tls_config:參數(shù)指定了TLS證書的相關(guān)配置,包括是否驗證服務(wù)器端證書等。
insecure_skip_verify:是一個bool類型的參數(shù),如果為true,表示跳過對服務(wù)器端證書的驗證。在生產(chǎn)環(huán)境中,不應(yīng)該使用,因為會導(dǎo)致通信的不安全。正常情況下。我們需要在客戶端上配置ca證書來驗證服務(wù)器端證書的合法性。
authorization: 指定了認(rèn)證信息的來源,這里使用了默認(rèn)的kubernetest服務(wù)賬號的Token。
relabel_configs:用于將原始數(shù)據(jù)標(biāo)簽進(jìn)行變換,篩選出需要的目標(biāo)數(shù)據(jù)
source_labels:定義了三個規(guī)則用來匹配標(biāo)簽。其中__meta_kubernetes_namespace
表示Kubernetes命名空間,__meta_kubernetes_service_name
表示服務(wù)名稱,__meta_kubernetes_endpoint_port_name
表示端口名稱
action:指定該操作是保留keep,也就是保留符合指定正則表達(dá)式的標(biāo)簽
regex:使用正則表達(dá)式來對標(biāo)簽進(jìn)行過濾,這里的正則表達(dá)式為default;kubernetes;http
,表示要保留的目標(biāo)是default
命名空間下的kubernetes
服務(wù),并且端口是http
。
通過這個relabel_configs塊,Prometheus將采集到的來自default
命名空間下的kubernetes
服務(wù),并且端口是http
的數(shù)據(jù)進(jìn)行保留,并將這些數(shù)據(jù)推送給后續(xù)的n9e夜鶯
remote_write: 用于將普羅米修斯采集的數(shù)據(jù)寫入外部存儲。這里我們定義的是夜鶯的地址。prometheus/v1/write
是外部存儲的接口路徑。
vim prometheus-cm.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-agent-conf
labels:
name: prometheus-agent-conf
namespace: flashcat
data:
prometheus.yml: |-
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'apiserver'
kubernetes_sd_configs:
- role: endpoints
scheme: https
tls_config:
insecure_skip_verify: true
authorization:
credentials_file: /var/run/secrets/kubernetes.io/serviceaccount/token
relabel_configs:
- source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
action: keep
regex: default;kubernetes;https
remote_write:
- url: 'http://192.168.120.17:17000/prometheus/v1/write'
② 部署Prometehus Agent
? ? 這里我們使用deployment的方式部署
? ? 其中--enable-feature=agent
表示啟動的是 agent 模式。
apiVersion: apps/v1
kind: Deployment
metadata:
name: prometheus-agent
namespace: flashcat
labels:
app: prometheus-agent
spec:
replicas: 1
selector:
matchLabels:
app: prometheus-agent
template:
metadata:
labels:
app: prometheus-agent
spec:
serviceAccountName: categraf
containers:
- name: prometheus
image: prom/prometheus
args:
- "--config.file=/etc/prometheus/prometheus.yml"
- "--web.enable-lifecycle"
- "--enable-feature=agent"
ports:
- containerPort: 9090
resources:
requests:
cpu: 500m
memory: 500M
limits:
cpu: 1
memory: 1Gi
volumeMounts:
- name: prometheus-config-volume
mountPath: /etc/prometheus/
- name: prometheus-storage-volume
mountPath: /prometheus/
volumes:
- name: prometheus-config-volume
configMap:
defaultMode: 420
name: prometheus-agent-conf
- name: prometheus-storage-volume
emptyDir: {}
? ? 查看是否部署成功
[root@k8s-master ~]# kubectl get pod -n flashcat
NAME READY STATUS RESTARTS AGE
prometheus-agent-7c8d7bc7bb-42djw 1/1 Running 0 115m
? ? 然后可以到夜鶯web頁面查看指標(biāo) 測試apiserver_request_total
? ? 獲取到了指標(biāo)數(shù)據(jù),后面就是合理利用指標(biāo)做其他動作,比如構(gòu)建面板、告警處理等。
?導(dǎo)入Apiserver的監(jiān)控大盤,監(jiān)控的json文件在categraf/apiserver-dash.json · GitHub
?直接復(fù)制導(dǎo)入json文件的內(nèi)容即可
另外,Apiserver 的關(guān)鍵指標(biāo)的含義也貼出來
# HELP apiserver_request_duration_seconds [STABLE] Response latency distribution in seconds for each verb, dry run value, group, version, resource, subresource, scope and component.
# TYPE apiserver_request_duration_seconds histogram
apiserver響應(yīng)的時間分布,按照url 和 verb 分類
一般按照instance和verb+時間 匯聚
# HELP apiserver_request_total [STABLE] Counter of apiserver requests broken out for each verb, dry run value, group, version, resource, scope, component, and HTTP response code.
# TYPE apiserver_request_total counter
apiserver的請求總數(shù),按照verb、 version、 group、resource、scope、component、 http返回碼分類統(tǒng)計
# HELP apiserver_current_inflight_requests [STABLE] Maximal number of currently used inflight request limit of this apiserver per request kind in last second.
# TYPE apiserver_current_inflight_requests gauge
最大并發(fā)請求數(shù), 按mutating(非get list watch的請求)和readOnly(get list watch)分別限制
超過max-requests-inflight(默認(rèn)值400)和max-mutating-requests-inflight(默認(rèn)200)的請求會被限流
apiserver變更時要注意觀察,也是反饋集群容量的一個重要指標(biāo)
# HELP apiserver_response_sizes [STABLE] Response size distribution in bytes for each group, version, verb, resource, subresource, scope and component.
# TYPE apiserver_response_sizes histogram
apiserver 響應(yīng)大小,單位byte, 按照verb、 version、 group、resource、scope、component分類統(tǒng)計
# HELP watch_cache_capacity [ALPHA] Total capacity of watch cache broken by resource type.
# TYPE watch_cache_capacity gauge
按照資源類型統(tǒng)計的watch緩存大小
# HELP process_cpu_seconds_total Total user and system CPU time spent in seconds.
# TYPE process_cpu_seconds_total counter
每秒鐘用戶態(tài)和系統(tǒng)態(tài)cpu消耗時間, 計算apiserver進(jìn)程的cpu的使用率
# HELP process_resident_memory_bytes Resident memory size in bytes.
# TYPE process_resident_memory_bytes gauge
apiserver的內(nèi)存使用量(單位:Byte)
# HELP workqueue_adds_total [ALPHA] Total number of adds handled by workqueue
# TYPE workqueue_adds_total counter
apiserver中包含的controller的工作隊列,已處理的任務(wù)總數(shù)
# HELP workqueue_depth [ALPHA] Current depth of workqueue
# TYPE workqueue_depth gauge
apiserver中包含的controller的工作隊列深度,表示當(dāng)前隊列中要處理的任務(wù)的數(shù)量,數(shù)值越小越好
例如APIServiceRegistrationController admission_quota_controller
(三)K8s-ControllerManager組件監(jiān)控
? ? ?controller-manager 是 Kubernetes 控制面的組件,通常不太可能出問題,一般監(jiān)控一下通用的進(jìn)程指標(biāo)就問題不大了,不過 controller-manager 確實也暴露了很多?/metrics
?白盒指標(biāo),我們也一并梳理一下相關(guān)內(nèi)容。
? ? 監(jiān)控思路跟上面一樣,也是用Prometheus-Agent的方式進(jìn)行采集指標(biāo)
(1)創(chuàng)建prometheus的配置文件
? ? 因為我們上面做apiserver的時候已經(jīng)做了權(quán)限綁定和一些基礎(chǔ)配置,所以這里我們直接添加Prometheus的配置文件添加job模塊內(nèi)容即可。
? ?這里我們可以直接打開之前創(chuàng)建的prometheus-cm的configmap配置文件 添加一個job關(guān)于controller-manager的即可
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-agent-conf
labels:
name: prometheus-agent-conf
namespace: flashcat
data:
prometheus.yml: |-
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'apiserver'
kubernetes_sd_configs:
- role: endpoints
scheme: https
tls_config:
insecure_skip_verify: true
authorization:
credentials_file: /var/run/secrets/kubernetes.io/serviceaccount/token
relabel_configs:
- source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
action: keep
regex: default;kubernetes;https
## 這里添加即可以下內(nèi)容即可
- job_name: 'controller-manager'
kubernetes_sd_configs:
- role: endpoints
scheme: https
tls_config:
insecure_skip_verify: true
authorization:
credentials_file: /var/run/secrets/kubernetes.io/serviceaccount/token
relabel_configs:
- source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
action: keep
regex: kube-system;kube-controller-manager;https-metrics
remote_write:
- url: 'http://192.168.120.17:17000/prometheus/v1/write'
?(2)重新創(chuàng)建controller 的endpoints
? ? 先查看一下自己有沒有對應(yīng)的controller-manager的endpoint 如果沒有創(chuàng)建一個servece即可
? ? 為什么要endpoint呢 因為我們上面Prometheus的采集規(guī)則role就是endpoint
[root@k8s-master ~]# kubectl get endpoints -A | grep controller
這里如果沒有查詢到就創(chuàng)建一個serveice文件
vim controller-manager-service.yaml
apiVersion: v1
kind: Service
metadata:
annotations:
labels:
k8s-app: kube-controller-manager
name: kube-controller-manager
namespace: kube-system
spec:
clusterIP: None
ports:
- name: https-metrics
port: 10257
protocol: TCP
targetPort: 10257
selector:
component: kube-controller-manager
sessionAffinity: None
type: ClusterIP
運行yaml: kubectl apply -f controller-manager-service.yaml
(3)更改controller 的bind-address
注意:如果你使用的kubeadm安裝的k8s集群,需要把controller-manager的bind-address改為0.0.0.0
[root@k8s-master ~]# vim /etc/kubernetes/manifests/kube-controller-manager.yaml
....
....
- --bind-address=0.0.0.0 ##找到bind-address 把127.0.0.1 改為 0.0.0.0
(4)指標(biāo)測試
然后重啟Prometheus-agent 的pod 重新加載Prometheus的配置文件的yaml
重啟后先在夜鶯的web頁面查詢指標(biāo),測試指標(biāo)daemon_controller_rate_limiter_use
導(dǎo)入監(jiān)控大盤,大盤鏈接:categraf/cm-dash.json at main · flashcatcloud/categraf · GitHub
?查看儀表盤 (怎么導(dǎo)入儀表盤的操作跟上面導(dǎo)入apiserver的儀表盤一樣的,把json文件克隆進(jìn)行即可)
controller-manager關(guān)鍵指標(biāo)意思也貼出來
# HELP rest_client_request_duration_seconds [ALPHA] Request latency in seconds. Broken down by verb and URL.
# TYPE rest_client_request_duration_seconds histogram
請求apiserver的耗時分布,按照url+verb統(tǒng)計
# HELP cronjob_controller_cronjob_job_creation_skew_duration_seconds [ALPHA] Time between when a cronjob is scheduled to be run, and when the corresponding job is created
# TYPE cronjob_controller_cronjob_job_creation_skew_duration_seconds histogram
cronjob 創(chuàng)建到運行的時間分布
# HELP leader_election_master_status [ALPHA] Gauge of if the reporting system is master of the relevant lease, 0 indicates backup, 1 indicates master. 'name' is the string used to identify the lease. Please make sure to group by name.
# TYPE leader_election_master_status gauge
控制器的選舉狀態(tài),0表示backup, 1表示master
# HELP node_collector_zone_health [ALPHA] Gauge measuring percentage of healthy nodes per zone.
# TYPE node_collector_zone_health gauge
每個zone的健康node占比
# HELP node_collector_zone_size [ALPHA] Gauge measuring number of registered Nodes per zones.
# TYPE node_collector_zone_size gauge
每個zone的node數(shù)
# HELP process_cpu_seconds_total Total user and system CPU time spent in seconds.
# TYPE process_cpu_seconds_total counter
cpu使用量(也可以理解為cpu使用率)
# HELP process_open_fds Number of open file descriptors.
# TYPE process_open_fds gauge
控制器打開的fd數(shù)
# HELP pv_collector_bound_pv_count [ALPHA] Gauge measuring number of persistent volume currently bound
# TYPE pv_collector_bound_pv_count gauge
當(dāng)前綁定的pv數(shù)量
# HELP pv_collector_unbound_pvc_count [ALPHA] Gauge measuring number of persistent volume claim currently unbound
# TYPE pv_collector_unbound_pvc_count gauge
當(dāng)前沒有綁定的pvc數(shù)量
# HELP pv_collector_bound_pvc_count [ALPHA] Gauge measuring number of persistent volume claim currently bound
# TYPE pv_collector_bound_pvc_count gauge
當(dāng)前綁定的pvc數(shù)量
# HELP pv_collector_total_pv_count [ALPHA] Gauge measuring total number of persistent volumes
# TYPE pv_collector_total_pv_count gauge
pv總數(shù)量
# HELP workqueue_adds_total [ALPHA] Total number of adds handled by workqueue
# TYPE workqueue_adds_total counter
各個controller已接受的任務(wù)總數(shù)
與apiserver的workqueue_adds_total指標(biāo)類似
# HELP workqueue_depth [ALPHA] Current depth of workqueue
# TYPE workqueue_depth gauge
各個controller隊列深度,表示一個controller中的任務(wù)的數(shù)量
與apiserver的workqueue_depth類似,這個是指各個controller中隊列的深度,數(shù)值越小越好
# HELP workqueue_queue_duration_seconds [ALPHA] How long in seconds an item stays in workqueue before being requested.
# TYPE workqueue_queue_duration_seconds histogram
任務(wù)在隊列中的等待耗時,按照控制器分別統(tǒng)計
# HELP workqueue_work_duration_seconds [ALPHA] How long in seconds processing an item from workqueue takes.
# TYPE workqueue_work_duration_seconds histogram
任務(wù)出隊到被處理完成的時間,按照控制分別統(tǒng)計
# HELP workqueue_retries_total [ALPHA] Total number of retries handled by workqueue
# TYPE workqueue_retries_total counter
任務(wù)進(jìn)入隊列重試的次數(shù)
# HELP workqueue_longest_running_processor_seconds [ALPHA] How many seconds has the longest running processor for workqueue been running.
# TYPE workqueue_longest_running_processor_seconds gauge
正在處理的任務(wù)中,最長耗時任務(wù)的處理時間
# HELP endpoint_slice_controller_syncs [ALPHA] Number of EndpointSlice syncs
# TYPE endpoint_slice_controller_syncs counter
endpoint_slice 同步的數(shù)量(1.20以上)
# HELP get_token_fail_count [ALPHA] Counter of failed Token() requests to the alternate token source
# TYPE get_token_fail_count counter
獲取token失敗的次數(shù)
# HELP go_memstats_gc_cpu_fraction The fraction of this program's available CPU time used by the GC since the program started.
# TYPE go_memstats_gc_cpu_fraction gauge
controller gc的cpu使用率
(四)K8s-Scheduler組件監(jiān)控
? ? scheduler 是 Kubernetes 的控制面組件,負(fù)責(zé)調(diào)度對象到合適的 node 上,會有一系列的規(guī)則計算和篩選,重點關(guān)注調(diào)度相關(guān)的指標(biāo)。相關(guān)監(jiān)控數(shù)據(jù)也是通過?/metrics
?接口暴露,scheduler的暴露的端口是10259
? ? 接下來就是采集數(shù)據(jù)了,我們還是使用 prometheus agent 來拉取數(shù)據(jù),原汁原味的,只要在上一篇文章提供的 configmap 中增加 scheduler 相關(guān)的配置job即可
(1)創(chuàng)建prometheus的配置文件
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-agent-conf
labels:
name: prometheus-agent-conf
namespace: flashcat
data:
prometheus.yml: |-
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'apiserver'
kubernetes_sd_configs:
- role: endpoints
scheme: https
tls_config:
insecure_skip_verify: true
authorization:
credentials_file: /var/run/secrets/kubernetes.io/serviceaccount/token
relabel_configs:
- source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
action: keep
regex: default;kubernetes;https
- job_name: 'controller-manager'
kubernetes_sd_configs:
- role: endpoints
scheme: https
tls_config:
insecure_skip_verify: true
authorization:
credentials_file: /var/run/secrets/kubernetes.io/serviceaccount/token
relabel_configs:
- source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
action: keep
regex: kube-system;kube-controller-manager;https-metrics
##添加以下scheduler的job即可
- job_name: 'scheduler'
kubernetes_sd_configs:
- role: endpoints
scheme: https
tls_config:
insecure_skip_verify: true
authorization:
credentials_file: /var/run/secrets/kubernetes.io/serviceaccount/token
relabel_configs:
- source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
action: keep
regex: kube-system;kube-scheduler;https
remote_write:
- url: 'http://192.168.120.17:17000/prometheus/v1/write'
?(2)配置Scheduler的service
? ? 跟上面一樣,首先我們要查看有沒有相關(guān)的sheduler的endpoint,如果沒有我們就要創(chuàng)建一個service來暴露
[root@k8s-master ~]# kubectl get endpoints -A | grep schedu
## 如果沒有我們就創(chuàng)建一個service的yaml
vim scheduler-service.yaml
apiVersion: v1
kind: Service
metadata:
labels:
k8s-app: kube-scheduler
name: kube-scheduler
namespace: kube-system
spec:
clusterIP: None
ports:
- name: https
port: 10259
protocol: TCP
targetPort: 10259
selector:
component: kube-scheduler
sessionAffinity: None
type: ClusterIP
?(3)重啟prometheus-agent
? ? 配置更新完configmap后要重新去apply一下configmap或者你edit更改,更改完成后如果還是無法獲取指標(biāo)就重啟一下Prometheus-agent的pod 重新apply一下就行 ,或者curl -X POST "http://<PROMETHEUS_IP>:9090/-/reload"
重載 Prometheus,這里的prometheus的ip是pod的IP,這個ip你要查看Prometheus pod的IP 可以使用kubectl get pod -o wide -n flashcat 即可。
(注意這里如果你的k8s是kubeadm安裝的,也要去scheduler的manifests文件把bind-address更改為0.0.0.0)
[root@k8s-master manifests]# vim /etc/kubernetes/manifests/kube-scheduler.yaml
......
......
......
- --bind-address=0.0.0.0 ##找到這行更改為0.0.0.0即可
??(4) 測試指標(biāo)導(dǎo)入儀表盤
? ? 測試指標(biāo)scheduler_scheduler_cache_size
? ? 導(dǎo)入監(jiān)控大盤,大盤json鏈接:?categraf/scheduler-dash.json at main · · GitHub
這里也貼出常用scheduler關(guān)鍵指標(biāo)意思:
# HELP rest_client_request_duration_seconds [ALPHA] Request latency in seconds. Broken down by verb and URL.
# TYPE rest_client_request_duration_seconds histogram
請求apiserver的延遲分布
# HELP rest_client_requests_total [ALPHA] Number of HTTP requests, partitioned by status code, method, and host.
# TYPE rest_client_requests_total counter
請求apiserver的總數(shù) ,按照host code method 統(tǒng)計
# HELP leader_election_master_status [ALPHA] Gauge of if the reporting system is master of the relevant lease, 0 indicates backup, 1 indicates master. 'name' is the string used to identify the lease. Please make sure to group by name.
# TYPE leader_election_master_status gauge
調(diào)度器的選舉狀態(tài),0表示backup, 1表示master
# HELP scheduler_queue_incoming_pods_total [STABLE] Number of pods added to scheduling queues by event and queue type.
# TYPE scheduler_queue_incoming_pods_total counter
進(jìn)入調(diào)度隊列的pod數(shù)
# HELP scheduler_preemption_attempts_total [STABLE] Total preemption attempts in the cluster till now
# TYPE scheduler_preemption_attempts_total counter
調(diào)度器驅(qū)逐容器的次數(shù)
# HELP scheduler_scheduler_cache_size [ALPHA] Number of nodes, pods, and assumed (bound) pods in the scheduler cache.
# TYPE scheduler_scheduler_cache_size gauge
調(diào)度器cache中node pod和綁定pod的數(shù)目
# HELP scheduler_pending_pods [STABLE] Number of pending pods, by the queue type. 'active' means number of pods in activeQ; 'backoff' means number of pods in backoffQ; 'unschedulable' means number of pods in unschedulableQ.
# TYPE scheduler_pending_pods gauge
調(diào)度pending的pod數(shù)量,按照queue type分別統(tǒng)計
# HELP scheduler_plugin_execution_duration_seconds [ALPHA] Duration for running a plugin at a specific extension point.
# TYPE scheduler_plugin_execution_duration_seconds histogram
調(diào)度插件在每個擴(kuò)展點的執(zhí)行時間,按照extension_point+plugin+status 分別統(tǒng)計
# HELP scheduler_e2e_scheduling_duration_seconds [ALPHA] (Deprecated since 1.23.0) E2e scheduling latency in seconds (scheduling algorithm + binding). This metric is replaced by scheduling_attempt_duration_seconds.
# TYPE scheduler_e2e_scheduling_duration_seconds histogram
調(diào)度延遲分布,1.23.0 以后會被scheduling_attempt_duration_seconds替代
# HELP scheduler_framework_extension_point_duration_seconds [STABLE] Latency for running all plugins of a specific extension point.
# TYPE scheduler_framework_extension_point_duration_seconds histogram
調(diào)度框架的擴(kuò)展點延遲分布,按extension_point(擴(kuò)展點Bind Filter Permit PreBind/PostBind PreFilter/PostFilter Reseve)
+profile(調(diào)度器)+ status(調(diào)度成功) 統(tǒng)計
# HELP scheduler_pod_scheduling_attempts [STABLE] Number of attempts to successfully schedule a pod.
# TYPE scheduler_pod_scheduling_attempts histogram
pod調(diào)度成功前,調(diào)度重試的次數(shù)分布
# HELP scheduler_schedule_attempts_total [STABLE] Number of attempts to schedule pods, by the result. 'unschedulable' means a pod could not be scheduled, while 'error' means an internal scheduler problem.
# TYPE scheduler_schedule_attempts_total counter
按照調(diào)度結(jié)果統(tǒng)計的調(diào)度重試次數(shù)。 "unschedulable" 表示無法調(diào)度,"error"表示調(diào)度器內(nèi)部錯誤
# HELP scheduler_scheduler_goroutines [ALPHA] Number of running goroutines split by the work they do such as binding.
# TYPE scheduler_scheduler_goroutines gauge
按照功能(binding filter之類)統(tǒng)計的goroutine數(shù)量
# HELP scheduler_scheduling_algorithm_duration_seconds [ALPHA] Scheduling algorithm latency in seconds
# TYPE scheduler_scheduling_algorithm_duration_seconds histogram
調(diào)度算法的耗時分布
# HELP scheduler_scheduling_attempt_duration_seconds [STABLE] Scheduling attempt latency in seconds (scheduling algorithm + binding)
# TYPE scheduler_scheduling_attempt_duration_seconds histogram
調(diào)度算法+binding的耗時分布
# HELP scheduler_scheduler_goroutines [ALPHA] Number of running goroutines split by the work they do such as binding.
# TYPE scheduler_scheduler_goroutines gauge
調(diào)度器的goroutines數(shù)目
(五)K8s-Etcd組件監(jiān)控
? ? ETCD 是 Kubernetes 控制面的重要組件和依賴,Kubernetes 的各類信息都存儲在 ETCD 中,所以監(jiān)控 ETCD 就顯得尤為重要。ETCD 在 Kubernetes 中的架構(gòu)角色如下(只與 APIServer 交互):
ETCD 是一個類似 Zookeeper 的產(chǎn)品,通常由多個節(jié)點組成集群,節(jié)點之間使用 raft 協(xié)議保證一致性。ETCD 具有以下特點:
- 每個節(jié)點都有一個角色狀態(tài),F(xiàn)ollower、Candidate、Leader
- 如果 Follower 找不到當(dāng)前 Leader 節(jié)點的時候,就會變成 Candidate
- 選舉系統(tǒng)會從 Candidate 中選出 Leader
- 所有的寫操作都通過 Leader 進(jìn)行
- 一旦 Leader 從大多數(shù) Follower 拿到 ack,該寫操作就被認(rèn)為是“已提交”狀態(tài)
- 只要大多數(shù)節(jié)點存活,整個 ETCD 就是存活的,個別節(jié)點掛掉不影響整個集群的可用性
- ETCD 使用 restful 風(fēng)格的 HTTP API 來操作,這使得 ETCD 的使用非常方便,這也是 ETCD 流行的一個關(guān)鍵因素
ETCD 這么云原生的組件,顯然是內(nèi)置支持了?/metrics
?接口的,不過 ETCD 很講求安全性,默認(rèn)的 2379 端口的訪問是要用證書的,我來測試一下先:
[root@tt-fc-dev01.nj ~]# curl -k https://localhost:2379/metrics
curl: (35) error:14094412:SSL routines:ssl3_read_bytes:sslv3 alert bad certificate
[root@tt-fc-dev01.nj ~]# ls /etc/kubernetes/pki/etcd
ca.crt ca.key healthcheck-client.crt healthcheck-client.key peer.crt peer.key server.crt server.key
[root@tt-fc-dev01.nj ~]# curl -s --cacert /etc/kubernetes/pki/etcd/ca.crt --cert /etc/kubernetes/pki/etcd/server.crt --key /etc/kubernetes/pki/etcd/server.key https://localhost:2379/metrics | head -n 6
# HELP etcd_cluster_version Which version is running. 1 for 'cluster_version' label with current cluster version
# TYPE etcd_cluster_version gauge
etcd_cluster_version{cluster_version="3.5"} 1
# HELP etcd_debugging_auth_revision The current revision of auth store.
# TYPE etcd_debugging_auth_revision gauge
etcd_debugging_auth_revision 1
? ? 使用 kubeadm 安裝的 Kubernetes 集群,相關(guān)證書是在?/etc/kubernetes/pki/etcd
?目錄下,為 curl 命令指定相關(guān)證書,是可以訪問的通的。后面使用 Categraf 的 prometheus 插件直接采集相關(guān)數(shù)據(jù)即可。
? ? 不過指標(biāo)數(shù)據(jù)實在沒必要做這么強的安全管控,整的挺麻煩,實際上,ETCD 也確實提供了另一個端口來獲取指標(biāo)數(shù)據(jù),無需走這套證書認(rèn)證機(jī)制。
(1)更改etcd配置文件監(jiān)聽地址為0.0.0.0
? ? 這里我們首先去etcd的manifests文件更改監(jiān)聽metrics地址
[root@k8s-master manifests]# vim /etc/kubernetes/manifests/etcd.yaml
......
......
- --listen-metrics-urls=http://0.0.0.0:2381 ##找到lisetn這行,把地址改為0.0.0.0
? ? 這樣改完以后我們就能直接通過2381端口來抓取metrics數(shù)據(jù)
[root@tt-fc-dev01.nj ~]# curl -s localhost:2381/metrics | head -n 6
# HELP etcd_cluster_version Which version is running. 1 for 'cluster_version' label with current cluster version
# TYPE etcd_cluster_version gauge
etcd_cluster_version{cluster_version="3.5"} 1
# HELP etcd_debugging_auth_revision The current revision of auth store.
# TYPE etcd_debugging_auth_revision gauge
etcd_debugging_auth_revision 1
(2)數(shù)據(jù)采集
ETCD 的數(shù)據(jù)采集通常使用 3 種方式:
- 使用 ETCD 所在宿主的 agent 直接來采集,因為 ETCD 是個靜態(tài) Pod,采用的 hostNetwork,所以 agent 直接連上去采集即可
- 把采集器和 ETCD 做成 sidecar 的模式,ETCD 的使用其實已經(jīng)越來越廣泛,不只是給 Kubernetes 使用,很多業(yè)務(wù)也在使用,在 Kubernetes 里創(chuàng)建和管理 ETCD 也是很常見的做法,sidecar 這種模式非常干凈,隨著 ETCD 創(chuàng)建而創(chuàng)建,隨著其銷毀而銷毀,省事
- 使用服務(wù)發(fā)現(xiàn)機(jī)制,在中心端部署采集器,就像之前的文章中介紹的 APIServer、Controller-manager、Scheduler 等的做法,使用 Prometheus agent mode 采集監(jiān)控數(shù)據(jù),當(dāng)然,這種方式的話需要有對應(yīng)的 etcd endpoint,你可以自行檢查一下?
kubectl get endpoints -n kube-system
?,如果沒有,創(chuàng)建一下即可
[root@k8s-master manifests]# kubectl get endpoints -A | grep etcd
##如果沒有對應(yīng)的endpoint 就創(chuàng)建一個service
vim etcd-service.yaml
apiVersion: v1
kind: Service
metadata:
namespace: kube-system
name: etcd
labels:
k8s-app: etcd
spec:
selector:
component: etcd
type: ClusterIP
clusterIP: None
ports:
- name: http
port: 2381
targetPort: 2381
protocol: TCP
? ? 更改我們直接寫的configmap的Prometheus的配置文件,添加etcd job字段模塊即可
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-agent-conf
labels:
name: prometheus-agent-conf
namespace: flashcat
data:
prometheus.yml: |-
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'apiserver'
kubernetes_sd_configs:
- role: endpoints
scheme: https
tls_config:
insecure_skip_verify: true
authorization:
credentials_file: /var/run/secrets/kubernetes.io/serviceaccount/token
relabel_configs:
- source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
action: keep
regex: default;kubernetes;https
- job_name: 'controller-manager'
kubernetes_sd_configs:
- role: endpoints
scheme: https
tls_config:
insecure_skip_verify: true
authorization:
credentials_file: /var/run/secrets/kubernetes.io/serviceaccount/token
relabel_configs:
- source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
action: keep
regex: kube-system;kube-controller-manager;https-metrics
- job_name: 'scheduler'
kubernetes_sd_configs:
- role: endpoints
scheme: https
tls_config:
insecure_skip_verify: true
authorization:
credentials_file: /var/run/secrets/kubernetes.io/serviceaccount/token
relabel_configs:
- source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
action: keep
regex: kube-system;kube-scheduler;https
## 添加以下etcd字段
- job_name: 'etcd'
kubernetes_sd_configs:
- role: endpoints
scheme: http
relabel_configs:
- source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
action: keep
regex: kube-system;etcd;http
remote_write:
- url: 'http://192.168.120.17:17000/prometheus/v1/write'
(3)指標(biāo)測試?
? ? 更改完成后 重新加載yaml文件和Prometheus-agent,然后打開夜鶯的web頁面指標(biāo)查詢,測試指標(biāo)是否查詢得到:etcd_cluster_version
? 查詢到指標(biāo)后,導(dǎo)入監(jiān)控儀表盤,儀表盤json地址:categraf/etcd-dash. · fl/categraf · GitHub
? 復(fù)制json文件克隆到儀表盤
?ETCD關(guān)鍵指標(biāo)意思含義:
# HELP etcd_server_is_leader Whether or not this member is a leader. 1 if is, 0 otherwise.
# TYPE etcd_server_is_leader gauge
etcd leader 表示 ,1 leader 0 learner
# HELP etcd_server_health_success The total number of successful health checks
# TYPE etcd_server_health_success counter
etcd server 健康檢查成功次數(shù)
# HELP etcd_server_health_failures The total number of failed health checks
# TYPE etcd_server_health_failures counter
etcd server 健康檢查失敗次數(shù)
# HELP etcd_disk_defrag_inflight Whether or not defrag is active on the member. 1 means active, 0 means not.
# TYPE etcd_disk_defrag_inflight gauge
是否啟動數(shù)據(jù)壓縮,1表示壓縮,0表示沒有啟動壓縮
# HELP etcd_server_snapshot_apply_in_progress_total 1 if the server is applying the incoming snapshot. 0 if none.
# TYPE etcd_server_snapshot_apply_in_progress_total gauge
是否再快照中,1 快照中,0 沒有
# HELP etcd_server_leader_changes_seen_total The number of leader changes seen.
# TYPE etcd_server_leader_changes_seen_total counter
集群leader切換的次數(shù)
# HELP grpc_server_handled_total Total number of RPCs completed on the server, regardless of success or failure.
# TYPE grpc_server_handled_total counter
grpc 調(diào)用總數(shù)
# HELP etcd_disk_wal_fsync_duration_seconds The latency distributions of fsync called by WAL.
# TYPE etcd_disk_wal_fsync_duration_seconds histogram
etcd wal同步耗時
# HELP etcd_server_proposals_failed_total The total number of failed proposals seen.
# TYPE etcd_server_proposals_failed_total counter
etcd proposal(提議)失敗總次數(shù)(proposal就是完成raft協(xié)議的一次請求)
# HELP etcd_server_proposals_pending The current number of pending proposals to commit.
# TYPE etcd_server_proposals_pending gauge
etcd proposal(提議)pending總次數(shù)(proposal就是完成raft協(xié)議的一次請求)
# HELP etcd_server_read_indexes_failed_total The total number of failed read indexes seen.
# TYPE etcd_server_read_indexes_failed_total counter
讀取索引失敗的次數(shù)統(tǒng)計(v3索引為所有key都建了索引,索引是為了加快range操作)
# HELP etcd_server_slow_read_indexes_total The total number of pending read indexes not in sync with leader's or timed out read index requests.
# TYPE etcd_server_slow_read_indexes_total counter
讀取到過期索引或者讀取超時的次數(shù)
# HELP etcd_server_quota_backend_bytes Current backend storage quota size in bytes.
# TYPE etcd_server_quota_backend_bytes gauge
當(dāng)前后端的存儲quota(db大小的上限)
通過參數(shù)quota-backend-bytes調(diào)整大小,默認(rèn)2G,官方建議不超過8G
# HELP etcd_mvcc_db_total_size_in_bytes Total size of the underlying database physically allocated in bytes.
# TYPE etcd_mvcc_db_total_size_in_bytes gauge
etcd 分配的db大小(使用量大小+空閑大小)
# HELP etcd_mvcc_db_total_size_in_use_in_bytes Total size of the underlying database logically in use in bytes.
# TYPE etcd_mvcc_db_total_size_in_use_in_bytes gauge
etcd db的使用量大小
# HELP etcd_mvcc_range_total Total number of ranges seen by this member.
# TYPE etcd_mvcc_range_total counter
etcd執(zhí)行range的數(shù)量
# HELP etcd_mvcc_put_total Total number of puts seen by this member.
# TYPE etcd_mvcc_put_total counter
etcd執(zhí)行put的數(shù)量
# HELP etcd_mvcc_txn_total Total number of txns seen by this member.
# TYPE etcd_mvcc_txn_total counter
etcd實例執(zhí)行事務(wù)的數(shù)量
# HELP etcd_mvcc_delete_total Total number of deletes seen by this member.
# TYPE etcd_mvcc_delete_total counter
etcd實例執(zhí)行delete操作的數(shù)量
# HELP process_cpu_seconds_total Total user and system CPU time spent in seconds.
# TYPE process_cpu_seconds_total counter
etcd cpu使用量
# HELP process_resident_memory_bytes Resident memory size in bytes.
# TYPE process_resident_memory_bytes gauge
etcd 內(nèi)存使用量
# HELP process_open_fds Number of open file descriptors.
# TYPE process_open_fds gauge
etcd 打開的fd數(shù)目
(六)K8s-kubelet組件監(jiān)控
? ? 接下來我們監(jiān)控的就是k8s第二個模塊salve-node的組件,kubelet監(jiān)聽有兩個固定的端口,一個是10248,一個是10250,可以用ss -ntlp | grep kubelet命令查看。
? ? 10248是健康檢測的端口,檢測節(jié)點狀態(tài),可以使用curl localhost:10248/healthz查看
[root@k8s-master ~]# curl localhost:10248/healthz
ok
? ??10250是kubelet默認(rèn)的端口,/metrics接口就是在這個端口下,但是你不能直接通過這個端口獲取metrics的數(shù)據(jù),因為他有認(rèn)證機(jī)制。 這一期我們還是講使用Prometheus-agent的方式來采集metrics數(shù)據(jù),下一期我們來通過認(rèn)證使用daemonset的方式部署categraf來采集。
(1)配置Prometheus-agent configmap配置文件
? ? 跟上面的操作一樣,在configmap下面添加名為kubelet的job字段即可,然后重新加載configmap的yaml文件
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-agent-conf
labels:
name: prometheus-agent-conf
namespace: flashcat
data:
prometheus.yml: |-
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'apiserver'
kubernetes_sd_configs:
- role: endpoints
scheme: https
tls_config:
insecure_skip_verify: true
authorization:
credentials_file: /var/run/secrets/kubernetes.io/serviceaccount/token
relabel_configs:
- source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
action: keep
regex: default;kubernetes;https
- job_name: 'controller-manager'
kubernetes_sd_configs:
- role: endpoints
scheme: https
tls_config:
insecure_skip_verify: true
authorization:
credentials_file: /var/run/secrets/kubernetes.io/serviceaccount/token
relabel_configs:
- source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
action: keep
regex: kube-system;kube-controller-manager;https-metrics
- job_name: 'scheduler'
kubernetes_sd_configs:
- role: endpoints
scheme: https
tls_config:
insecure_skip_verify: true
authorization:
credentials_file: /var/run/secrets/kubernetes.io/serviceaccount/token
relabel_configs:
- source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
action: keep
regex: kube-system;kube-scheduler;https
- job_name: 'etcd'
kubernetes_sd_configs:
- role: endpoints
scheme: http
relabel_configs:
- source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
action: keep
regex: kube-system;etcd;http
## 以下為添加的kubelete內(nèi)容
- job_name: 'kubelet'
kubernetes_sd_configs:
- role: endpoints
scheme: https
tls_config:
insecure_skip_verify: true
authorization:
credentials_file: /var/run/secrets/kubernetes.io/serviceaccount/token
relabel_configs:
- source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
action: keep
regex: kube-system;kube-kubelet;https
remote_write:
- url: 'http://192.168.120.17:17000/prometheus/v1/write'
(2)配置kubelet的service和endpoints
? ? 跟之前一樣我們要先查看本地有沒有kubelet的endpoints如果沒有就要添加。
[root@k8s-master ~]# kubectl get endpoints -A | grep kubelet
##如果沒有就添加
vim kubelet-service.yaml
apiVersion: v1
kind: Service
metadata:
labels:
k8s-app: kubelet
name: kube-kubelet
namespace: kube-system
spec:
clusterIP: None
ports:
- name: https
port: 10250
protocol: TCP
targetPort: 10250
sessionAffinity: None
type: ClusterIP
---
apiVersion: v1
kind: Endpoints
metadata:
labels:
k8s-app: kubelet
name: kube-kubelet
namespace: kube-system
subsets:
- addresses:
- ip: 192.168.120.101
- ip: 192.168.120.102 ##這里我們自定義的endpoint,這里添加的是需要監(jiān)控的k8s節(jié)點,這里我寫的是master的ip地址和node的IP地址
ports:
- name: https
port: 10250
protocol: TCP
(3)測試指標(biāo)
? ? 然后打開夜鶯的web頁面,查看指標(biāo)是否采集上。 測試指標(biāo):kubelet_running_pods
? ? ?導(dǎo)入儀表盤,儀表盤地址:categraf/dashboard-by-ident.json at main ·? · GitHub
? ? ?kubelet相關(guān)指標(biāo)意思:
# HELP go_gc_duration_seconds A summary of the pause duration of garbage collection cycles.
# TYPE go_gc_duration_seconds summary
gc的時間統(tǒng)計(summary指標(biāo))
# HELP go_goroutines Number of goroutines that currently exist.
# TYPE go_goroutines gauge
goroutine 數(shù)量
# HELP go_threads Number of OS threads created.
# TYPE go_threads gauge
線程數(shù)量
# HELP kubelet_cgroup_manager_duration_seconds [ALPHA] Duration in seconds for cgroup manager operations. Broken down by method.
# TYPE kubelet_cgroup_manager_duration_seconds histogram
操作cgroup的時長分布,按照操作類型統(tǒng)計
# HELP kubelet_containers_per_pod_count [ALPHA] The number of containers per pod.
# TYPE kubelet_containers_per_pod_count histogram
pod中container數(shù)量的統(tǒng)計(spec.containers的數(shù)量)
# HELP kubelet_docker_operations_duration_seconds [ALPHA] Latency in seconds of Docker operations. Broken down by operation type.
# TYPE kubelet_docker_operations_duration_seconds histogram
操作docker的時長分布,按照操作類型統(tǒng)計
# HELP kubelet_docker_operations_errors_total [ALPHA] Cumulative number of Docker operation errors by operation type.
# TYPE kubelet_docker_operations_errors_total counter
操作docker的錯誤累計次數(shù),按照操作類型統(tǒng)計
# HELP kubelet_docker_operations_timeout_total [ALPHA] Cumulative number of Docker operation timeout by operation type.
# TYPE kubelet_docker_operations_timeout_total counter
操作docker的超時統(tǒng)計,按照操作類型統(tǒng)計
# HELP kubelet_docker_operations_total [ALPHA] Cumulative number of Docker operations by operation type.
# TYPE kubelet_docker_operations_total counter
操作docker的累計次數(shù),按照操作類型統(tǒng)計
# HELP kubelet_eviction_stats_age_seconds [ALPHA] Time between when stats are collected, and when pod is evicted based on those stats by eviction signal
# TYPE kubelet_eviction_stats_age_seconds histogram
驅(qū)逐操作的時間分布,按照驅(qū)逐信號(原因)分類統(tǒng)計
# HELP kubelet_evictions [ALPHA] Cumulative number of pod evictions by eviction signal
# TYPE kubelet_evictions counter
驅(qū)逐次數(shù)統(tǒng)計,按照驅(qū)逐信號(原因)統(tǒng)計
# HELP kubelet_http_inflight_requests [ALPHA] Number of the inflight http requests
# TYPE kubelet_http_inflight_requests gauge
請求kubelet的inflight請求數(shù),按照method path server_type統(tǒng)計, 注意與每秒的request數(shù)區(qū)別開
# HELP kubelet_http_requests_duration_seconds [ALPHA] Duration in seconds to serve http requests
# TYPE kubelet_http_requests_duration_seconds histogram
請求kubelet的請求時間統(tǒng)計, 按照method path server_type統(tǒng)計
# HELP kubelet_http_requests_total [ALPHA] Number of the http requests received since the server started
# TYPE kubelet_http_requests_total counter
請求kubelet的請求數(shù)統(tǒng)計,按照method path server_type統(tǒng)計
# HELP kubelet_managed_ephemeral_containers [ALPHA] Current number of ephemeral containers in pods managed by this kubelet. Ephemeral containers will be ignored if disabled by the EphemeralContainers feature gate, and this number will be 0.
# TYPE kubelet_managed_ephemeral_containers gauge
當(dāng)前kubelet管理的臨時容器數(shù)量
# HELP kubelet_network_plugin_operations_duration_seconds [ALPHA] Latency in seconds of network plugin operations. Broken down by operation type.
# TYPE kubelet_network_plugin_operations_duration_seconds histogram
網(wǎng)絡(luò)插件的操作耗時分布 ,按照操作類型(operation_type)統(tǒng)計, 如果 --feature-gates=EphemeralContainers=false, 否則一直為0
# HELP kubelet_network_plugin_operations_errors_total [ALPHA] Cumulative number of network plugin operation errors by operation type.
# TYPE kubelet_network_plugin_operations_errors_total counter
網(wǎng)絡(luò)插件累計操作錯誤數(shù)統(tǒng)計,按照操作類型(operation_type)統(tǒng)計
# HELP kubelet_network_plugin_operations_total [ALPHA] Cumulative number of network plugin operations by operation type.
# TYPE kubelet_network_plugin_operations_total counter
網(wǎng)絡(luò)插件累計操作數(shù)統(tǒng)計,按照操作類型(operation_type)統(tǒng)計
# HELP kubelet_node_name [ALPHA] The node's name. The count is always 1.
# TYPE kubelet_node_name gauge
node name
# HELP kubelet_pleg_discard_events [ALPHA] The number of discard events in PLEG.
# TYPE kubelet_pleg_discard_events counter
PLEG(pod lifecycle event generator) 丟棄的event數(shù)統(tǒng)計
# HELP kubelet_pleg_last_seen_seconds [ALPHA] Timestamp in seconds when PLEG was last seen active.
# TYPE kubelet_pleg_last_seen_seconds gauge
PLEG上次活躍的時間戳
# HELP kubelet_pleg_relist_duration_seconds [ALPHA] Duration in seconds for relisting pods in PLEG.
# TYPE kubelet_pleg_relist_duration_seconds histogram
PLEG relist pod時間分布
# HELP kubelet_pleg_relist_interval_seconds [ALPHA] Interval in seconds between relisting in PLEG.
# TYPE kubelet_pleg_relist_interval_seconds histogram
PLEG relist 間隔時間分布
# HELP kubelet_pod_start_duration_seconds [ALPHA] Duration in seconds for a single pod to go from pending to running.
# TYPE kubelet_pod_start_duration_seconds histogram
pod啟動時間(從pending到running)分布, kubelet watch到pod時到pod中contianer都running后, watch各種source channel的pod變更
# HELP kubelet_pod_worker_duration_seconds [ALPHA] Duration in seconds to sync a single pod. Broken down by operation type: create, update, or sync
# TYPE kubelet_pod_worker_duration_seconds histogram
pod狀態(tài)變化的時間分布, 按照操作類型(create update sync)統(tǒng)計, worker就是kubelet中處理一個pod的邏輯工作單位
# HELP kubelet_pod_worker_start_duration_seconds [ALPHA] Duration in seconds from seeing a pod to starting a worker.
# TYPE kubelet_pod_worker_start_duration_seconds histogram
kubelet watch到pod到worker啟動的時間分布
# HELP kubelet_run_podsandbox_duration_seconds [ALPHA] Duration in seconds of the run_podsandbox operations. Broken down by RuntimeClass.Handler.
# TYPE kubelet_run_podsandbox_duration_seconds histogram
啟動sandbox的時間分布
# HELP kubelet_run_podsandbox_errors_total [ALPHA] Cumulative number of the run_podsandbox operation errors by RuntimeClass.Handler.
# TYPE kubelet_run_podsandbox_errors_total counter
啟動sanbox出現(xiàn)error的總數(shù)
# HELP kubelet_running_containers [ALPHA] Number of containers currently running
# TYPE kubelet_running_containers gauge
當(dāng)前containers運行狀態(tài)的統(tǒng)計, 按照container狀態(tài)統(tǒng)計,created running exited
# HELP kubelet_running_pods [ALPHA] Number of pods that have a running pod sandbox
# TYPE kubelet_running_pods gauge
當(dāng)前處于running狀態(tài)pod數(shù)量
# HELP kubelet_runtime_operations_duration_seconds [ALPHA] Duration in seconds of runtime operations. Broken down by operation type.
# TYPE kubelet_runtime_operations_duration_seconds histogram
容器運行時的操作耗時(container在create list exec remove stop等的耗時)
# HELP kubelet_runtime_operations_errors_total [ALPHA] Cumulative number of runtime operation errors by operation type.
# TYPE kubelet_runtime_operations_errors_total counter
容器運行時的操作錯誤數(shù)統(tǒng)計(按操作類型統(tǒng)計)
# HELP kubelet_runtime_operations_total [ALPHA] Cumulative number of runtime operations by operation type.
# TYPE kubelet_runtime_operations_total counter
容器運行時的操作總數(shù)統(tǒng)計(按操作類型統(tǒng)計)
# HELP kubelet_started_containers_errors_total [ALPHA] Cumulative number of errors when starting containers
# TYPE kubelet_started_containers_errors_total counter
kubelet啟動容器錯誤總數(shù)統(tǒng)計(按code和container_type統(tǒng)計)
code包括ErrImagePull ErrImageInspect ErrImagePull ErrRegistryUnavailable ErrInvalidImageName等
container_type一般為"container" "podsandbox"
# HELP kubelet_started_containers_total [ALPHA] Cumulative number of containers started
# TYPE kubelet_started_containers_total counter
kubelet啟動容器總數(shù)
# HELP kubelet_started_pods_errors_total [ALPHA] Cumulative number of errors when starting pods
# TYPE kubelet_started_pods_errors_total counter
kubelet啟動pod遇到的錯誤總數(shù)(只有創(chuàng)建sandbox遇到錯誤才會統(tǒng)計)
# HELP kubelet_started_pods_total [ALPHA] Cumulative number of pods started
# TYPE kubelet_started_pods_total counter
kubelet啟動的pod總數(shù)
# HELP process_cpu_seconds_total Total user and system CPU time spent in seconds.
# TYPE process_cpu_seconds_total counter
統(tǒng)計cpu使用率
# HELP process_max_fds Maximum number of open file descriptors.
# TYPE process_max_fds gauge
允許進(jìn)程打開的最大fd數(shù)
# HELP process_open_fds Number of open file descriptors.
# TYPE process_open_fds gauge
當(dāng)前打開的fd數(shù)量
# HELP process_resident_memory_bytes Resident memory size in bytes.
# TYPE process_resident_memory_bytes gauge
進(jìn)程駐留內(nèi)存大小
# HELP process_start_time_seconds Start time of the process since unix epoch in seconds.
# TYPE process_start_time_seconds gauge
進(jìn)程啟動時間
# HELP rest_client_request_duration_seconds [ALPHA] Request latency in seconds. Broken down by verb and URL.
# TYPE rest_client_request_duration_seconds histogram
請求apiserver的耗時統(tǒng)計(按照url和請求類型統(tǒng)計verb)
# HELP rest_client_requests_total [ALPHA] Number of HTTP requests, partitioned by status code, method, and host.
# TYPE rest_client_requests_total counter
請求apiserver的總次數(shù)(按照返回碼code和請求類型method統(tǒng)計)
# HELP storage_operation_duration_seconds [ALPHA] Storage operation duration
# TYPE storage_operation_duration_seconds histogram
存儲操作耗時(按照存儲plugin(configmap emptydir hostpath 等 )和operation_name分類統(tǒng)計)
# HELP volume_manager_total_volumes [ALPHA] Number of volumes in Volume Manager
# TYPE volume_manager_total_volumes gauge
本機(jī)掛載的volume數(shù)量統(tǒng)計(按照plugin_name和state統(tǒng)計
plugin_name包括"host-path" "empty-dir" "configmap" "projected")
state(desired_state_of_world期狀態(tài)/actual_state_of_world實際狀態(tài))
? ? cadivisor指標(biāo)梳理
# HELP container_cpu_cfs_periods_total Number of elapsed enforcement period intervals.
# TYPE container_cpu_cfs_periods_total counter
cfs時間片總數(shù), 完全公平調(diào)度的時間片總數(shù)(分配到cpu的時間片數(shù))
# HELP container_cpu_cfs_throttled_periods_total Number of throttled period intervals.
# TYPE container_cpu_cfs_throttled_periods_total counter
容器被throttle的時間片總數(shù)
# HELP container_cpu_cfs_throttled_seconds_total Total time duration the container has been throttled.
# TYPE container_cpu_cfs_throttled_seconds_total counter
容器被throttle的時間
# HELP container_file_descriptors Number of open file descriptors for the container.
# TYPE container_file_descriptors gauge
容器打開的fd數(shù)
# HELP container_memory_usage_bytes Current memory usage in bytes, including all memory regardless of when it was accessed
# TYPE container_memory_usage_bytes gauge
容器內(nèi)存使用量,單位byte
# HELP container_network_receive_bytes_total Cumulative count of bytes received
# TYPE container_network_receive_bytes_total counter
容器入方向的流量
# HELP container_network_transmit_bytes_total Cumulative count of bytes transmitted
# TYPE container_network_transmit_bytes_total counter
容器出方向的流量
# HELP container_spec_cpu_period CPU period of the container.
# TYPE container_spec_cpu_period gauge
容器的cpu調(diào)度單位時間
# HELP container_spec_cpu_quota CPU quota of the container.
# TYPE container_spec_cpu_quota gauge
容器的cpu規(guī)格 ,除以單位調(diào)度時間可以計算核數(shù)
# HELP container_spec_memory_limit_bytes Memory limit for the container.
# TYPE container_spec_memory_limit_bytes gauge
容器的內(nèi)存規(guī)格,單位byte
# HELP container_threads Number of threads running inside the container
# TYPE container_threads gauge
容器當(dāng)前的線程數(shù)
# HELP container_threads_max Maximum number of threads allowed inside the container, infinity if value is zero
# TYPE container_threads_max gauge
允許容器啟動的最大線程數(shù)
(七)K8s-KubeProxy組件監(jiān)控
? ? KubeProxy 主要負(fù)責(zé)節(jié)點的網(wǎng)絡(luò)管理,它在每個節(jié)點都會存在,是通過10249
端口暴露監(jiān)控指標(biāo)。
? ? 這里指標(biāo)采集我們也用上面的方法,使用Prometheus-agent的方式
(1)配置Prometheus-agent configmap配置文件
? ? 在之前的configmap的yaml文件中添加名為kube-proxy的job模塊字段,添加完記得重新加載yaml文件和Prometheus-agent的pod
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-agent-conf
labels:
name: prometheus-agent-conf
namespace: flashcat
data:
prometheus.yml: |-
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'apiserver'
kubernetes_sd_configs:
- role: endpoints
scheme: https
tls_config:
insecure_skip_verify: true
authorization:
credentials_file: /var/run/secrets/kubernetes.io/serviceaccount/token
relabel_configs:
- source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
action: keep
regex: default;kubernetes;https
- job_name: 'controller-manager'
kubernetes_sd_configs:
- role: endpoints
scheme: https
tls_config:
insecure_skip_verify: true
authorization:
credentials_file: /var/run/secrets/kubernetes.io/serviceaccount/token
relabel_configs:
- source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
action: keep
regex: kube-system;kube-controller-manager;https-metrics
- job_name: 'scheduler'
kubernetes_sd_configs:
- role: endpoints
scheme: https
tls_config:
insecure_skip_verify: true
authorization:
credentials_file: /var/run/secrets/kubernetes.io/serviceaccount/token
relabel_configs:
- source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
action: keep
regex: kube-system;kube-scheduler;https
- job_name: 'etcd'
kubernetes_sd_configs:
- role: endpoints
scheme: http
relabel_configs:
- source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
action: keep
regex: kube-system;etcd;http
- job_name: 'kubelet'
kubernetes_sd_configs:
- role: endpoints
scheme: https
tls_config:
insecure_skip_verify: true
authorization:
credentials_file: /var/run/secrets/kubernetes.io/serviceaccount/token
relabel_configs:
- source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
action: keep
regex: kube-system;kube-kubelet;https
##這里是添加的模塊
- job_name: 'kube-proxy'
kubernetes_sd_configs:
- role: endpoints
scheme: http
relabel_configs:
- source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
action: keep
regex: kube-system;kube-proxy;http
remote_write:
- url: 'http://192.168.120.17:17000/prometheus/v1/write'
(2)配置kube-proxy的endpoints
? ? 跟之前一樣,先查看有沒有這個kube-proxy的endpoints如果沒有添加。
[root@k8s-master ~]# kubectl get endpoints -A | grep kube-pro
## 如果沒有 添加service
vim kube-proxy-service.yaml
apiVersion: v1
kind: Service
metadata:
labels:
k8s-app: proxy
name: kube-proxy
namespace: kube-system
spec:
clusterIP: None
selector:
k8s-app: kube-proxy
ports:
- name: http
port: 10249
protocol: TCP
targetPort: 10249
sessionAffinity: None
type: ClusterIP
(3)更改kube-proxy的metricsbindAddress
? ??查看 kube-proxy 的10249
端口是否綁定到127.0.0.1
了,如果是就修改成0.0.0.0
,通過kubectl edit cm -n kube-system kube-proxy
修改metricsBindAddress
即可
[root@k8s-master ~]# kubectl edit cm -n kube-system kube-proxy
......
......
......
kind: KubeProxyConfiguration
metricsBindAddress: "0.0.0.0" ## 這里修改為0.0.0.0 即可
mode: ""
nodePortAddresses: null
oomScoreAdj: nul
(4)指標(biāo)測試
在夜鶯的web頁面輸入指標(biāo)測試:
kubeproxy_network_programming_duration_seconds_bucket
?導(dǎo)入監(jiān)控大盤, 儀表盤json文件:https://github.com/flin/inputs/kube_proxy/dashboard-by-ident.json
kube-proxy關(guān)鍵指標(biāo)含義:
# HELP go_gc_duration_seconds A summary of the pause duration of garbage collection cycles.
# TYPE go_gc_duration_seconds summary
gc時間
# HELP go_goroutines Number of goroutines that currently exist.
# TYPE go_goroutines gauge
goroutine數(shù)量
# HELP go_threads Number of OS threads created.
# TYPE go_threads gauge
線程數(shù)量
# HELP kubeproxy_network_programming_duration_seconds [ALPHA] In Cluster Network Programming Latency in seconds
# TYPE kubeproxy_network_programming_duration_seconds histogram
service或者pod發(fā)生變化到kube-proxy規(guī)則同步完成時間指標(biāo)含義較復(fù)雜,參照https://github.com/kubernetes/community/blob/master/sig-scalability/slos/network_programming_latency.md
# HELP kubeproxy_sync_proxy_rules_duration_seconds [ALPHA] SyncProxyRules latency in seconds
# TYPE kubeproxy_sync_proxy_rules_duration_seconds histogram
規(guī)則同步耗時
# HELP kubeproxy_sync_proxy_rules_endpoint_changes_pending [ALPHA] Pending proxy rules Endpoint changes
# TYPE kubeproxy_sync_proxy_rules_endpoint_changes_pending gauge
endpoint 發(fā)生變化后規(guī)則同步pending的次數(shù)
# HELP kubeproxy_sync_proxy_rules_endpoint_changes_total [ALPHA] Cumulative proxy rules Endpoint changes
# TYPE kubeproxy_sync_proxy_rules_endpoint_changes_total counter
endpoint 發(fā)生變化后規(guī)則同步的總次數(shù)
# HELP kubeproxy_sync_proxy_rules_iptables_restore_failures_total [ALPHA] Cumulative proxy iptables restore failures
# TYPE kubeproxy_sync_proxy_rules_iptables_restore_failures_total counter
本機(jī)上 iptables restore 失敗的總次數(shù)
# HELP kubeproxy_sync_proxy_rules_last_queued_timestamp_seconds [ALPHA] The last time a sync of proxy rules was queued
# TYPE kubeproxy_sync_proxy_rules_last_queued_timestamp_seconds gauge
最近一次規(guī)則同步的請求時間戳,如果比下一個指標(biāo) kubeproxy_sync_proxy_rules_last_timestamp_seconds 大很多,那說明同步 hung 住了
# HELP kubeproxy_sync_proxy_rules_last_timestamp_seconds [ALPHA] The last time proxy rules were successfully synced
# TYPE kubeproxy_sync_proxy_rules_last_timestamp_seconds gauge
最近一次規(guī)則同步的完成時間戳
# HELP kubeproxy_sync_proxy_rules_service_changes_pending [ALPHA] Pending proxy rules Service changes
# TYPE kubeproxy_sync_proxy_rules_service_changes_pending gauge
service變化引起的規(guī)則同步pending數(shù)量
# HELP kubeproxy_sync_proxy_rules_service_changes_total [ALPHA] Cumulative proxy rules Service changes
# TYPE kubeproxy_sync_proxy_rules_service_changes_total counter
service變化引起的規(guī)則同步總數(shù)
# HELP process_cpu_seconds_total Total user and system CPU time spent in seconds.
# TYPE process_cpu_seconds_total counter
利用這個指標(biāo)統(tǒng)計cpu使用率
# HELP process_max_fds Maximum number of open file descriptors.
# TYPE process_max_fds gauge
進(jìn)程可以打開的最大fd數(shù)
# HELP process_open_fds Number of open file descriptors.
# TYPE process_open_fds gauge
進(jìn)程當(dāng)前打開的fd數(shù)
# HELP process_resident_memory_bytes Resident memory size in bytes.
# TYPE process_resident_memory_bytes gauge
統(tǒng)計內(nèi)存使用大小
# HELP process_start_time_seconds Start time of the process since unix epoch in seconds.
# TYPE process_start_time_seconds gauge
進(jìn)程啟動時間戳
# HELP rest_client_request_duration_seconds [ALPHA] Request latency in seconds. Broken down by verb and URL.
# TYPE rest_client_request_duration_seconds histogram
請求 apiserver 的耗時(按照url和verb統(tǒng)計)
# HELP rest_client_requests_total [ALPHA] Number of HTTP requests, partitioned by status code, method, and host.
# TYPE rest_client_requests_total counter
請求 apiserver 的總數(shù)(按照code method host統(tǒng)計)
最后的最后
? ? ?夜鶯監(jiān)控k8s的方法夜鶯的官網(wǎng)也做了合計專欄,有興趣的伙伴可以去看看Kubernetes監(jiān)控專欄,無論是指標(biāo)還是原理,都做了解釋初識Kubernetes -(flashcat.cloud)。如果在部署中遇到問題歡迎在本文章留言,24小時內(nèi)必回復(fù)。文章來源:http://www.zghlxwxcb.cn/news/detail-457763.html
? ? ?看完這一期肯定會有小伙伴會有疑問,我業(yè)務(wù)都跑pod上面,光監(jiān)控這些組件沒啥大用啊,我想知道總共有幾個 Namespace,有幾個 Service、Deployment、Statefulset,某個 Deployment 期望有幾個 Pod 要運行,實際有幾個 Pod 在運行,這些既有的指標(biāo)就無法回答了。當(dāng)然這一點肯定重中之重,這個問題我們下一期詳細(xì)講解使用使用 kube-state-metrics (KSM)監(jiān)控 Kubernetes 對象,俗稱KSM來監(jiān)聽各個Kubernetes對象的狀態(tài),生產(chǎn)指標(biāo)暴露出來讓我們查看。 還有下一期還會講用daemonset的最佳實踐方案來采集監(jiān)控。文章來源地址http://www.zghlxwxcb.cn/news/detail-457763.html
到了這里,關(guān)于夜鶯(Flashcat)V6監(jiān)控(五):夜鶯監(jiān)控k8s組件(上)的文章就介紹完了。如果您還想了解更多內(nèi)容,請在右上角搜索TOY模板網(wǎng)以前的文章或繼續(xù)瀏覽下面的相關(guān)文章,希望大家以后多多支持TOY模板網(wǎng)!