国产 无码 综合区,色欲AV无码国产永久播放,无码天堂亚洲国产AV,国产日韩欧美女同一区二区

K8S集群中PLEG問題排查

這篇具有很好參考價值的文章主要介紹了K8S集群中PLEG問題排查。希望對大家有所幫助。如果存在錯誤或未考慮完全的地方,請大家不吝賜教,您也可以點(diǎn)擊"舉報(bào)違法"按鈕提交疑問。

一、背景

k8s集群排障真的很麻煩

今天集群有同事找我,節(jié)點(diǎn)報(bào) PLEG is not healthy 集群中有的節(jié)點(diǎn)出現(xiàn)了NotReady,這是什么原因呢?

二、kubernetes源碼分析

PLEG is not healthy 也是一個經(jīng)常出現(xiàn)的問題

POD 生命周期事件生成器

先說下PLEG 這部分代碼在kubelet 里,我們看一下在kubelet中的注釋:

// GenericPLEG is an extremely simple generic PLEG that relies solely on
// periodic listing to discover container changes. It should be used
// as temporary replacement for container runtimes do not support a proper
// event generator yet.
//
// Note that GenericPLEG assumes that a container would not be created,
// terminated, and garbage collected within one relist period. If such an
// incident happens, GenenricPLEG would miss all events regarding this
// container. In the case of relisting failure, the window may become longer.
// Note that this assumption is not unique -- many kubelet internal components
// rely on terminated containers as tombstones for bookkeeping purposes. The
// garbage collector is implemented to work with such situations. However, to
// guarantee that kubelet can handle missing container events, it is
// recommended to set the relist period short and have an auxiliary, longer
// periodic sync in kubelet as the safety net.
type GenericPLEG struct {
	// The period for relisting.
	relistPeriod time.Duration
	// The container runtime.
	runtime kubecontainer.Runtime
	// The channel from which the subscriber listens events.
	eventChannel chan *PodLifecycleEvent
	// The internal cache for pod/container information.
	podRecords podRecords
	// Time of the last relisting.
	relistTime atomic.Value
	// Cache for storing the runtime states required for syncing pods.
	cache kubecontainer.Cache
	// For testability.
	clock clock.Clock
	// Pods that failed to have their status retrieved during a relist. These pods will be
	// retried during the next relisting.
	podsToReinspect map[types.UID]*kubecontainer.Pod
}

也就是說kubelet 會定時把 拉取pod 的列表,然后記錄下結(jié)果。

運(yùn)行代碼后會執(zhí)行一個定時任務(wù),定時調(diào)用relist函數(shù)

// Start spawns a goroutine to relist periodically.
func (g *GenericPLEG) Start() {
	go wait.Until(g.relist, g.relistPeriod, wait.NeverStop)
}

relist函數(shù)里關(guān)鍵代碼:

	// Get all the pods.
	podList, err := g.runtime.GetPods(true)
	if err != nil {
		klog.ErrorS(err, "GenericPLEG: Unable to retrieve pods")
		return
	}

	g.updateRelistTime(timestamp)

我們可以看到kubelet 定期調(diào)用 docker.sock 或者containerd.sock 去調(diào)用CRI 去拉取pod列表,然后更新下relist時間。

我們在看Health 函數(shù),是被定時調(diào)用的健康檢查處理函數(shù):

// Healthy check if PLEG work properly.
// relistThreshold is the maximum interval between two relist.
func (g *GenericPLEG) Healthy() (bool, error) {
	relistTime := g.getRelistTime()
	if relistTime.IsZero() {
		return false, fmt.Errorf("pleg has yet to be successful")
	}
	// Expose as metric so you can alert on `time()-pleg_last_seen_seconds > nn`
	metrics.PLEGLastSeen.Set(float64(relistTime.Unix()))
	elapsed := g.clock.Since(relistTime)
	if elapsed > relistThreshold {
		return false, fmt.Errorf("pleg was last seen active %v ago; threshold is %v", elapsed, relistThreshold)
	}
	return true, nil
}

他是用當(dāng)前時間 減去 relist更新時間,得到的時間如果超過relistThreshold就代表可能不健康

	// The threshold needs to be greater than the relisting period + the
	// relisting time, which can vary significantly. Set a conservative
	// threshold to avoid flipping between healthy and unhealthy.
	relistThreshold = 3 * time.Minute

進(jìn)一步思考這個問題,我們就把問題鎖定在了CRI 容器運(yùn)行時的地方

三、鎖定錯誤

這個問題出錯的根源是在容器運(yùn)行時超時,意味著dockerd 或者 contaienrd 出現(xiàn)故障,我們到那臺機(jī)器上看到kubelet 的日志發(fā)現(xiàn)很多CRI 超時的 不可用的日志

Nov 02 13:41:43 app04 kubelet[8411]: E1102 13:41:43.111882    8411 generic.go:205] "GenericPLEG: Unable to retrieve pods" err="rpc error: code = Unknown desc = Cannot connect to the
Nov 02 13:41:44 app04 kubelet[8411]: E1102 13:41:44.036729    8411 kubelet.go:2396] "Container runtime not ready" runtimeReady="RuntimeReady=false reason:DockerDaemonNotReady messag
Nov 02 13:41:44 app04 kubelet[8411]: E1102 13:41:44.112993    8411 remote_runtime.go:351] "ListPodSandbox with filter from runtime service failed" err="rpc error: code = Unknown des
Nov 02 13:41:44 app04 kubelet[8411]: E1102 13:41:44.113027    8411 kuberuntime_sandbox.go:285] "Failed to list pod sandboxes" err="rpc error: code = Unknown desc = Cannot connect to
Nov 02 13:41:44 app04 kubelet[8411]: E1102 13:41:44.113041    8411 generic.go:205] "GenericPLEG: Unable to retrieve pods" err="rpc error: code = Unknown desc = Cannot connect to the
Nov 02 13:41:45 app04 kubelet[8411]: E1102 13:41:45.114281    8411 remote_runtime.go:351] "ListPodSandbox with filter from runtime service failed" err="rpc error: code = Unknown des
Nov 02 13:41:45 app04 kubelet[8411]: E1102 13:41:45.114319    8411 kuberuntime_sandbox.go:285] "Failed to list pod sandboxes" err="rpc error: code = Unknown desc = Cannot connect to
Nov 02 13:41:45 app04 kubelet[8411]: E1102 13:41:45.114335    8411 generic.go:205] "GenericPLEG: Unable to retrieve pods" err="rpc error: code = Unknown desc = Cannot connect to the
Nov 02 13:41:45 app04 kubelet[8411]: E1102 13:41:45.344912    8411 remote_runtime.go:673] "ExecSync cmd from runtime service failed" err="rpc error: code = Unknown desc = Cannot con
Nov 02 13:41:45 app04 kubelet[8411]: E1102 13:41:45.345214    8411 remote_runtime.go:673] "ExecSync cmd from runtime service failed" err="rpc error: code = Unknown desc = Cannot con
Nov 02 13:41:45 app04 kubelet[8411]: E1102 13:41:45.345501    8411 remote_runtime.go:673] "ExecSync cmd from runtime service failed" err="rpc error: code = Unknown desc = Cannot con
Nov 02 13:41:45 app04 kubelet[8411]: E1102 13:41:45.630715    8411 kubelet.go:2040] "Skipping pod synchronization" err="[container runtime is down, PLEG is not healthy: pleg was las
Nov 02 13:41:46 app04 kubelet[8411]: E1102 13:41:46.115226    8411 remote_runtime.go:351] "ListPodSandbox with filter from runtime service failed" err="rpc error: code = Unknown des
Nov 02 13:41:46 app04 kubelet[8411]: E1102 13:41:46.115265    8411 kuberuntime_sandbox.go:285] "Failed to list pod sandboxes" err="rpc error: code = Unknown desc = Cannot connect to
Nov 02 13:41:46 app04 kubelet[8411]: E1102 13:41:46.115280    8411 generic.go:205] "GenericPLEG: Unable to retrieve pods" err="rpc error: code = Unknown desc = Cannot connect to the
Nov 02 13:41:47 app04 kubelet[8411]: E1102 13:41:47.116608    8411 remote_runtime.go:351] "ListPodSandbox with filter from runtime service failed" err="rpc error: code = Unknown des
Nov 02 13:41:47 app04 kubelet[8411]: E1102 13:41:47.116647    8411 kuberuntime_sandbox.go:285] "Failed to list pod sandboxes" err="rpc error: code = Unknown desc = Cannot connect to
Nov 02 13:41:47 app04 kubelet[8411]: E1102 13:41:47.116667    8411 generic.go:205] "GenericPLEG: Unable to retrieve pods" err="rpc error: code = Unknown desc = Cannot connect to the
Nov 02 13:41:48 app04 kubelet[8411]: E1102 13:41:48.081612    8411 remote_runtime.go:673] "ExecSync cmd from runtime service failed" err="rpc error: code = Unknown desc = Cannot con
Nov 02 13:41:48 app04 kubelet[8411]: E1102 13:41:48.081611    8411 remote_runtime.go:673] "ExecSync cmd from runtime service failed" err="rpc error: code = Unknown desc = Cannot con
Nov 02 13:41:48 app04 kubelet[8411]: E1102 13:41:48.082134    8411 remote_runtime.go:673] "ExecSync cmd from runtime service failed" err="rpc error: code = Unknown desc = Cannot con
Nov 02 13:41:48 app04 kubelet[8411]: E1102 13:41:48.082201    8411 remote_runtime.go:673] "ExecSync cmd from runtime service failed" err="rpc error: code = Unknown desc = Cannot con
Nov 02 13:41:48 app04 kubelet[8411]: E1102 13:41:48.082378    8411 remote_runtime.go:6

想辦法重啟運(yùn)行時 或者去排查containerd

Nov 02 12:58:45 app04 dockerd[8435]: http: Accept error: accept unix /var/run/docker.sock: accept4: too many open files; retrying in 1s
Nov 02 12:58:46 app04 dockerd[8435]: http: Accept error: accept unix /var/run/docker.sock: accept4: too many open files; retrying in 1s
Nov 02 12:58:47 app04 dockerd[8435]: http: Accept error: accept unix /var/run/docker.sock: accept4: too many open files; retrying in 1s
Nov 02 12:58:48 app04 dockerd[8435]: http: Accept error: accept unix /var/run/docker.sock: accept4: too many open files; retrying in 1s
Nov 02 12:58:49 app04 dockerd[8435]: http: Accept error: accept unix /var/run/docker.sock: accept4: too many open files; retrying in 1s
Nov 02 12:58:50 app04 dockerd[8435]: http: Accept error: accept unix /var/run/docker.sock: accept4: too many open files; retrying in 1s
Nov 02 12:58:51 app04 dockerd[8435]: http: Accept error: accept unix /var/run/docker.sock: accept4: too many open files; retrying in 1s
Nov 02 12:58:52 app04 dockerd[8435]: http: Accept error: accept unix /var/run/docker.sock: accept4: too many open files; retrying in 1s
Nov 02 12:58:53 app04 dockerd[8435]: http: Accept error: accept unix /var/run/docker.sock: accept4: too many open files; retrying in 1s
Nov 02 12:58:54 app04 dockerd[8435]: http: Accept error: accept unix /var/run/docker.sock: accept4: too many open files; retrying in 1s
Nov 02 12:58:55 app04 dockerd[8435]: http: Accept error: accept unix /var/run/docker.sock: accept4: too many open files; retrying in 1s
Nov 02 12:58:56 app04 dockerd[8435]: http: Accept error: accept unix /var/run/docker.sock: accept4: too many open files; retrying in 1s
Nov 02 12:58:57 app04 dockerd[8435]: http: Accept error: accept unix /var/run/docker.sock: accept4: too many open files; retrying in 1s
Nov 02 12:58:58 app04 dockerd[8435]: http: Accept error: accept unix /var/run/docker.sock: accept4: too many open files; retrying in 1s
Nov 02 12:58:59 app04 dockerd[8435]: http: Accept error: accept unix /var/run/docker.sock: accept4: too many open files; retrying in 1s
Nov 02 12:59:00 app04 dockerd[8435]: http: Accept error: accept unix /var/run/docker.sock: accept4: too many open files; retrying in 1s
Nov 02 12:59:01 app04 dockerd[8435]: http: Accept error: accept unix /var/run/docker.sock: accept4: too many open files; retrying in 1s
Nov 02 12:59:02 app04 dockerd[8435]: http: Accept error: accept unix /var/run/docker.sock: accept4: too many open files; retrying in 1s
Nov 02 12:59:03 app04 dockerd[8435]: http: Accept error: accept unix /var/run/docker.sock: accept4: too many open files; retrying in 1s
Nov 02 12:59:04 app04 dockerd[8435]: http: Accept error: accept unix /var/run/docker.sock: accept4: too many open files; retrying in 1s

發(fā)現(xiàn)是CRI 服務(wù)端接受太多套接字,導(dǎo)致accept 失敗了,可以適當(dāng)調(diào)大ulimit文章來源地址http://www.zghlxwxcb.cn/news/detail-787950.html

到了這里,關(guān)于K8S集群中PLEG問題排查的文章就介紹完了。如果您還想了解更多內(nèi)容,請?jiān)谟疑辖撬阉鱐OY模板網(wǎng)以前的文章或繼續(xù)瀏覽下面的相關(guān)文章,希望大家以后多多支持TOY模板網(wǎng)!

本文來自互聯(lián)網(wǎng)用戶投稿,該文觀點(diǎn)僅代表作者本人,不代表本站立場。本站僅提供信息存儲空間服務(wù),不擁有所有權(quán),不承擔(dān)相關(guān)法律責(zé)任。如若轉(zhuǎn)載,請注明出處: 如若內(nèi)容造成侵權(quán)/違法違規(guī)/事實(shí)不符,請點(diǎn)擊違法舉報(bào)進(jìn)行投訴反饋,一經(jīng)查實(shí),立即刪除!

領(lǐng)支付寶紅包贊助服務(wù)器費(fèi)用

相關(guān)文章

  • 【故障排查】VMware掛起后恢復(fù),k8s集群無法ping/curl通pod/svc/ingress

    【故障排查】VMware掛起后恢復(fù),k8s集群無法ping/curl通pod/svc/ingress

    一、master/node節(jié)點(diǎn),去curl pod IP,一直卡著,沒反應(yīng)。timeout。 二、掛起恢復(fù)后,harbor服務(wù)無法正常訪問503 ,需要重啟harbor服務(wù)。 進(jìn)容器curl localhost,是正常的。 而網(wǎng)絡(luò)CNI 、flannel 、 coreDNS等都是running狀態(tài)。 (發(fā)現(xiàn)restarts的次數(shù)有點(diǎn)多) .這里的metrics-server一直失敗的。 可參考

    2023年04月17日
    瀏覽(25)
  • 現(xiàn)場問題排查-k8s(docker)上某服務(wù)pod頻繁自動重啟

    根因:應(yīng)用內(nèi)存占用不合理(個人認(rèn)為)+現(xiàn)場配置內(nèi)存不夠?qū)е骂l繁觸發(fā)OOM引發(fā)該現(xiàn)象。 為啥要寫這個文章? 之前沒有k8s下pod頻繁重啟的問題處理經(jīng)驗(yàn),這次實(shí)戰(zhàn)沉淀思路及過程,供后續(xù)自己處理相同問題提供參考資料 為其他遇到類似問題的人提供一些排查思路 現(xiàn)場反饋

    2024年02月03日
    瀏覽(20)
  • k8s故障排查個案:當(dāng)Pod內(nèi)存持續(xù)增長,OOM問題如何解決?

    k8s故障排查個案:當(dāng)Pod內(nèi)存持續(xù)增長,OOM問題如何解決?

    pod 運(yùn)行一段時間后,內(nèi)存持續(xù)增長,甚至 oom 的情況. 容器化過程中,我們經(jīng)常會發(fā)現(xiàn) kubernetes 集群內(nèi) pod 的內(nèi)存使用率會不停持續(xù)增長,加多少內(nèi)存吃多少內(nèi)存,如果對 cgroup 內(nèi)存的構(gòu)成不是很清楚的情況下,單純看監(jiān)控看不出什么問題。 經(jīng)過一番查閱,目前總結(jié)出大致有

    2024年02月22日
    瀏覽(31)
  • k8s pod一直處于pending狀態(tài)一般有哪些情況,怎么排查?

    一個pod一開始創(chuàng)建的時候,它本身就是會處于pending狀態(tài),這時可能是正在拉取鏡像,正在創(chuàng)建容器的過程。 如果等了一會發(fā)現(xiàn)pod一直處于pending狀態(tài), 那么我們可以使用kubectl describe命令查看一下pod的Events詳細(xì)信息。一般可能會有這么幾種情況導(dǎo)致pod一直處于pending狀態(tài): 1、

    2024年01月17日
    瀏覽(33)
  • k8s集群pod和node狀態(tài)監(jiān)控

    k8s集群pod和node狀態(tài)監(jiān)控

    curl -L -O https://raw.githubusercontent.com/gjeanmart/kauri-content/master/spring-boot-simple/k8s/kube-state-metrics.yml 修改namespace為dev(default也行,但是后面的metricbeat安裝也需要修改namespace為default)。 kubectl create -f kube-state-metrics.yml curl -L -O https://raw.githubusercontent.com/elastic/beats/7.6/deploy/kubernetes/metr

    2024年04月09日
    瀏覽(37)
  • k8s集群pod中文件導(dǎo)出到本地

    k8s集群pod中文件導(dǎo)出到本地

    首先在k8s集群中先找到pod主機(jī); ?確定pod容器主機(jī)ip為192.168.1.113;等到113主機(jī)查看docker ps;發(fā)現(xiàn)113上沒有docker命令; rpm -qa |grep contain; top; ps ?aux |grep docker ; 查詢主機(jī)上實(shí)現(xiàn)docker的方式; ? ?crictl ps;查詢pod容器; ?容器中沒有tar命令的話可以嘗試:在主機(jī)上 crictl ?cp ?容器

    2024年02月11日
    瀏覽(25)
  • 清理k8s集群Evicted,F(xiàn)ailed的Pod!

    簡介:不知知道各位是如何清理的,我嘗試過用阿里的任何一個面板清理,但是還要換頁就很煩,所以自己就寫了一個小腳本,更GOOD!的是你還可以把他放到你的定時任務(wù)里面去,為啥要這么做,不得不說,咱的集群有點(diǎn)小垃圾,不過那也沒辦法,集群也不敢動,誰知道啥時

    2024年02月20日
    瀏覽(22)
  • K8S集群Node節(jié)點(diǎn)NotReay狀態(tài)故障排查思路

    在K8S集群中,經(jīng)常會出現(xiàn)Node節(jié)點(diǎn)處于NotReady的狀態(tài),當(dāng)Node節(jié)點(diǎn)處于NotReady狀態(tài)時,會導(dǎo)致該Node節(jié)點(diǎn)上的所有Pod資源停止服務(wù),對整體應(yīng)用程序會產(chǎn)生一定的影響。 在一個Node節(jié)點(diǎn)中可以運(yùn)行多個Pod資源,每一個Pod資源中可以運(yùn)行一個容器或者多個容器,同時共享一個網(wǎng)絡(luò)存儲

    2024年01月22日
    瀏覽(42)
  • 【精品】kubernetes(K8S)集群top命令監(jiān)控 Pod 度量指標(biāo)

    【精品】kubernetes(K8S)集群top命令監(jiān)控 Pod 度量指標(biāo)

    提示:做到舉一反三就要學(xué)會使用help信息 找出標(biāo)簽是name=cpu-user的Pod,并過濾出使用CPU最高的Pod,然后把它的名字寫在已經(jīng)存在的/opt/cordon.txt文件里 了解pod指標(biāo),主要需要關(guān)注,CPU與內(nèi)存占用率;生產(chǎn)環(huán)境,可能有幾十個pod,我們?yōu)榱耸蛊浔阌诳焖贆z索到需要的pod,可以學(xué)會

    2024年02月16日
    瀏覽(93)
  • k8s集群中service的域名解析、pod的域名解析

    在k8s集群中,service和pod都可以通過域名的形式進(jìn)行相互通信,換句話說,在k8s集群內(nèi),通過service和pod的域名,可以直接訪問內(nèi)部應(yīng)用,不必在通過service ip地址進(jìn)行通信,一般的,我們創(chuàng)建service的時候不建議指定service的clusterIP,而是讓k8s自動為service分配一個clusterIP,這樣,

    2024年02月01日
    瀏覽(21)

覺得文章有用就打賞一下文章作者

支付寶掃一掃打賞

博客贊助

微信掃一掃打賞

請作者喝杯咖啡吧~博客贊助

支付寶掃一掃領(lǐng)取紅包,優(yōu)惠每天領(lǐng)

二維碼1

領(lǐng)取紅包

二維碼2

領(lǐng)紅包