国产 无码 综合区,色欲AV无码国产永久播放,无码天堂亚洲国产AV,国产日韩欧美女同一区二区

k8s 驅(qū)逐eviction機(jī)制源碼分析

這篇具有很好參考價值的文章主要介紹了k8s 驅(qū)逐eviction機(jī)制源碼分析。希望對大家有所幫助。如果存在錯誤或未考慮完全的地方,請大家不吝賜教,您也可以點擊"舉報違法"按鈕提交疑問。

原理部分

1. 驅(qū)逐概念介紹
kubelet會定期監(jiān)控node的內(nèi)存,磁盤,文件系統(tǒng)等資源,當(dāng)達(dá)到指定的閾值后,就會先嘗試回收node級別的資源,比如當(dāng)磁盤資源不足時會刪除不同的image,如果仍然在閾值之上就會開始驅(qū)逐pod來回收資源。

2. 驅(qū)逐信號
kubelet定義了如下的驅(qū)逐信號,當(dāng)驅(qū)逐信號達(dá)到了驅(qū)逐閾值執(zhí)行驅(qū)逐流程
eviction manager: failed to get summary stats
3. 驅(qū)逐閾值
驅(qū)逐閾值用來指定當(dāng)驅(qū)逐信號達(dá)到某個閾值后執(zhí)行驅(qū)逐流程,格式如下:[eviction-signal][operator][quantity],其中eviction-signa為上面定義的驅(qū)逐信號,operator為操作符,比如小于等,quantity為指定閾值數(shù)據(jù),可以為數(shù)字,也可以為百分比。
比如一個node有10G內(nèi)存,如果期望當(dāng)可用內(nèi)存小于1G觸發(fā)驅(qū)逐,可以定義閾值如下:memory.available<10%或者memory.available<1Gi

a. 軟驅(qū)逐閾值
軟驅(qū)逐閾值會指定一個grace period時間,只有達(dá)到閾值的時間超過了grace period才會執(zhí)行驅(qū)逐流程。
有如下三個相關(guān)參數(shù):
–eviction-soft: 指定驅(qū)逐閾值集合,比如memory.available<1.5Gi
–eviction-soft-grace-period:指定驅(qū)逐grace period時間集合,比如memory.available=1m30s
–eviction-max-pod-grace-period: 指定pod優(yōu)雅退出時間,要和上面的–eviction-soft-grace-period區(qū)分開,–eviction-soft-grace-period指的是
達(dá)到閾值持續(xù)多久后才執(zhí)行驅(qū)逐流程,而–eviction-max-pod-grace-period指的是驅(qū)逐pod后,pod的退出時間

b. 硬驅(qū)逐閾值
硬驅(qū)逐閾值只要達(dá)到了閾值就會執(zhí)行驅(qū)逐流程,有如下參數(shù)
–eviction-hard: 指定驅(qū)逐閾值集合

如果不指定–eviction-hard,則使用如下默認(rèn)值

//pkg/kubelet/apis/config/v1beta1/default_linux.go
// DefaultEvictionHard includes default options for hard eviction.
var DefaultEvictionHard = map[string]string{
	"memory.available":  "100Mi",
	"nodefs.available":  "10%",
	"nodefs.inodesFree": "5%",
	"imagefs.available": "15%",
}

驅(qū)逐的這些參數(shù)都可以在KubeletConfiguration文件中指定,只需要指定驅(qū)逐信號和對應(yīng)的數(shù)值即可,操作符默認(rèn)為小于,具體可參考官網(wǎng)

4. node健康狀況
當(dāng)達(dá)到軟驅(qū)逐(不用等到grace period)或者硬驅(qū)逐閾值后,kubelet就會向api-server報告node的健康狀況,反應(yīng)出node的壓力。

驅(qū)逐信號和node健康狀況的關(guān)系如下表
eviction manager: failed to get summary stats

有些情況下,node健康狀況可能會在軟閾值上下波動,時而健康時而有壓力,導(dǎo)致錯誤的驅(qū)逐決定。為了防止這種情況,可以使用參數(shù)–eviction-pressure-transition-period來控制node健康狀況至少多久變化一次,默認(rèn)值為5分鐘

5. 驅(qū)逐pod選擇
如果kubelet回收node級別的資源后仍然在閾值之上,則需要驅(qū)逐用戶創(chuàng)建的pod。影響選擇pod進(jìn)行驅(qū)逐的因素如下:
a. 是否pod的資源使用超過了請求值
b. pod的優(yōu)先級
c. pod的資源使用值和請求值的比例

根據(jù)上面三個因素,kubelet排序后進(jìn)行驅(qū)逐pod的順序如下:
a. 資源使用值超過了請求值的BestEffort和Burstable級別的pod。根據(jù)他們的pod優(yōu)先級和超出請求值的多少進(jìn)行驅(qū)逐
b. 資源使用值小于請求值的Guaranteed和Burstable級別的pod。根據(jù)他們的pod優(yōu)先級進(jìn)行驅(qū)逐

6. 最小回收資源
有些情況下,驅(qū)逐pod只能回收很少的一部分資源,可能導(dǎo)致kubelet頻繁的執(zhí)行達(dá)到閾值/進(jìn)行驅(qū)逐的過程,為了防止這種情況,可以使用參數(shù)–eviction-minimum-reclaim為每種資源配置最小回收數(shù)值,當(dāng)kubelet執(zhí)行驅(qū)逐時,回收的資源會額外加上–eviction-minimum-reclaim指定的值。默認(rèn)值為0

舉例如下,當(dāng)nodefs.available達(dá)到驅(qū)逐閾值后,kubelet開始驅(qū)逐回收資源直到可用資源達(dá)到1G,此時還會繼續(xù)回收直到達(dá)到1.5G才會停止

apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
evictionHard:
  memory.available: "500Mi"
  nodefs.available: "1Gi"
  imagefs.available: "100Gi"
evictionMinimumReclaim:
  memory.available: "0Mi"
  nodefs.available: "500Mi"
  imagefs.available: "2Gi"

7. KernelMemcgNotification
kubelet會啟動協(xié)程周期檢查是否達(dá)到驅(qū)逐閾值,如果有個很重要的pod內(nèi)存使用增長很快的話,即使達(dá)到內(nèi)存閾值了,kubelet也可能不能及時發(fā)現(xiàn),最終oomkill掉此pod,如果kubelet能及時發(fā)現(xiàn),就能驅(qū)逐其他低優(yōu)先級的pod釋放資源給這個高優(yōu)先級的pod使用。
此時可以使用參數(shù)–kernel-memcg-notification使能memcg機(jī)制,kubelet會使用epoll監(jiān)聽,當(dāng)達(dá)到閾值后,kubelet能很快得到通知,及時執(zhí)行驅(qū)逐流程。

源碼分析

1. 解析驅(qū)逐閾值配置
KubeletConfiguration文件中的驅(qū)逐閾值相關(guān)配置最終會保存到如下結(jié)構(gòu)體中,Thresholds用來保存軟驅(qū)逐閾值及其grace period,硬驅(qū)逐閾值。

// Config holds information about how eviction is configured.
type Config struct {
	// PressureTransitionPeriod is duration the kubelet has to wait before transitioning out of a pressure condition.
	PressureTransitionPeriod time.Duration
	// Maximum allowed grace period (in seconds) to use when terminating pods in response to a soft eviction threshold being met.
	MaxPodGracePeriodSeconds int64
	// Thresholds define the set of conditions monitored to trigger eviction.
	Thresholds []evictionapi.Threshold
	// KernelMemcgNotification if true will integrate with the kernel memcg notification to determine if memory thresholds are crossed.
	KernelMemcgNotification bool
	// PodCgroupRoot is the cgroup which contains all pods.
	PodCgroupRoot string
}

// Threshold defines a metric for when eviction should occur.
type Threshold struct {
	// Signal defines the entity that was measured.
	Signal Signal
	// Operator represents a relationship of a signal to a value.
	Operator ThresholdOperator
	// Value is the threshold the resource is evaluated against.
	Value ThresholdValue
	// GracePeriod represents the amount of time that a threshold must be met before eviction is triggered.
	GracePeriod time.Duration
	// MinReclaim represents the minimum amount of resource to reclaim if the threshold is met.
	MinReclaim *ThresholdValue
}

調(diào)用ParseThresholdConfig解析用戶配置

//pkg/kubelet/kubelet.go
func NewMainKubelet(...)
	thresholds, err := eviction.ParseThresholdConfig(enforceNodeAllocatable, kubeCfg.EvictionHard, kubeCfg.EvictionSoft, kubeCfg.EvictionSoftGracePeriod, kubeCfg.EvictionMinimumReclaim)

	evictionConfig := eviction.Config{
		PressureTransitionPeriod: kubeCfg.EvictionPressureTransitionPeriod.Duration,
		MaxPodGracePeriodSeconds: int64(kubeCfg.EvictionMaxPodGracePeriod),
		Thresholds:               thresholds,
		KernelMemcgNotification:  kernelMemcgNotification,
		PodCgroupRoot:            kubeDeps.ContainerManager.GetPodCgroupRoot(),
	}

2. 創(chuàng)建evictionManager
調(diào)用eviction.NewManager創(chuàng)建evictionManager,klet.resourceAnalyzer用來獲取node和pod的統(tǒng)計信息,evictionConfig為用戶配置的驅(qū)逐閾值,killPodNow用來kill pod

//pkg/kubelet/kubelet.go
func NewMainKubelet(...)
	// setup eviction manager
	evictionManager, evictionAdmitHandler := eviction.NewManager(klet.resourceAnalyzer, evictionConfig, killPodNow(klet.podWorkers, kubeDeps.Recorder), klet.podManager.GetMirrorPodByPod, klet.imageManager, klet.containerGC, kubeDeps.Recorder, nodeRef, klet.clock)

	klet.evictionManager = evictionManager
	//將evictionManager添加到admitHandlers,創(chuàng)建pod時會調(diào)用evictionManager.Admit如果node有資源壓力則拒絕pod運(yùn)行在此node上
	klet.admitHandlers.AddPodAdmitHandler(evictionAdmitHandler)

3. 啟動evictionManager
調(diào)用evictionManager的start函數(shù)啟動驅(qū)逐管理,kl.GetActivePods用來獲取運(yùn)行在本node上的active pod,kl.podResourcesAreReclaimed用來確認(rèn)pod的資源是否完全釋放,evictionMonitoringPeriod為沒有驅(qū)逐pod時的sleep時間

// Period for performing eviction monitoring.
// ensure this is kept in sync with internal cadvisor housekeeping.
evictionMonitoringPeriod = time.Second * 10

func initializeRuntimeDependentModules(...)
	// eviction manager must start after cadvisor because it needs to know if the container runtime has a dedicated imagefs
	kl.evictionManager.Start(kl.StatsProvider, kl.GetActivePods, kl.podResourcesAreReclaimed, evictionMonitoringPeriod)

如果指定了–kernel-memcg-notification,則啟動實時驅(qū)逐,同時也會創(chuàng)建協(xié)程啟動輪訓(xùn)驅(qū)逐

//pkg/kubelet/eviction/eviction_manager.go
// Start starts the control loop to observe and response to low compute resources.
func (m *managerImpl) Start(diskInfoProvider DiskInfoProvider, podFunc ActivePodsFunc, podCleanedUpFunc PodCleanedUpFunc, monitoringInterval time.Duration) {
	thresholdHandler := func(message string) {
		klog.InfoS(message)
		m.synchronize(diskInfoProvider, podFunc)
	}
	//實時驅(qū)逐。如果指定了--kernel-memcg-notification,則使能memcg機(jī)制,如果達(dá)到指定的閾值,就能很快調(diào)用synchronize
	if m.config.KernelMemcgNotification {
		for _, threshold := range m.config.Thresholds {
			if threshold.Signal == evictionapi.SignalMemoryAvailable || threshold.Signal == evictionapi.SignalAllocatableMemoryAvailable {
				notifier, err := NewMemoryThresholdNotifier(threshold, m.config.PodCgroupRoot, &CgroupNotifierFactory{}, thresholdHandler)
				if err != nil {
					klog.InfoS("Eviction manager: failed to create memory threshold notifier", "err", err)
				} else {
					go notifier.Start()
					m.thresholdNotifiers = append(m.thresholdNotifiers, notifier)
				}
			}
		}
	}
	
	//輪訓(xùn)驅(qū)逐
	// start the eviction manager monitoring
	go func() {
		for {
			//如果synchronize返回值不為空,說明有pod被驅(qū)逐了,則調(diào)用waitForPodsCleanup等待pod的資源被釋放
			if evictedPods := m.synchronize(diskInfoProvider, podFunc); evictedPods != nil {
				klog.InfoS("Eviction manager: pods evicted, waiting for pod to be cleaned up", "pods", format.Pods(evictedPods))
				m.waitForPodsCleanup(podCleanedUpFunc, evictedPods)
			} else {//否則沒有pod被驅(qū)逐時,需要sleep 10s
				time.Sleep(monitoringInterval)
			}
		}
	}()
}

waitForPodsCleanup堵塞等待pod資源被回收,直到成功或者超時

func (m *managerImpl) waitForPodsCleanup(podCleanedUpFunc PodCleanedUpFunc, pods []*v1.Pod) {
	//最多等待30s
	timeout := m.clock.NewTimer(podCleanupTimeout)
	defer timeout.Stop()
	//每秒執(zhí)行一次
	ticker := m.clock.NewTicker(podCleanupPollFreq)
	defer ticker.Stop()
	for {
		select {
		case <-timeout.C():
			klog.InfoS("Eviction manager: timed out waiting for pods to be cleaned up", "pods", format.Pods(pods))
			return
		case <-ticker.C():
			for i, pod := range pods {
				//podCleanedUpFunc為pkg/kubelet/kubelet_pods.go:podResourcesAreReclaimed,用來判斷pod的資源是否已經(jīng)被回收,
				//如果仍然被回收則返回false,跳出循環(huán)等待下次
				if !podCleanedUpFunc(pod) {
					break
				}
				if i == len(pods)-1 {
					klog.InfoS("Eviction manager: pods successfully cleaned up", "pods", format.Pods(pods))
					return
				}
			}
		}
	}
}

synchronize為evictionManager的核心函數(shù),實時驅(qū)逐和輪訓(xùn)驅(qū)逐都會調(diào)用它

// synchronize is the main control loop that enforces eviction thresholds.
// Returns the pod that was killed, or nil if no pod was killed.
func (m *managerImpl) synchronize(diskInfoProvider DiskInfoProvider, podFunc ActivePodsFunc) []*v1.Pod {
	// if we have nothing to do, just return
	thresholds := m.config.Thresholds
	//如果配置的驅(qū)逐閾值集合為空,則返回。因為有默認(rèn)值的存在,肯定不會為空
	if len(thresholds) == 0 && !utilfeature.DefaultFeatureGate.Enabled(features.LocalStorageCapacityIsolation) {
		return nil
	}

	klog.V(3).InfoS("Eviction manager: synchronize housekeeping")
	// build the ranking functions (if not yet known)
	// TODO: have a function in cadvisor that lets us know if global housekeeping has completed
	if m.dedicatedImageFs == nil {
		hasImageFs, ok := diskInfoProvider.HasDedicatedImageFs()
		if ok != nil {
			return nil
		}
		m.dedicatedImageFs = &hasImageFs
		//建立驅(qū)逐信號到排序函數(shù)的映射,比如對于SignalMemoryAvailable,使用rankMemoryPressure進(jìn)行排序
		m.signalToRankFunc = buildSignalToRankFunc(hasImageFs)
		//建立驅(qū)逐信號到node級別資源回收函數(shù)的映射
		m.signalToNodeReclaimFuncs = buildSignalToNodeReclaimFuncs(m.imageGC, m.containerGC, hasImageFs)
	}

	//podFunc()為kl.GetActivePods用來獲取運(yùn)行在本node上的active pod
	activePods := podFunc()
	updateStats := true
	//調(diào)用summaryProviderImpl.Get獲取node和pod統(tǒng)計信息
	summary, err := m.summaryProvider.Get(updateStats)
	if err != nil {
		klog.ErrorS(err, "Eviction manager: failed to get summary stats")
		return nil
	}

	//notifier相關(guān)的,暫時忽略
	if m.clock.Since(m.thresholdsLastUpdated) > notifierRefreshInterval {
		m.thresholdsLastUpdated = m.clock.Now()
		for _, notifier := range m.thresholdNotifiers {
			if err := notifier.UpdateThreshold(summary); err != nil {
				klog.InfoS("Eviction manager: failed to update notifier", "notifier", notifier.Description(), "err", err)
			}
		}
	}

	//將獲取的node和pod統(tǒng)計信息summary轉(zhuǎn)換到observations,此為驅(qū)逐信號到signalObservation的map,signalObservation保存了驅(qū)逐信號
	//的總?cè)萘?,可用值和獲取統(tǒng)計時的時間
	// make observations and get a function to derive pod usage stats relative to those observations.
	observations, statsFunc := makeSignalObservations(summary)
	debugLogObservations("observations", observations)

	//比較observations中的資源信息和配置的驅(qū)逐閾值thresholds,將達(dá)到閾值的thresholds返回,
	//比如配置的驅(qū)逐閾值為memory.available: "500Mi"和nodefs.available: "1Gi",而observations中內(nèi)存可用為400Mi,
	//可用nodefs為2Gi,則返回memory.available相關(guān)信息
	// determine the set of thresholds met independent of grace period
	thresholds = thresholdsMet(thresholds, observations, false)
	debugLogThresholdsWithObservation("thresholds - ignoring grace period", thresholds, observations)

	//thresholdsMet記錄的是上次達(dá)到驅(qū)逐閾值的信號
	// determine the set of thresholds previously met that have not yet satisfied the associated min-reclaim
	if len(m.thresholdsMet) > 0 {
		//經(jīng)過上次的驅(qū)逐流程后,可能已經(jīng)成功回收資源,對上次達(dá)到驅(qū)逐閾值的信號再次進(jìn)行判斷是否降低到閾值之下,
		//如果沒降低到閾值之下,需要將本次的thresholds和上次的m.thresholdsMet進(jìn)行合并。
		//還有一種情況,對于軟驅(qū)逐來說,第一次達(dá)到閾值后可能還沒超過grace periods,則保存到m.thresholdsMet,下次走到這里時進(jìn)行合并,
		//執(zhí)行后續(xù)流程,如果仍然沒超過grace periods則繼續(xù),直到超過grace periods指定的時間
		thresholdsNotYetResolved := thresholdsMet(m.thresholdsMet, observations, true)
		thresholds = mergeThresholds(thresholds, thresholdsNotYetResolved)
	}
	debugLogThresholdsWithObservation("thresholds - reclaim not satisfied", thresholds, observations)

	//記錄驅(qū)逐信號第一次達(dá)到閾值的時間,目的是為了計算是否超過grace periods指定的時間
	//thresholds是本次發(fā)現(xiàn)達(dá)到閾值的驅(qū)逐信號,m.thresholdsFirstObservedAt保存的驅(qū)逐信號是第一次
	//達(dá)到閾值的時間
	// track when a threshold was first observed
	now := m.clock.Now()
	thresholdsFirstObservedAt := thresholdsFirstObservedAt(thresholds, m.thresholdsFirstObservedAt, now)

	//根據(jù)達(dá)到閾值的驅(qū)逐信號返回對應(yīng)的node健康狀況,比如達(dá)到內(nèi)存可用閾值了,則返回NodeMemoryPressure
	// the set of node conditions that are triggered by currently observed thresholds
	nodeConditions := nodeConditions(thresholds)
	if len(nodeConditions) > 0 {
		klog.V(3).InfoS("Eviction manager: node conditions - observed", "nodeCondition", nodeConditions)
	}

	//記錄不同node健康狀況上次出問題的時間,目的是為了計算是否超過了PressureTransitionPeriod指定的時間,防止node健康狀況出現(xiàn)波動
	// track when a node condition was last observed
	nodeConditionsLastObservedAt := nodeConditionsLastObservedAt(nodeConditions, m.nodeConditionsLastObservedAt, now)

	//只要記錄的node健康狀況出問題的時間在PressureTransitionPeriod指定的時間內(nèi),則返回true,即認(rèn)為node健康狀況還是有問題,
	//比如PressureTransitionPeriod為5分鐘,在第一分鐘時可用內(nèi)存信號達(dá)到了內(nèi)存閾值,設(shè)置node健康狀況為NodeMemoryPressure,
	//第二分鐘即使可用內(nèi)存信號降到了內(nèi)存閾值之下,也不會將NodeMemoryPressure刪除,而是等到過了5分鐘之后,如果可用內(nèi)存信號仍然
	//在內(nèi)存閾值之下,才會刪除NodeMemoryPressure,即需要保持NodeMemoryPressure狀態(tài)持續(xù)至少5分鐘,不過這段時間內(nèi)內(nèi)存如何變化。
	// node conditions report true if it has been observed within the transition period window
	nodeConditions = nodeConditionsObservedSince(nodeConditionsLastObservedAt, m.config.PressureTransitionPeriod, now)
	if len(nodeConditions) > 0 {
		klog.V(3).InfoS("Eviction manager: node conditions - transition period not met", "nodeCondition", nodeConditions)
	}

	//返回超過grace periods的驅(qū)逐信號,主要是針對軟驅(qū)逐來說,硬驅(qū)逐每次都會返回
	// determine the set of thresholds we need to drive eviction behavior (i.e. all grace periods are met)
	thresholds = thresholdsMetGracePeriod(thresholdsFirstObservedAt, now)
	debugLogThresholdsWithObservation("thresholds - grace periods satisfied", thresholds, observations)

	//保存重要的變量
	// update internal state
	m.Lock()
	m.nodeConditions = nodeConditions
	m.thresholdsFirstObservedAt = thresholdsFirstObservedAt
	m.nodeConditionsLastObservedAt = nodeConditionsLastObservedAt
	m.thresholdsMet = thresholds

	// determine the set of thresholds whose stats have been updated since the last sync
	thresholds = thresholdsUpdatedStats(thresholds, observations, m.lastObservations)
	debugLogThresholdsWithObservation("thresholds - updated stats", thresholds, observations)

	m.lastObservations = observations
	m.Unlock()

	// evict pods if there is a resource usage violation from local volume temporary storage
	// If eviction happens in localStorageEviction function, skip the rest of eviction action
	if utilfeature.DefaultFeatureGate.Enabled(features.LocalStorageCapacityIsolation) {
		if evictedPods := m.localStorageEviction(activePods, statsFunc); len(evictedPods) > 0 {
			return evictedPods
		}
	}

	//如果為0,說明沒有驅(qū)逐信號達(dá)到閾值,即沒有資源壓力,返回即可
	if len(thresholds) == 0 {
		klog.V(3).InfoS("Eviction manager: no resources are starved")
		return nil
	}

	//將驅(qū)逐信號進(jìn)行排序,將內(nèi)存驅(qū)逐信號排在前面
	// rank the thresholds by eviction priority
	sort.Sort(byEvictionPriority(thresholds))
	//獲取第一個驅(qū)逐信號,上面剛排序過,最新驅(qū)逐的是內(nèi)存信號
	thresholdToReclaim, resourceToReclaim, foundAny := getReclaimableThreshold(thresholds)
	if !foundAny {
		return nil
	}
	klog.InfoS("Eviction manager: attempting to reclaim", "resourceName", resourceToReclaim)

	// record an event about the resources we are now attempting to reclaim via eviction
	m.recorder.Eventf(m.nodeRef, v1.EventTypeWarning, "EvictionThresholdMet", "Attempting to reclaim %s", resourceToReclaim)

	//首先查看是否有node層級的資源可回收,回收后,再次m.summaryProvider.Get獲取node和pod統(tǒng)計信息,和m.config.Thresholds進(jìn)行比較,
	//如果沒有達(dá)到閾值的驅(qū)逐信號,則返回true,說明node層級的資源回收后,已經(jīng)沒有資源壓力。
	//內(nèi)存驅(qū)逐信號不屬于node層級資源
	// check if there are node-level resources we can reclaim to reduce pressure before evicting end-user pods.
	if m.reclaimNodeLevelResources(thresholdToReclaim.Signal, resourceToReclaim) {
		klog.InfoS("Eviction manager: able to reduce resource pressure without evicting pods.", "resourceName", resourceToReclaim)
		return nil
	}

	klog.InfoS("Eviction manager: must evict pod(s) to reclaim", "resourceName", resourceToReclaim)

	//根據(jù)驅(qū)逐信號獲取排序函數(shù),比如內(nèi)存驅(qū)逐信號對應(yīng)的排序函數(shù)為rankMemoryPressure
	// rank the pods for eviction
	rank, ok := m.signalToRankFunc[thresholdToReclaim.Signal]
	if !ok {
		klog.ErrorS(nil, "Eviction manager: no ranking function for signal", "threshold", thresholdToReclaim.Signal)
		return nil
	}

	//沒有active的pod,返回
	// the only candidates viable for eviction are those pods that had anything running.
	if len(activePods) == 0 {
		klog.ErrorS(nil, "Eviction manager: eviction thresholds have been met, but no pods are active to evict")
		return nil
	}

	//對activePods進(jìn)行排序
	// rank the running pods for eviction for the specified resource
	rank(activePods, statsFunc)

	klog.InfoS("Eviction manager: pods ranked for eviction", "pods", format.Pods(activePods))

	//record age of metrics for met thresholds that we are using for evictions.
	for _, t := range thresholds {
		timeObserved := observations[t.Signal].time
		if !timeObserved.IsZero() {
			metrics.EvictionStatsAge.WithLabelValues(string(t.Signal)).Observe(metrics.SinceInSeconds(timeObserved.Time))
		}
	}

	//遍歷activePods開始驅(qū)逐,每次最多驅(qū)逐一個pod。為什么還要循環(huán)呢?因為有些pod屬于Critical pod,這些pod不能驅(qū)逐
	// we kill at most a single pod during each eviction interval
	for i := range activePods {
		pod := activePods[i]
		gracePeriodOverride := int64(0)
		//軟驅(qū)逐設(shè)置gracePeriodOverride為MaxPodGracePeriodSeconds,
		//硬驅(qū)逐置gracePeriodOverride為0
		if !isHardEvictionThreshold(thresholdToReclaim) {
			gracePeriodOverride = m.config.MaxPodGracePeriodSeconds
		}
		message, annotations := evictionMessage(resourceToReclaim, pod, statsFunc)
		//執(zhí)行驅(qū)逐,如果evictPod返回nil說明pod為Critical pod,返回非nil說明對pod進(jìn)行驅(qū)逐了,但不會管驅(qū)逐是否成功
		if m.evictPod(pod, gracePeriodOverride, message, annotations) {
			metrics.Evictions.WithLabelValues(string(thresholdToReclaim.Signal)).Inc()
			return []*v1.Pod{pod}
		}
	}
	klog.InfoS("Eviction manager: unable to evict any pods from the node")
	return nil
}

evictPod對pod執(zhí)行驅(qū)逐,如果pod為Critical pod則直接返回nil

func (m *managerImpl) evictPod(pod *v1.Pod, gracePeriodOverride int64, evictMsg string, annotations map[string]string) bool {
	// If the pod is marked as critical and static, and support for critical pod annotations is enabled,
	// do not evict such pods. Static pods are not re-admitted after evictions.
	// https://github.com/kubernetes/kubernetes/issues/40573 has more details.
	if kubelettypes.IsCriticalPod(pod) {
		klog.ErrorS(nil, "Eviction manager: cannot evict a critical pod", "pod", klog.KObj(pod))
		return false
	}
	// record that we are evicting the pod
	m.recorder.AnnotatedEventf(pod, annotations, v1.EventTypeWarning, Reason, evictMsg)
	// this is a blocking call and should only return when the pod and its containers are killed.
	klog.V(3).InfoS("Evicting pod", "pod", klog.KObj(pod), "podUID", pod.UID, "message", evictMsg)
	//killPodFunc為pkg/kubelet/pod_workers.go:killPodNow
	err := m.killPodFunc(pod, true, &gracePeriodOverride, func(status *v1.PodStatus) {
		status.Phase = v1.PodFailed
		status.Reason = Reason
		status.Message = evictMsg
	})
	if err != nil {
		klog.ErrorS(err, "Eviction manager: pod failed to evict", "pod", klog.KObj(pod))
	} else {
		klog.InfoS("Eviction manager: pod is evicted successfully", "pod", klog.KObj(pod))
	}
	return true
}

Critical pod的判斷方法

// IsCriticalPod returns true if pod's priority is greater than or equal to SystemCriticalPriority.
func IsCriticalPod(pod *v1.Pod) bool {
	//pod的注釋kubernetes.io/config.source不為api,即不是從apiserver獲取的pod都為靜態(tài)pod
	if IsStaticPod(pod) {
		return true
	}
	//pod的注釋kubernetes.io/config.mirror不為空的pod為鏡像pod
	if IsMirrorPod(pod) {
		return true
	}
	//pod的優(yōu)先級不為空,且優(yōu)先級大于 2*1000000000
	if pod.Spec.Priority != nil && IsCriticalPodBasedOnPriority(*pod.Spec.Priority) {
		return true
	}
	return false
}

killPodNow調(diào)用pod_workers的UpdatePod kill pod,此函數(shù)為堵塞調(diào)用,或者返回成功,或者等timeout超時

//pkg/kubelet/pod_workers.go
// killPodNow returns a KillPodFunc that can be used to kill a pod.
// It is intended to be injected into other modules that need to kill a pod.
func killPodNow(podWorkers PodWorkers, recorder record.EventRecorder) eviction.KillPodFunc {
	return func(pod *v1.Pod, isEvicted bool, gracePeriodOverride *int64, statusFn func(*v1.PodStatus)) error {
		// determine the grace period to use when killing the pod
		gracePeriod := int64(0)
		if gracePeriodOverride != nil {
			gracePeriod = *gracePeriodOverride
		} else if pod.Spec.TerminationGracePeriodSeconds != nil {
			gracePeriod = *pod.Spec.TerminationGracePeriodSeconds
		}

		// we timeout and return an error if we don't get a callback within a reasonable time.
		// the default timeout is relative to the grace period (we settle on 10s to wait for kubelet->runtime traffic to complete in sigkill)
		timeout := int64(gracePeriod + (gracePeriod / 2))
		minTimeout := int64(10)
		if timeout < minTimeout {
			timeout = minTimeout
		}
		timeoutDuration := time.Duration(timeout) * time.Second

		// open a channel we block against until we get a result
		ch := make(chan struct{}, 1)
		podWorkers.UpdatePod(UpdatePodOptions{
			Pod:        pod,
			UpdateType: kubetypes.SyncPodKill, //更新類型為kill
			KillPodOptions: &KillPodOptions{
				CompletedCh:                              ch,
				Evict:                                    isEvicted,
				PodStatusFunc:                            statusFn,
				PodTerminationGracePeriodSecondsOverride: gracePeriodOverride,
			},
		})

		// wait for either a response, or a timeout
		select {
		//堵塞channel等待,kill成功后會close此channel,這里返回nil
		case <-ch:
			return nil
		case <-time.After(timeoutDuration): //超時了返回err
			recorder.Eventf(pod, v1.EventTypeWarning, events.ExceededGracePeriod, "Container runtime did not kill the pod within specified grace period.")
			return fmt.Errorf("timeout waiting to kill pod")
		}
	}
}

UpdatePod的調(diào)用鏈如下

UpdatePod-> managePodLoop -> syncTerminatingPod -> killPod -> containerRuntime.KillPod

4. nodeConditions的作用
有如下兩個作用
a. 如果node健康狀況出問題就會上報到kube-apiserver,代碼如下

func (kl *Kubelet) defaultNodeStatusFuncs() []func(*v1.Node) error
	setters = append(setters,
		//上傳MemoryPressure
		nodestatus.MemoryPressureCondition(kl.clock.Now, kl.evictionManager.IsUnderMemoryPressure, kl.recordNodeStatusEvent),
		//上傳DiskPressure
		nodestatus.DiskPressureCondition(kl.clock.Now, kl.evictionManager.IsUnderDiskPressure, kl.recordNodeStatusEvent),
		//上傳PIDPressure
		nodestatus.PIDPressureCondition(kl.clock.Now, kl.evictionManager.IsUnderPIDPressure, kl.recordNodeStatusEvent),
	)

如果m.nodeConditions包含NodeMemoryPressure說明有當(dāng)前可用內(nèi)存超過了內(nèi)存驅(qū)逐信號的閾值

// IsUnderMemoryPressure returns true if the node is under memory pressure.
func (m *managerImpl) IsUnderMemoryPressure() bool {
	m.RLock()
	defer m.RUnlock()
	return hasNodeCondition(m.nodeConditions, v1.NodeMemoryPressure)
}

會有兩個表現(xiàn),一個是node污點增加了node.kubernetes.io/memory-pressure:NoSchedule,另一個是node的Conditions也能看到MemoryPressure,可使用kubectl describe node master查看

root@master:~# kubectl describe node master
...
Taints:             node.kubernetes.io/memory-pressure:NoSchedule
Unschedulable:      false
Lease:
  HolderIdentity:  master
  AcquireTime:     <unset>
  RenewTime:       Sat, 10 Dec 2022 10:23:38 +0000
Conditions:
  Type                 Status  LastHeartbeatTime                 LastTransitionTime                Reason                         Message
  ----                 ------  -----------------                 ------------------                ------                         -------
  MemoryPressure       True    Sat, 10 Dec 2022 10:22:06 +0000   Sat, 10 Dec 2022 10:22:06 +0000   KubeletHasInsufficientMemory   kubelet has insufficient memory available

node有了node.kubernetes.io/memory-pressure:NoSchedule污點后,還可能會影響新pod的調(diào)度,在Filter擴(kuò)展點TaintToleration插件會檢查污點

// Filter invoked at the filter extension point.
func (pl *TaintToleration) Filter(ctx context.Context, state *framework.CycleState, pod *v1.Pod, nodeInfo *framework.NodeInfo) *framework.Status {
	if nodeInfo == nil || nodeInfo.Node() == nil {
		return framework.AsStatus(fmt.Errorf("invalid nodeInfo"))
	}

	filterPredicate := func(t *v1.Taint) bool {
		// PodToleratesNodeTaints is only interested in NoSchedule and NoExecute taints.
		return t.Effect == v1.TaintEffectNoSchedule || t.Effect == v1.TaintEffectNoExecute
	}

	taint, isUntolerated := v1helper.FindMatchingUntoleratedTaint(nodeInfo.Node().Spec.Taints, pod.Spec.Tolerations, filterPredicate)
	if !isUntolerated {
		return nil
	}

	errReason := fmt.Sprintf("node(s) had taint {%s: %s}, that the pod didn't tolerate",
		taint.Key, taint.Value)
	return framework.NewStatus(framework.UnschedulableAndUnresolvable, errReason)
}

如果新創(chuàng)建的pod沒有聲明容忍memory-pressure污點,會有如下報錯,導(dǎo)致pod調(diào)度失敗

I1210 12:43:04.147715   92808 scheduler.go:351] "Unable to schedule pod; no fit; waiting" pod="default/nginx-demo3-574cdd99c7-lwk9x" err="0/2 nodes are available: 1 node(s) had untolerated taint {node.kubernetes.io/memory-pressure: }, 1 node(s) had untolerated taint {node.kubernetes.io/unreachable: }. preemption: 0/2 nodes are available: 2 Preemption is not helpful for scheduling."

b. 即使新建pod能調(diào)度成功,在目標(biāo)node上也需要經(jīng)過canAdmitPod檢查后才能允許此pod運(yùn)行在此node上,調(diào)用鏈如下
HandlePodAdditions -> canAdmitPod -> podAdmitHandler.Admit

evictionManager的Admit會檢查node健康狀態(tài)m.nodeConditions,如果為空說明沒有資源壓力,可直接返回true,如果是CriticalPod也返回true,如果只有內(nèi)存壓力,會根據(jù)pod的qos級別進(jìn)行區(qū)分,如果不是besteffort的pod也返回true,是besteffort的pod如果可以容忍內(nèi)存壓力也可以,但如果不是內(nèi)存壓力就返回false,即pod不能運(yùn)行在此node上

// Admit rejects a pod if its not safe to admit for node stability.
func (m *managerImpl) Admit(attrs *lifecycle.PodAdmitAttributes) lifecycle.PodAdmitResult {
	m.RLock()
	defer m.RUnlock()
	//node健康狀態(tài)集合為空,說明沒有資源壓力,可直接返回true
	if len(m.nodeConditions) == 0 {
		return lifecycle.PodAdmitResult{Admit: true}
	}
	//CriticalPod也返回true
	// Admit Critical pods even under resource pressure since they are required for system stability.
	// https://github.com/kubernetes/kubernetes/issues/40573 has more details.
	if kubelettypes.IsCriticalPod(attrs.Pod) {
		return lifecycle.PodAdmitResult{Admit: true}
	}

	//有且只有內(nèi)存壓力
	// Conditions other than memory pressure reject all pods
	nodeOnlyHasMemoryPressureCondition := hasNodeCondition(m.nodeConditions, v1.NodeMemoryPressure) && len(m.nodeConditions) == 1
	if nodeOnlyHasMemoryPressureCondition {
		//獲取pod的qos級別
		notBestEffort := v1.PodQOSBestEffort != v1qos.GetPodQOS(attrs.Pod)
		//非BestEffort的pod返回true
		if notBestEffort {
			return lifecycle.PodAdmitResult{Admit: true}
		}

		//如果能容忍內(nèi)存壓力也返回true
		// When node has memory pressure, check BestEffort Pod's toleration:
		// admit it if tolerates memory pressure taint, fail for other tolerations, e.g. DiskPressure.
		if v1helper.TolerationsTolerateTaint(attrs.Pod.Spec.Tolerations, &v1.Taint{
			Key:    v1.TaintNodeMemoryPressure,
			Effect: v1.TaintEffectNoSchedule,
		}) {
			return lifecycle.PodAdmitResult{Admit: true}
		}
	}

	//其他情況均返回false
	// reject pods when under memory pressure (if pod is best effort), or if under disk pressure.
	klog.InfoS("Failed to admit pod to node", "pod", klog.KObj(attrs.Pod), "nodeCondition", m.nodeConditions)
	return lifecycle.PodAdmitResult{
		Admit:   false,
		Reason:  Reason,
		Message: fmt.Sprintf(nodeConditionMessageFmt, m.nodeConditions),
	}
}

參考
https://kubernetes.io/docs/concepts/scheduling-eviction/node-pressure-eviction/#eviction-thresholds
https://github.com/kubernetes/design-proposals-archive/blob/main/node/kubelet-eviction.md文章來源地址http://www.zghlxwxcb.cn/news/detail-660732.html

到了這里,關(guān)于k8s 驅(qū)逐eviction機(jī)制源碼分析的文章就介紹完了。如果您還想了解更多內(nèi)容,請在右上角搜索TOY模板網(wǎng)以前的文章或繼續(xù)瀏覽下面的相關(guān)文章,希望大家以后多多支持TOY模板網(wǎng)!

本文來自互聯(lián)網(wǎng)用戶投稿,該文觀點僅代表作者本人,不代表本站立場。本站僅提供信息存儲空間服務(wù),不擁有所有權(quán),不承擔(dān)相關(guān)法律責(zé)任。如若轉(zhuǎn)載,請注明出處: 如若內(nèi)容造成侵權(quán)/違法違規(guī)/事實不符,請點擊違法舉報進(jìn)行投訴反饋,一經(jīng)查實,立即刪除!

領(lǐng)支付寶紅包贊助服務(wù)器費用

相關(guān)文章

  • k8s 維護(hù)node與驅(qū)逐pod

    k8s 維護(hù)node與驅(qū)逐pod

    1.維護(hù)node節(jié)點 設(shè)置節(jié)點狀態(tài)為不可調(diào)度狀態(tài),執(zhí)行以下命令后,節(jié)點狀態(tài)會多出一個SchedulingDisabled的狀態(tài),即新建的pod不會往該節(jié)點上調(diào)度,本身存在node中的pod保持正常運(yùn)行 kubectl cordon k8s-node01 kubectl get node 2.驅(qū)逐pod 在node節(jié)點設(shè)置為不可調(diào)度狀態(tài)后,就可以開始驅(qū)逐本節(jié)點

    2024年02月06日
    瀏覽(27)
  • 云原生 | k8s批量刪除Evicted/Terminating/Unknown Pods

    云原生 | k8s批量刪除Evicted/Terminating/Unknown Pods

    宿主機(jī)內(nèi)存被docker占滿導(dǎo)致,K8s集群pod處于Evicted?狀態(tài),清理內(nèi)存后處理Evicted和Terminating狀態(tài)的pod 1、在集群查詢pod狀態(tài),發(fā)現(xiàn)大量pod處于Evicted和Terminating狀態(tài) 2.使用kubectl中的強(qiáng)制刪除命令 3.刪除非正常的pod ?

    2024年02月08日
    瀏覽(19)
  • flink集群與資源@k8s源碼分析-集群

    flink集群與資源@k8s源碼分析-集群

    本文是flink集群與資源@k8s源碼分析系列的第二篇-集群 下面詳細(xì)分析各用例 k8s集群支持session和application模式,job模式將會被廢棄,本文分析session模式集群 Configuration作為配置容器,幾乎所有的構(gòu)建需要從配置類獲取配置項,這里不顯示關(guān)聯(lián)關(guān)系 1. 用戶命令行執(zhí)行kubernates-ses

    2024年02月07日
    瀏覽(24)
  • flink集群與資源@k8s源碼分析-運(yùn)行時

    flink集群與資源@k8s源碼分析-運(yùn)行時

    運(yùn)行時提供了Flink作業(yè)運(yùn)行過程依賴的基礎(chǔ)執(zhí)行環(huán)境,包含Dispatcher、ResourceManager、JobManager和TaskManager等核心組件,本節(jié)分析資源相關(guān)運(yùn)行時組件構(gòu)建和啟動。 flink沒有使用spring,缺少ioc的構(gòu)建過程相當(dāng)復(fù)雜,所有依賴手動關(guān)聯(lián)和置入,為了共享組件,flink使用了很多中間持有

    2024年02月07日
    瀏覽(48)
  • 【k8s源碼分析-Apiserver-2】kube-apiserver 結(jié)構(gòu)概覽以及主體部分源碼分析

    【k8s源碼分析-Apiserver-2】kube-apiserver 結(jié)構(gòu)概覽以及主體部分源碼分析

    Kubernetes 源碼剖析(書籍) kube-apiserver的設(shè)計與實現(xiàn) - 自記小屋 APIGroupInfo 記錄 GVK 與 Storage 的對應(yīng)關(guān)系 將 GVK 轉(zhuǎn)換成,Restful HTTP Path 將 Storage 封裝成 HTTP Handler 將上面兩個形成映射,實現(xiàn)相關(guān)的路由處理 發(fā)起請求并處理的流程 發(fā)送請求:通過 GVK 對應(yīng)的 Restful HTTP Path 發(fā)送請求

    2024年02月03日
    瀏覽(52)
  • flink集群與資源@k8s源碼分析-資源III 聲明式資源管理

    flink集群與資源@k8s源碼分析-資源III 聲明式資源管理

    資源分析分3部分,資源請求,資源提供,聲明式資源管理,本文是第三部分 聲明式資源管理 檢查資源需求/檢查資源聲明是flink 聲明式資源管理 的核心方法 上面的資源場景分為兩類, 提出資源需求 和 提供資源 , 檢查資源請求/檢查資源聲明是交匯點,處理資源請求,該分

    2024年02月07日
    瀏覽(50)
  • k8s 使用cert-manager證書管理自簽

    k8s 使用cert-manager證書管理自簽

    個人建議使用安裝更快,比helm快,還要等待安裝crd 查看你的證書 手動簽發(fā)ssl自簽證書 最后ingress 最后訪問

    2024年01月22日
    瀏覽(13)
  • k8s中GPU虛擬化工具gpu-manager的安裝

    k8s中GPU虛擬化工具gpu-manager的安裝

    gpu-manager是騰訊的一個開源vGPU應(yīng)用,具體原理就不介紹了,詳見GPUManager虛擬化方案。 本文主要參照騰訊開源vgpu方案gpu-manager安裝教程進(jìn)行安裝,并就安裝時出現(xiàn)的問題,對其中的部分配置進(jìn)行了更改,如果根據(jù)上述文章安裝失敗,可以參考本文來進(jìn)行安裝。 gpu-manager不提供

    2024年02月06日
    瀏覽(20)
  • GPU虛擬化理解包含直通,k8s安裝,GPU-manager

    GPU虛擬化理解包含直通,k8s安裝,GPU-manager

    vGPU,即真正意義上的GPU虛擬化方案,就是將一塊GPU卡的計算能力進(jìn)行切片,分成多個邏輯上虛擬的GPU,以vGPU為單位分配GPU的計算能力, 并將單塊GPU卡分配給多臺虛擬機(jī)使用,使得虛擬機(jī)能夠運(yùn)行3D軟件、播放高清視頻等,極大地提升了用戶體驗。真正實現(xiàn)了GPU資源的按需分

    2024年02月13日
    瀏覽(21)
  • k8s 安全機(jī)制

    安全機(jī)制 //機(jī)制說明 Kubernetes 作為一個分布式集群的管理工具,保證集群的安全性是其一個重要的任務(wù)。API Server 是集群內(nèi)部各個組件通信的中介, 也是外部控制的入口。所以 Kubernetes 的安全機(jī)制基本就是圍繞保護(hù) API Server 來設(shè)計的。 比如 kubectl 如果想向 API Server 請求資源,

    2024年02月17日
    瀏覽(20)

覺得文章有用就打賞一下文章作者

支付寶掃一掃打賞

博客贊助

微信掃一掃打賞

請作者喝杯咖啡吧~博客贊助

支付寶掃一掃領(lǐng)取紅包,優(yōu)惠每天領(lǐng)

二維碼1

領(lǐng)取紅包

二維碼2

領(lǐng)紅包