項(xiàng)目場(chǎng)景:
最近實(shí)時(shí)平臺(tái)flink任務(wù)頻繁失敗,報(bào)檢查點(diǎn)方面的錯(cuò)誤,最近集群的hdfs也經(jīng)常報(bào)警:運(yùn)行狀況不良,不知道是否和該情況有關(guān),我的狀態(tài)后端位置是hdfs,廢話不多說,干貨搞起來~
問題描述
日志中報(bào)錯(cuò)如下:
2022-07-16 06:26:46,566 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Checkpoint 670223 of job 61103d713243c4a71befb436fa3f32ee expired before completing.
2022-07-16 06:26:46,571 INFO org.apache.flink.runtime.jobmaster.JobMaster [] - Trying to recover from a global failure.
org.apache.flink.util.FlinkRuntimeException: Exceeded checkpoint tolerable failure threshold.
at org.apache.flink.runtime.checkpoint.CheckpointFailureManager.handleCheckpointException(CheckpointFailureManager.java:98) ~[flink-dist_2.11-1.13.1.jar:1.13.1]
at org.apache.flink.runtime.checkpoint.CheckpointFailureManager.handleJobLevelCheckpointException(CheckpointFailureManager.java:67) ~[flink-dist_2.11-1.13.1.jar:1.13.1]
at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.abortPendingCheckpoint(CheckpointCoordinator.java:1934) ~[flink-dist_2.11-1.13.1.jar:1.13.1]
at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.abortPendingCheckpoint(CheckpointCoordinator.java:1906) ~[flink-dist_2.11-1.13.1.jar:1.13.1]
at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.access$600(CheckpointCoordinator.java:96) ~[flink-dist_2.11-1.13.1.jar:1.13.1]
at org.apache.flink.runtime.checkpoint.CheckpointCoordinator$CheckpointCanceller.run(CheckpointCoordinator.java:1990) ~[flink-dist_2.11-1.13.1.jar:1.13.1]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) ~[?:1.8.0_201]
at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[?:1.8.0_201]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) ~[?:1.8.0_201]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) ~[?:1.8.0_201]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_201]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_201]
at java.lang.Thread.run(Thread.java:748) ~[?:1.8.0_201]
注意:
在報(bào)
Exceeded checkpoint tolerable failure threshold.
錯(cuò)誤的之前,是先報(bào)的是Checkpoint expired before completing.
大概意思是檢查點(diǎn)在完成前過期了。
解決方案:
這個(gè)錯(cuò)誤也是頭一次見,更讓我好奇的是報(bào)這個(gè)錯(cuò)誤的時(shí)間點(diǎn)大概差不多(每?jī)商齑蟾艌?bào)一次,早晨6點(diǎn)多)。
最開始調(diào)整了檢查點(diǎn)的頻率(5s -> 10s)和任務(wù)重啟間隔(5s -> 30s),以為頻率太快了,但調(diào)整后并沒能解決該問題。
后來又將jobmanager和taskmanager運(yùn)行內(nèi)存調(diào)大,但也沒能解決…
通過查找flink檢查點(diǎn)相關(guān)配置,發(fā)現(xiàn)了配置項(xiàng)TolerableCheckpointFailureNumber
即可容忍檢查點(diǎn)失敗次數(shù)的配置,默認(rèn)值為0表示不允許容忍任何檢查點(diǎn)失敗。
報(bào)的錯(cuò)就是超過檢查點(diǎn)可容忍失敗閾值,試試觀察觀察再說,因此在程序里加上了這個(gè)配置。
//設(shè)置可容忍的檢查點(diǎn)失敗數(shù),默認(rèn)值為0表示不允許容忍任何檢查點(diǎn)失敗
env.getCheckpointConfig().setTolerableCheckpointFailureNumber(2);
配置說明:
限制的是最大可容忍的連續(xù)失敗checkpoint計(jì)數(shù) continuousFailureCounter,例如將tolerableCheckpointFailureNumber設(shè)置成3,連續(xù)失敗3次,continuousFailureCounter會(huì)累計(jì)到3,作業(yè)就會(huì)嘗試重啟。如果中間有一個(gè)checkpoint成功了,continuousFailureCounter 就會(huì)重置為零。
按之前的規(guī)律第二天任務(wù)就得報(bào)這個(gè)錯(cuò)誤失敗了,查看flink任務(wù)web界面,任務(wù)正常,但檢查點(diǎn)確實(shí)失敗過一次,也是大概那個(gè)時(shí)間失敗的,失敗原因和之前一樣Checkpoint expired before completing.
說明該配置對(duì)報(bào)錯(cuò)的解決有效,問題解決!??!文章來源:http://www.zghlxwxcb.cn/news/detail-466635.html
記得點(diǎn)贊收藏奧,后續(xù)遇到問題會(huì)持續(xù)更新,關(guān)注不迷路~文章來源地址http://www.zghlxwxcb.cn/news/detail-466635.html
到了這里,關(guān)于Flink任務(wù)失敗,檢查點(diǎn)失效:Exceeded checkpoint tolerable failure threshold.的文章就介紹完了。如果您還想了解更多內(nèi)容,請(qǐng)?jiān)谟疑辖撬阉鱐OY模板網(wǎng)以前的文章或繼續(xù)瀏覽下面的相關(guān)文章,希望大家以后多多支持TOY模板網(wǎng)!