問題
早上過來發(fā)現(xiàn)定時任務(wù)出現(xiàn)告警,F(xiàn)link Jobs運行失敗,登錄Flinkweb后臺一看,所有jobs都沒了,slot也為0。
查看Flink日志,有以下錯誤異常:
2022-12-07 08:00:05,444 ERROR org.apache.flink.runtime.taskexecutor.TaskManagerRunner [] - Fatal error occurred while executing the TaskManager. Shutting it down...
java.lang.OutOfMemoryError: Metaspace. The metaspace out-of-memory error has occurred. This can mean two things: either the job requires a larger size of JVM metaspace to load classes or there is a class loading leak. In the first case 'taskmanager.memory.jvm-metaspace.size' configuration option should be increased. If the error persists (usually in cluster after several job (re-)submissions) then there is probably a class loading leak in user code or some of its dependencies which has to be investigated and fixed. The task executor has to be shutdown...
at java.lang.ClassLoader.defineClass1(Native Method) ~[?:1.8.0_191]
at java.lang.ClassLoader.defineClass(ClassLoader.java:763) ~[?:1.8.0_191]
at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) ~[?:1.8.0_191]
at java.net.URLClassLoader.defineClass(URLClassLoader.java:468) ~[?:1.8.0_191]
at java.net.URLClassLoader.access$100(URLClassLoader.java:74) ~[?:1.8.0_191]
at java.net.URLClassLoader$1.run(URLClassLoader.java:369) ~[?:1.8.0_191]
at java.net.URLClassLoader$1.run(URLClassLoader.java:363) ~[?:1.8.0_191]
at java.security.AccessController.doPrivileged(Native Method) ~[?:1.8.0_191]
at java.net.URLClassLoader.findClass(URLClassLoader.java:362) ~[?:1.8.0_191]
at org.apache.flink.util.ChildFirstClassLoader.loadClassWithoutExceptionHandling(ChildFirstClassLoader.java:71) ~[flink-dist_2.12-1.13.6.jar:1.13.6]
at org.apache.flink.util.FlinkUserCodeClassLoader.loadClass(FlinkUserCodeClassLoader.java:48) [flink-dist_2.12-1.13.6.jar:1.13.6]
at java.lang.ClassLoader.loadClass(ClassLoader.java:357) [?:1.8.0_191]
at java.lang.invoke.MethodHandleNatives.resolve(Native Method) ~[?:1.8.0_191]
at java.lang.invoke.MemberName$Factory.resolve(MemberName.java:975) [?:1.8.0_191]
at java.lang.invoke.MemberName$Factory.resolveOrFail(MemberName.java:1000) [?:1.8.0_191]
at java.lang.invoke.MethodHandles$Lookup.resolveOrFail(MethodHandles.java:1394) [?:1.8.0_191]
at java.lang.invoke.MethodHandles$Lookup.linkMethodHandleConstant(MethodHandles.java:1750) [?:1.8.0_191]
at java.lang.invoke.MethodHandleNatives.linkMethodHandleConstant(MethodHandleNatives.java:477) [?:1.8.0_191]
at org.apache.poi.xssf.usermodel.XSSFRelation.<clinit>(XSSFRelation.java:124) [blob_p-17a50df9a1ef5c556557b4f62f21ad96d7b2f961-bf8906a9771fafff8b2c8f75c3f656b7:?]
at org.apache.poi.xssf.usermodel.XSSFWorkbookType.<clinit>(XSSFWorkbookType.java:26) [blob_p-17a50df9a1ef5c556557b4f62f21ad96d7b2f961-bf8906a9771fafff8b2c8f75c3f656b7:?]
at org.apache.poi.xssf.usermodel.XSSFWorkbook.<init>(XSSFWorkbook.java:247) [blob_p-17a50df9a1ef5c556557b4f62f21ad96d7b2f961-bf8906a9771fafff8b2c8f75c3f656b7:?]
at cn.hutool.poi.excel.WorkbookUtil.createBook(WorkbookUtil.java:133) [blob_p-17a50df9a1ef5c556557b4f62f21ad96d7b2f961-bf8906a9771fafff8b2c8f75c3f656b7:?]
at cn.hutool.poi.excel.WorkbookUtil.createBookForWriter(WorkbookUtil.java:73) [blob_p-17a50df9a1ef5c556557b4f62f21ad96d7b2f961-bf8906a9771fafff8b2c8f75c3f656b7:?]
at cn.hutool.poi.excel.ExcelWriter.<init>(ExcelWriter.java:145) [blob_p-17a50df9a1ef5c556557b4f62f21ad96d7b2f961-bf8906a9771fafff8b2c8f75c3f656b7:?]
at cn.hutool.poi.excel.ExcelWriter.<init>(ExcelWriter.java:135) [blob_p-17a50df9a1ef5c556557b4f62f21ad96d7b2f961-bf8906a9771fafff8b2c8f75c3f656b7:?]
at cn.hutool.poi.excel.ExcelUtil.getWriter(ExcelUtil.java:418) [blob_p-17a50df9a1ef5c556557b4f62f21ad96d7b2f961-bf8906a9771fafff8b2c8f75c3f656b7:?]
at com.ucloud.provider.flink.imsafe.job.FlinkJobCheatFind.operateResult(FlinkJobCheatFind.java:276) [blob_p-17a50df9a1ef5c556557b4f62f21ad96d7b2f961-bf8906a9771fafff8b2c8f75c3f656b7:?]
at com.ucloud.provider.flink.imsafe.job.FlinkJobCheatFind$2.close(FlinkJobCheatFind.java:149) [blob_p-17a50df9a1ef5c556557b4f62f21ad96d7b2f961-bf8906a9771fafff8b2c8f75c3f656b7:?]
at org.apache.flink.runtime.operators.DataSinkTask.invoke(DataSinkTask.java:247) [flink-dist_2.12-1.13.6.jar:1.13.6]
at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:779) [flink-dist_2.12-1.13.6.jar:1.13.6]
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:566) [flink-dist_2.12-1.13.6.jar:1.13.6]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_191]
2022-12-07 08:00:05,445 INFO org.apache.flink.runtime.blob.PermanentBlobCache [] - Shutting down BLOB cache
2022-12-07 08:00:05,445 INFO org.apache.flink.runtime.blob.TransientBlobCache [] - Shutting down BLOB cache
2022-12-07 08:00:05,446 INFO org.apache.flink.runtime.filecache.FileCache [] - removed file cache directory /tmp/flink-dist-cache-4b019f58-a9a8-49ce-9429-41054270ef41
2022-12-07 08:00:06,134 INFO org.apache.flink.runtime.state.TaskExecutorLocalStateStoresManager [] - Shutting down TaskExecutorLocalStateStoresManager.
分析
根據(jù)錯誤異常不難得出,是因為metaspace內(nèi)存溢出導(dǎo)致的。
通過日志能觀察到是一個批處理任務(wù)(FlinkJobCheatFind)導(dǎo)致;這個批處理任務(wù)是通過一個定時任務(wù)中心進(jìn)行調(diào)度。
問題大概的地方知道了,但是為什么會導(dǎo)致內(nèi)存泄漏?
首先我們要了解metasapce內(nèi)存是啥,一般我們開發(fā)java程序的時候很多時候碰到的是heap space內(nèi)存溢出,很少metaspace溢出;這里就涉及到j(luò)ava的運行時模型,java運行時就包含有metaspace和heap,metaspace(在jdk1.8之前叫perm)就是存儲java類、靜態(tài)變量等一些相對比較固定的信息,heap存放的是就是類創(chuàng)建的對象信息,是相對比較大的一塊內(nèi)容,兩個區(qū)域存放的內(nèi)容所采用的垃圾回收策略也不太一樣。
了解到metaspace是啥,但是還是不知道為什么內(nèi)存溢出?我們再看看日志,這里我們看到批處理任務(wù)使用到了poi的excel工具實現(xiàn)對excel的處理,這個工具類里面有大量的static靜態(tài)變量數(shù)據(jù),這里加載class和靜態(tài)數(shù)據(jù)的是ChildFirstClassLoader的類加載器,ChildFirstClassLoader類加載器加載了過多的class和靜態(tài)變量導(dǎo)致內(nèi)存泄漏。
**那為什么ChildFirstClassLoader類加載器會導(dǎo)致內(nèi)存泄漏?**這里我們就要去了解下flink的類加載器原理:
Flink有兩種類加載器Parent-First和Child-First:
Parent-First:類似 Java 中的雙親委派的類加載機制。Parent First ClassLoader 實際的邏輯就是一個 URL ClassLoader。
Child-First:底層也是基于URL ClassLoader,但是會先用 classloader.parent-first-patterns.default 和 classloader.parent-first-patterns.additional 拼接的list做匹配,如果類名前綴匹配了,先走雙親委派。否則就用 ChildFirstClassLoader進(jìn)行加載。
Child-First是默認(rèn)的方式,standalone模式下每次執(zhí)行批處理任務(wù)的時候就會生成一個ChildFirstClassLoader加載所有class,當(dāng)任務(wù)結(jié)束后Flink會將ChildFirstClassLoader關(guān)閉釋放掉;其實這里就有問題了,F(xiàn)link關(guān)閉classloader只是調(diào)用了URLClassLoader的close方法,這個關(guān)閉只是將jar包的打開給關(guān)閉了,之前加載的class都還在,ClassLoader如果有其他引用,這個ChildFirstClassLoader就不會被釋放掉。當(dāng)多次運行批處理任務(wù)后就會出現(xiàn)元數(shù)據(jù)空間內(nèi)存溢出。
大概原理都知道了,內(nèi)存溢出位置大概也知道,但是到底是什么沒有釋放導(dǎo)致ChildFirstClassLoader一直存在呢?
這個得實際運行分下才能知道問題出在哪里,我們再測試環(huán)境跑幾次批處理任務(wù),登錄web端查看metaspace,發(fā)現(xiàn)一直增長,沒有回收;通過jmap命令將task的進(jìn)程dump下來,用Memory Analyzer工具打開:
在Leak Suspects可以看到,這個ChildFirstClassLoader有4,408,232個bytes,明顯不太正常:
在Histogram界面我們搜索下這個類:
可以看到有3個對象一直存在,具體是什么對象,我們通過查找引用去看:
通過查看引用可以看到,是mysql-cj-abandoned-connection-cleanup這個線程持有了這個loader。
這下真相大白了:批處理用到了數(shù)據(jù)庫連接,數(shù)據(jù)庫連接開啟了一個cleanup線程導(dǎo)致ClassLoader一直不能釋放。
解決方案
Flink提供了一個hook鉤子服務(wù),可以注冊ClassLoader釋放的動作,在ClassLoader釋放之前做一些處理。
那我們就可以利用這個鉤子來處理一些事情:
(這里的getRuntimeContext()可以再RichInput、RichSink和RichOutput等計算類中獲取。)
log.info("注冊鉤子,用于釋放一些依賴釋放不了的類");
RuntimeContext ctx = this.getRuntimeContext();
ctx.registerUserCodeClassLoaderReleaseHookIfAbsent(FlinkJobCheatFind.class.getName() + "_clsreleasehook", new Runnable() {
@Override
public void run() {
log.info("release hook");
//release driver
Enumeration<Driver> drivers = DriverManager.getDrivers();
while(drivers.hasMoreElements()) {
Driver driver = drivers.nextElement();
try {
log.info("注銷driver:{}", driver);
DriverManager.deregisterDriver(driver);
} catch (SQLException throwables) {
log.error("", throwables);
}
}
log.info("刪除mysql的cleanup線程");
AbandonedConnectionCleanupThread.uncheckedShutdown();
}
});
這個鉤子主要作用就是在classloader釋放前將driver注銷掉、將cleanupthread線程關(guān)閉掉,保證ChirdFirstClassLoader沒有其他引用。
另外關(guān)于手動釋放哪些,其實官網(wǎng)上也給出了一些說明:
https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/ops/debugging/debugging_classloading/
主要是這一段:
Common causes for class leaks and suggested fixes:
Lingering Threads: Make sure the application functions/sources/sinks shuts down all threads. Lingering threads cost resources themselves and additionally typically hold references to (user code) objects, preventing garbage collection and unloading of the classes.
Interners: Avoid caching objects in special structures that live beyond the lifetime of the functions/sources/sinks. Examples are Guava’s interners, or Avro’s class/object caches in the serializers.
JDBC: JDBC drivers leak references outside the user code classloader. To ensure that these classes are only loaded once you should either add the driver jars to Flink’s lib/ folder, or add the driver classes to the list of parent-first loaded class via classloader.parent-first-patterns-additional.
測試驗證
不斷的運行批處理,查看metaspace內(nèi)存情況。
這里我使用的是arthas檢查內(nèi)存,可以看到113M降到了79M,說明是有回收的。文章來源:http://www.zghlxwxcb.cn/news/detail-478534.html
其他方案
上面的解決方案其實還是有點麻煩的,如果項目有很多引用,那就很難判斷具體要釋放哪些。
還有兩種方案:
1、采用yarn提交方式,yarn提交方式是臨時啟動docker啟動job和task進(jìn)程處理,直接銷毀的是整個進(jìn)程,所以不存在問題。
2、將依賴的jar包放到flink的lib目錄下,jar包中只有基礎(chǔ)的一些對象,這樣就不存在每次啟動都要額外加載類的情況。文章來源地址http://www.zghlxwxcb.cn/news/detail-478534.html
到了這里,關(guān)于Flink批處理metaspace內(nèi)存溢出問題的文章就介紹完了。如果您還想了解更多內(nèi)容,請在右上角搜索TOY模板網(wǎng)以前的文章或繼續(xù)瀏覽下面的相關(guān)文章,希望大家以后多多支持TOY模板網(wǎng)!