在测试flink的HA时,把某个节点(部署了jobmanager和namenode)的节点reboot了,然后启动时发现namenode没有起来,报错大概如下:

org.apache.hadoop.hdfs.qjournal.protocol.JournalNotFormattedException: Journal Storage Directory /tmp/hadoop/dfs/journalnode/xxxx not formatted
	at org.apache.hadoop.hdfs.qjournal.server.Journal.checkFormatted(Journal.java:457)

原因:大概为journalnode保存的元数据和namenode的不一致,导致,3台机器中有2台报了这个错误。

解决:在nn1上启动journalnode,再执行hdfs namenode -initializeSharedEdits,使得journalnode与namenode保持一致。再重新启动namenode就没有问题了。

但又遇到flink的jobmanager启动不了,报错如下:

ERROR org.apache.flink.runtime.entrypint.XlusterEntrypoint   -Fatal error occurred in the cluster entrypoint.
	org.apache.flink.runtime.dispatcher.DispatcherException: Failed to take leadership with session id xxxxxxxxxxxxxxxxxxxxxxxxxx
	....
caused by: java.lang.RuntimeException: org.apache.flink.util.FlinkException: Could not retrieve submitted JobGraph from state handle under /xxxxx. This indicates that the retrieved state handle is broken. Try cleaning the state handle store.
..
caused by: java.io.FileNotFoundException: File does not exitst: /xxxx/submittedJobGraphe439cfc979db

节点reboot时,是有任务在执行的,而刚才journalnode的initializeSharedEdits导致某些文件丢失了,而jobmanager在读取这个提交的job时发生了报错,故在zookeeper删除flink任务的引用即可

./zkCli.sh -server zookeeper的host

set /flink/default/running_job_registry/xxxxx DONE
delete /flink/default/jobgraphs/xxxx

解决后,重新启动jobmanager、taskmanager没有问题了,再提交任务就可以了。

Logo

腾讯云面向开发者汇聚海量精品云计算使用和开发经验,营造开放的云计算技术生态圈。

更多推荐