hadoop 集群因磁盘满了,导致服务挂掉,甚至有机器宕机。
当机器重启后,启动nameNode 和 journalNode 有报错。

1.启动 Namenode 错误信息

2023-03-27 10:23:23,484 INFO ipc.Server (Server.java:logException(2650)) - IPC Server handler 20 on 8020, call Call#26630 Retry#1 org.apache.hadoop.hdfs.protocol.ClientProtocol.delete from 192.17.128.187:21662
org.apache.hadoop.ipc.RetriableException: org.apache.hadoop.hdfs.server.namenode.SafeModeException: Cannot delete /livy2-covery/v1/batch/6213. Name node is in safe mode.
The reported blocks 229819 needs additional 35467 blocks to reach the threshold 0.9900 of total blocks 267966.
The number of live datanodes 8 has reached the minimum number 0. Safe mode will be turned off automatically once the thresholds have been reached. NamenodeHostName:ambari-d-146.te.td
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkNameNodeSafeMode(FSNamesystem.java:1404)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.delete(FSNamesystem.java:2850)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.delete(NameNodeRpcServer.java:1065)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.delete(ClientNamenodeProtocolServerSideTranslatorPB.java:641)
at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol2.callBlockingMethod(ClientNamenodeProtocolProtos.java)atorg.apache.hadoop.ipc.ProtobufRpcEngine2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine2.callBlockingMethod(ClientNamenodeProtocolProtos.java)atorg.apache.hadoop.ipc.ProtobufRpcEngineServerProtoBufRpcInvoker.call(ProtobufRpcEngine.java:503)atorg.apache.hadoop.ipc.RPCProtoBufRpcInvoker.call(ProtobufRpcEngine.java:503) at org.apache.hadoop.ipc.RPCProtoBufRpcInvoker.call(ProtobufRpcEngine.java:503)atorg.apache.hadoop.ipc.RPCServer.call(RPC.java:989)
at org.apache.hadoop.ipc.ServerRpcCall.run(Server.java:871)atorg.apache.hadoop.ipc.ServerRpcCall.run(Server.java:871) at org.apache.hadoop.ipc.ServerRpcCall.run(Server.java:871)atorg.apache.hadoop.ipc.ServerRpcCall.run(Server.java:817)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1893)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2606)
Caused by: org.apache.hadoop.hdfs.server.namenode.SafeModeException: Cannot delete /livy2-covery/v1/batch/6213. Name node is in safe mode.
The reported blocks 229819 needs additional 35467 blocks to reach the threshold 0.9900 of total blocks 267966.
The number of live datanodes 8 has reached the minimum number 0. Safe mode will be turned off automatically once the thresholds have been reached. NamenodeHostName:ambari-d-146.te.td
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.newSafemodeException(FSNamesystem.java:1412)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkNameNodeSafeMode(FSNamesystem.java:1400)
… 12 more

2.启动 journalNode 错误信息

2023-03-27 10:18:17,443 WARN namenode.FSImage (FSEditLogLoader.java:scanEditLog(1162)) - After resync, position is 1024000
2023-03-27 10:18:17,443 WARN namenode.FSImage (FSEditLogLoader.java:scanEditLog(1157)) - Caught exception after scanning through 0 ops from /hadoop/hdfs/journal/ns74/current/edits_inprogress_0000000000075656572 while determining its valid length. Position was 1024000
java.io.IOException: Can’t scan a pre-transactional edit log.
at org.apache.hadoop.hdfs.server.namenode.FSEditLogOpLegacyReader.scanOp(FSEditLogOp.java:4913)atorg.apache.hadoop.hdfs.server.namenode.EditLogFileInputStream.scanNextOp(EditLogFileInputStream.java:245)atorg.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.scanEditLog(FSEditLogLoader.java:1153)atorg.apache.hadoop.hdfs.server.namenode.EditLogFileInputStream.scanEditLog(EditLogFileInputStream.java:329)atorg.apache.hadoop.hdfs.server.namenode.FileJournalManagerLegacyReader.scanOp(FSEditLogOp.java:4913) at org.apache.hadoop.hdfs.server.namenode.EditLogFileInputStream.scanNextOp(EditLogFileInputStream.java:245) at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.scanEditLog(FSEditLogLoader.java:1153) at org.apache.hadoop.hdfs.server.namenode.EditLogFileInputStream.scanEditLog(EditLogFileInputStream.java:329) at org.apache.hadoop.hdfs.server.namenode.FileJournalManagerLegacyReader.scanOp(FSEditLogOp.java:4913)atorg.apache.hadoop.hdfs.server.namenode.EditLogFileInputStream.scanNextOp(EditLogFileInputStream.java:245)atorg.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.scanEditLog(FSEditLogLoader.java:1153)atorg.apache.hadoop.hdfs.server.namenode.EditLogFileInputStream.scanEditLog(EditLogFileInputStream.java:329)atorg.apache.hadoop.hdfs.server.namenode.FileJournalManagerEditLogFile.scanLog(FileJournalManager.java:548)
at org.apache.hadoop.hdfs.qjournal.server.Journal.scanStorageForLatestEdits(Journal.java:195)
at org.apache.hadoop.hdfs.qjournal.server.Journal.(Journal.java:155)
at org.apache.hadoop.hdfs.qjournal.server.JournalNode.getOrCreateJournal(JournalNode.java:97)
at org.apache.hadoop.hdfs.qjournal.server.JournalNode.getOrCreateJournal(JournalNode.java:106)
at org.apache.hadoop.hdfs.qjournal.server.JournalNodeRpcServer.getEditLogManifest(JournalNodeRpcServer.java:201)
at org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolServerSideTranslatorPB.getEditLogManifest(QJournalProtocolServerSideTranslatorPB.java:224)
at org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocolProtos$QJournalProtocolService2.callBlockingMethod(QJournalProtocolProtos.java:25431)atorg.apache.hadoop.ipc.ProtobufRpcEngine2.callBlockingMethod(QJournalProtocolProtos.java:25431) at org.apache.hadoop.ipc.ProtobufRpcEngine2.callBlockingMethod(QJournalProtocolProtos.java:25431)atorg.apache.hadoop.ipc.ProtobufRpcEngineServerProtoBufRpcInvoker.call(ProtobufRpcEngine.java:503)atorg.apache.hadoop.ipc.RPCProtoBufRpcInvoker.call(ProtobufRpcEngine.java:503) at org.apache.hadoop.ipc.RPCProtoBufRpcInvoker.call(ProtobufRpcEngine.java:503)atorg.apache.hadoop.ipc.RPCServer.call(RPC.java:989)
at org.apache.hadoop.ipc.ServerRpcCall.run(Server.java:871)atorg.apache.hadoop.ipc.ServerRpcCall.run(Server.java:871) at org.apache.hadoop.ipc.ServerRpcCall.run(Server.java:871)atorg.apache.hadoop.ipc.ServerRpcCall.run(Server.java:817)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1893)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2606)
2023-03-27 10:18:17,443 WARN namenode.FSImage (FSEditLogLoader.java:scanEditLog(1162)) - After resync, position is 1024000

3.解决思路

首先需要现将有错误的jouranlNode 节点恢复,再恢复namenode。当两个节点问题都修复完成后,才可以重启其他服务。
jouranlNode 节点恢复,需要拷贝其他没有报错的jouranlNode 节点完整的jouranl目录来恢复。

4.恢复journalNode

1)将其他没有错误的journalNode节点目录,完成拷贝过来。
146机器journalNode 是好的,将146机器journalNode 安装目录打包发送到74上。

cd /hadoop/hdfs/
sudo tar -zcvf journal.tar.gz journal
scp journal.tar.gz admin@192.17.128.74:/home/admin/

2)登录74机器
cp /home/admin/journal.tar.gz /hadoop/hdfs/journal.tar.gz
备份原journal
mv journal journal-bak

sudo tar -zxvf journal.tar.gz

3)重启journal 服务

5.恢复nameNode

确保journal服务已恢复正常。执行如下操作:

1)切换到hdfs用户下
sudo su - hdfs

2)将hdfs 设置为脱离安全模式
hdfs dfsadmin -safemode leave

3)重启nameNode
当nameNode 重启成功后,再启动集群其他服务,比如yarn、hive、hbase。
此时,集群状态一般都会恢复。

Logo

腾讯云面向开发者汇聚海量精品云计算使用和开发经验,营造开放的云计算技术生态圈。

更多推荐