文章目录
一. 问题描述
搭建了一个Hadoop的demo环境,用于一些功能测试,使用了一段时间之后发现flink任务提交不到hadoop上了。查看资源也都充足,查看hdfs后发现文件出现丢失和损坏的情况。此文章用于解决hdfs文件的问题。
二. 问题分析与解决
1. HDFS 块损坏
1.1. 问题表述
执行命令:
hdfs fsck /
发现文件存在丢失和损坏的情况
.....
/dodb/datalake/jars/110/e24d18b0014183c95f56a26724353c15.jar: Under replicated BP-1704786246-10.101.1.140-1663251681207:blk_1073742003_1179. Target Replicas is 3 but found 1 live replica(s), 0 decommissioned replica(s), 0 decommissioning replica(s).
.................................................................................
/flink-savepoint/34c63dd8507daaa028860d984baa6597/chk-142/_metadata: CORRUPT blockpool BP-1704786246-10.101.1.140-1663251681207 block blk_1073742316
/flink-savepoint/34c63dd8507daaa028860d984baa6597/chk-142/_metadata: MISSING 1 blocks of total size 17009 B..................................
/flink-savepoint/ced56a39eca402a8050c99a5187b995a/chk-151/_metadata: CORRUPT blockpool BP-1704786246-10.101.1.140-1663251681207 block blk_1073742317
/flink-savepoint/ced56a39eca402a8050c99a5187b995a/chk-151/_metadata: MISSING 1 blocks of total size 2439 B.............
/user/commonuser/.flink/application_1663251697076_0001/dataflow-flink.jar: Under replicated BP-1704786246-10.101.1.140-1663251681207:blk_1073741830_1006. Target Replicas is 3 but found 1 live replica(s), 0 decommissioned replica(s), 0 decommissioning replica(s).
......
1.2. 问题解决
看到是执行flink任务时做的savepoint存在丢失的情况,一般savepoint用于集群迁移等情况。
直接删除文件的情况
根据上述命令找到有问题的块,查看无用之后,执行删除,然后再看hdfs的健康状态。
hdfs dfs -rm -f -R /flink-savepoint/34c63dd8507daaa028860d984baa6597
需要文件恢复的情况
参考:
How to fix corrupt HDFS FIles
2. 副本同步问题
2.1. 问题表述
修复完损坏的文件之后,再次查看hadoop的健康状态
hdfs fsck /
Status: HEALTHY
Total size: 1039836161 B
Total dirs: 179
Total files: 188
Total symlinks: 0
Total blocks (validated): 188(avg. block size 5531043 B)
Minimally replicated blocks: 188(100.0 %)
Over-replicated blocks: 0(0.0 %)
Under-replicated blocks: 69(36.70213 %)
Mis-replicated blocks: 0(0.0 %)
Default replication factor: 1
Average block replication: 1.0
Corrupt blocks: 0
Missing replicas: 138(42.331287 %)
Number of data-nodes: 1
Number of racks: 1
FSCK ended at Thu Sep 2913:53:02 CST 2022in7 milliseconds
The filesystem under path '/' is HEALTHY
看到大量的块存在副本问题
Missing replicas: 138(42.331287 %)
demo环境是一个单节点,只有一个副本,但是日志却提示有三个副本
/user/commonuser/.flink/application_1664426842010_0004/log4j.properties: Under replicated BP-1704786246-10.101.1.140-1663251681207:blk_1073752973_12149.
Target Replicas is 3 but found 1 live replica(s), 0 decommissioned replica(s), 0 decommissioning replica(s).
2.2. 问题解决
首先查看副本数的设置:dfs.replication=1,没有问题,这会保证新建的文件副本数=1。
然后对已存在的文件修改副本数为1
hadoop dfs -setrep -w 1 -R /
再查看HDFS的健康状态,看到HDFS恢复正常
hdfs fsck /
Status: HEALTHY
。。。。
Corrupt blocks: 0
Missing replicas: 0
。。。。
FSCK ended at Fri Sep 3010:49:05 CST 2022in5 milliseconds
The filesystem under path '/' is HEALTHY
版权归原作者 roman_日积跬步-终至千里 所有, 如有侵权,请联系我们删除。