问题描述
flink运行jar包任务,运行几个小时或者1天以后,任务就会挂掉!!!
第一个错误是
2023-02-01 23:43:08,083 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Window(TumblingEventTimeWindows(60000), EventTimeTrigger, getHvcDownLine) -> Sink: Unnamed (1/1) (8672ad64cfc4ddce37756e60242432be) switched from RUNNING to FAILED on 11.11.1.102:40227-006cac @ flinkc (dataPort=37255).
java.util.concurrent.TimeoutException: Heartbeat of TaskManager with id 11.11.1.102:40227-006cac timed out.
第二个错误是
2023-02-01 23:43:08,111 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Job T4301_productDownLine (fef0fb9f856277bc9d9da05df7d63bf6) switched from state FAILING to FAILED.
org.apache.flink.runtime.JobException: Recovery is suppressed by NoRestartBackoffTimeStrategy
第三个错误是
2023-02-03 23:42:35,875 ERROR akka.remote.Remoting [] - Association to [akka.tcp://flink-metrics@11.11.1.102:34546] with UID [-1590851144] irrecoverably failed. Quarantining address.
java.util.concurrent.TimeoutException: Remote system has been silent for too long. (more than 48.0 hours)
百度的建议是:
java程序里面添加重启策略
java程序的jar包版本与flink集群有冲突
flink集群的slot分配有问题
flink集群的心跳设置太短了,设置长一点 heartbeat.timeout: 180000
flink中flink-conf.yaml 优先使用flink集群有的jar包
隔了大概2个月以后,再次评论 2023-03-27日
我用了hadoop的集群的yarn,以及分析了taskmanager和jobmanager的内存大小、以及分析每个错误的原因。
我觉得可能是1、因为代码错误运行时间长了有bug,这个是最有可能的(90%) 当时默认的内存是1G,不可能存在着内存不够的原因。以上百度的解决方案,我后面一个都没有用到,依然健壮
2、没仔细分析Log日志,很多时候想要节省时间走捷径,但是发现走的都是弯路
版权归原作者 qq_37591637 所有, 如有侵权,请联系我们删除。