一、问题是怎么发现的
近期工作中发现数字人形象模型训练期间服务器报错:
[2024-03-22 03:13:57] [INFO] [auto_train_light_algorithm_executor] [train_algorithm_execute] [W CUDAGuardImpl.h:112] Warning: CUDA warning: uncorrectable ECC error encountered (function destroyEvent)
[2024-03-22 03:14:25] [INFO] [auto_train_light_algorithm_executor] [train_algorithm_execute] Traceback (most recent call last):
[2024-03-22 03:14:25] [INFO] [auto_train_light_algorithm_executor] [train_algorithm_execute] File "/export/Instance/algorithm/blindrestoration/distribute_training_sr_s.py", line 206, in <module>
[2024-03-22 03:14:25] [INFO] [auto_train_light_algorithm_executor] [train_algorithm_execute] main(args)
[2024-03-22 03:14:25] [INFO] [auto_train_light_algorithm_executor] [train_algorithm_execute] File "/export/Instance/algorithm/blindrestoration/distribute_training_sr_s.py", line 178, in main
[2024-03-22 03:14:25] [INFO] [auto_train_light_algorithm_executor] [train_algorithm_execute] torch.multiprocessing.spawn(subprocess_fn, args=(args, temp_dir), nprocs=args.num_gpus)
[2024-03-22 03:14:25] [INFO] [auto_train_light_algorithm_executor] [train_algorithm_execute] File "/usr/local/anaconda3/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
[2024-03-22 03:14:25] [INFO] [auto_train_light_algorithm_executor] [train_algorithm_execute] return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
[2024-03-22 03:14:25] [INFO] [auto_train_light_algorithm_executor] [train_algorithm_execute] File "/usr/local/anaconda3/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
[2024-03-22 03:14:25] [INFO] [auto_train_light_algorithm_executor] [train_algorithm_execute] while not context.join():
[2024-03-22 03:14:25] [INFO] [auto_train_light_algorithm_executor] [train_algorithm_execute] File "/usr/local/anaconda3/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 150, in join
[2024-03-22 03:14:25] [INFO] [auto_train_light_algorithm_executor] [train_algorithm_execute] raise ProcessRaisedException(msg, error_index, failed_process.pid)
[2024-03-22 03:14:25] [INFO] [auto_train_light_algorithm_executor] [train_algorithm_execute] torch.multiprocessing.spawn.ProcessRaisedException:
[2024-03-22 03:14:25] [INFO] [auto_train_light_algorithm_executor] [train_algorithm_execute] -- Process 2 terminated with the following error:
[2024-03-22 03:14:25] [INFO] [auto_train_light_algorithm_executor] [train_algorithm_execute] Traceback (most recent call last):
[2024-03-22 03:14:25] [INFO] [auto_train_light_algorithm_executor] [train_algorithm_execute] File "/usr/local/anaconda3/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
[2024-03-22 03:14:25] [INFO] [auto_train_light_algorithm_executor] [train_algorithm_execute] fn(i, *args)
[2024-03-22 03:14:25] [INFO] [auto_train_light_algorithm_executor] [train_algorithm_execute] File "/export/Instance/algorithm/blindrestoration/distribute_training_sr_s.py", line 165, in subprocess_fn
[2024-03-22 03:14:25] [INFO] [auto_train_light_algorithm_executor] [train_algorithm_execute] training_loop(rank=rank, args=args)
[2024-03-22 03:14:25] [INFO] [auto_train_light_algorithm_executor] [train_algorithm_execute] File "/export/Instance/algorithm/blindrestoration/distribute_training_sr_s.py", line 90, in training_loop
[2024-03-22 03:14:25] [INFO] [auto_train_light_algorithm_executor] [train_algorithm_execute] input = input.to(device)
[2024-03-22 03:14:25] [INFO] [auto_train_light_algorithm_executor] [train_algorithm_execute] RuntimeError: CUDA error: uncorrectable ECC error encountered
[2024-03-22 03:14:25] [INFO] [auto_train_light_algorithm_executor] [train_algorithm_execute] CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
[2024-03-22 03:14:25] [INFO] [auto_train_light_algorithm_executor] [train_algorithm_execute] For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
通过异常日志分析,发现关键字:RuntimeError: CUDA error: uncorrectable ECC error encountered
二、问题带来的影响
该异常导致数字人形象模型训练失败,无法为客户提供数字人形象,进而影响交付。
三、排查问题的详细过程
首先通过百度搜索[CUDA warning: uncorrectable ECC error encountered],了解了ECC是什么,它的作用是什么。
Volatile Uncorr. ECC:是否启用显存错误校正(如果未启用则为0)(Volatile Uncorr. ECC——Volatile Uncorrectable Error Correction and Detection (VUECC):是一种可变不可修正的错误校验与纠正(ECC)技术,它旨在在计算机存储器中检测和纠正位错误。它使用了特殊的硬件来监控计算机内部数据,并在发现任何差错时通过可靠的方法自动纠正它们。)
然后搜索解决方案,百度搜索没有找到特别好的解决方案,然后改用Google搜索,找到了如下搜索记录:
发现有英伟达官方的帖子,果断点进去,寻找解决方案。
四、如何解决问题
1、查看显卡状态 nvidia-smi, 发现了关键参数[Volatile Uncorr. ECC],4张显卡其中第3张的值与其他三张不同,这样就定位到了出故障的显卡。
2、通过指令 nvidia-smi -i 2 -p 0 修复显卡状态
显卡状态已恢复,完好如初。
联系业务运营,重新开启形象模型训练任务。
五、总结反思
线上问题出现的时候,如果国内的百度搜不到解决方案,就试试国际的Google,办法总比困难多。
版权归原作者 卡亦克 所有, 如有侵权,请联系我们删除。