Pytorch 多机多卡报错:Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA
·
Pytorch 多机多卡 报错
Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
解决
多台机器上面的CUDA和Pytorch版本号不一致,必须要保证每台机器上面的CUDA和Pytorch版本号一致才能运行成功
更多推荐
所有评论(0)