Pytorch 多机多卡 报错

Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.

解决

多台机器上面的CUDA和Pytorch版本号不一致,必须要保证每台机器上面的CUDA和Pytorch版本号一致才能运行成功

Logo

腾讯云面向开发者汇聚海量精品云计算使用和开发经验,营造开放的云计算技术生态圈。

更多推荐