分布式模型训练时loss.backward()报错的解决办法

一开始，查阅资料，取消所有inplace操作，如：将nn.ReLU(inplace=True)中的inplace取消，设置为false或使用默认值nn.ReLU()。另外就是将所有的y+=x改为y=y+x。注意到UserWarning: Error detected in CudnnBatchNormBackward0. No forward pass information available这

zouxiaolv

168人浏览 · 2024-10-24 19:43:53

zouxiaolv · 2024-10-24 19:43:53 发布

https://zhuanlan.zhihu.com/p/670519711

RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [4, 512, 16, 16]], which is output 0 of ConstantPadNdBackward, is at version 1; expected version 0 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True)

解决办法：

一开始，查阅资料，取消所有inplace操作，如：将nn.ReLU(inplace=True)中的inplace取消，设置为false或使用默认值nn.ReLU()。另外就是将所有的y+=x改为y=y+x。

仍然会报错。

之后按照提示，使用torch.autograd.set_detect_anomaly(True)打印更详细的信息之后：

注意到UserWarning: Error detected in CudnnBatchNormBackward0. No forward pass information available这条信息。

根据解决UserWarning: Error detected in CudnnBatchNormBackward0. No forward pass information available.的解释，在模型中设置broadcast_buffers=False：

model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[args.gpu], broadcast_buffers=False)

错误消失，可以继续运行。

腾讯云开发者社区

腾讯云面向开发者汇聚海量精品云计算使用和开发经验，营造开放的云计算技术生态圈。

更多推荐

自动化提示词生成工具盘点

腾讯云开发者社区

AI 浪潮下的锚与帆：工程师文化的变与不变 | 架构师夜生活

腾讯云开发者社区

腾讯云架构师技术沙龙 · 长沙站圆满落幕，共话AI驱动下的技术架构与前沿应用

人工智能已成为推动技术创新与产业变革的重要引擎，开发者正身处一场前所未有的技术变革之中。通过本次腾讯云架构师技术沙龙，各位专家深入分享前沿技术洞察，探讨 AI 落地的应用路径与实践经验，为架构师的职业发展指明方向。腾讯云架构师长沙同盟和腾讯云架构师技术同盟长沙地区理事会正式成立。未来，腾讯云架构师长沙同盟将凝心聚力，打造属于本地架构师的学习与成长的家园，助力中国架构的蓬勃发展。未来已来，让我们携手