https://zhuanlan.zhihu.com/p/670519711

RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [4, 512, 16, 16]], which is output 0 of ConstantPadNdBackward, is at version 1; expected version 0 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True)

解决办法:

一开始,查阅资料,取消所有inplace操作,如:将nn.ReLU(inplace=True)中的inplace取消,设置为false或使用默认值nn.ReLU()。另外就是将所有的y+=x改为y=y+x。

仍然会报错。

之后按照提示,使用torch.autograd.set_detect_anomaly(True)打印更详细的信息之后:

注意到UserWarning: Error detected in CudnnBatchNormBackward0. No forward pass information available这条信息。

根据解决UserWarning: Error detected in CudnnBatchNormBackward0. No forward pass information available.的解释,在模型中设置broadcast_buffers=False:

model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[args.gpu], broadcast_buffers=False)

错误消失,可以继续运行。

Logo

腾讯云面向开发者汇聚海量精品云计算使用和开发经验,营造开放的云计算技术生态圈。

更多推荐