解决paddlepaddle nccl报错问题【已解决】An internal check failed. This is either a bug in NCCL or due to memory

环境来自docker安装解决思路：由于docker来自官方，经过检查nccl正确安装，大概率是通信的错误。我们需要检查多卡之间的通信。

keepython

1767人浏览 · 2024-08-15 14:40:46

keepython · 2024-08-15 14:40:46 发布

问题描述：paddle.utils.run_check()报错

报错内容:

python -c "import paddle; paddle.utils.run_check()"
Running verify PaddlePaddle program ... 
I0815 14:25:08.157060 690548 program_interpreter.cc:212] New Executor is Running.
W0815 14:25:08.158898 690548 gpu_resources.cc:119] Please NOTE: device: 0, GPU Compute Capability: 8.9, Driver API Version: 12.6, Runtime API Version: 12.0
W0815 14:25:08.160075 690548 gpu_resources.cc:164] device: 0, cuDNN Version: 8.9.
I0815 14:25:08.350580 690548 interpreter_util.cc:624] Standalone Executor is Used.
PaddlePaddle works well on 1 GPU.
======================= Modified FLAGS detected =======================
FLAGS(name='FLAGS_selected_gpus', current_value='0', default_value='')
=======================================================================
I0815 14:25:09.524173 690618 tcp_utils.cc:181] The server starts to listen on IP_ANY:43661
I0815 14:25:09.524746 690618 tcp_utils.cc:130] Successfully connected to 127.0.0.1:43661
======================= Modified FLAGS detected =======================
FLAGS(name='FLAGS_selected_gpus', current_value='1', default_value='')
=======================================================================
I0815 14:25:09.542655 690619 tcp_utils.cc:130] Successfully connected to 127.0.0.1:43661
I0815 14:25:09.614272 690619 process_group_nccl.cc:129] ProcessGroupNCCL pg_timeout_ 1800000
I0815 14:25:09.714462 690618 process_group_nccl.cc:129] ProcessGroupNCCL pg_timeout_ 1800000
W0815 14:25:09.831360 690619 gpu_resources.cc:119] Please NOTE: device: 1, GPU Compute Capability: 8.9, Driver API Version: 12.6, Runtime API Version: 12.0
W0815 14:25:09.832465 690619 gpu_resources.cc:164] device: 1, cuDNN Version: 8.9.
W0815 14:25:09.910295 690618 gpu_resources.cc:119] Please NOTE: device: 0, GPU Compute Capability: 8.9, Driver API Version: 12.6, Runtime API Version: 12.0
W0815 14:25:09.911039 690618 gpu_resources.cc:164] device: 0, cuDNN Version: 8.9.
I0815 14:25:10.265226 690618 process_group_nccl.cc:132] ProcessGroupNCCL destruct 
I0815 14:25:10.307034 690636 tcp_store.cc:289] receive shutdown event and so quit from MasterDaemon run loop


--------------------------------------
C++ Traceback (most recent call last):
--------------------------------------
No stack trace in paddle, may be caused by external reasons.

----------------------
Error Message Summary:
----------------------
FatalError: `Termination signal` is detected by the operating system.
  [TimeInfo: *** Aborted at 1723703110 (unix time) try "date -d @1723703110" if you are using GNU date ***]
  [SignalInfo: *** SIGTERM (@0x3e9000a8974) received by PID 690619 (TID 0x73013050a480) from PID 690548 ***]

[2024-08-15 14:25:10,881] [ WARNING] install_check.py:289 - PaddlePaddle meets some problem with 2 GPUs. This may be caused by:
 1. There is not enough GPUs visible on your system
 2. Some GPUs are occupied by other process now
 3. NVIDIA-NCCL2 is not installed correctly on your system. Please follow instruction on https://github.com/NVIDIA/nccl-tests 
 to test your NCCL, or reinstall it following https://docs.nvidia.com/deeplearning/sdk/nccl-install-guide/index.html
[2024-08-15 14:25:10,881] [ WARNING] install_check.py:297 - 
 Original Error is: 

----------------------------------------------
Process 0 terminated with the following error:
----------------------------------------------

Traceback (most recent call last):
  File "/home/yangkelang/program/companyfile/PaddleDetection/venv/lib/python3.10/site-packages/paddle/distributed/spawn.py", line 372, in _func_wrapper
    result = func(*args)
  File "/home/yangkelang/program/companyfile/PaddleDetection/venv/lib/python3.10/site-packages/paddle/utils/install_check.py", line 184, in train_for_run_parallel
    dp_layer = paddle.DataParallel(layer)
  File "/home/yangkelang/program/companyfile/PaddleDetection/venv/lib/python3.10/site-packages/paddle/distributed/parallel.py", line 398, in __init__
    sync_params_buffers(self._layers, fuse_params=False)
  File "/home/yangkelang/program/companyfile/PaddleDetection/venv/lib/python3.10/site-packages/decorator.py", line 232, in fun
    return caller(func, *(extras + args), **kw)
  File "/home/yangkelang/program/companyfile/PaddleDetection/venv/lib/python3.10/site-packages/paddle/base/dygraph/base.py", line 340, in __impl__
    return func(*args, **kwargs)
  File "/home/yangkelang/program/companyfile/PaddleDetection/venv/lib/python3.10/site-packages/decorator.py", line 232, in fun
    return caller(func, *(extras + args), **kw)
  File "/home/yangkelang/program/companyfile/PaddleDetection/venv/lib/python3.10/site-packages/paddle/base/wrapped_decorator.py", line 26, in __impl__
    return wrapped_func(*args, **kwargs)
  File "/home/yangkelang/program/companyfile/PaddleDetection/venv/lib/python3.10/site-packages/paddle/base/framework.py", line 593, in __impl__
    return func(*args, **kwargs)
  File "/home/yangkelang/program/companyfile/PaddleDetection/venv/lib/python3.10/site-packages/paddle/distributed/parallel.py", line 197, in sync_params_buffers
    paddle.distributed.broadcast(
  File "/home/yangkelang/program/companyfile/PaddleDetection/venv/lib/python3.10/site-packages/paddle/distributed/communication/broadcast.py", line 64, in broadcast
    return stream.broadcast(
  File "/home/yangkelang/program/companyfile/PaddleDetection/venv/lib/python3.10/site-packages/paddle/distributed/communication/stream/broadcast.py", line 124, in broadcast
    return _broadcast_in_dygraph(
  File "/home/yangkelang/program/companyfile/PaddleDetection/venv/lib/python3.10/site-packages/paddle/distributed/communication/stream/broadcast.py", line 32, in _broadcast_in_dygraph
    task = group.process_group.broadcast(tensor, src_rank_in_group, sync_op)
OSError: (External) NCCL error(3), internal error - please report this issue to the NCCL developers. 
  [Hint: 'ncclInternalError'. An internal check failed. This is either a bug in NCCL or due to memory corruption.] (at ../paddle/phi/core/distributed/comm_context_manager.cc:72)


PaddlePaddle is installed successfully ONLY for single GPU! Let's start deep learning with PaddlePaddle now.
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/home/yangkelang/program/companyfile/PaddleDetection/venv/lib/python3.10/site-packages/paddle/utils/install_check.py", line 302, in run_check
    raise e
  File "/home/yangkelang/program/companyfile/PaddleDetection/venv/lib/python3.10/site-packages/paddle/utils/install_check.py", line 283, in run_check
    _run_parallel(device_list)
  File "/home/yangkelang/program/companyfile/PaddleDetection/venv/lib/python3.10/site-packages/paddle/utils/install_check.py", line 210, in _run_parallel
    paddle.distributed.spawn(train_for_run_parallel, nprocs=len(device_list))
  File "/home/yangkelang/program/companyfile/PaddleDetection/venv/lib/python3.10/site-packages/paddle/distributed/spawn.py", line 614, in spawn
    while not context.join():
  File "/home/yangkelang/program/companyfile/PaddleDetection/venv/lib/python3.10/site-packages/paddle/distributed/spawn.py", line 423, in join
    self._throw_exception(error_index)
  File "/home/yangkelang/program/companyfile/PaddleDetection/venv/lib/python3.10/site-packages/paddle/distributed/spawn.py", line 447, in _throw_exception
    raise Exception(msg)
Exception: 

----------------------------------------------
Process 0 terminated with the following error:
----------------------------------------------

Traceback (most recent call last):
  File "/home/yangkelang/program/companyfile/PaddleDetection/venv/lib/python3.10/site-packages/paddle/distributed/spawn.py", line 372, in _func_wrapper
    result = func(*args)
  File "/home/yangkelang/program/companyfile/PaddleDetection/venv/lib/python3.10/site-packages/paddle/utils/install_check.py", line 184, in train_for_run_parallel
    dp_layer = paddle.DataParallel(layer)
  File "/home/yangkelang/program/companyfile/PaddleDetection/venv/lib/python3.10/site-packages/paddle/distributed/parallel.py", line 398, in __init__
    sync_params_buffers(self._layers, fuse_params=False)
  File "/home/yangkelang/program/companyfile/PaddleDetection/venv/lib/python3.10/site-packages/decorator.py", line 232, in fun
    return caller(func, *(extras + args), **kw)
  File "/home/yangkelang/program/companyfile/PaddleDetection/venv/lib/python3.10/site-packages/paddle/base/dygraph/base.py", line 340, in __impl__
    return func(*args, **kwargs)
  File "/home/yangkelang/program/companyfile/PaddleDetection/venv/lib/python3.10/site-packages/decorator.py", line 232, in fun
    return caller(func, *(extras + args), **kw)
  File "/home/yangkelang/program/companyfile/PaddleDetection/venv/lib/python3.10/site-packages/paddle/base/wrapped_decorator.py", line 26, in __impl__
    return wrapped_func(*args, **kwargs)
  File "/home/yangkelang/program/companyfile/PaddleDetection/venv/lib/python3.10/site-packages/paddle/base/framework.py", line 593, in __impl__
    return func(*args, **kwargs)
  File "/home/yangkelang/program/companyfile/PaddleDetection/venv/lib/python3.10/site-packages/paddle/distributed/parallel.py", line 197, in sync_params_buffers
    paddle.distributed.broadcast(
  File "/home/yangkelang/program/companyfile/PaddleDetection/venv/lib/python3.10/site-packages/paddle/distributed/communication/broadcast.py", line 64, in broadcast
    return stream.broadcast(
  File "/home/yangkelang/program/companyfile/PaddleDetection/venv/lib/python3.10/site-packages/paddle/distributed/communication/stream/broadcast.py", line 124, in broadcast
    return _broadcast_in_dygraph(
  File "/home/yangkelang/program/companyfile/PaddleDetection/venv/lib/python3.10/site-packages/paddle/distributed/communication/stream/broadcast.py", line 32, in _broadcast_in_dygraph
    task = group.process_group.broadcast(tensor, src_rank_in_group, sync_op)
OSError: (External) NCCL error(3), internal error - please report this issue to the NCCL developers. 
  [Hint: 'ncclInternalError'. An internal check failed. This is either a bug in NCCL or due to memory corruption.] (at ../paddle/phi/core/distributed/comm_context_manager.cc:72)

环境来自docker安装

Linux 下的 Docker 安装

解决思路：

由于docker来自官方，经过检查nccl正确安装，大概率是通信的错误。
我们需要检查多卡之间的通信。

检查通信

git clone https://github.com/NVIDIA/nccl-tests.git
cd nccl-test
make

官方环境下能全部通过，接下来测试多卡通信：

./build/all_reduce_perf -b 8 -e 128M -f 2 -g 2

注意：这里的 -g 2 指的是有两张显卡
测试结果一般是

bootstrap.cc:45 NCCL WARN Bootstrap : no socket interface found
kings-System-Product-Name: Test NCCL failure common.cu:992 'internal error'
 .. kings-System-Product-Name pid 10130: Test failure common.cu:925

说明双卡存在，但是通信地址有问题

检查通信地址

ifconfig

输出：


eno1: flags=4099<UP,BROADCAST,MULTICAST>  mtu 1500
        ether c8:7f:54:a9:fc:1b  txqueuelen 1000  (以太网)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 0  bytes 0 (0.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
        device memory 0x86600000-866fffff

发现默认的eno1没有正确配置

因此我们设置其他的网卡，我这里的例子是wlp0s20f3：

export NCCL_SOCKET_IFNAME=wlp0s20f3

注意：这里的 NCCL_SOCKET_IFNAME=wlp0s20f3 需要根据你的ifconfig结果修改

也可以直接加入~/.bashrc中

vim ~/.bashrc

添加：

export NCCL_SOCKET_IFNAME=wlp0s20f3

保存退出,加载bashrc

source ~/.bashrc

测试：

python -c "import paddle; paddle.utils.run_check()"

发现成功：

Running verify PaddlePaddle program ... 
I0815 14:39:17.257519 711714 program_interpreter.cc:212] New Executor is Running.
W0815 14:39:17.258880 711714 gpu_resources.cc:119] Please NOTE: device: 0, GPU Compute Capability: 8.9, Driver API Version: 12.6, Runtime API Version: 12.0
W0815 14:39:17.259541 711714 gpu_resources.cc:164] device: 0, cuDNN Version: 8.9.
I0815 14:39:17.449719 711714 interpreter_util.cc:624] Standalone Executor is Used.
PaddlePaddle works well on 1 GPU.
======================= Modified FLAGS detected =======================
FLAGS(name='FLAGS_selected_gpus', current_value='1', default_value='')
=======================================================================
I0815 14:39:18.777268 711798 tcp_utils.cc:107] Retry to connect to 127.0.0.1:41833 while the server is not yet listening.
======================= Modified FLAGS detected =======================
FLAGS(name='FLAGS_selected_gpus', current_value='0', default_value='')
=======================================================================
I0815 14:39:18.927515 711797 tcp_utils.cc:181] The server starts to listen on IP_ANY:41833
I0815 14:39:18.927894 711797 tcp_utils.cc:130] Successfully connected to 127.0.0.1:41833
I0815 14:39:21.777606 711798 tcp_utils.cc:130] Successfully connected to 127.0.0.1:41833
I0815 14:39:21.778041 711798 process_group_nccl.cc:129] ProcessGroupNCCL pg_timeout_ 1800000
I0815 14:39:21.870291 711797 process_group_nccl.cc:129] ProcessGroupNCCL pg_timeout_ 1800000
W0815 14:39:22.051515 711798 gpu_resources.cc:119] Please NOTE: device: 1, GPU Compute Capability: 8.9, Driver API Version: 12.6, Runtime API Version: 12.0
W0815 14:39:22.053068 711798 gpu_resources.cc:164] device: 1, cuDNN Version: 8.9.
W0815 14:39:22.107273 711797 gpu_resources.cc:119] Please NOTE: device: 0, GPU Compute Capability: 8.9, Driver API Version: 12.6, Runtime API Version: 12.0
W0815 14:39:22.108410 711797 gpu_resources.cc:164] device: 0, cuDNN Version: 8.9.
I0815 14:39:23.415659 711797 process_group_nccl.cc:132] ProcessGroupNCCL destruct 
I0815 14:39:23.521483 711855 tcp_store.cc:289] receive shutdown event and so quit from MasterDaemon run loop
I0815 14:39:23.564513 711798 process_group_nccl.cc:132] ProcessGroupNCCL destruct 
PaddlePaddle works well on 2 GPUs.
PaddlePaddle is installed successfully! Let's start deep learning with PaddlePaddle now.

腾讯云开发者社区

腾讯云面向开发者汇聚海量精品云计算使用和开发经验，营造开放的云计算技术生态圈。

更多推荐

终极指南：Flink SQL连接器版本管理从混乱到有序的升级之路

Apache Flink作为流处理领域的佼佼者，其SQL连接器的版本管理一直是开发者面临的核心挑战。本文将系统讲解Flink SQL连接器版本管理的最佳实践，帮助你轻松应对版本兼容性问题，实现从混乱到有序的升级之旅。## 连接器版本管理的常见痛点 😫在Flink应用开发中，连接器版本管理常常让开发者头疼不已。不同版本的连接器可能导致各种兼容性问题，例如API变更、功能差异甚至运行时错误。

腾讯云开发者社区

Elasticsearch复杂数据类型终极指南：从入门到精通

Elasticsearch作为功能强大的搜索引擎，支持多种复杂数据类型，让开发者能够灵活处理各种结构化和非结构化数据。本文将带你全面了解Elasticsearch中的复杂数据类型，从基础概念到实际应用，助你轻松掌握数据建模的核心技巧。## 内部对象：构建层级化数据结构在Elasticsearch中，对象类型（Object）是最基础的复杂数据类型之一，用于表示具有嵌套关系的数据。例如，我们可

腾讯云开发者社区

如何快速搭建Neon无服务器PostgreSQL：面向初学者的完整指南

Neon是一款革命性的无服务器PostgreSQL解决方案，它通过分离存储和计算层，实现了自动扩缩容、类代码式数据库分支以及零级扩展能力。本指南将帮助你从零开始搭建Neon开发环境，体验这款创新数据库的强大功能。## 准备工作：环境要求与依赖项在开始搭建Neon环境前，请确保你的系统满足以下要求：- Linux操作系统（推荐Ubuntu 20.04+或Debian 11+）- Git