解决paddlepaddle nccl报错问题【已解决】An internal check failed. This is either a bug in NCCL or due to memory
环境来自docker安装解决思路:由于docker来自官方,经过检查nccl正确安装,大概率是通信的错误。我们需要检查多卡之间的通信。
·
问题描述:paddle.utils.run_check()报错
报错内容:
python -c "import paddle; paddle.utils.run_check()"
Running verify PaddlePaddle program ...
I0815 14:25:08.157060 690548 program_interpreter.cc:212] New Executor is Running.
W0815 14:25:08.158898 690548 gpu_resources.cc:119] Please NOTE: device: 0, GPU Compute Capability: 8.9, Driver API Version: 12.6, Runtime API Version: 12.0
W0815 14:25:08.160075 690548 gpu_resources.cc:164] device: 0, cuDNN Version: 8.9.
I0815 14:25:08.350580 690548 interpreter_util.cc:624] Standalone Executor is Used.
PaddlePaddle works well on 1 GPU.
======================= Modified FLAGS detected =======================
FLAGS(name='FLAGS_selected_gpus', current_value='0', default_value='')
=======================================================================
I0815 14:25:09.524173 690618 tcp_utils.cc:181] The server starts to listen on IP_ANY:43661
I0815 14:25:09.524746 690618 tcp_utils.cc:130] Successfully connected to 127.0.0.1:43661
======================= Modified FLAGS detected =======================
FLAGS(name='FLAGS_selected_gpus', current_value='1', default_value='')
=======================================================================
I0815 14:25:09.542655 690619 tcp_utils.cc:130] Successfully connected to 127.0.0.1:43661
I0815 14:25:09.614272 690619 process_group_nccl.cc:129] ProcessGroupNCCL pg_timeout_ 1800000
I0815 14:25:09.714462 690618 process_group_nccl.cc:129] ProcessGroupNCCL pg_timeout_ 1800000
W0815 14:25:09.831360 690619 gpu_resources.cc:119] Please NOTE: device: 1, GPU Compute Capability: 8.9, Driver API Version: 12.6, Runtime API Version: 12.0
W0815 14:25:09.832465 690619 gpu_resources.cc:164] device: 1, cuDNN Version: 8.9.
W0815 14:25:09.910295 690618 gpu_resources.cc:119] Please NOTE: device: 0, GPU Compute Capability: 8.9, Driver API Version: 12.6, Runtime API Version: 12.0
W0815 14:25:09.911039 690618 gpu_resources.cc:164] device: 0, cuDNN Version: 8.9.
I0815 14:25:10.265226 690618 process_group_nccl.cc:132] ProcessGroupNCCL destruct
I0815 14:25:10.307034 690636 tcp_store.cc:289] receive shutdown event and so quit from MasterDaemon run loop
--------------------------------------
C++ Traceback (most recent call last):
--------------------------------------
No stack trace in paddle, may be caused by external reasons.
----------------------
Error Message Summary:
----------------------
FatalError: `Termination signal` is detected by the operating system.
[TimeInfo: *** Aborted at 1723703110 (unix time) try "date -d @1723703110" if you are using GNU date ***]
[SignalInfo: *** SIGTERM (@0x3e9000a8974) received by PID 690619 (TID 0x73013050a480) from PID 690548 ***]
[2024-08-15 14:25:10,881] [ WARNING] install_check.py:289 - PaddlePaddle meets some problem with 2 GPUs. This may be caused by:
1. There is not enough GPUs visible on your system
2. Some GPUs are occupied by other process now
3. NVIDIA-NCCL2 is not installed correctly on your system. Please follow instruction on https://github.com/NVIDIA/nccl-tests
to test your NCCL, or reinstall it following https://docs.nvidia.com/deeplearning/sdk/nccl-install-guide/index.html
[2024-08-15 14:25:10,881] [ WARNING] install_check.py:297 -
Original Error is:
----------------------------------------------
Process 0 terminated with the following error:
----------------------------------------------
Traceback (most recent call last):
File "/home/yangkelang/program/companyfile/PaddleDetection/venv/lib/python3.10/site-packages/paddle/distributed/spawn.py", line 372, in _func_wrapper
result = func(*args)
File "/home/yangkelang/program/companyfile/PaddleDetection/venv/lib/python3.10/site-packages/paddle/utils/install_check.py", line 184, in train_for_run_parallel
dp_layer = paddle.DataParallel(layer)
File "/home/yangkelang/program/companyfile/PaddleDetection/venv/lib/python3.10/site-packages/paddle/distributed/parallel.py", line 398, in __init__
sync_params_buffers(self._layers, fuse_params=False)
File "/home/yangkelang/program/companyfile/PaddleDetection/venv/lib/python3.10/site-packages/decorator.py", line 232, in fun
return caller(func, *(extras + args), **kw)
File "/home/yangkelang/program/companyfile/PaddleDetection/venv/lib/python3.10/site-packages/paddle/base/dygraph/base.py", line 340, in __impl__
return func(*args, **kwargs)
File "/home/yangkelang/program/companyfile/PaddleDetection/venv/lib/python3.10/site-packages/decorator.py", line 232, in fun
return caller(func, *(extras + args), **kw)
File "/home/yangkelang/program/companyfile/PaddleDetection/venv/lib/python3.10/site-packages/paddle/base/wrapped_decorator.py", line 26, in __impl__
return wrapped_func(*args, **kwargs)
File "/home/yangkelang/program/companyfile/PaddleDetection/venv/lib/python3.10/site-packages/paddle/base/framework.py", line 593, in __impl__
return func(*args, **kwargs)
File "/home/yangkelang/program/companyfile/PaddleDetection/venv/lib/python3.10/site-packages/paddle/distributed/parallel.py", line 197, in sync_params_buffers
paddle.distributed.broadcast(
File "/home/yangkelang/program/companyfile/PaddleDetection/venv/lib/python3.10/site-packages/paddle/distributed/communication/broadcast.py", line 64, in broadcast
return stream.broadcast(
File "/home/yangkelang/program/companyfile/PaddleDetection/venv/lib/python3.10/site-packages/paddle/distributed/communication/stream/broadcast.py", line 124, in broadcast
return _broadcast_in_dygraph(
File "/home/yangkelang/program/companyfile/PaddleDetection/venv/lib/python3.10/site-packages/paddle/distributed/communication/stream/broadcast.py", line 32, in _broadcast_in_dygraph
task = group.process_group.broadcast(tensor, src_rank_in_group, sync_op)
OSError: (External) NCCL error(3), internal error - please report this issue to the NCCL developers.
[Hint: 'ncclInternalError'. An internal check failed. This is either a bug in NCCL or due to memory corruption.] (at ../paddle/phi/core/distributed/comm_context_manager.cc:72)
PaddlePaddle is installed successfully ONLY for single GPU! Let's start deep learning with PaddlePaddle now.
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/home/yangkelang/program/companyfile/PaddleDetection/venv/lib/python3.10/site-packages/paddle/utils/install_check.py", line 302, in run_check
raise e
File "/home/yangkelang/program/companyfile/PaddleDetection/venv/lib/python3.10/site-packages/paddle/utils/install_check.py", line 283, in run_check
_run_parallel(device_list)
File "/home/yangkelang/program/companyfile/PaddleDetection/venv/lib/python3.10/site-packages/paddle/utils/install_check.py", line 210, in _run_parallel
paddle.distributed.spawn(train_for_run_parallel, nprocs=len(device_list))
File "/home/yangkelang/program/companyfile/PaddleDetection/venv/lib/python3.10/site-packages/paddle/distributed/spawn.py", line 614, in spawn
while not context.join():
File "/home/yangkelang/program/companyfile/PaddleDetection/venv/lib/python3.10/site-packages/paddle/distributed/spawn.py", line 423, in join
self._throw_exception(error_index)
File "/home/yangkelang/program/companyfile/PaddleDetection/venv/lib/python3.10/site-packages/paddle/distributed/spawn.py", line 447, in _throw_exception
raise Exception(msg)
Exception:
----------------------------------------------
Process 0 terminated with the following error:
----------------------------------------------
Traceback (most recent call last):
File "/home/yangkelang/program/companyfile/PaddleDetection/venv/lib/python3.10/site-packages/paddle/distributed/spawn.py", line 372, in _func_wrapper
result = func(*args)
File "/home/yangkelang/program/companyfile/PaddleDetection/venv/lib/python3.10/site-packages/paddle/utils/install_check.py", line 184, in train_for_run_parallel
dp_layer = paddle.DataParallel(layer)
File "/home/yangkelang/program/companyfile/PaddleDetection/venv/lib/python3.10/site-packages/paddle/distributed/parallel.py", line 398, in __init__
sync_params_buffers(self._layers, fuse_params=False)
File "/home/yangkelang/program/companyfile/PaddleDetection/venv/lib/python3.10/site-packages/decorator.py", line 232, in fun
return caller(func, *(extras + args), **kw)
File "/home/yangkelang/program/companyfile/PaddleDetection/venv/lib/python3.10/site-packages/paddle/base/dygraph/base.py", line 340, in __impl__
return func(*args, **kwargs)
File "/home/yangkelang/program/companyfile/PaddleDetection/venv/lib/python3.10/site-packages/decorator.py", line 232, in fun
return caller(func, *(extras + args), **kw)
File "/home/yangkelang/program/companyfile/PaddleDetection/venv/lib/python3.10/site-packages/paddle/base/wrapped_decorator.py", line 26, in __impl__
return wrapped_func(*args, **kwargs)
File "/home/yangkelang/program/companyfile/PaddleDetection/venv/lib/python3.10/site-packages/paddle/base/framework.py", line 593, in __impl__
return func(*args, **kwargs)
File "/home/yangkelang/program/companyfile/PaddleDetection/venv/lib/python3.10/site-packages/paddle/distributed/parallel.py", line 197, in sync_params_buffers
paddle.distributed.broadcast(
File "/home/yangkelang/program/companyfile/PaddleDetection/venv/lib/python3.10/site-packages/paddle/distributed/communication/broadcast.py", line 64, in broadcast
return stream.broadcast(
File "/home/yangkelang/program/companyfile/PaddleDetection/venv/lib/python3.10/site-packages/paddle/distributed/communication/stream/broadcast.py", line 124, in broadcast
return _broadcast_in_dygraph(
File "/home/yangkelang/program/companyfile/PaddleDetection/venv/lib/python3.10/site-packages/paddle/distributed/communication/stream/broadcast.py", line 32, in _broadcast_in_dygraph
task = group.process_group.broadcast(tensor, src_rank_in_group, sync_op)
OSError: (External) NCCL error(3), internal error - please report this issue to the NCCL developers.
[Hint: 'ncclInternalError'. An internal check failed. This is either a bug in NCCL or due to memory corruption.] (at ../paddle/phi/core/distributed/comm_context_manager.cc:72)
环境来自docker安装
解决思路:
由于docker来自官方,经过检查nccl正确安装,大概率是通信的错误。
我们需要检查多卡之间的通信。
检查通信
git clone https://github.com/NVIDIA/nccl-tests.git
cd nccl-test
make
官方环境下能全部通过,接下来测试多卡通信:
./build/all_reduce_perf -b 8 -e 128M -f 2 -g 2
注意:这里的 -g 2 指的是有两张显卡
测试结果一般是
bootstrap.cc:45 NCCL WARN Bootstrap : no socket interface found
kings-System-Product-Name: Test NCCL failure common.cu:992 'internal error'
.. kings-System-Product-Name pid 10130: Test failure common.cu:925
说明双卡存在,但是通信地址有问题
检查通信地址
ifconfig
输出:
eno1: flags=4099<UP,BROADCAST,MULTICAST> mtu 1500
ether c8:7f:54:a9:fc:1b txqueuelen 1000 (以太网)
RX packets 0 bytes 0 (0.0 B)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 0 bytes 0 (0.0 B)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
device memory 0x86600000-866fffff
发现默认的eno1没有正确配置
因此我们设置其他的网卡,我这里的例子是wlp0s20f3:
export NCCL_SOCKET_IFNAME=wlp0s20f3
注意:这里的 NCCL_SOCKET_IFNAME=wlp0s20f3 需要根据你的ifconfig结果修改
也可以直接加入~/.bashrc中
vim ~/.bashrc
添加:
export NCCL_SOCKET_IFNAME=wlp0s20f3
保存退出,加载bashrc
source ~/.bashrc
测试:
python -c "import paddle; paddle.utils.run_check()"
发现成功:
Running verify PaddlePaddle program ...
I0815 14:39:17.257519 711714 program_interpreter.cc:212] New Executor is Running.
W0815 14:39:17.258880 711714 gpu_resources.cc:119] Please NOTE: device: 0, GPU Compute Capability: 8.9, Driver API Version: 12.6, Runtime API Version: 12.0
W0815 14:39:17.259541 711714 gpu_resources.cc:164] device: 0, cuDNN Version: 8.9.
I0815 14:39:17.449719 711714 interpreter_util.cc:624] Standalone Executor is Used.
PaddlePaddle works well on 1 GPU.
======================= Modified FLAGS detected =======================
FLAGS(name='FLAGS_selected_gpus', current_value='1', default_value='')
=======================================================================
I0815 14:39:18.777268 711798 tcp_utils.cc:107] Retry to connect to 127.0.0.1:41833 while the server is not yet listening.
======================= Modified FLAGS detected =======================
FLAGS(name='FLAGS_selected_gpus', current_value='0', default_value='')
=======================================================================
I0815 14:39:18.927515 711797 tcp_utils.cc:181] The server starts to listen on IP_ANY:41833
I0815 14:39:18.927894 711797 tcp_utils.cc:130] Successfully connected to 127.0.0.1:41833
I0815 14:39:21.777606 711798 tcp_utils.cc:130] Successfully connected to 127.0.0.1:41833
I0815 14:39:21.778041 711798 process_group_nccl.cc:129] ProcessGroupNCCL pg_timeout_ 1800000
I0815 14:39:21.870291 711797 process_group_nccl.cc:129] ProcessGroupNCCL pg_timeout_ 1800000
W0815 14:39:22.051515 711798 gpu_resources.cc:119] Please NOTE: device: 1, GPU Compute Capability: 8.9, Driver API Version: 12.6, Runtime API Version: 12.0
W0815 14:39:22.053068 711798 gpu_resources.cc:164] device: 1, cuDNN Version: 8.9.
W0815 14:39:22.107273 711797 gpu_resources.cc:119] Please NOTE: device: 0, GPU Compute Capability: 8.9, Driver API Version: 12.6, Runtime API Version: 12.0
W0815 14:39:22.108410 711797 gpu_resources.cc:164] device: 0, cuDNN Version: 8.9.
I0815 14:39:23.415659 711797 process_group_nccl.cc:132] ProcessGroupNCCL destruct
I0815 14:39:23.521483 711855 tcp_store.cc:289] receive shutdown event and so quit from MasterDaemon run loop
I0815 14:39:23.564513 711798 process_group_nccl.cc:132] ProcessGroupNCCL destruct
PaddlePaddle works well on 2 GPUs.
PaddlePaddle is installed successfully! Let's start deep learning with PaddlePaddle now.
更多推荐
所有评论(0)