docker container在创建时是加了gpu设备的,在container里安装cuda后却发现gpu用不起来,连执行最简单的nvidia-smi命令都报错:Failed to initialize NVML: Driver/library version mismatch

在容器内分别检查nvidia drvier和nvidia相关库发现:

    cat /proc/driver/nvidia/version

NVRM version: NVIDIA UNIX x86_64 Kernel Module  465.19.01  Fri Mar 19 07:44:41 UTC 2021

    cat /var/log/dpkg.log|grep nvidia

2022-08-14 14:52:45 install libnvidia-cfg1-470:amd64 <none> 470.141.03-0ubuntu0.18.04.1
2022-08-14 14:52:45 status half-installed libnvidia-cfg1-470:amd64 470.141.03-0ubuntu0.18.04.1
2022-08-14 14:52:45 status unpacked libnvidia-cfg1-470:amd64 470.141.03-0ubuntu0.18.04.1
2022-08-14 14:52:45 status unpacked libnvidia-cfg1-470:amd64 470.141.03-0ubuntu0.18.04.1
2022-08-14 14:52:46 install libnvidia-common-470:all <none> 470.141.03-0ubuntu0.18.04.1
2022-08-14 14:52:46 status half-installed libnvidia-common-470:all 470.141.03-0ubuntu0.18.04.1
2022-08-14 14:52:46 status unpacked libnvidia-common-470:all 470.141.03-0ubuntu0.18.04.1
2022-08-14 14:52:46 status unpacked libnvidia-common-470:all 470.141.03-0ubuntu0.18.04.1
2022-08-14 14:52:46 install libnvidia-compute-470:amd64 <none> 470.141.03-0ubuntu0.18.04.1
2022-08-14 14:52:46 status half-installed libnvidia-compute-470:amd64 470.141.03-0ubuntu0.18.04.1
2022-08-14 14:52:47 status unpacked libnvidia-compute-470:amd64 470.141.03-0ubuntu0.18.04.1
2022-08-14 14:52:47 status unpacked libnvidia-compute-470:amd64 470.141.03-0ubuntu0.18.04.1
2022-08-14 14:52:47 install libnvidia-decode-470:amd64 <none> 470.141.03-0ubuntu0.18.04.1
2022-08-14 14:52:47 status half-installed libnvidia-decode-470:amd64 470.141.03-0ubuntu0.18.04.1
2022-08-14 14:52:47 status unpacked libnvidia-decode-470:amd64 470.141.03-0ubuntu0.18.04.1
2022-08-14 14:52:47 status unpacked libnvidia-decode-470:amd64 470.141.03-0ubuntu0.18.04.1
2022-08-14 14:52:47 install libnvidia-encode-470:amd64 <none> 470.141.03-0ubuntu0.18.04.1
2022-08-14 14:52:47 status half-installed libnvidia-encode-470:amd64 470.141.03-0ubuntu0.18.04.1
2022-08-14 14:52:47 status unpacked libnvidia-encode-470:amd64 470.141.03-0ubuntu0.18.04.1
2022-08-14 14:52:48 status unpacked libnvidia-encode-470:amd64 470.141.03-0ubuntu0.18.04.1
2022-08-14 14:52:48 install libnvidia-extra-470:amd64 <none> 470.141.03-0ubuntu0.18.04.1
2022-08-14 14:52:48 status half-installed libnvidia-extra-470:amd64 470.141.03-0ubuntu0.18.04.1
2022-08-14 14:52:48 status unpacked libnvidia-extra-470:amd64 470.141.03-0ubuntu0.18.04.1
2022-08-14 14:52:48 status unpacked libnvidia-extra-470:amd64 470.141.03-0ubuntu0.18.04.1
2022-08-14 14:52:48 install libnvidia-fbc1-470:amd64 <none> 470.141.03-0ubuntu0.18.04.1
2022-08-14 14:52:48 status half-installed libnvidia-fbc1-470:amd64 470.141.03-0ubuntu0.18.04.1
...

出了这种问题一般是因为container里安装的cuda版本较高,和driver版本不匹配,因为container使用的driver是host环境里安装的,而不是container里安装cuda时安装的。

解决办法很简单,把host环境下的nvidia driver 升级到不低于容器内的nvidia库的版本即可,例如:

      sudo apt install nvidia-driver-470

然后执行reboot即可,不重启是不行的,cat /proc/driver/nvidia/version可以看到driver还是465,而不是新安装的470,新安装的驱动需要重启后才能生效。

Logo

腾讯云面向开发者汇聚海量精品云计算使用和开发经验,营造开放的云计算技术生态圈。

更多推荐