參考資料
Docker 中無法使用 GPU 時該怎么辦(無法初始化 NVML:未知錯誤)
按照下面這篇文章當中引用的文章來(附錄1)
SOLVED Docker with GPU: “Failed to initialize NVML: Unknown Error”
解決方案需要的條件:
需要在服務器上docker的admin list之中. 不需要服務器整體的admin權限. 我在創(chuàng)建docker的時候向管理員申請了把握加到docker list當中了. 如果你能夠創(chuàng)建docker你就滿足這個條件了
問題描述:
在主機上nvidia-smi正常, 但是在docker上報錯如標題.
解決: 使用上述方法修改. 但是有一些不同
- 我的docker沒有/etc/nvidia-container-runtime/config.toml, 于是我自己新建了一個. 注意新建這個文件需要有docker的admin密碼(不是服務器主機上docker 命令的管理員密碼)
#在docker當中
cd /etc/nvidia-container-runtime/
sudo touch config.toml
sudo vim config.toml
#把下面的config.toml內(nèi)容復制進去
#ESC, :wq
- config.toml的內(nèi)容是從服務器上抄的, 復制如下
disable-require = false
#swarm-resource = "DOCKER_RESOURCE_GPU"
#accept-nvidia-visible-devices-envvar-when-unprivileged = true
#accept-nvidia-visible-devices-as-volume-mounts = false
[nvidia-container-cli]
#root = "/run/nvidia/driver"
#path = "/usr/bin/nvidia-container-cli"
environment = []
#debug = "/var/log/nvidia-container-toolkit.log"
#ldcache = "/etc/ld.so.cache"
load-kmods = true
#no-cgroups = false
#user = "root:video"
ldconfig = "@/sbin/ldconfig.real"
[nvidia-container-runtime]
#debug = "/var/log/nvidia-container-runtime.log"
log-level = "info"
# Specify the runtimes to consider. This list is processed in order and the PATH
# searched for matching executables unless the entry is an absolute path.
runtimes = [
"docker-runc",
"runc",
]
mode = "auto"
[nvidia-container-runtime.modes.csv]
mount-spec-path = "/etc/nvidia-container-runtime/host-files-for-container.d"
- 不需要重啟docker, 只要重啟容器就可以了. 需要服務器docker admin list權限.
上面的鏈接當中, 使用命令sudo systemctl restart docker重啟docker, 需要服務器admin權限,權限等級比較高. 我只是在docker list 當中.
我首先執(zhí)行了sudo docker run --rm --gpus all nvidia/cuda:11.0-base nvidia-smi.(1.18更新:我甚至沒有執(zhí)行這一步,如果下次再出現(xiàn)這種情況我考慮只是重啟我的docker試試看)
然后再在主機當中重啟我的container.
我使用docker ps -a查看我的container_id(36e1b3a9c2af), 然后使用docker stop <container_id>關閉我的container, 再使用docker start <container_id>重啟
然后就成功了
附錄1
I’ve bumped to the same issue after recent update of nvidia related packages. Fortunately, I managed to fix it.
Method 1, recommended
-
Kernel parameter
The easiest way to ensure the presence of systemd.unified_cgroup_hierarchy=false param is to check /proc/cmdline :cat /proc/cmdline
It’s of course related to a method with usage of boot loader. You can hijack this file to set the parameter on runtime https://wiki.archlinux.org/title/Kernel_parameters#Hijacking_cmdline -
nvidia-container configuration
In the file
/etc/nvidia-container-runtime/config.toml
set the parameterno-cgroups = false
After that restart docker and run test container:
sudo systemctl restart docker
sudo docker run --rm --gpus all nvidia/cuda:11.0-base nvidia-smi
Method 2
Actually, you can try to bypass cgroupsv2 by setting (in file mentioned above)no-cgroups = true
Then you must manually pass all gpu devices to the container. Check this answer for the list of required mounts:https://github.com/NVIDIA/nvidia-docker/issues/1447#issuecomment-851039827
For debugging purposes, just run:文章來源:http://www.zghlxwxcb.cn/news/detail-740904.html
sudo systemctl restart docker
sudo docker run --rm --gpus all --privileged -v /dev:/dev nvidia/cuda:11.0-base nvidia-smi
Good luck
Last edited by szalinski (2021-06-04 23:41:06)文章來源地址http://www.zghlxwxcb.cn/news/detail-740904.html
到了這里,關于Docker中Failed to initialize NVML: Unknown Error的文章就介紹完了。如果您還想了解更多內(nèi)容,請在右上角搜索TOY模板網(wǎng)以前的文章或繼續(xù)瀏覽下面的相關文章,希望大家以后多多支持TOY模板網(wǎng)!