1.背景
在云渲染容器組pod中,有xx,xx,xx,unity四個(gè)container容器組成,然后因?yàn)閡nity容器鏡像的構(gòu)成是基于vlukan(cudagl相關(guān))和cuda-base打包的,這里的cuda是nvidia的一個(gè)驅(qū)動(dòng)版本,類似顯卡驅(qū)動(dòng)。現(xiàn)象是啟動(dòng)unity容器后無法運(yùn)行nvidia-smi和vlukaninfo
初步排查:
因?yàn)槿萜骰\(yùn)行需要依賴宿主機(jī)的GPU機(jī)器資源,需要宿主機(jī)有nvidia驅(qū)動(dòng)且容器能正常映射到宿主機(jī)資源。
最后定位到容器中nvidia-smi未輸出任何信息,是由于nvidia-container-toolkit組件未將GPU設(shè)備掛載到容器中,組件中的nvidia-container-runtime無法被containerd管理和使用。
2.部署
2.1.宿主機(jī)上部署nvidia驅(qū)動(dòng)
- 選擇操作系統(tǒng)和安裝包,單機(jī)下載驅(qū)動(dòng)版本,訪問官網(wǎng)下載
- 在宿主機(jī)上執(zhí)行安裝
chmod a+x NVIDIA-Linux-x86_64-460.73.01.run && ./NVIDIA-Linux-x86_64-460.73.01.run --ui=none --no-questions
- 宿主機(jī)驗(yàn)證是否安裝成功,執(zhí)行nvidia-smi,輸出下圖則安裝成功
- cuda驅(qū)動(dòng)安裝
- 備注:此操作已經(jīng)在打包的容器鏡像中安裝,可以跳過執(zhí)行
- 可以在官網(wǎng)下載驅(qū)動(dòng)版本
- 添加nvidia-docker倉庫且安裝工具包nvidia-container-toolkit
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.repo | sudo tee /etc/yum.repos.d/nvidia-docker.repo
yum install -y nvidia-container-toolkit
- 安裝x,可視化桌面
- 修改/etc/X11/xorg.conf中的pci序列號(hào)和nvidia-smi中的序列號(hào)一樣
- 運(yùn)行g(shù)dm服務(wù)
2.2.k8s容器中部署驅(qū)動(dòng)
- 集群中部署nvidia gpu設(shè)備插件
kubectl apply -f https://github.com/NVIDIA/k8s-device-plugin/blob/main/nvidia-device-plugin.yml
# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: nvidia-device-plugin-daemonset
namespace: kube-system
spec:
selector:
matchLabels:
name: nvidia-device-plugin-ds
updateStrategy:
type: RollingUpdate
template:
metadata:
labels:
name: nvidia-device-plugin-ds
spec:
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
# Mark this pod as a critical add-on; when enabled, the critical add-on
# scheduler reserves resources for critical add-on pods so that they can
# be rescheduled after a failure.
# See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
priorityClassName: "system-node-critical"
containers:
- image: nvcr.io/nvidia/k8s-device-plugin:v0.13.0-rc.1
name: nvidia-device-plugin-ctr
env:
- name: FAIL_ON_INIT_ERROR
value: "false"
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop: ["ALL"]
volumeMounts:
- name: device-plugin
mountPath: /var/lib/kubelet/device-plugins
volumes:
- name: device-plugin
hostPath:
path: /var/lib/kubelet/device-plugins
- 進(jìn)入容器untiy測試執(zhí)行,nvidia-smi
- 或者直接用containerd命令行ctr測試
ctr images pull docker.io/nvidia/cuda:9.0-base
ctr run --rm -t --gpus 0 docker.io/nvidia/cuda:9.0-base nvidia-smi
nvidia-smi
3.問題排查
3.1.方向一sealos節(jié)點(diǎn)加入集群后,提示錯(cuò)誤
- 在宿主機(jī)配置完后,sealos加入集群
[root@iZbp1329l07uu7gp2xxijhZ ~]# sealos join --node xx.xx.xx.xx
15:26:33 [EROR] [check.go:91] docker exist error when kubernetes version >= 1.20.
sealos install kubernetes version >= 1.20 use containerd cri instead.
please uninstall docker on [[10.0.1.88:22]]. For example run on centos7: "yum remove docker-ce containerd-io -y",
see details: https://github.com/fanux/sealos/issues/582
- 因?yàn)橹霸诩尤爰褐?,安裝了docker-ce進(jìn)行測試,和kubernetes下載的運(yùn)行時(shí)containerd相沖突,根據(jù)提示需要將這些刪除
- 根據(jù)官網(wǎng)安裝步驟
- 更新yum源并添加源
- 安裝docker-ce
- 安裝nvidia container tookit,參見宿主機(jī)安裝過程
- 安裝nvidia-docker2
- 驗(yàn)證,容器內(nèi)是否能映射到gpu資源
yum update -y
yum-config-manager --add-repo http://mirrors.aliyun.com/docker-ce/linux/centos/docker-ce.repo
yum makecache fast
yum install docker-ce -y
systemctl --now enable docker
yum clean expire-cache
yum install -y nvidia-docker2
systemctl restart docker
docker run --rm --gpus all nvidia/cuda:11.6.2-base-ubuntu20.04 nvidia-smi
結(jié)論:
這里是在加入集群k8s之前的操作,安裝了docker-ce和container-io,需要先卸載,然后在sealos加入集群后,在去安裝nvidia-docker2
3.2.方向二集群k8s容器守護(hù)進(jìn)程containerd未加載插件和docker啟動(dòng)錯(cuò)誤
- 在加入容器后,修改daemon.json后docker容器報(bào)錯(cuò)
[root@al-media-other-03 ~]# systemctl status docker
● docker.service - Docker Application Container Engine
Loaded: loaded (/usr/lib/systemd/system/docker.service; enabled; vendor preset: disabled)
Active: failed (Result: start-limit) since Tue 2022-11-15 17:29:31 CST; 7s ago
Docs: https://docs.docker.com
Process: 17379 ExecStart=/usr/bin/dockerd -H fd:// --containerd=/run/containerd/containerd.sock (code=exited, status=1/FAILURE)
Main PID: 17379 (code=exited, status=1/FAILURE)
Nov 15 17:29:28 al-media-other-03 systemd[1]: Failed to start Docker Application Container Engine.
Nov 15 17:29:28 al-media-other-03 systemd[1]: Unit docker.service entered failed state.
Nov 15 17:29:28 al-media-other-03 systemd[1]: docker.service failed.
Nov 15 17:29:31 al-media-other-03 systemd[1]: docker.service holdoff time over, scheduling restart.
Nov 15 17:29:31 al-media-other-03 systemd[1]: Stopped Docker Application Container Engine.
Nov 15 17:29:31 al-media-other-03 systemd[1]: start request repeated too quickly for docker.service
Nov 15 17:29:31 al-media-other-03 systemd[1]: Failed to start Docker Application Container Engine.
Nov 15 17:29:31 al-media-other-03 systemd[1]: Unit docker.service entered failed state.
Nov 15 17:29:31 al-media-other-03 systemd[1]: docker.service failed.
- 參考官網(wǎng),修改daemon.json,然后重新啟動(dòng)docker
{
"default-runtime": "nvidia",
"runtimes": {
"nvidia": {
"path": "/usr/bin/nvidia-container-runtime",
"runtimeArgs": []
}
}
}
- 在節(jié)點(diǎn)加入集群后的,containerd的配置文件不能加載nvidia-container-runtime
- 參考如上官網(wǎng)地址,先執(zhí)行containerd config default >
/etc/containerd/config.toml初始化containerd配置項(xiàng),然后修改添加/etc/containerd/config.toml如下,runc修改成nvidia,同時(shí)添加plugin加載信息,然后在重啟containerd
version = 2
[plugins]
[plugins."io.containerd.grpc.v1.cri"]
[plugins."io.containerd.grpc.v1.cri".containerd]
default_runtime_name = "nvidia"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes]
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
privileged_without_host_devices = false
runtime_engine = ""
runtime_root = ""
runtime_type = "io.containerd.runc.v2"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
BinaryName = "/usr/bin/nvidia-container-runtime"
結(jié)論:
需要修改docker和containerd的配置文件,讓nvidia-container-runtime可以運(yùn)行時(shí)加載文章來源:http://www.zghlxwxcb.cn/news/detail-435214.html
3.3.方向三nvidia-plugin容器log日志報(bào)錯(cuò)
- 前面容器部署驅(qū)動(dòng)yaml的時(shí)候,查看pod日志有報(bào)錯(cuò)
[root@al-master-01 ~]# kubectl logs nvidia-device-plugin-daemonset-4qdqw -n kube-system
2022/11/15 03:43:58 Loading NVML
2022/11/15 03:43:58 Failed to initialize NVML: could not load NVML library.
2022/11/15 03:43:58 If this is a GPU node, did you set the docker default runtime to `nvidia`?
2022/11/15 03:43:58 You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites
2022/11/15 03:43:58 You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start
2022/11/15 03:43:58 If this is not a GPU node, you should set up a toleration or nodeSelector to only deploy this plugin on GPU nodes
- 這里根據(jù)官網(wǎng)搜索是因?yàn)槲醇虞dnvidia-container-runtime,暫未解決
- 在deployment.yaml中設(shè)置了pod選擇nodeSelector獨(dú)占式使用GPU節(jié)點(diǎn),已經(jīng)可以在容器內(nèi)運(yùn)行nvidia-smi和vlukaninfo
apiVersion: apps/v1
kind: Deployment
metadata:
name: cuda-vector-add
spec:
replicas: 1
selector:
matchLabels:
app: cuda-vector-add
template:
metadata:
labels:
app: cuda-vector-add
spec:
nodeSelector:
node-scope: gpu-node
imagePullSecrets:
- name: xxx
containers:
- name: cuda-vector-add
image: "k8s.gcr.io/cuda-vector-add:v0.1"
imagePullPolicy: IfNotPresent
關(guān)注微信公眾號(hào)
搜索:布魯斯手記文章來源地址http://www.zghlxwxcb.cn/news/detail-435214.html
到了這里,關(guān)于記一次“nvidia-smi”在容器中映射GPU資源時(shí)的排錯(cuò)的文章就介紹完了。如果您還想了解更多內(nèi)容,請(qǐng)?jiān)谟疑辖撬阉鱐OY模板網(wǎng)以前的文章或繼續(xù)瀏覽下面的相關(guān)文章,希望大家以后多多支持TOY模板網(wǎng)!