記NVIDIA顯卡A100在K8S POD中"Failed to initialize NVML: Unknown Error"問題解決
問題描述
因項目原因需要在k8s上跑GPU相關的代碼,優(yōu)選使用NVIDIA A100顯卡,但在根據(jù)官方文檔簡單并部署GitHub - NVIDIA/k8s-device-plugin:適用于 Kubernetes 的 NVIDIA 設備插件后,出現(xiàn)了pod中GPU運行一段時間后丟失的問題,進入容器后發(fā)現(xiàn)nvidia-smi命令報錯"Failed to initialize NVML: Unknown Error"。嘗試刪除并且重建容器后,剛開始nvidia-smi命令正常,但是在大約10秒過后,重復出現(xiàn)以上異常。
問題分析
對于出現(xiàn)的問題,github中有多人提到,如:
nvidia-smi command in container returns “Failed to initialize NVML: Unknown Error” after couple of times · Issue #1678 · NVIDIA/nvidia-docker · GitHub
“Failed to initialize NVML: Unknown Error” after random amount of time · Issue #1671 · NVIDIA/nvidia-docker · GitHub
通過討論可以發(fā)現(xiàn),我們的現(xiàn)象與其他人是相同的,該命令失效的原因為一段時間后,devices.list中丟失了GPU的設備(路徑:/sys/fs/cgroup/devices/devices.list)
導致問題的原因為k8s的cpu管理策略為static,并且修改cpu的管理策略為none,該問題確實可以解決,建議對CPU管理策略研究沒有那么嚴格時,操作到此即可。但是我們對于CPU的管理策略要求為static,所以我們繼續(xù)追溯到github上以下issue。
Updating cpu-manager-policy=static causes NVML unknown error · Issue #966 · NVIDIA/nvidia-docker · GitHub
問題原因可以參考https://zhuanlan.zhihu.com/p/344561710
在https://github.com/NVIDIA/nvidia-docker/issues/966#issuecomment-610928514作者提到了解決方式,并且官方在幾個版本之前提供了相關的解決方案,在部署官方插件的時候添加參數(shù)**–pass-device-specs=ture**,至此重新閱讀官方部署文檔,確實發(fā)現(xiàn)了相關參數(shù)的說明。但是在部署之后發(fā)現(xiàn)問題還是沒有解決,再次閱讀相關討論,發(fā)現(xiàn)runc版本有限制(https://github.com/NVIDIA/nvidia-docker/issues/1671#issuecomment-1330466432),我們的版本為1.14,再次對runc降級后,該問題解決。
解決步驟
-
檢查runc版本,如果版本小于1.1.3可以直接跳到第3步操作:
# runc -v runc version 1.1.4 commit: v1.1.4-0-xxxxx spec: 1.0.2-dev go: go1.17.10 libseccomp: 2.5.3
-
更新runc版本:
-
下載指定版本的runc版本,本文下載的為1.1.2版本(https://github.com/opencontainers/runc/releases/tag/v1.1.2)
[外鏈圖片轉存失敗,源站可能有防盜鏈機制,建議將圖片保存下來直接上傳(img-yW8c0HGZ-1675244222972)(C:\Users\jia\AppData\Roaming\Typora\typora-user-images\image-20230201155651434.png)]
-
將下載好的runc.amd64文件上傳到服務器、修改文件名并賦權
mv runc.amd64 runc && chmod +x runc
-
備份原有的runc
mv /usr/bin/runc /home/runcbak
-
停止docker
systemctl stop docker
-
替換新版本runc
cp runc /usr/bin/runc
-
啟動docker
systemctl start docker
-
檢查runc是否升級成功
# runc -v runc version 1.1.2 commit: v1.1.2-0-ga916309f spec: 1.0.2-dev go: go1.17.10 libseccomp: 2.5.3
-
-
安裝NVIDIA GPU插件
-
創(chuàng)建plugin.yml,該yaml文件中跟普通部署的區(qū)別主要為PASS_DEVICE_SPECS
# You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. apiVersion: apps/v1 kind: DaemonSet metadata: name: nvidia-device-plugin-daemonset namespace: kube-system spec: selector: matchLabels: name: nvidia-device-plugin-ds updateStrategy: type: RollingUpdate template: metadata: labels: name: nvidia-device-plugin-ds spec: tolerations: - key: nvidia.com/gpu operator: Exists effect: NoSchedule # Mark this pod as a critical add-on; when enabled, the critical add-on # scheduler reserves resources for critical add-on pods so that they can # be rescheduled after a failure. # See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/ priorityClassName: "system-node-critical" containers: - image: nvcr.io/nvidia/k8s-device-plugin:v0.13.0 name: nvidia-device-plugin-ctr env: - name: FAIL_ON_INIT_ERROR value: "false" - name: PASS_DEVICE_SPECS value: "true" securityContext: privileged: true volumeMounts: - name: device-plugin mountPath: /var/lib/kubelet/device-plugins volumes: - name: device-plugin hostPath: path: /var/lib/kubelet/device-plugins
-
創(chuàng)建插件
$ kubectl create -f plugin.yml
-
-
創(chuàng)建GPU POD并且驗證
附
SEO切換cpu管理策略
-
關閉kubelet
systemctl stop kubelet
-
刪除cpu_manager_state
rm /var/lib/kubelet/cpu_manager_state
-
修改config.yaml
vi /var/lib/kubelet/config.yaml apiVersion: kubelet.config.k8s.io/v1beta1 authentication: anonymous: enabled: false webhook: cacheTTL: 0s enabled: true x509: clientCAFile: /etc/kubernetes/pki/ca.crt authorization: mode: Webhook webhook: cacheAuthorizedTTL: 0s cacheUnauthorizedTTL: 0s cgroupDriver: systemd clusterDNS: - 10.96.0.10 clusterDomain: cluster.local # 修改cpu管理策略,none或者static cpuManagerPolicy: static cpuManagerReconcilePeriod: 0s evictionPressureTransitionPeriod: 0s featureGates: TopologyManager: true fileCheckFrequency: 0s healthzBindAddress: 127.0.0.1 healthzPort: 10248 httpCheckFrequency: 0s imageMinimumGCAge: 0s kind: KubeletConfiguration logging: {} memorySwap: {} nodeStatusReportFrequency: 0s nodeStatusUpdateFrequency: 0s podPidsLimit: 4096 reservedSystemCPUs: 0,1 resolvConf: /run/systemd/resolve/resolv.conf rotateCertificates: true runtimeRequestTimeout: 0s shutdownGracePeriod: 0s shutdownGracePeriodCriticalPods: 0s staticPodPath: /etc/kubernetes/manifests streamingConnectionIdleTimeout: 0s syncFrequency: 0s tlsCipherSuites: - TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 - TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256 - TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 - TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384 tlsMinVersion: VersionTLS12 topologyManagerPolicy: best-effort volumeStatsAggPeriod: 0s
-
啟動kubelet
systemctl start kubelet
變更containerd版本
https://github.com/NVIDIA/nvidia-docker/issues/1671#issuecomment-1238644201
參考https://blog.csdn.net/Ivan_Wz/article/details/111932120
-
github下載二進制containerd(https://github.com/containerd/containerd/releases/tag/v1.6.16)
[外鏈圖片轉存失敗,源站可能有防盜鏈機制,建議將圖片保存下來直接上傳(img-J2Du8dbI-1675244222972)(C:\Users\jia\AppData\Roaming\Typora\typora-user-images\image-20230201172032663.png)] -
解壓containerd
tar -zxvf containerd-1.6.16-linux-amd64.tar.gz
-
檢查當前containerd版本
docker info containerd -v
-
暫停docker
systemctl stop docker
-
替換containerd二進制文件文章來源:http://www.zghlxwxcb.cn/news/detail-477277.html
cp containerd /usr/bin/containerd cp containerd-shim /usr/bin/containerd-shim cp containerd-shim-runc-v1 /usr/bin/containerd-shim-runc-v1 cp containerd-shim-runc-v2 /usr/bin/containerd-shim-runc-v2 cp ctr /usr/bin/ctr
-
重啟docker 檢查containerd版本是否替換成功文章來源地址http://www.zghlxwxcb.cn/news/detail-477277.html
到了這里,關于記NVIDIA顯卡A100在K8S POD中“Failed to initialize NVML: Unknown Error“問題解決的文章就介紹完了。如果您還想了解更多內容,請在右上角搜索TOY模板網(wǎng)以前的文章或繼續(xù)瀏覽下面的相關文章,希望大家以后多多支持TOY模板網(wǎng)!