国产无码综合区,色欲AV无码国产永久播放,无码天堂亚洲国产AV,国产日韩欧美女同一区二区

<th id="g738q"></th>

<th id="g738q"><progress id="g738q"></progress></th>

<tfoot id="g738q"></tfoot>

記NVIDIA顯卡A100在K8S POD中“Failed to initialize NVML: Unknown Error“問題解決

2年前作者：jiageibuuuyi分類：Toy博客閱讀(23)違法舉報

這篇具有很好參考價值的文章主要介紹了記NVIDIA顯卡A100在K8S POD中“Failed to initialize NVML: Unknown Error“問題解決。希望對大家有所幫助。如果存在錯誤或未考慮完全的地方，請大家不吝賜教，您也可以點擊"舉報違法"按鈕提交疑問。

記NVIDIA顯卡A100在K8S POD中"Failed to initialize NVML: Unknown Error"問題解決

問題描述

因項目原因需要在k8s上跑GPU相關的代碼，優(yōu)選使用NVIDIA A100顯卡，但在根據(jù)官方文檔簡單并部署GitHub - NVIDIA/k8s-device-plugin：適用于 Kubernetes 的 NVIDIA 設備插件后，出現(xiàn)了pod中GPU運行一段時間后丟失的問題，進入容器后發(fā)現(xiàn)nvidia-smi命令報錯"Failed to initialize NVML: Unknown Error"。嘗試刪除并且重建容器后，剛開始nvidia-smi命令正常，但是在大約10秒過后，重復出現(xiàn)以上異常。

問題分析

對于出現(xiàn)的問題，github中有多人提到，如：

nvidia-smi command in container returns “Failed to initialize NVML: Unknown Error” after couple of times · Issue #1678 · NVIDIA/nvidia-docker · GitHub

“Failed to initialize NVML: Unknown Error” after random amount of time · Issue #1671 · NVIDIA/nvidia-docker · GitHub

通過討論可以發(fā)現(xiàn)，我們的現(xiàn)象與其他人是相同的，該命令失效的原因為一段時間后，devices.list中丟失了GPU的設備（路徑：/sys/fs/cgroup/devices/devices.list）

導致問題的原因為k8s的cpu管理策略為static，并且修改cpu的管理策略為none，該問題確實可以解決，建議對CPU管理策略研究沒有那么嚴格時，操作到此即可。但是我們對于CPU的管理策略要求為static，所以我們繼續(xù)追溯到github上以下issue。

Updating cpu-manager-policy=static causes NVML unknown error · Issue #966 · NVIDIA/nvidia-docker · GitHub

問題原因可以參考https://zhuanlan.zhihu.com/p/344561710

在https://github.com/NVIDIA/nvidia-docker/issues/966#issuecomment-610928514作者提到了解決方式，并且官方在幾個版本之前提供了相關的解決方案，在部署官方插件的時候添加參數(shù)**–pass-device-specs=ture**，至此重新閱讀官方部署文檔，確實發(fā)現(xiàn)了相關參數(shù)的說明。但是在部署之后發(fā)現(xiàn)問題還是沒有解決，再次閱讀相關討論，發(fā)現(xiàn)runc版本有限制（https://github.com/NVIDIA/nvidia-docker/issues/1671#issuecomment-1330466432），我們的版本為1.14，再次對runc降級后，該問題解決。

解決步驟

檢查runc版本，如果版本小于1.1.3可以直接跳到第3步操作：

# runc -v
runc version 1.1.4
commit: v1.1.4-0-xxxxx
spec: 1.0.2-dev
go: go1.17.10
libseccomp: 2.5.3

更新runc版本：
- 下載指定版本的runc版本，本文下載的為1.1.2版本（https://github.com/opencontainers/runc/releases/tag/v1.1.2）
  
  [外鏈圖片轉存失敗,源站可能有防盜鏈機制,建議將圖片保存下來直接上傳(img-yW8c0HGZ-1675244222972)(C:\Users\jia\AppData\Roaming\Typora\typora-user-images\image-20230201155651434.png)]
- 將下載好的runc.amd64文件上傳到服務器、修改文件名并賦權
```
mv runc.amd64 runc && chmod +x runc
```
- 備份原有的runc
```
mv /usr/bin/runc /home/runcbak
```
- 停止docker
```
systemctl stop docker
```
- 替換新版本runc
```
cp runc /usr/bin/runc
```
- 啟動docker
```
systemctl start docker
```
- 檢查runc是否升級成功
```
# runc -v
runc version 1.1.2
commit: v1.1.2-0-ga916309f
spec: 1.0.2-dev
go: go1.17.10
libseccomp: 2.5.3
```

安裝NVIDIA GPU插件

創(chuàng)建plugin.yml，該yaml文件中跟普通部署的區(qū)別主要為PASS_DEVICE_SPECS

# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: nvidia-device-plugin-daemonset
  namespace: kube-system
spec:
  selector:
    matchLabels:
      name: nvidia-device-plugin-ds
  updateStrategy:
    type: RollingUpdate
  template:
    metadata:
      labels:
        name: nvidia-device-plugin-ds
    spec:
      tolerations:
      - key: nvidia.com/gpu
        operator: Exists
        effect: NoSchedule
      # Mark this pod as a critical add-on; when enabled, the critical add-on
      # scheduler reserves resources for critical add-on pods so that they can
      # be rescheduled after a failure.
      # See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
      priorityClassName: "system-node-critical"
      containers:
      - image: nvcr.io/nvidia/k8s-device-plugin:v0.13.0
        name: nvidia-device-plugin-ctr
        env:
          - name: FAIL_ON_INIT_ERROR
            value: "false"
          - name: PASS_DEVICE_SPECS
            value: "true"
        securityContext:
          privileged: true
        volumeMounts:
        - name: device-plugin
          mountPath: /var/lib/kubelet/device-plugins
      volumes:
      - name: device-plugin
        hostPath:
          path: /var/lib/kubelet/device-plugins

創(chuàng)建插件
```
$ kubectl create -f plugin.yml
```

創(chuàng)建GPU POD并且驗證

附

SEO切換cpu管理策略

關閉kubelet
```
systemctl stop kubelet
```
刪除cpu_manager_state
```
rm /var/lib/kubelet/cpu_manager_state
```

修改config.yaml

vi /var/lib/kubelet/config.yaml

apiVersion: kubelet.config.k8s.io/v1beta1
authentication:
  anonymous:
    enabled: false
  webhook:
    cacheTTL: 0s
    enabled: true
  x509:
    clientCAFile: /etc/kubernetes/pki/ca.crt
authorization:
  mode: Webhook
  webhook:
    cacheAuthorizedTTL: 0s
    cacheUnauthorizedTTL: 0s
cgroupDriver: systemd
clusterDNS:
- 10.96.0.10
clusterDomain: cluster.local

# 修改cpu管理策略，none或者static
cpuManagerPolicy: static

cpuManagerReconcilePeriod: 0s
evictionPressureTransitionPeriod: 0s
featureGates:
  TopologyManager: true
fileCheckFrequency: 0s
healthzBindAddress: 127.0.0.1
healthzPort: 10248
httpCheckFrequency: 0s
imageMinimumGCAge: 0s
kind: KubeletConfiguration
logging: {}
memorySwap: {}
nodeStatusReportFrequency: 0s
nodeStatusUpdateFrequency: 0s
podPidsLimit: 4096
reservedSystemCPUs: 0,1
resolvConf: /run/systemd/resolve/resolv.conf
rotateCertificates: true
runtimeRequestTimeout: 0s
shutdownGracePeriod: 0s
shutdownGracePeriodCriticalPods: 0s
staticPodPath: /etc/kubernetes/manifests
streamingConnectionIdleTimeout: 0s
syncFrequency: 0s
tlsCipherSuites:
- TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256
- TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256
- TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384
- TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384
tlsMinVersion: VersionTLS12
topologyManagerPolicy: best-effort
volumeStatsAggPeriod: 0s

啟動kubelet
```
systemctl start kubelet
```

變更containerd版本

https://github.com/NVIDIA/nvidia-docker/issues/1671#issuecomment-1238644201

參考https://blog.csdn.net/Ivan_Wz/article/details/111932120

github下載二進制containerd（https://github.com/containerd/containerd/releases/tag/v1.6.16）
[外鏈圖片轉存失敗,源站可能有防盜鏈機制,建議將圖片保存下來直接上傳(img-J2Du8dbI-1675244222972)(C:\Users\jia\AppData\Roaming\Typora\typora-user-images\image-20230201172032663.png)]

解壓containerd

tar -zxvf containerd-1.6.16-linux-amd64.tar.gz

檢查當前containerd版本
```
docker info 
containerd -v
```
暫停docker
```
systemctl stop docker
```

替換containerd二進制文件

cp containerd /usr/bin/containerd
cp containerd-shim /usr/bin/containerd-shim
cp containerd-shim-runc-v1 /usr/bin/containerd-shim-runc-v1
cp containerd-shim-runc-v2 /usr/bin/containerd-shim-runc-v2
cp ctr /usr/bin/ctr

重啟docker 檢查containerd版本是否替換成功文章來源地址http://www.zghlxwxcb.cn/news/detail-477277.html

到了這里，關于記NVIDIA顯卡A100在K8S POD中“Failed to initialize NVML: Unknown Error“問題解決的文章就介紹完了。如果您還想了解更多內容，請在右上角搜索TOY模板網(wǎng)以前的文章或繼續(xù)瀏覽下面的相關文章，希望大家以后多多支持TOY模板網(wǎng)！

本文來自互聯(lián)網(wǎng)用戶投稿，該文觀點僅代表作者本人，不代表本站立場。本站僅提供信息存儲空間服務，不擁有所有權，不承擔相關法律責任。如若轉載，請注明出處：如若內容造成侵權/違法違規(guī)/事實不符，請點擊違法舉報進行投訴反饋，一經(jīng)查實，立即刪除！

分享到：

領支付寶紅包贊助服務器費用

清理k8s集群Evicted，F(xiàn)ailed的Pod！
簡介：不知知道各位是如何清理的，我嘗試過用阿里的任何一個面板清理，但是還要換頁就很煩，所以自己就寫了一個小腳本，更GOOD！的是你還可以把他放到你的定時任務里面去，為啥要這么做，不得不說，咱的集群有點小垃圾，不過那也沒辦法，集群也不敢動，誰知道啥時
2024年02月20日
瀏覽(21)
k8s pod啟動報錯: no route to host
查看pod命令查看報錯pod日志命令： pod報錯都一樣: no route to host 原因：找不到這個路由對應的地址，關閉防火墻，重啟k8s，執(zhí)行下面命令即可。正常線上是不能關閉防火墻，這個屬于pod內部之間的host調用失敗，有大神知道的這種情況怎么處理的，可以在評論區(qū)告知一下，
2024年02月13日
瀏覽(16)
failed to get sandbox image “k8s.gcr.io/pause:3.6“: failed to pull image “k8s.gcr.io/pause:3.6“
從日志能夠看到k8s核心服務的pod創(chuàng)建失敗，因為獲取pause鏡像失敗，總是從k8s.gcr.io下載。經(jīng)過確認，k8s 1.26中啟用了CRI sandbox(pause) image的配置支持。之前通過kubeadm init –image-repository設置的鏡像地址，不再會傳遞給cri運行時去下載pause鏡像而是需要在cri運行時的配置文件中設
2024年02月16日
瀏覽(26)
k8s 啟動 elasticsearch 失敗: [failed to bind service]
具體的錯誤信息 k logs -f elasticsearch-0 -n kube-system 排查最后導致啟動失敗的原因是，我的 elasticsearch 使用到了 pv、pvc 來做持久化存儲，但是 elasticsearch 對我的掛載目錄沒有執(zhí)行權限。 chmod 777 elasticsearch 之后重啟 pod 即可。
2024年02月15日
瀏覽(22)
Kubernetes Pod報錯 filed to get sandbox image “k8s.gcr.io/pause:3.6“
? ? ? ? 最近工作中在部署Pod后發(fā)現(xiàn)無法正常啟動，查看Pod詳情后看到以下報錯信息： ? ? ? ? 問題的原因是因為調度的這臺服務器上沒有?k8s.gcr.io/pause:3.6 鏡像，所以我們把鏡像下載到這臺服務器就可以了，執(zhí)行命令： ? ? ? ? 到此?Kubernetes Pod報錯 filed to get sandbox image
2024年02月16日
瀏覽(19)
k8s服務突然中斷重啟原因排查-eviction manager: must evict pod(s) to reclaim memory
20230512早上9點半左右，服務突然中斷造成產(chǎn)品不可用。 1.時間端內有占用大內存操作，定時任務，造成內存溢出或者探針失敗重啟 2.時間段內業(yè)務高峰，內存溢出或探針失敗重啟 3.kafka大量失敗造成應用重啟。那么kafka失敗原因排查首先查看 pod狀態(tài)：所有pod都有一次重啟記錄
2024年02月15日
瀏覽(21)
解決k8s node節(jié)點報錯： Failed to watch *v1.Secret: unknown
現(xiàn)象： ?這個現(xiàn)象是發(fā)生在k8s集群證書過期，重新續(xù)簽證書以后。記得master節(jié)點的/etc/kubernetes/kubelet.conf文件已經(jīng)復制到node節(jié)點了。但是為什么還是報這個錯，然后運行證書檢查命令看一下：? ?看樣子是差/etc/kubernetes/pki/apiserver.crt文件。但是從master節(jié)點scpapiserver.crt文件以
2024年01月16日
瀏覽(28)
flink k8s sink到kafka報錯 Failed to get metadata for topics
--
2024年02月07日
瀏覽(24)
kubeadm init：failed to pull image registry.k8s.io/pause:3.6
錯誤信息： Unfortunately, an error has occurred: ? ? ? ? timed out waiting for the condition This error is likely caused by: ? ? ? ? - The kubelet is not running ? ? ? ? - The kubelet is unhealthy due to a misconfiguration of the node in some way (required cgroups disabled) If you are on a systemd-powered system, you can try to troubleshoot
2024年02月14日
瀏覽(18)
kubeadm init [ERROR ImagePull]: failed to pull image registry.k8s.io 解決方法
** https://blog.itwk.cc 由于國內網(wǎng)絡原因，kubeadm init部署集群會卡住不動，，報錯如下： error execution phase preflight: [preflight] Some fatal errors occurred: [ERROR ImagePull]: failed to pull image registry.k8s.io/kube-apiserver:v1.25.6: output: E0124 00:28:25.369652 3299 remote_image.go:171] “PullImage from image service failed”
2024年02月11日
瀏覽(33)

<b id="2k3wd"><abbr id="2k3wd"></abbr></b>