時(shí)間 | 版本號 | 修改描述 | 修改人 |
---|---|---|---|
2022年6月9日15:33:12 | V0.1 | 新建K8S調(diào)用GPU資源配置指南, 編寫了Nvidia驅(qū)動(dòng)安裝過程 |
|
2022年6月10日11:16:52 | V0.2 | 添加K8S容器編排調(diào)用GPU撰寫 |
簡介
文檔描述
?該文檔用于描述使用Kubernetes調(diào)用GPU資源的配置過程。文檔會較為詳細(xì)的描述在配置過程中遇到的問題和解決方式,并且會詳細(xì)描述每個(gè)步驟的驗(yàn)證結(jié)果,該文檔對于Kubernetes的使用以及GPU資源的理解有一定的輔助意義。在行文時(shí)主要描述了TensorFlow框架調(diào)用GPU、也有Pytorch調(diào)用GPU支持的過程,文檔適用于運(yùn)維人員、開發(fā)人員。
配置目標(biāo)描述
?配置過程的主要目標(biāo)是實(shí)現(xiàn)通過yaml文件實(shí)現(xiàn)對于底層GPU資源的調(diào)度。為達(dá)到此目的,需要實(shí)現(xiàn)如下的目標(biāo):
- 完成Nvidia GPU驅(qū)動(dòng)程序的安裝
- 通過Tensorflow框架示例程序驗(yàn)證驅(qū)動(dòng)程序安裝成功,即驗(yàn)證Docker容器對于GPU資源的調(diào)用
- 完成k8s-device-plugin的安裝
- 通過示例程序驗(yàn)證k8s-device-plugin啟動(dòng)成功,即驗(yàn)證k8s容器編排下對于GPU資源的調(diào)用
整體架構(gòu)
?Docker 使用容器創(chuàng)建虛擬環(huán)境,以便將 TensorFlow 安裝結(jié)果與系統(tǒng)的其余部分隔離開來。TensorFlow 程序在此虛擬環(huán)境中運(yùn)行,該環(huán)境能夠與其主機(jī)共享資源(訪問目錄、使用 GPU、連接到互聯(lián)網(wǎng)等)。行文中,可以理解,nvidia-docker2的驅(qū)動(dòng)安裝在每個(gè)GPU節(jié)點(diǎn)上。而k8s-device-plugin為k8spod網(wǎng)絡(luò)訪問GPU的資源,這也是通過驅(qū)動(dòng)nvidia-docker2實(shí)現(xiàn)的。
環(huán)境簡介
?當(dāng)前環(huán)境,共兩個(gè)服務(wù)器,每臺服務(wù)器有8張卡,可以通過命令查看nvidia-smi查看型號NVIDIA A100-SXM4-40GB。具體如下
root@node33-a100:/mnt/nas# nvidia-smi
Thu Jun 9 07:34:17 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.129.06 Driver Version: 470.129.06 CUDA Version: 11.4 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A100-SXM... Off | 00000000:07:00.0 Off | 0 |
| N/A 28C P0 59W / 400W | 31922MiB / 39538MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA A100-SXM... Off | 00000000:0A:00.0 Off | 0 |
| N/A 24C P0 58W / 400W | 32724MiB / 39538MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 2 NVIDIA A100-SXM... Off | 00000000:47:00.0 Off | 0 |
| N/A 34C P0 183W / 400W | 26509MiB / 39538MiB | 58% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 3 NVIDIA A100-SXM... Off | 00000000:4D:00.0 Off | 0 |
| N/A 31C P0 83W / 400W | 14036MiB / 39538MiB | 15% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 4 NVIDIA A100-SXM... Off | 00000000:87:00.0 Off | 0 |
| N/A 29C P0 75W / 400W | 24175MiB / 39538MiB | 26% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 5 NVIDIA A100-SXM... Off | 00000000:8D:00.0 Off | 0 |
| N/A 25C P0 60W / 400W | 31039MiB / 39538MiB | 31% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 6 NVIDIA A100-SXM... Off | 00000000:C7:00.0 Off | 0 |
| N/A 24C P0 58W / 400W | 31397MiB / 39538MiB | 11% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 7 NVIDIA A100-SXM... Off | 00000000:CA:00.0 Off | 0 |
| N/A 28C P0 61W / 400W | 26737MiB / 39538MiB | 20% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
?上述輸出的各項(xiàng)指標(biāo)可以參見下圖:
?nvidia-smi的用法可以參見Nvidia-smi簡介及常用指令及其參數(shù)說明。
?兩臺Ubuntu組成了K8S集群。
root@node33-a100:/mnt/nas# kubectl get nodes
NAME STATUS ROLES AGE VERSION
node33-a100 Ready master 15d v1.18.2
node34-a100 Ready <none> 15d v1.18.2
NVIDIA 驅(qū)動(dòng)架構(gòu)
?
?
配置步驟
?為了使用系統(tǒng)支持多版本TensorFlow,使用Docker容器環(huán)境來隔離不同版本是非常簡單高效的方式。首先我們在部署了k8s的兩個(gè)節(jié)點(diǎn)上(當(dāng)然已經(jīng)安裝了Docker運(yùn)行環(huán)境)。根據(jù)docker版本,在TensorFlow官網(wǎng)上指引我們要安裝Nvidia 驅(qū)動(dòng)。
root@node33-a100:~/gpu# docker version
Client: Docker Engine - Community
Version: 19.03.13
API version: 1.40
Go version: go1.13.15
Git commit: 4484c46d9d
Built: Wed Sep 16 17:02:36 2020
OS/Arch: linux/amd64
Experimental: false
Server: Docker Engine - Community
Engine:
Version: 19.03.13
API version: 1.40 (minimum version 1.12)
Go version: go1.13.15
Git commit: 4484c46d9d
Built: Wed Sep 16 17:01:06 2020
OS/Arch: linux/amd64
Experimental: false
containerd:
Version: 1.4.3
GitCommit: 269548fa27e0089a8b8278fc4fc781d7f65a939b
nvidia:
Version: 1.0.0-rc92
GitCommit: ff819c7e9184c13b7c2607fe6c30ae19403a7aff
docker-init:
Version: 0.18.0
GitCommit: fec3683
?然后可以通過TensorFlow官網(wǎng)上的例子,來驗(yàn)證使用Docker容器TensorFlow訪問GPU是否實(shí)現(xiàn)。
TensorFlow調(diào)用GPU
Nvidia驅(qū)動(dòng)安裝
Nvidia驅(qū)動(dòng)的安裝是在所有GPU節(jié)點(diǎn)都安裝。
?由于兩個(gè)節(jié)點(diǎn)的操作系統(tǒng)為Ubuntu,因此我們可以參考Ubuntu環(huán)境安裝Nvidia驅(qū)動(dòng)的安裝手冊Installation Guide:
root@node33-a100:~/gpu# cat /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=18.04
DISTRIB_CODENAME=bionic
DISTRIB_DESCRIPTION="Ubuntu 18.04.6 LTS"
?安裝穩(wěn)定版本,設(shè)置包倉庫和GPG key:
# distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \
&& curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
&& curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
?更新包訪問列表,并安裝nvidia-docker2
# sudo apt-get update
# sudo apt-get install -y nvidia-docker2
?在執(zhí)行過程中,會出現(xiàn)提示是否覆蓋/etc/docker/daemon.json的內(nèi)容,此時(shí)注意備份,可以把daemon.json的內(nèi)容和新生成的合成一體。最后的daemon.json文件內(nèi)容如下所示:
?重啟Docker完成安裝
# systemctl restart docker
安裝過程問題描述
Reading from proxy failed
?不知實(shí)驗(yàn)室網(wǎng)絡(luò)架構(gòu),當(dāng)時(shí)在執(zhí)行apt-get update時(shí),出現(xiàn)了如下問題:
Reading from proxy failed - read (115: Operation now in progress) [IP: *.*.*.* 443]
?管理節(jié)點(diǎn)上執(zhí)行apt-get-update可以比較順暢的更新包列表,但是另外一個(gè)節(jié)點(diǎn)死活就是遇到上述的問題。當(dāng)時(shí)該問題困擾了許久,嘗試了另外幾個(gè)節(jié)點(diǎn),兩個(gè)云主機(jī)可以正常執(zhí)行update操作,但另外的一個(gè)k8s節(jié)點(diǎn)無法執(zhí)行。不過不斷的瀏覽網(wǎng)頁最終采用禁用代理的方式解決。
apt無法獲得鎖
?在執(zhí)行安裝時(shí),出現(xiàn)無法活得鎖的問題,參見apt install的lock問題
?解決方式如下:
unboxed as root as file /root/iscos/depends/deb/InRelease
?這是由于兩臺服務(wù)器的K8s集群采用的是指令集系統(tǒng)本地安裝導(dǎo)致的,解決方式是刪除了/etc/apt/的local本地源文件。
程序驗(yàn)證
Nvidia官網(wǎng)例子
?在Installation Guide可以看到一個(gè)驗(yàn)證例子
# docker run --rm --gpus all nvidia/cuda:11.0.3-base-ubuntu20.04 nvidia-smi
?能夠正常打印GPU顯卡信息,表明可以正確訪問。
多版本TensorFlow支持
?為了驗(yàn)證對于多個(gè)TensorFlow版本支持,本次選用了Tensorflow2和1兩個(gè)版本進(jìn)行驗(yàn)證,首先下載兩個(gè)鏡像:
# docker pull tensorflow/tensorflow:2.9.1-gpu
# docker pull tensorflow/tensorflow:1.15.5-gpu
?使用TensorFlow官網(wǎng)的例子,
# docker run -it --rm tensorflow/tensorflow:2.9.1-gpu python -c "import tensorflow as tf; print(tf.reduce_sum(tf.random.normal([1000, 1000])))"
2022-06-09 09:30:00.944529: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:6 with 6026 MB memory: -> device: 6, name: NVIDIA A100-SXM4-40GB, pci bus id: 0000:c7:00.0, compute capability: 8.0
2022-06-09 09:30:00.949071: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:7 with 10598 MB memory: -> device: 7, name: NVIDIA A100-SXM4-40GB, pci bus id: 0000:ca:00.0, compute capability: 8.0
tf.Tensor(-921.5332, shape=(), dtype=float32)
如果能像上述控制臺輸出tf.Tensor()則表示驅(qū)動(dòng)安裝成功。
?上述采用了直接啟動(dòng)容器在容器中執(zhí)行命令的方式,也可以采用如下的方式在控制臺與容器交互,打印所有的GPU設(shè)備的方式來驗(yàn)證驅(qū)動(dòng)安裝的成功。
?最后,也可以在交互時(shí)采用指定GPU的方式進(jìn)行驗(yàn)證:
?也可以進(jìn)行接口驗(yàn)證:
# print (tf.test.is_gpu_available())
# tf.random.normal([2, 2])
?不再贅述。
K8S容器編排調(diào)用GPU
k8s-device-plugin的項(xiàng)目地址位于GitHub。安裝的過程參考README.md.
注意:一定要確保
/etc/docker/daemon.json
中默認(rèn)low-level運(yùn)行時(shí)。
{
"default-runtime": "nvidia",
"runtimes": {
"nvidia": {
"path": "/usr/bin/nvidia-container-runtime",
"runtimeArgs": []
}
}
}
安裝k8s-device-plugin.
?當(dāng)在k8s集群中所有的GPU節(jié)點(diǎn)上配置了nvidia驅(qū)動(dòng)之后,可以通過部署下面的Daemonset來啟動(dòng)GPU支持。
?在k8s管理節(jié)點(diǎn)上執(zhí)行如下程序:
# wget https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.12.0/nvidia-device-plugin.yml
# kubectl create -f nvidia-device-plugin.yml
注意:可以簡單的使用kubectl create 來部署k8s-device-plugin插件,不過官方更加推薦使用helm來部署該插件,不過在此不表了。操作了一下,也是可以實(shí)現(xiàn)的,需要注意的是,上述kubectl create與helm是并列關(guān)系,兩者選其中一種方式即可。
?使用如下命令查看插件是否運(yùn)行
root@node33-a100:~/gpu# kubectl get pods -n kube-system
NAME READY STATUS RESTARTS AGE
coredns-66bff467f8-q7pks 1/1 Running 8 15d
coredns-66bff467f8-wtm8j 1/1 Running 8 15d
etcd-node33-a100 1/1 Running 9 15d
kube-apiserver-node33-a100 1/1 Running 8 15d
kube-controller-manager-node33-a100 1/1 Running 11 15d
kube-flannel-ds-amd64-6zl2n 1/1 Running 3 3h12m
kube-flannel-ds-amd64-vl84w 1/1 Running 763 8d
kube-proxy-5nmt2 1/1 Running 9 15d
kube-proxy-j9k96 1/1 Running 8 15d
kube-scheduler-node33-a100 1/1 Running 11 15d
nvidia-device-plugin-daemonset-9tfhr 1/1 Running 0 7d1h
nvidia-device-plugin-daemonset-p27ph 1/1 Running 0 7d1h
可以看到,在k8s及群眾中,兩個(gè)GPU節(jié)點(diǎn)上都運(yùn)行了nvidia-device-plugin-daemonset,其中使用的鏡像就是
image: nvcr.io/nvidia/k8s-device-plugin:v0.11.0
?此時(shí),一定要通過kubectl logs查看一下插件是否啟動(dòng),通過上面的控制臺輸出,可以看到該Daemonset位于命名空間kube-system。
root@node33-a100:~/gpu# kubectl logs nvidia-device-plugin-daemonset-p27ph
Error from server (NotFound): pods "nvidia-device-plugin-daemonset-p27ph" not found
root@node33-a100:~/gpu# kubectl logs nvidia-device-plugin-daemonset-p27ph -n kube-system
2022/06/02 09:40:59 Loading NVML
2022/06/02 09:40:59 Starting FS watcher.
2022/06/02 09:40:59 Starting OS watcher.
2022/06/02 09:40:59 Retreiving plugins.
2022/06/02 09:40:59 Starting GRPC server for 'nvidia.com/gpu'
2022/06/02 09:40:59 Starting to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock
2022/06/02 09:40:59 Registered device plugin for 'nvidia.com/gpu' with Kubelet
注意: 對于不在默認(rèn)命名空間下的pod,查看pod啟動(dòng)日志,需要通過-n指定命名空間,不然會報(bào)錯(cuò)。
?然后使用kubectl-describe來檢查一下顯示一下特定資源或者資源組的細(xì)節(jié)。
root@node33-a100:~/gpu# kubectl describe pod nvidia-device-plugin-daemonset-p27ph -n kube-system
Name: nvidia-device-plugin-daemonset-p27ph
Namespace: kube-system
Conditions:
Type Status
Initialized True
Ready True
ContainersReady True
PodScheduled True
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: CriticalAddonsOnly
node.kubernetes.io/disk-pressure:NoSchedule
node.kubernetes.io/memory-pressure:NoSchedule
node.kubernetes.io/not-ready:NoExecute
node.kubernetes.io/pid-pressure:NoSchedule
node.kubernetes.io/unreachable:NoExecute
node.kubernetes.io/unschedulable:NoSchedule
nvidia.com/gpu:NoSchedule
Events: <none>
?為了信息的顯著,上面的控制臺信息有刪減。可以從上述的控制臺輸出中查看到Conditions
的輸出中,判斷該插件正常運(yùn)行。
?另外,我們也可以通過docker命令來直接查看nvidia-device-plugin容器的啟動(dòng)日志
root@node33-a100:~/gpu# docker ps -f name=nvidia
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
6af0cf230428 6bf6d481d77e "nvidia-device-plugi…" 7 days ago Up 7 days k8s_nvidia-device-plugin-ctr_nvidia-device-plugin-daemonset-9tfhr_kube-system_58bb23d9-7692-44be-86ef-d815a7061b21_0
c8103716e8dd k8s.gcr.io/pause:3.2 "/pause" 7 days ago Up 7 days k8s_POD_nvidia-device-plugin-daemonset-9tfhr_kube-system_58bb23d9-7692-44be-86ef-d815a7061b21_0
root@node33-a100:~/gpu# docker logs 6af
2022/06/02 09:40:59 Loading NVML
2022/06/02 09:40:59 Starting FS watcher.
2022/06/02 09:40:59 Starting OS watcher.
2022/06/02 09:40:59 Retreiving plugins.
2022/06/02 09:40:59 Starting GRPC server for 'nvidia.com/gpu'
2022/06/02 09:40:59 Starting to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock
2022/06/02 09:41:00 Registered device plugin for 'nvidia.com/gpu' with Kubelet
?可以看到,Docker容器的日志,與kubectl查看pods日志的輸出一致。
k8s-device-plugin插件安裝問題
Error from server (NotFound)
?這是由于在使用kubect describe或者kubectl logs查詢指定資源的信息或者日志時(shí),默認(rèn)是在default命名空間下查詢的。
?解決方式,添加pod對應(yīng)的命名空間即可。
root@node33-a100:~/gpu# kubectl get pods nvidia-device-plugin-daemonset-9tfhr
Error from server (NotFound): pods "nvidia-device-plugin-daemonset-9tfhr" not found
root@node33-a100:~/gpu# kubectl get pods nvidia-device-plugin-daemonset-9tfhr -n kube-system
NAME READY STATUS RESTARTS AGE
nvidia-device-plugin-daemonset-9tfhr 1/1 Running 0 7d16h
root@node33-a100:~/gpu# kubectl logs nvidia-device-plugin-daemonset-9tfhr --namespace kube-system
2022/06/02 09:40:59 Loading NVML
2022/06/02 09:40:59 Starting FS watcher.
2022/06/02 09:40:59 Starting OS watcher.
2022/06/02 09:40:59 Retreiving plugins.
2022/06/02 09:40:59 Starting GRPC server for 'nvidia.com/gpu'
2022/06/02 09:40:59 Starting to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock
2022/06/02 09:41:00 Registered device plugin for 'nvidia.com/gpu' with Kubelet
Insufficient nvidia.com/gpu
?在使用kubectl logs查詢插件的日志時(shí),出現(xiàn)了如下問題:
?當(dāng)出現(xiàn)這個(gè)問題時(shí),請仔細(xì)檢查一下是否配置了nvidia默認(rèn)運(yùn)行時(shí),檢查/etc/docker/daemon.json
{
"default-runtime": "nvidia",
"runtimes": {
"nvidia": {
"path": "/usr/bin/nvidia-container-runtime",
"runtimeArgs": []
}
}
?檢查之后,記得重啟每一個(gè)GPU節(jié)點(diǎn),然后重啟Docker服務(wù),不再贅述了。
?另外,該問題在GitHub上也有其他人遇到。
程序驗(yàn)證
Tensorflow調(diào)用GPU
?使用如下的yaml文件,來訪問GPU資源。下面代碼申請了一個(gè)GPU資源,調(diào)用了并且打印了GPU顯卡信息和一個(gè)一個(gè)張量。
apiVersion: v1
kind: Pod
metadata:
name: gpu-pod
spec:
containers:
- name: tf-test
image: tensorflow/tensorflow:2.9.1-gpu
command:
- python
- -c
- "import tensorflow as tf; print(tf.reduce_sum(tf.random.normal([1000, 1000]))); gpus = tf.config.experimental.list_physical_devices(device_type='GPU'); print(gpus)"
resources:
limits:
nvidia.com/gpu: 1
tolerations:
- effect: NoSchedule
operator: Exists
?使用命令創(chuàng)建pod
root@node33-a100:~/gpu# kubectl create -f gpu_job.yaml
pod/gpu-pod created
?查看pod運(yùn)行日志
root@node33-a100:~/gpu# kubectl logs gpu-pod
2022-06-10 02:50:20.852512: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-06-10 02:50:21.515055: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 5501 MB memory: -> device: 0, name: NVIDIA A100-SXM4-40GB, pci bus id: 0000:07:00.0, compute capability: 8.0
tf.Tensor(-508.727, shape=(), dtype=float32)
[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
?發(fā)現(xiàn)已經(jīng)使用GPU:0號設(shè)備,并且打印了tf.Tensor張量。由于在啟動(dòng)Pod時(shí),采用了Command的形式,當(dāng)該命令執(zhí)行完成,該P(yáng)od處于Completed狀態(tài)。
Pytorch調(diào)用GPU
?首先也是在集群中每個(gè)節(jié)點(diǎn)安裝pytorch/pytorch:latest。
?使用如下的yaml來驗(yàn)證Pytroch對于GPU資源的使用
apiVersion: v1
kind: Pod
metadata:
name: pytorch-gpu
labels:
test-gpu: "true"
spec:
containers:
- name: training
image: pytorch/pytorch:latest
Command:
- python
- -c
- "import torch as torch; print('gpu available', torch.cuda.is_available());print(torch.cuda.device_count());print(torch.cuda.get_device_name(0))"
env:
# - name: NVIDIA_VISIBLE_DEVICES
# value: none
resources:
limits:
nvidia.com/gpu: 1
tolerations:
- effect: NoSchedule
operator: Exists
?上述雖然申請了一個(gè)GPU資源,并且打印0號GPU設(shè)備的名稱。
?使用命令啟動(dòng)該P(yáng)od驗(yàn)證pytorch對于GPU資源的可調(diào)度性
# kubectl create -f pytorch-gpu.yaml
?查看pod啟動(dòng)日志,如下:
root@node33-a100:~/gpu# kubectl logs pytorch-gpu -f
gpu available True
1
NVIDIA A100-SXM4-40GB
?這與Pod中yaml限額一致,通過輸出表明Pytorch框架在k8s集群中可以正確的申請GPU資源。
總結(jié)
?文檔詳細(xì)的講述了為使得K8s集群發(fā)現(xiàn)集群中GPU資源而進(jìn)行的兩個(gè)配置,首先是在所有GPU節(jié)點(diǎn)上配置nvidia驅(qū)動(dòng)nvidia-docker2,然后在k8s集群中配置k8s-device-plugin使得k8s集群可以發(fā)現(xiàn)底層GPU資源。通過詳細(xì)記錄在此過程中遇到的問題,并記錄在其上建立的思考,完成本文檔的撰寫。
參考
?本文在行文過程中重點(diǎn)參考的地址如下所示:文章來源:http://www.zghlxwxcb.cn/news/detail-413659.html
- TensorFlow安裝
- nvidia container toolkit 安裝指南
- k8s-device-plugin項(xiàng)目Readme
下載
?在進(jìn)行該項(xiàng)工作時(shí),輸出的XMind如下所示:
XMind記錄配置過程文章來源地址http://www.zghlxwxcb.cn/news/detail-413659.html
到了這里,關(guān)于K8S調(diào)用GPU資源配置指南的文章就介紹完了。如果您還想了解更多內(nèi)容,請?jiān)谟疑辖撬阉鱐OY模板網(wǎng)以前的文章或繼續(xù)瀏覽下面的相關(guān)文章,希望大家以后多多支持TOY模板網(wǎng)!