0、k8s安裝、docker安裝
參考:前兩步Ubuntu云原生環(huán)境安裝,docker+k8s+kubeedge(親測(cè)好用)_愛吃關(guān)東煮的博客-CSDN博客_ubantu部署kubeedge
?配置節(jié)點(diǎn)gpu:
K8S調(diào)用GPU資源配置指南_思影影思的博客-CSDN博客_k8s 使用gpu
1、重置和清除舊工程:每個(gè)節(jié)點(diǎn)主機(jī)都要運(yùn)行
kubeadm reset
iptables -F && iptables -t nat -F && iptables -t mangle -F && iptables -X
2、部署新的k8s項(xiàng)目:
只在主節(jié)點(diǎn)運(yùn)行,apiserver-advertise-address填寫主節(jié)點(diǎn)ip
sudo kubeadm init \
--apiserver-advertise-address=192.168.1.117 \
--control-plane-endpoint=node4212 \
--image-repository registry.cn-hangzhou.aliyuncs.com/google_containers \
--kubernetes-version v1.21.10 \
--service-cidr=10.96.0.0/12 \
--pod-network-cidr=10.244.0.0/16
mkdir -p $HOME/.kube
sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
sudo chown $(id -u):$(id -g) $HOME/.kube/config
主節(jié)點(diǎn)完成后,子節(jié)點(diǎn)運(yùn)行主節(jié)點(diǎn)完成后展示的join命令
3、裝網(wǎng)絡(luò)插件
curl https://docs.projectcalico.org/manifests/calico.yaml -O
kubectl apply -f calico.yaml
等待完成
4、裝bashboard:主節(jié)點(diǎn)運(yùn)行
sudo kubectl apply -f /dashbord.yaml
sudo kubectl edit svc kubernetes-dashboard -n kubernetes-dashboard
將type: ClusterIP 改為 type: NodePort
# 找到端口,關(guān)閉對(duì)應(yīng)防火墻
sudo kubectl get svc -A |grep kubernetes-dashboard
任意主機(jī)ip:31678為實(shí)際訪問連接(https://192.168.1.109:31678/)
?驗(yàn)證所有pod為run狀態(tài),否則檢查前面步驟
kubectl get pods --all-namespaces -o wide
#查看pod狀態(tài)
kubectl describe pod kubernetes-dashboard-57c9bfc8c8-lmb67 --namespace kubernetes-dashboard
#打印log
kubectl logs nvidia-device-plugin-daemonset-xn7hx --namespace kube-system
創(chuàng)建訪問賬號(hào)
kubectl apply -f /dashuser.yaml
獲取訪問令牌,在主節(jié)點(diǎn)運(yùn)行,每天都會(huì)更新
kubectl -n kubernetes-dashboard get secret $(kubectl -n kubernetes-dashboard get sa/admin-user -o jsonpath="{.secrets[0].name}") -o go-template="{{.data.token | base64decode}}"
?填入token
5、創(chuàng)建鏡像并上傳dockerhub:
查看本地鏡像:docker images
登陸docker賬戶
給docker打標(biāo)簽,左:本地名:tag 右hub用戶名/倉庫名:tag
docker tag deeplabv3plus:1.0.0 chenzishu/deepmodel:labv3
上傳hub
docker push chenzishu/deepmodel:labv3
6、dashboard使用
?創(chuàng)建deployment
應(yīng)用名隨意,鏡像地址填寫docherhub上對(duì)應(yīng)鏡像地址(chenzishu/deepmodel:pytorch)
等待容器運(yùn)行,需要時(shí)間
########
#pod啟動(dòng)后一直重啟,并報(bào)Back-off restarting failed container
#找到對(duì)應(yīng)的deployment添加
command: ["/bin/bash", "-ce", "tail -f /dev/null"]
########
?
7、運(yùn)行pod:
顯示本地容器:docker ps -a?
找到容器:
kubectl get pods --all-namespaces -o wide
?進(jìn)入容器:
kubectl exec -it segnet-747b798bf5-4bjqk /bin/bash
查看容器中文件:
ls
?nvidia-smi查看容器是否可以調(diào)用gpu
8、容器使用顯卡資源,gpu資源分片
https://gitcode.net/mirrors/AliyunContainerService/gpushare-scheduler-extender/-/blob/master/docs/install.md
先安裝nvidia-docker2:
distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \
&& curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
&& curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
?
sudo apt-get update
sudo apt-get install -y nvidia-docker2
sudo systemctl restart docker
#測(cè)試
sudo docker run --rm --gpus all nvidia/cuda:11.6.2-base-ubuntu20.04 nvidia-smi
update可能會(huì)報(bào)錯(cuò):參見官方文檔Conflicting values set for option Signed-By error when running apt update
E: Conflicting values set for option Signed-By regarding source https://nvidia.github.io/libnvidia-container/stable/ubuntu18.04/amd64/ /: /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg !=
E: The list of sources could not be read.
解決方法:
grep -l "nvidia.github.io" /etc/apt/sources.list.d/* | grep -vE "/nvidia-container-toolkit.list\$"
刪除列出的文件即可
安裝 gpushare-device-plugin 之前,確保在 GPU 節(jié)點(diǎn)上已經(jīng)安裝 Nvidia-Driver 以及 Nvidia-Docker2,同時(shí)已將 docker 的默認(rèn)運(yùn)行時(shí)設(shè)置為 nvidia:
配置runtime:/etc/docker/daemon.json
{
"default-runtime": "nvidia",
"runtimes": {
"nvidia": {
"path": "/usr/bin/nvidia-container-runtime",
"runtimeArgs": []
}
}
}
GPU Sharing 部署
再參考阿里開發(fā)文檔,寫的很詳細(xì)?:配置、使用nvidia-share:https://developer.aliyun.com/article/690623
K8S 集群使用阿里云 GPU sharing 實(shí)現(xiàn) GPU 調(diào)度 - 點(diǎn)擊領(lǐng)取 (dianjilingqu.com)
部署 GPU 共享調(diào)度插件 gpushare-schd-extender
cd /tmp/
curl -O https://raw.githubusercontent.com/AliyunContainerService/gpushare-scheduler-extender/master/config/gpushare-schd-extender.yaml
kubectl create -f gpushare-schd-extender.yaml
# 需要能夠在 master 上進(jìn)行調(diào)度,在 gpushare-schd-extender.yaml 中將
# nodeSelector:
# node-role.kubernetes.io/master: ""
# 這兩句刪除,使 k8s 能夠在 master 上進(jìn)行 GPU 調(diào)度
### 無法下載參考如下鏈接:
wget http://49.232.8.65/yaml/gpushare-schd-extender.yaml
部署設(shè)備插件?gpushare-device-plugin
cd /tmp/
wget https://raw.githubusercontent.com/AliyunContainerService/gpushare-device-plugin/master/device-plugin-rbac.yaml
kubectl create -f device-plugin-rbac.yaml
wget https://raw.githubusercontent.com/AliyunContainerService/gpushare-device-plugin/master/device-plugin-ds.yaml
# 默認(rèn)情況下,GPU 顯存以 GiB 為單位,若需要使用 MiB 為單位,需要在這個(gè)文件中,將 --memory-unit=GiB 修改為 --memory-unit=MiB
kubectl create -f device-plugin-ds.yaml
### 無法下載參考如下鏈接:
wget http://49.232.8.65/yaml/device-plugin-rbac.yaml
wget http://49.232.8.65/yaml/device-plugin-ds.yaml
??為?GPU?節(jié)點(diǎn)打標(biāo)簽
# 為了將 GPU 程序調(diào)度到帶有 GPU 的服務(wù)器,需要給服務(wù)打標(biāo)簽 gpushare=true
kubectl get nodes
# 選取 GPU 節(jié)點(diǎn)打標(biāo)
kubectl label node <target_node> gpushare=true
kubectl describe node <target_node>
更新?kubectl?可執(zhí)行程序
wget https://github.com/AliyunContainerService/gpushare-device-plugin/releases/download/v0.3.0/kubectl-inspect-gpushare
chmod u+x kubectl-inspect-gpushare
mv kubectl-inspect-gpushare /usr/local/bin
### 無法下載參考如下鏈接:
wget http://49.232.8.64/k8s/kubectl-inspect-gpushare
?查看?GPU?信息:若能看到 GPU 信息,則代表安裝成功
root@dell21[/root]# kubectl inspect gpushare
NAME IPADDRESS GPU0(Allocated/Total) PENDING(Allocated) GPU Memory(GiB)
10.45.61.22 10.45.61.22 0/7 2 2/7
------------------------------------------------------
Allocated/Total GPU Memory In Cluster:
2/7 (28%)
9、部分問題
pod無法啟動(dòng)、資源不足文章來源:http://www.zghlxwxcb.cn/news/detail-419279.html
#設(shè)置污點(diǎn)閾值
systemctl status -l kubelet
#文件路徑
/etc/systemd/system/kubelet.service.d/
#放寬閾值
#修改配置文件增加傳參數(shù),添加此配置項(xiàng) --eviction-hard=nodefs.available<3%
Environment="KUBELET_KUBECONFIG_ARGS=--bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf --eviction-hard=nodefs.available<3%"
systemctl daemon-reload
systemctl restart kubelet
pod反復(fù)重啟:文章來源地址http://www.zghlxwxcb.cn/news/detail-419279.html
pod啟動(dòng)后一直重啟,并報(bào)Back-off restarting failed container
找到對(duì)應(yīng)的deployment
command: ["/bin/bash", "-ce", "tail -f /dev/null"]
spec:
containers:
- name: test-file
image: xxx:v1
command: ["/bin/bash", "-ce", "tail -f /dev/null"]
imagePullPolicy: IfNotPresent
到了這里,關(guān)于K8S部署后的使用:dashboard啟動(dòng)、使用+docker鏡像拉取、容器部署(ubuntu環(huán)境+gpu3080+3主機(jī)+部署深度學(xué)習(xí)模型)的文章就介紹完了。如果您還想了解更多內(nèi)容,請(qǐng)?jiān)谟疑辖撬阉鱐OY模板網(wǎng)以前的文章或繼續(xù)瀏覽下面的相關(guān)文章,希望大家以后多多支持TOY模板網(wǎng)!