采用k8s,而非minikube, 在3個centos系統(tǒng)的節(jié)點上安裝fate集群。
集群配置信息
3節(jié)點配置信息如下圖:
當(dāng)時kubefate最新版是1.9.0,依賴的k8s和ingress-ngnix版本如下:
Recommended version of dependent software:
Kubernetes: v1.23.5
Ingress-nginx: v1.1.3
升級K8S到1.23.5
如果你的集群k8s版本高于1.19.0,可以直接跳過本步驟。k8s可升級,也可重新安裝到該版本
卸載舊版Fate
如果你的集群未安裝過Fate,跳過本步驟,我之前安裝的版本步驟記錄在:
https://blog.csdn.net/Acecai01/article/details/127979608
查看之前已安裝的舊版fate,將其刪除:
[root@harbor kubefate]# kubectl get ns
NAME STATUS AGE
default Active 504d
fate-10000 Active 459d
fate-9999 Active 459d
fate-9998 Active 459d
ingress-nginx Active 465d
....
先切換到原版安裝文件的目錄(如/home/FATE_V180/kubefate),刪除3個節(jié)點的Fate,先找到cluster id, 根據(jù)cluster id,用kubfate cluster delete刪除:
[root@harbor kubefate]# kubefate cluster ls
UUID NAME NAMESPACE REVISION STATUS CHART ChartVERSION AGE
5d57a5e4-abdc-4dbd-94be-3966940f36dd fate-10000 fate-10000 1 Running fate v1.8.0 7d22h
1c83526e-9c1e-4a7d-b364-40775544abcc fate-9999 fate-9999 1 Running fate v1.8.0 7d22h
2dc9eede-2c9b-4a27-a58a-96fd84edd31a fate-9998 fate-9998 1 Running fate v1.8.0 7d22h
[root@harbor kubefate]# kubefate cluster delete 5d57a5e4-abdc-4dbd-94be-3966940f36dd
create job Success, job id=bc3276bf-5a2a-425e-a4e5-4a831785736e
[root@harbor kubefate]# kubefate cluster delete 1c83526e-9c1e-4a7d-b364-40775544abcc
create job Success, job id=b36feca8-e575-4f03-998f-3264fdb541e6
[root@harbor kubefate]# kubefate cluster delete 2dc9eede-2c9b-4a27-a58a-96fd84edd31a
create job Success, job id=c50fcb1f-2632-487d-94dd-88beb7018eba
然后用當(dāng)時安裝該命名空間fate-10000、fate-9999、fate-9998的yaml文件一一刪除即可:
[root@harbor kubefate]# kubectl delete -f ./cluster.yaml
....
再刪除:
[root@harbor kubefate]# kubectl delete -f ./rbac-config.yaml
....
最后刪除ingress-nginx:
[root@harbor kubefate]# kubectl apply -f ./deploy.yaml # 這個文件是當(dāng)時自己下載的,下載源參照我安裝舊版的博客
....
v1.7.2 kate下載
鏈接: link
軟件包:kubefate-k8s-v1.7.2.tar.gz
以下操作在Master節(jié)點上完成。
部署ingress-nginx
參考:https://blog.csdn.net/qq_41296573/article/details/125809696
以下deploy.yaml為部署ingress-nginx(1.1.3版本,當(dāng)時最新1.5.0)的文件,可能需要翻墻才能下載:https://raw.githubusercontent.com/kubernetes/ingress-nginx/controller-v1.1.3/deploy/static/provider/cloud/deploy.yaml
以上文件中有2個翻墻才能下載的鏡像,將鏡像改成國內(nèi)的鏡像(3處地方):
k8s.gcr.io/ingress-nginx/controller:v1.1.3@sha256:31f47c1e202b39fadecf822a9b76370bd4baed199a005b3e7d4d1455f4fd3fe2
改為:
registry.cn-hangzhou.aliyuncs.com/google_containers/nginx-ingress-controller:v1.1.3
k8s.gcr.io/ingress-nginx/kube-webhook-certgen:v1.1.1@sha256:64d8c73dca984af206adf9d6d7e46aa550362b1d7a01f3a0a91b20cc67868660
改為:
registry.cn-hangzhou.aliyuncs.com/google_containers/kube-webhook-certgen:v1.1.1
deploy.yaml文件中的pod需要指定一個服務(wù)器進(jìn)行部署,首先給服務(wù)器打上label(這個步驟是我后來補(bǔ)上的,打label的步驟被先前寫到后面了,參照后面的“使用KubeFATE安裝FATE–>為集群各節(jié)點添加label“的步驟進(jìn)行),然后修改deploy.yaml文件中的內(nèi)容,查詢到三處包含 nodeSelector的屬性,每處都加入如下內(nèi)容:
...
nodeSelector:
type: node2 # 指定在被打上node2的服務(wù)器安裝pod
然后部署ingress-nginx:kubectl apply -f ./deploy.yaml
查看ingress-nginx是否成功:
[root@harbor kubefate]# kubectl get pods -n ingress-nginx -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
ingress-nginx-admission-create-zh96h 0/1 Completed 0 2d23h 10.244.1.26 gpu-51 <none> <none>
ingress-nginx-admission-patch-hmgr5 0/1 Completed 1 2d23h 10.244.1.27 gpu-51 <none> <none>
ingress-nginx-controller-6995ffb95b-m87gh 1/1 Running 0 2d18h 172.17.0.8 k8s-node02 <none> <none>
可見ingress-nginx被安裝到了k8s-node02節(jié)點,而不是master節(jié)點,這個是正常的(即便是在master操作,也會安裝到別處)
輸入如下命令,檢查配置是否生效:kubectl -n ingress-nginx get svc
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
ingress-nginx-controller LoadBalancer 10.1.196.14 <pending> 80:30428/TCP,443:30338/TCP 16m
ingress-nginx-controller-admission ClusterIP 10.1.32.33 <none> 443/TCP 16m
可以看到ingress-nginx-controller的EXTERNAL-IP為pending狀態(tài),經(jīng)查閱資料,借鑒如下博客:
鏈接: link
修改 service中ingress-nginx-controller的EXTERNAL-IP為k8s-node02節(jié)點的IP:kubectl edit -n ingress-nginx service/ingress-nginx-controller
在大概如下位置添加externalIPs:
spec:
allocateLoadBalancerNodePorts: true
clusterIP: 10.1.86.240
clusterIPs:
- 10.1.86.240
externalIPs:
- 10.6.17.106
再次查看,EXTERNAL-IP已經(jīng)有了:
[root@harbor kubefate]# kubectl -n ingress-nginx get svc
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
ingress-nginx-controller LoadBalancer 10.1.86.240 10.6.17.106 80:31872/TCP,443:32412/TCP 2d23h
ingress-nginx-controller-admission ClusterIP 10.1.41.173 <none> 443/TCP 2d23h
安裝kubefate服務(wù)
創(chuàng)建目錄mkdir /home/FATE_V172
將kubefate-k8s-v1.7.2.tar.gz拷貝到新目錄中解壓tar -zxvf kubefate-k8s-v1.7.2.tar.gz
解壓后的目錄,可見可執(zhí)行文件KubeFATE,可以直接移動到path目錄方便使用:[root@harbor kubefate]# chmod +x ./kubefate && sudo mv ./kubefate /usr/bin
測試下kubefate命令是否可用:[root@harbor kubefate]# kubefate version
* kubefate commandLine version=v1.4.4
* kubefate service connection error, resp.StatusCode=404, error: <?xml version="1.0" encoding="iso-8859-1"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<title>404 - Not Found</title>
</head>
<body>
<h1>404 - Not Found</h1>
<script type="text/javascript" src="http://wpc.75674.betacdn.net/0075674/www/ec_tpm_bcon.js"></script>
</body>
</html>
以上提示的問題算正常,后面會解決。
執(zhí)行rbac-config.yaml–為 KubeFATE服務(wù)創(chuàng)建命名空間[root@harbor kubefate]# kubectl apply -f ./rbac-config.yaml
因為近期Dockerhub調(diào)整了下載限制服務(wù)條例 Dockerhub latest limitation, 我建議使用國內(nèi)網(wǎng)易云的鏡像倉庫代替官方Dockerhub
1、將kubefate.yaml內(nèi)鏡像federatedai/kubefate:v1.4.4改成hub.c.163.com/federatedai/kubefate:v1.4.3
2、sed 's/mariadb:10/hub.c.163.com\/federatedai\/mariadb:10/g' kubefate.yaml > kubefate_163.yaml
在kube-fate命名空間里部署KubeFATE服務(wù),相關(guān)的yaml文件也已經(jīng)準(zhǔn)備在工作目錄,直接使用kubectl apply:[root@harbor kubefate]# kubectl apply -f ./kubefate_163.yaml
【注】如果你是刪除了kubefate和ingress-ngnix重新執(zhí)行到這一步,可能會發(fā)生一個錯誤,解決辦法參考:https://blog.csdn.net/qq_39218530/article/details/115372879
稍等一會,大概10幾秒后查看下KubeFATE服務(wù)是否部署好,如果看到kubefate工具的兩pod中kubefate沒起來:
如上圖,原因很可能是因為kubefate和mariadb被部署到了兩個不同的節(jié)點,導(dǎo)致kubefate無法連上mariadb,可以將前面步驟的rbac-config和kubefate_163安裝全部刪除重來,運氣好的話,這兩pod會被部署在同一節(jié)點,這樣kubefate就不會有問題,如下圖所示:
當(dāng)然靠運氣安裝會比較耗時,可以參考如下博客將pod安裝到指定節(jié)點:
http://t.zoukankan.com/wucaiyun1-p-11698320.html
如果返回類似下面的信息(特別是pod的STATUS顯示的是Running狀態(tài)),則KubeFATE的服務(wù)就已經(jīng)部署好并正常運行:
[root@harbor kubefate]# kubectl get all,ingress -n kube-fate
NAME READY STATUS RESTARTS AGE
pod/kubefate-5bf485957b-tznw6 1/1 Running 0 2d20h
pod/mariadb-574d4679f8-f5wc2 1/1 Running 0 2d20h
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/kubefate NodePort 10.1.151.34 <none> 8080:30053/TCP 3d1h
service/mariadb ClusterIP 10.1.150.151 <none> 3306/TCP 3d1h
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/kubefate 1/1 1 1 3d1h
deployment.apps/mariadb 1/1 1 1 3d1h
NAME DESIRED CURRENT READY AGE
replicaset.apps/kubefate-5bf485957b 1 1 1 3d1h
replicaset.apps/mariadb-574d4679f8 1 1 1 3d1h
NAME CLASS HOSTS ADDRESS PORTS AGE
ingress.networking.k8s.io/kubefate nginx example.com 10.6.17.106 80 3d1h
.添加example.com到hosts文件
因為我們要用 example.com 域名來訪問KubeFATE服務(wù)(該域名在ingress中定義,有需要可自行修改),需要在運行kubefate命令行所在的機(jī)器配置hosts文件(注意不是Kubernetes所在的機(jī)器,而是ingress-ngnix所在的機(jī)器,前面安裝ingress-ngnix部分有講)。 另外下文中部署的FATE集群默認(rèn)也是使用example.com作為默認(rèn)域名, 如果網(wǎng)絡(luò)環(huán)境有域名解析服務(wù),可配置example.com域名指向master機(jī)器的IP地址,這樣就不用配置hosts文件。(IP地址一定要換成你自己的)sudo -- sh -c "echo \"10.6.17.106 example.com\" >> /etc/hosts"
[root@harbor kubefate]# ping example.com
PING example.com (10.6.17.106) 56(84) bytes of data.
64 bytes from k8s-master (10.6.17.106): icmp_seq=1 ttl=64 time=0.041 ms
64 bytes from k8s-master (10.6.17.106): icmp_seq=2 ttl=64 time=0.054 ms
64 bytes from k8s-master (10.6.17.106): icmp_seq=3 ttl=64 time=0.050 ms
^C
--- example.com ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2000ms
rtt min/avg/max/mdev = 0.041/0.048/0.054/0.007 ms
使用vi修改config.yaml的內(nèi)容。只需要修改serviceurl: example.com:32303加上映射的端口,如果忘記了重新查看一下80端口對應(yīng)的映射端口:
[root@harbor kubefate]# kubectl -n ingress-nginx get svc
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
ingress-nginx-controller LoadBalancer 10.1.209.99 10.6.17.106 80:32303/TCP,443:31648/TCP 43h
ingress-nginx-controller-admission ClusterIP 10.1.241.232 <none> 443/TCP 43h
修改完成查看一下,顯示如下:
[root@harbor kubefate]# kubefate version
* kubefate commandLine version=v1.4.3
* kubefate service version=v1.4.3
使用KubeFATE安裝FATE
為集群各節(jié)點添加label
聲明部分(無需執(zhí)行)
當(dāng)同命名空間的pod被分配安裝到不同節(jié)點時,pod之間無法互通,pod部署會失敗,比如如下圖所示,python和mysql被部署到不同的節(jié)點了,python一直無法連上mysql,所以python一直無法成功部署:
根據(jù)以上圖片可以看出同個命名空間的pod沒有被部署到相同節(jié)點之外,也可知道pod的部署是沒有受到控制的,master調(diào)度部署pod的情況可能不會如你所愿(本人是希望3個命名空間的pod被分別部署到3個不同的節(jié)點),所以本人推測pod的部署可以指定節(jié)點,后面閱讀官方的配置參數(shù),確有選定節(jié)點配置pod的方法。
執(zhí)行部分
為了將不同命名空間的pod部署到指定的節(jié)點,需要先將集群的各個節(jié)點打上label
[root@harbor kubefate]# kubectl get node # 先查看集群節(jié)點的名字
NAME STATUS ROLES AGE VERSION
gpu-51 Ready <none> 15d v1.23.5
harbor.clife.io Ready control-plane,master 15d v1.23.5
k8s-node02 Ready <none> 15d v1.20.2
[root@harbor ~]# kubectl label node harbor.clife.io type=master
node/harbor.clife.io labeled
[root@harbor ~]# kubectl label node k8s-node02 type=node2
node/k8s-node02 labeled
[root@harbor ~]# kubectl label node gpu-51 type=node1
node/gpu-51 labeled
[root@harbor ~]# kubectl get nodes --show-labels
NAME STATUS ROLES AGE VERSION LABELS
gpu-51 Ready <none> 14d v1.23.5 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=gpu-51,kubernetes.io/os=linux,type=node1
harbor.clife.io Ready control-plane,master 14d v1.23.5 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=harbor.clife.io,kubernetes.io/os=linux,node-role.kubernetes.io/control-plane=,node-role.kubernetes.io/master=,node.kubernetes.io/exclude-from-external-load-balancers=,type=master
k8s-node02 Ready <none> 14d v1.20.2
。beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=k8s-node02,kubernetes.io/os=linux, type=node2
配置部署參數(shù)
按照前面的計劃,我們需要安裝3聯(lián)盟方,ID分別9998、9999與10000?,F(xiàn)實情況,這3方應(yīng)該是完全獨立、隔絕的組織,為了模擬現(xiàn)實情況,所以我們需要先為他們在Kubernetes上創(chuàng)建各自獨立的命名空間(namespace)。 我們創(chuàng)建命名空間fate-9998用來部署9998,fate-9999用來部署9999,fate-10000部署10000
kubectl create namespace fate-9998
kubectl create namespace fate-9999
kubectl create namespace fate-10000
在exmaple目錄下,預(yù)先設(shè)置了3個例子(9998由自己復(fù)制):/kubefate/examples/party-9998/和/kubefate/examples/party-9999/ 和 /kubefate/examples/party-10000,這里先說配置,后面說配置的關(guān)注點:
對于/kubefate/examples/party-9998/cluster.yaml,修改如下:
name: fate-9998
namespace: fate-9998
chartName: fate
chartVersion: v1.7.2
partyId: 9998
registry: "hub.c.163.com/federatedai" # 修改未國內(nèi)鏡像庫
imageTag: "1.7.2-release"
pullPolicy:
imagePullSecrets:
- name: myregistrykey
persistence: false
istio:
enabled: false
podSecurityPolicy:
enabled: false
modules:
- rollsite
- clustermanager
- nodemanager
- mysql
- python
- fateboard
- client
backend: eggroll
rollsite:
type: NodePort
nodePort: 30081
partyList: # 填寫另外兩個party的信息
- partyId: 10000
partyIp: 10.6.17.104
partyPort: 30101
- partyId: 9999
partyIp: 10.6.17.106
partyPort: 30091
nodeSelector: # 設(shè)置pod的部署節(jié)點
type: node1
clustermanager:
nodeSelector: # 該配置在官網(wǎng)說明中沒有,自己強(qiáng)加的nodeSelector,強(qiáng)行將其部署在目標(biāo)節(jié)點上
type: node1
nodemanager:
count: 3
sessionProcessorsPerNode: 4
storageClass: "nodemanagers"
accessMode: ReadWriteOnce
size: 2Gi
nodeSelector: # 設(shè)置pod的部署節(jié)點,這里官網(wǎng)也沒有,自己加的
type: node1
list:
- name: nodemanager
nodeSelector: # 設(shè)置pod的部署節(jié)點
type: node1
sessionProcessorsPerNode: 4
subPath: "nodemanager"
existingClaim: ""
storageClass: "nodemanager"
accessMode: ReadWriteOnce
size: 1Gi
mysql:
nodeSelector: # 設(shè)置pod的部署節(jié)點
type: node1
ip: mysql
port: 3306
database: eggroll_meta
user: fate
password: fate_dev
subPath: ""
existingClaim: ""
storageClass: "mysql"
accessMode: ReadWriteOnce
size: 1Gi
ingress:
fateboard:
annotations:
kubernetes.io/ingress.class: "nginx"
hosts:
- name: party9998.fateboard.example.com
client:
annotations:
kubernetes.io/ingress.class: "nginx"
hosts:
- name: party9998.notebook.example.com
python:
type: NodePort
httpNodePort: 30087
grpcNodePort: 30082
logLevel: INFO # 這個一定要設(shè)置,否則在fateboard看不到日志
nodeSelector: # 設(shè)置pod的部署節(jié)點
type: node1
fateboard: # 該服務(wù)是由在上面的python提供的,所以不用設(shè)置nodeSelector
type: ClusterIP
username: admin
password: admin
client:
nodeSelector: # 設(shè)置pod的部署節(jié)點
type: node1
subPath: ""
existingClaim: ""
storageClass: "client"
accessMode: ReadWriteOnce
size: 1Gi
servingIp: 10.6.14.13
servingPort: 30085
對于/kubefate/examples/party-9999/cluster.yaml,修改如下:
name: fate-9999
namespace: fate-9999
chartName: fate
chartVersion: v1.7.2
partyId: 9999
registry: "hub.c.163.com/federatedai" # 修改未國內(nèi)鏡像庫
imageTag: "1.7.2-release"
pullPolicy:
imagePullSecrets:
- name: myregistrykey
persistence: false
istio:
enabled: false
podSecurityPolicy:
enabled: false
modules:
- rollsite
- clustermanager
- nodemanager
- mysql
- python
- fateboard
- client
backend: eggroll
rollsite:
type: NodePort
nodePort: 30091
partyList: # 填寫另外兩個party的信息
- partyId: 10000
partyIp: 10.6.17.104
partyPort: 30101
- partyId: 9998
partyIp: 10.6.14.13
partyPort: 30081
nodeSelector: # 設(shè)置pod的部署節(jié)點
type: node2
clustermanager:
nodeSelector: # 該配置在官網(wǎng)說明中沒有,自己強(qiáng)加的nodeSelector,強(qiáng)行將其部署在目標(biāo)節(jié)點上
type: node2
nodemanager:
count: 3
sessionProcessorsPerNode: 4
storageClass: "nodemanagers"
accessMode: ReadWriteOnce
size: 2Gi
nodeSelector: # 設(shè)置pod的部署節(jié)點,這里官網(wǎng)也沒有,自己加的
type: node2
list:
- name: nodemanager
nodeSelector: # 設(shè)置pod的部署節(jié)點
type: node2
sessionProcessorsPerNode: 4
subPath: "nodemanager"
existingClaim: ""
storageClass: "nodemanager"
accessMode: ReadWriteOnce
size: 1Gi
mysql:
nodeSelector: # 設(shè)置pod的部署節(jié)點
type: node2
ip: mysql
port: 3306
database: eggroll_meta
user: fate
password: fate_dev
subPath: ""
existingClaim: ""
storageClass: "mysql"
accessMode: ReadWriteOnce
size: 1Gi
ingress:
fateboard:
annotations:
kubernetes.io/ingress.class: "nginx"
hosts:
- name: party9999.fateboard.example.com
client:
annotations:
kubernetes.io/ingress.class: "nginx"
hosts:
- name: party9999.notebook.example.com
python:
type: NodePort
httpNodePort: 30097
grpcNodePort: 30092
logLevel: INFO # 這個一定要設(shè)置,否則在fateboard看不到日志
nodeSelector: # 設(shè)置pod的部署節(jié)點
type: node2
fateboard: # 該服務(wù)是由在上面的python提供的,所以不用設(shè)置nodeSelector
type: ClusterIP
username: admin
password: admin
client:
nodeSelector: # 設(shè)置pod的部署節(jié)點
type: node2
subPath: ""
existingClaim: ""
storageClass: "client"
accessMode: ReadWriteOnce
size: 1Gi
servingIp: 10.6.17.106
servingPort: 30095
對于/kubefate/examples/party-10000/cluster.yaml,修改如下:
name: fate-10000
namespace: fate-10000
chartName: fate
chartVersion: v1.7.2
partyId: 10000
registry: "hub.c.163.com/federatedai" # 修改未國內(nèi)鏡像庫
imageTag: "1.7.2-release"
pullPolicy:
imagePullSecrets:
- name: myregistrykey
persistence: false
istio:
enabled: false
podSecurityPolicy:
enabled: false
modules:
- rollsite
- clustermanager
- nodemanager
- mysql
- python
- fateboard
- client
backend: eggroll
rollsite:
type: NodePort
nodePort: 30101
partyList: # 填寫另外兩個party的信息
- partyId: 9999
partyIp: 10.6.17.106
partyPort: 30091
- partyId: 9998
partyIp: 10.6.14.13
partyPort: 30081
nodeSelector: # 設(shè)置pod的部署節(jié)點
type: master
clustermanager:
nodeSelector: # 該配置在官網(wǎng)說明中沒有,自己強(qiáng)加的nodeSelector,強(qiáng)行將其部署在目標(biāo)節(jié)點上
type: master
nodemanager:
count: 3
sessionProcessorsPerNode: 4
storageClass: "nodemanagers"
accessMode: ReadWriteOnce
size: 2Gi
nodeSelector: # 設(shè)置pod的部署節(jié)點,這里官網(wǎng)也沒有,自己加的
type: master
list:
- name: nodemanager
nodeSelector: # 設(shè)置pod的部署節(jié)點
type: master
sessionProcessorsPerNode: 4
subPath: "nodemanager"
existingClaim: ""
storageClass: "nodemanager"
accessMode: ReadWriteOnce
size: 1Gi
mysql:
nodeSelector: # 設(shè)置pod的部署節(jié)點
type: master
ip: mysql
port: 3306
database: eggroll_meta
user: fate
password: fate_dev
subPath: ""
existingClaim: ""
storageClass: "mysql"
accessMode: ReadWriteOnce
size: 1Gi
ingress:
fateboard:
annotations:
kubernetes.io/ingress.class: "nginx"
hosts:
- name: party10000.fateboard.example.com
client:
annotations:
kubernetes.io/ingress.class: "nginx"
hosts:
- name: party10000.notebook.example.com
python:
type: NodePort
httpNodePort: 30107
grpcNodePort: 30102
logLevel: INFO # 這個一定要設(shè)置,否則在fateboard看不到日志
nodeSelector: # 設(shè)置pod的部署節(jié)點
type: master
fateboard: # 該服務(wù)是由在上面的python提供的,所以不用設(shè)置nodeSelector
type: ClusterIP
username: admin
password: admin
client:
nodeSelector: # 設(shè)置pod的部署節(jié)點
type: master
subPath: ""
existingClaim: ""
storageClass: "client"
accessMode: ReadWriteOnce
size: 1Gi
servingIp: 10.6.17.104
servingPort: 30105
以上配置主要關(guān)注點是:
1、修改命名空間的名字;
2、修改鏡像庫來源;
3、修改每個party的服務(wù)IP和端口,以及每個party之外的party ip和端口;
4、配置每個pod的nodeSelector,指定該pod安裝到集群的哪個節(jié)點上,這步非常重要,官方的配置是沒寫這個的,沒配置的話后面會出問題;nodeSelector是通過節(jié)點的label來選定的,所以上一小節(jié)的步驟對該配置是必要的。
部署FATE集群
如果一切沒有問題,那就可以使用kubefate cluster install來部署兩個fate集群了,(沒遇到坑的步驟按照官方的執(zhí)行就可以)
kubefate cluster install -f ./examples/party-10000/cluster10000.yaml
kubefate cluster install -f ./examples/party-9999/cluster9999.yaml
kubefate cluster install -f ./examples/party-9998/cluster9998.yaml
這時候,KubeFATE會創(chuàng)建3個任務(wù)去分別部署兩個FATE集群。我們可以通過kubefate job ls來查看任務(wù),或者直接watch KubeFATE中集群的狀態(tài),直至變成Running
[root@harbor kubefate]# watch kubefate cluster ls
UUID NAME NAMESPACE REVISION STATUS CHART ChartVERSION AGE
7bca70c1-236c-4931-81f8-1350cce579d4 fate-9998 fate-9998 1 Running fate v1.8.0 18m
143378db-b84d-4045-8615-11d36335d5b2 fate-9999 fate-9999 0 Creating fate v1.8.0 17m
d3e27a39-c8de-4615-96f2-29012f3edc68 fate-10000 fate-10000 0 Creating fate v1.8.0 17m
因為這個步驟需要到網(wǎng)易云鏡像倉庫去下載約10G的鏡像,所以第一次執(zhí)行視乎你的網(wǎng)絡(luò)情況需要一定時間(耐心等待上述下載過程,直至狀態(tài)變成Running)。 檢查下載的進(jìn)度可以用
kubectl get po -n fate-9998
kubectl get po -n fate-9999
kubectl get po -n fate-10000
全部的鏡像下載完成后,結(jié)果會呈現(xiàn)如下樣子:
[root@harbor kubefate]# kubectl get po -n fate-9998 -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
client-6f64dfc96d-45dzd 1/1 Running 0 21h 10.244.1.152 gpu-51 <none> <none>
clustermanager-578ddd9665-whwxq 1/1 Running 0 21h 10.244.1.153 gpu-51 <none> <none>
mysql-5d5b7bd654-78wp7 1/1 Running 0 21h 10.244.1.150 gpu-51 <none> <none>
nodemanager-0-5c4868fb85-mrd6h 2/2 Running 0 21h 10.244.1.151 gpu-51 <none> <none>
nodemanager-1-787588cd7c-2ds68 2/2 Running 0 21h 10.244.1.154 gpu-51 <none> <none>
nodemanager-2-d7f986fb5-wclkr 2/2 Running 0 21h 10.244.1.148 gpu-51 <none> <none>
python-f6c4f885c-mh8ws 2/2 Running 0 21h 10.244.1.149 gpu-51 <none> <none>
rollsite-c946d6989-znm7b 1/1 Running 0 21h 10.244.1.147 gpu-51 <none> <none>
fate-9998和fate-9999是正常的,而fate-10000不正常,因為它的pod被指定部署在master節(jié)點了,當(dāng)將pod指定部署到master節(jié)點時,pod都呈現(xiàn)pending狀態(tài),查看pending的pod日志看到:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 3s (x5 over 4m19s) default-scheduler 0/3 nodes are available: 1 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 2 node(s) didn't match Pod's node affinity/selector.
出現(xiàn)錯誤的原因是master 節(jié)點是默認(rèn)不允許調(diào)度 pod的,參考博客解決問題:
https://blog.csdn.net/weixin_43114954/article/details/119153903
[root@harbor ~]# kubectl taint nodes --all node-role.kubernetes.io/master-
node/harbor.clife.io untainted
taint "node-role.kubernetes.io/master" not found
taint "node-role.kubernetes.io/master" not found
上面的not found可以不管,現(xiàn)在master節(jié)點已經(jīng)可以部署pod了,過一會兒fate-10000下的pod都部署成功。
mysql pod頻繁重啟問題
在使用fateboard時,發(fā)現(xiàn)fate-9999的mysql pod老是重啟,導(dǎo)致fateboard訪問不了,查看其日志沒發(fā)現(xiàn)什么問題:
[root@harbor kubefate]# kubectl logs mysql-846476f9bf-j96nz -n fate-9999
2022-12-09 02:37:22+00:00 [Note] [Entrypoint]: Entrypoint script for MySQL Server 8.0.28-1debian10 started.
2022-12-09 02:37:22+00:00 [Note] [Entrypoint]: Switching to dedicated user 'mysql'
2022-12-09 02:37:22+00:00 [Note] [Entrypoint]: Entrypoint script for MySQL Server 8.0.28-1debian10 started.
2022-12-09T02:37:22.874490Z 0 [System] [MY-010116] [Server] /usr/sbin/mysqld (mysqld 8.0.28) starting as process 1
2022-12-09T02:37:23.027833Z 1 [System] [MY-013576] [InnoDB] InnoDB initialization has started.
2022-12-09T02:37:23.630021Z 1 [System] [MY-013577] [InnoDB] InnoDB initialization has ended.
2022-12-09T02:37:23.861099Z 0 [System] [MY-010229] [Server] Starting XA crash recovery...
2022-12-09T02:37:23.878257Z 0 [System] [MY-010232] [Server] XA crash recovery finished.
2022-12-09T02:37:23.982436Z 0 [Warning] [MY-010068] [Server] CA certificate ca.pem is self signed.
2022-12-09T02:37:23.982493Z 0 [System] [MY-013602] [Server] Channel mysql_main configured to support TLS. Encrypted connections are now supported for this channel.
2022-12-09T02:37:23.984665Z 0 [Warning] [MY-011810] [Server] Insecure configuration for --pid-file: Location '/var/run/mysqld' in the path is accessible to all OS users. Consider choosing a different directory.
2022-12-09T02:37:24.108885Z 0 [System] [MY-011323] [Server] X Plugin ready for connections. Bind-address: '::' port: 33060, socket: /var/run/mysqld/mysqlx.sock
2022-12-09T02:37:24.108958Z 0 [System] [MY-010931] [Server] /usr/sbin/mysqld: ready for connections. Version: '8.0.28' socket: '/var/run/mysqld/mysqld.sock' port: 3306 MySQL Community Server - GPL.
有網(wǎng)友這是服務(wù)器的內(nèi)存不夠用了,于是給fate-9999對應(yīng)的服務(wù)器k8s-node02創(chuàng)建了16G的swap分區(qū)
[root@k8s-node02 ~]# dd if=/dev/zero of=/home/swapfile bs=1024 count=16777216
16777216+0 records in
16777216+0 records out
17179869184 bytes (17 GB) copied, 62.5734 s, 275 MB/s
[root@k8s-node02 ~]# mkswap /home/swapfile
Setting up swapspace version 1, size = 16777212 KiB
no label, UUID=d0a7f218-10a6-406a-9bea-be90b8493828
[root@k8s-node02 ~]# swapon /home/swapfile
swapon: /home/swapfile: insecure permissions 0644, 0600 suggested.
[root@k8s-node02 ~]# vim /etc/fstab # 編輯/etc/fstab文件,使在每次開機(jī)時自動加載swap文件,最后添加如下行即可:
...
/home/swapfile swap swap defaults 0 0
...
[root@k8s-node02 ~]# free -m
total used free shared buff/cache available
Mem: 15847 14440 242 760 1164 315
Swap: 16383 5 16378
之后fate-9999的mysql pod就正常了,不再反復(fù)重啟。
驗證FATE的部署
通過以上的 kubefate cluster ls 命令, 我們得到 fate-9998 的集群ID是 7bca70c1-236c-4931-81f8-1350cce579d4, fate-9999 的集群ID是 143378db-b84d-4045-8615-11d36335d5b2, 而 fate-10000 的集群ID是 d3e27a39-c8de-4615-96f2-29012f3edc68. 我們可以通過kubefate cluster describe查詢集群的具體訪問信息:
[root@harbor kubefate]# kubefate cluster describe 7bca70c1-236c-4931-81f8-1350cce579d4
....
Info dashboard:
- party9998.notebook.example.com
- party9998.fateboard.example.com
ip: 10.6.17.106
port: 30081
status:
containers:
client: Running
clustermanager: Running
fateboard: Running
mysql: Running
nodemanager-0: Running
nodemanager-0-eggrollpair: Running
nodemanager-1: Running
nodemanager-1-eggrollpair: Running
python: Running
rollsite: Running
deployments:
client: Available
clustermanager: Available
mysql: Available
nodemanager-0: Available
nodemanager-1: Available
python: Available
rollsite: Available
從返回的內(nèi)容中,我們看到Info->dashboard里包含了:
- Jupyter Notebook的訪問地址: party9998.notebook.example.com。這個是我們準(zhǔn)備讓數(shù)據(jù)科學(xué)家進(jìn)行建模分析的平臺。已經(jīng)集成了FATE-Clients;
- FATEBoard的訪問地址: party9998.fateboard.example.com。我們可以通過FATEBoard來查詢當(dāng)前訓(xùn)練的狀態(tài)。
同樣的查看 fate-10000的信息,可以看到 dashboard的網(wǎng)址雖然不同,但是ip都是10.6.17.106,也就是ingress-ngnix的地址,所以即使是訪問party10000.fateboard.example.com,也是先訪問10.6.17.106,而不是fate-10000所在的主機(jī)10.6.17.104。
在瀏覽器訪問FATE集群的機(jī)器上配置相關(guān)的Host信息
如果是Windows機(jī)器,我們需要把相關(guān)域名解析配置到C:\WINDOWS\system32\drivers\etc\hosts:
10.6.17.106 party9998.notebook.example.com
10.6.17.106 party9998.fateboard.example.com
10.6.17.106 party9999.notebook.example.com
10.6.17.106 party9999.fateboard.example.com
10.6.17.106 party10000.notebook.example.com
10.6.17.106 party10000.fateboard.example.com
注意以上網(wǎng)址都是設(shè)置IP為10.6.17.106
用網(wǎng)址party10000.fateboard.example.com:32303,登陸party10000的fateboard,用戶名和密碼如下圖:
注意上面網(wǎng)址的端口就是ingress服務(wù)的端口,由以下命令查看:
[root@harbor kubefate]# kubectl -n ingress-nginx get svc
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
ingress-nginx-controller LoadBalancer 10.1.209.99 10.6.17.106 80:32303/TCP,443:31648/TCP 120d
ingress-nginx-controller-admission ClusterIP 10.1.241.232 <none> 443/TCP 120d
問題:
1、fateboard界面訪問不了
過了1天,發(fā)現(xiàn)命名空間fate-9998和fate-10000對應(yīng)的fateboard界面訪問不了了,只有fate-9999的可以訪問,經(jīng)檢查:
root@harbor kubefate]# kubectl get pods -n fate-9998
NAME READY STATUS RESTARTS AGE
client-7ccbc89559-njr2m 1/1 Running 0 3d21h
clustermanager-fcb86747f-8zzh7 1/1 Running 0 3d21h
mysql-6d546bd578-9mfvn 1/1 Running 37 (117m ago) 3d21h
nodemanager-0-66dfd58cdc-76wqc 2/2 Running 0 3d21h
nodemanager-1-7b7c65c685-jb2gs 2/2 Running 0 3d21h
python-594cd5c47b-vl4mb 1/2 CrashLoopBackOff 473 (117s ago) 3d21h
rollsite-6b77d9f5f7-lk6dm 1/1 Running 0 3d21h
查看到python這個podCrashLoopBackOff,其內(nèi)部由兩容器fateboard和ping-mysql,查看其ping-mysql容器:root@harbor kubefate]# kubectl logs -f python-594cd5c47b-vl4mb -n fate-9998 -c ping-mysql
得知mysql有問題,于是直接重新部署fate-9998的mysql:kubectl rollout restart deployment mysql -n fate-9998
再重新部署fate-9998的python:kubectl rollout restart deployment python -n fate-9998
問題解決。
重啟之后可能會有個新問題,以fate-9998為例:
(app-root) bash-4.2# flow
bash: flow: command not found
就是flow命令不能用了,需要手動安裝:
進(jìn)入fate-9998的python容器內(nèi)安裝fate-client:
[root@harbor kubefate]# kubectl exec -it svc/fateflow -c python -n fate-9998 -- bash
(app-root) bash-4.2# pip install fate-client -i https://pypi.tuna.tsinghua.edu.cn/simple
在主節(jié)點查看fateflow的服務(wù)ip:
[root@harbor kubefate]# kubectl describe svc fateflow -n fate-9998
Name: fateflow
Namespace: fate-9998
Labels: app.kubernetes.io/managed-by=Helm
chart=fate
cluster=fate
fateMoudle=fateflow
heritage=Helm
name=fate-9998
owner=kubefate
partyId=9998
release=fate-9998
Annotations: meta.helm.sh/release-name: fate-9998
meta.helm.sh/release-namespace: fate-9998
Selector: fateMoudle=python,name=fate-9998,partyId=9998
Type: ClusterIP
IP Family Policy: SingleStack
IP Families: IPv4
IP: None
IPs: None
Port: tcp-grpc 9360/TCP
TargetPort: 9360/TCP
Endpoints: 10.244.1.195:9360
Port: tcp-http 9380/TCP
TargetPort: 9380/TCP
Endpoints: 10.244.1.195:9380
Session Affinity: None
Events: <none>
根據(jù)上面的Endpoints設(shè)置flow的服務(wù)ip,進(jìn)入fate-9998的python容器:
(app-root) bash-4.2# flow init --ip 10.244.1.195 --port 9380 # 初始化flow
{
"retcode": 0,
"retmsg": "Fate Flow CLI has been initialized successfully."
}
(app-root) bash-4.2# pipeline init --ip 10.244.1.195 --port 9380 # 初始化pipeline
Pipeline configuration succeeded.
(app-root) bash-4.2# pipeline config check
Flow server status normal, Flow version: 1.7.2
2、發(fā)現(xiàn)fate-10000很多Evicted的pod
因為fate-10000節(jié)點資源問題,導(dǎo)致python這個pod生成了多次,但都是Evicted狀態(tài),一大條失敗pod很影響查看pod狀態(tài),于是刪除這些失敗pod記錄:文章來源:http://www.zghlxwxcb.cn/news/detail-428403.html
[root@harbor kubefate]# kubectl get pods -n fate-10000 -A | awk '/Evicted/{print $1,$2}' | xargs -r -n2 kubectl delete pod -n
還有ContainerStatusUnknown狀態(tài)的pod也刪除掉:文章來源地址http://www.zghlxwxcb.cn/news/detail-428403.html
[root@harbor kubefate]# kubectl get pods -n fate-10000 -A | awk '/ContainerStatusUnknown/{print $1,$2}' | xargs -r -n2 kubectl delete pod -n
到了這里,關(guān)于k8s安裝3節(jié)點的聯(lián)邦學(xué)習(xí)Fate集群 v1.7.2(全網(wǎng)最細(xì)-解決N多坑)的文章就介紹完了。如果您還想了解更多內(nèi)容,請在右上角搜索TOY模板網(wǎng)以前的文章或繼續(xù)瀏覽下面的相關(guān)文章,希望大家以后多多支持TOY模板網(wǎng)!