環(huán)境
master節(jié)點
主機 | IP | 版本 |
master01 | 192.168.66.50 | k8s-1.23.17 |
master02 | 192.168.66.55 | k8s-1.23.17 |
master03 | 192.168.66.56 | k8s-1.23.17 |
etcd集群節(jié)點
主機 | IP | 版本 |
etcd01 | 192.168.66.58 | 3.5.6 |
etcd02 | 192.168.66.59 | 3.5.6 |
etcd03 | 192.168.66.57 | 3.5.6 |
生產(chǎn)環(huán)境中我們?yōu)榱吮苊獬霈F(xiàn)誤操作或者是服務(wù)器硬件出見異常導(dǎo)致宕機,我們的虛擬機或者k8s集群崩潰,所以我們都會創(chuàng)建多節(jié)點的高可用集群,包括k8s集群使用外接etcd集群,但有時也可能出現(xiàn)數(shù)據(jù)丟失,所以經(jīng)常要備份數(shù)據(jù)。
etcd備份
etcd集群的備份我們使用snapshot備份etcd集群數(shù)據(jù)
備份一般會使用腳本備份,三個etcd節(jié)點分別備份(雖然每份etcd節(jié)點的數(shù)據(jù)相同,但防止虛擬機宕機起不來,所以最好三個節(jié)點都備份,每小時備份一次,創(chuàng)建一個備份計劃任務(wù)):
#!/bin/bash
#
###etcd cluster backup
time_back=`date +%Y%m%d-%H%M%S`
path='/etc/etcd/snapshot/'
/usr/bin/etcdctl --cert=/etc/kubernetes/pki/etcd/peer.crt --key=/etc/kubernetes/pki/etcd/peer.key --cacert=/etc/kubernetes/pki/etcd/ca.crt --endpoints=https://192.168.66.58:2379 snapshot save ${path}etcd-snapshot-`date +%Y%m%d-%H%M%S`.db
###刪除7天之前的文件
/usr/bin/find /etc/etcd/snapshot -name "*.db" -mtime +7 | xargs rm -f
為了防止備份文件過多,占用磁盤空間,刪除磁盤中7天之前的備份,一般最早的數(shù)據(jù)來說,意義不是很大
etcd集群使用snapshot恢復(fù)集群
1:使用snapshot恢復(fù)etcd集群,我們需要先停掉master節(jié)點上的docker和kubelet服務(wù),停止etcd節(jié)點上的kubelet服務(wù),確保沒有服務(wù)調(diào)用etcd服務(wù)
[root@master01 ~]# systemctl stop kubelet
[root@master01 ~]# systemctl stop docker.socket && systemctl stop docker
[root@master01 ~]# docker ps
2:etcd的數(shù)據(jù)計劃存儲在/var/lib/etcd/目錄下,所以刪除etcd集群中每個節(jié)點/var/lib/etcd/目錄下的數(shù)據(jù),確保/var/lib/etcd/是空目錄,或者/var/lib/etcd/是新創(chuàng)建的目錄。如果存放其他目錄下也可以,需要修改etcd配置文件,需要修改--data-dir參數(shù)。
apiVersion: v1
kind: Pod
metadata:
annotations:
kubeadm.kubernetes.io/etcd.advertise-client-urls: https://192.168.66.58:2379
creationTimestamp: null
labels:
component: etcd
tier: control-plane
name: etcd
namespace: kube-system
spec:
containers:
- command:
- etcd
- --advertise-client-urls=https://192.168.66.58:2379
- --cert-file=/etc/kubernetes/pki/etcd/server.crt
- --client-cert-auth=true
- --data-dir=/var/lib/etcd
- --experimental-initial-corrupt-check=true
- --initial-advertise-peer-urls=https://192.168.66.58:2380
- --initial-cluster=infra2=https://192.168.66.59:2380,infra1=https://192.168.66.58:2380,infra0=https://192.168.66.57:2380
- --initial-cluster-state=new
- --key-file=/etc/kubernetes/pki/etcd/server.key
- --listen-client-urls=https://192.168.66.58:2379,https://127.0.0.1:2379
- --listen-metrics-urls=http://127.0.0.1:2381
- --listen-peer-urls=https://192.168.66.58:2380
- --name=infra1
- --peer-cert-file=/etc/kubernetes/pki/etcd/peer.crt
- --peer-client-cert-auth=true
- --peer-key-file=/etc/kubernetes/pki/etcd/peer.key
- --peer-trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt
- --snapshot-count=10000
- --trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt
image: registry.k8s.io/etcd:3.5.6-0
imagePullPolicy: IfNotPresent
livenessProbe:
failureThreshold: 8
httpGet:
host: 127.0.0.1
path: /health
port: 2381
scheme: HTTP
initialDelaySeconds: 10
periodSeconds: 10
timeoutSeconds: 15
name: etcd
resources:
requests:
cpu: 100m
memory: 100Mi
startupProbe:
failureThreshold: 24
httpGet:
host: 128.0.0.1
path: /health
port: 2381
scheme: HTTP
initialDelaySeconds: 10
periodSeconds: 10
timeoutSeconds: 15
volumeMounts:
- mountPath: /var/lib/etcd
name: etcd-data
- mountPath: /etc/kubernetes/pki/etcd
name: etcd-certs
hostNetwork: true
priorityClassName: system-node-critical
securityContext:
seccompProfile:
type: RuntimeDefault
volumes:
- hostPath:
path: /etc/kubernetes/pki/etcd
type: DirectoryOrCreate
name: etcd-data
status: {}
?停止相關(guān)服務(wù)后,使用快照恢復(fù)數(shù)據(jù),三個節(jié)點一次恢復(fù),先恢復(fù)etcd01
etcd01恢復(fù)命令:
etcdctl snapshot restore etcd-snapshot-20231213-121501.db\
--data-dir=/var/lib/etcd/ --initial-cluster-token="etcd-cluster" \
--name=infra1 --initial-advertise-peer-urls=https://192.168.66.58:2380 \
--initial-cluster="infra0=https://192.168.66.57:2380,infra1=https://192.168.66.58:2380,infra2=https://192.168.66.59:2380"
etcd02和etcd03節(jié)點恢復(fù):
etcdctl snapshot restore etcd-snapshot-20231213-121501.db\
--data-dir=/var/lib/etcd/ --initial-cluster-token="etcd-cluster" \
--name=infra2 --initial-advertise-peer-urls=https://192.168.66.59:2380 \
--initial-cluster="infra0=https://192.168.66.57:2380,infra1=https://192.168.66.58:2380,infra2=https://192.168.66.59:2380"
etcdctl snapshot restore etcd-snapshot-20231213-121501.db\
--data-dir=/var/lib/etcd/ --initial-cluster-token="etcd-cluster" \
--name=infra0 --initial-advertise-peer-urls=https://192.168.66.57:2380 \
--initial-cluster="infra0=https://192.168.66.57:2380,infra1=https://192.168.66.58:2380,infra2=https://192.168.66.59:2380"
注意使用最新的快照文件,和快照文件路徑。
3:數(shù)據(jù)恢復(fù)后,啟動docker和kubelet服務(wù) ,需要近最快時間啟動。
[root@etcd01 snapshot]# systemctl start docker && systemctl start kubelet
[root@etcd02 snapshot]# systemctl start docker && systemctl start kubelet
[root@etcd03 snapshot]# systemctl start docker && systemctl start kubelet
4:查看服務(wù)狀態(tài),以及etcd服務(wù)是否啟動,集群是否正常
[root@etcd01 snapshot]# systemctl status docker && systemctl status kubelet
[root@etcd02 snapshot]# systemctl status docker && systemctl status kubelet
[root@etcd03 snapshot]# systemctl status docker && systemctl status kubelet
需要在三個節(jié)點上分別查看,docker ps,etcd容器是否起來
[root@etcd01 snapshot]# docker ps
[root@etcd02 snapshot]# docker ps
[root@etcd03 snapshot]# docker ps
使用member list查看集群節(jié)點狀態(tài),或使用腳本查看集群狀態(tài):
#!/bin/bash
#
###etcd cluster status check
####檢查集群節(jié)點的健康狀態(tài)
echo "etcd集群節(jié)點的健康狀態(tài)檢查"
/usr/bin/etcdctl --cert=/etc/kubernetes/pki/etcd/peer.crt --key=/etc/kubernetes/pki/etcd/peer.key --cacert=/etc/kubernetes/pki/etcd/ca.crt --endpoints=https://192.168.66.58:2379,https://192.168.66.59:2379,https://192.168.66.57:2379 endpoint health -w table
####檢查集群的狀態(tài)
echo "etcd=`hostname`:集群的節(jié)點詳細狀態(tài)包括leader or flower"
/usr/bin/etcdctl --cert=/etc/kubernetes/pki/etcd/peer.crt --key=/etc/kubernetes/pki/etcd/peer.key --cacert=/etc/kubernetes/pki/etcd/ca.crt --endpoints=https://192.168.66.58:2379,https://192.168.66.59:2379,https://192.168.66.57:2379 endpoint status --write-out=table
?
在第二章表格中的IS LEADER字段中,可以看到etcd01是true,etcd02和etcd03是false,所以集群沒有出現(xiàn)腦裂,集群狀態(tài)正常
恢復(fù)master節(jié)點服務(wù),查看k8s集群狀態(tài)
[root@master01 ~]# systemctl start docker && systemctl start kubelet
[root@master02 ~]# systemctl start docker && systemctl start kubelet
[root@master03 ~]# systemctl start docker && systemctl start kubelet
查看服務(wù)狀態(tài)
[root@master01 ~]# systemctl status docker && systemctl status kubelet
[root@master02 ~]# systemctl status docker && systemctl status kubelet
[root@master03 ~]# systemctl status docker && systemctl status kubelet
查看k8s集群是否恢復(fù)
[root@master01 ~]# kubectl get nodes
文章來源:http://www.zghlxwxcb.cn/news/detail-798352.html
可以看到我們的k8s集群已經(jīng)恢復(fù),之前創(chuàng)建的pods也都存在,其中謝謝pods狀態(tài)不正常,再繼續(xù)排查。文章來源地址http://www.zghlxwxcb.cn/news/detail-798352.html
到了這里,關(guān)于k8s外接etcd集群服務(wù)異常,使用snapshot恢復(fù)etcd集群的文章就介紹完了。如果您還想了解更多內(nèi)容,請在右上角搜索TOY模板網(wǎng)以前的文章或繼續(xù)瀏覽下面的相關(guān)文章,希望大家以后多多支持TOY模板網(wǎng)!