本文主要介紹PG的各個狀態(tài),以及ceph故障過程中PG狀態(tài)的轉(zhuǎn)變。
Placement Group States(PG狀態(tài))
creating
Ceph is still creating the placement group.
Ceph 仍在創(chuàng)建PG。
activating
The placement group is peered but not yet active.
PG已經(jīng)互聯(lián),但是還沒有active。
active
Ceph will process requests to the placement group.
Ceph 可處理到此PG的請求。
clean
Ceph replicated all objects in the placement group the correct
number of times.
PG內(nèi)所有的對象都被正確的復(fù)制了對應(yīng)的份數(shù)。
down
A replica with necessary data is down, so the placement group is
offline.
一個包含必備數(shù)據(jù)的副本離線,所以PG也離線了。
scrubbing
Ceph is checking the placement group metadata for inconsistencies.
Ceph 正在檢查PG metadata的一致性。
deep
Ceph is checking the placement group data against stored checksums.
Ceph 正在檢查PG數(shù)據(jù)和checksums的一致性。
degraded
Ceph has not replicated some objects in the placement group the
correct number of times yet.
PG中的一些對象還沒有被復(fù)制到規(guī)定的份數(shù)。
inconsistent
Ceph detects inconsistencies in the one or more replicas of an
object in the placement group (e.g. objects are the wrong size,
objects are missing from one replica *after* recovery finished,
etc.).
Ceph檢測到PG中對象的一份或多份數(shù)據(jù)不一致(比如對象大小不一致,或者恢復(fù)成功后對象丟失)
peering
The placement group is undergoing the peering process
PG正在互聯(lián)過程中。
repair
Ceph is checking the placement group and repairing any
inconsistencies it finds (if possible).
Ceph正在檢查PG并且修復(fù)所有發(fā)現(xiàn)的不一致情況(如果有的話)。
recovering
Ceph is migrating/synchronizing objects and their replicas.
Ceph正在遷移/同步對象和其副本。
forced_recovery
High recovery priority of that PG is enforced by user.
用戶指定的PG高優(yōu)先級恢復(fù)
recovery_wait
The placement group is waiting in line to start recover.
PG正在等待恢復(fù)被調(diào)度執(zhí)行。
recovery_toofull
A recovery operation is waiting because the destination OSD is over
its full ratio.
恢復(fù)操作因為目標(biāo)OSD容量超過指標(biāo)而掛起。
recovery_unfound
Recovery stopped due to unfound objects.
恢復(fù)因為沒有找到對應(yīng)對象而停止。
backfilling
Ceph is scanning and synchronizing the entire contents of a
placement group instead of inferring what contents need to be
synchronized from the logs of recent operations. Backfill is a
special case of recovery.
Ceph正常掃描并同步整個PG的數(shù)據(jù),而不是從最近的操作日志中推斷需要同步的數(shù)據(jù),Backfill(回填)是恢復(fù)的一個特殊狀態(tài)。
forced_backfill
High backfill priority of that PG is enforced by user.
用戶指定的高優(yōu)先級backfill。
backfill_wait
The placement group is waiting in line to start backfill.
PG正在等待backfill被調(diào)度執(zhí)行。
backfill_toofull
A backfill operation is waiting because the destination OSD is over
its full ratio.
backfill操作因為目標(biāo)OSD容量超過指標(biāo)而掛起。
backfill_unfound
Backfill stopped due to unfound objects.
Backfill因為沒有找到對應(yīng)對象而停止。
incomplete
Ceph detects that a placement group is missing information about
writes that may have occurred, or does not have any healthy copies.
If you see this state, try to start any failed OSDs that may contain
the needed information. In the case of an erasure coded pool
temporarily reducing min\_size may allow recovery.
Ceph 探測到某一PG可能丟失了寫入信息,或者沒有健康的副本。如果你看到了這個狀態(tài),嘗試啟動有可能包含所需信息的失敗OSD,
如果是erasure coded pool的話,臨時調(diào)整一下`min_size`也可能完成恢復(fù)。
stale
The placement group is in an unknown state - the monitors have not
received an update for it since the placement group mapping changed.
PG狀態(tài)未知,從PG mapping更新后Monitor一直沒有收到更新。
remapped
The placement group is temporarily mapped to a different set of OSDs
from what CRUSH specified.
PG被臨時分配到了和CRUSH所指定的不同的OSD上。
undersized
The placement group has fewer copies than the configured pool
replication level.
該PG的副本數(shù)量小于存儲池所配置的副本數(shù)量。
peered
The placement group has peered, but cannot serve client IO due to
not having enough copies to reach the pool\'s configured min\_size
parameter. Recovery may occur in this state, so the pg may heal up
to min\_size eventually.
PG已互聯(lián),但是不能向客戶端提供服務(wù),因為其副本數(shù)沒達(dá)到本存儲池的配置值( min_size 參數(shù))。
在此狀態(tài)下恢復(fù)會進(jìn)行,所以此PG最終能達(dá)到 min_size 。
snaptrim
Trimming snaps.
正在對快照做Trim操作。
snaptrim_Wait
Queued to trim snaps.
Trim操作等待被調(diào)度執(zhí)行
snaptrim_Error
Error stopped trimming snaps.
Trim操作因為錯誤而停止
Placement Group Concepts(PG相關(guān)概念)
Peering (建立互聯(lián))
The process of bringing all of the OSDs that store a Placement Group
(PG) into agreement about the state of all of the objects (and their
metadata) in that PG. Note that agreeing on the state does not mean
that they all have the latest contents.
表示所有存儲PG數(shù)據(jù)的OSD達(dá)成對PG中所有對象(和元數(shù)據(jù))共識的過程。
需要注意的是達(dá)成共識并不代表他們都擁有最新的數(shù)據(jù)。
Acting Set (在任集合)
The ordered list of OSDs who are (or were as of some epoch)
responsible for a particular placement group.
一個OSD的有序集合,他們?yōu)橐粋€PG(或者一些版本)負(fù)責(zé)。
Up Set (當(dāng)選集合)
一列有序OSD ,它們依據(jù) CRUSH 算法為某一PG的特定元版本負(fù)責(zé)。
它通常和*Acting Set*相同,除非*Acting Set*被OSD map中的`pg_temp`顯式地覆蓋了。
a.acting set & up set:每個pg都有這兩個集合,acting set中保存是該pg所有的副本所在OSD的集合,比如acting[0,1,2],就表示這個pg的副本保存在OSD.0 、OSD.1、OSD.2中,而且排在第一位的是OSD.0 ,表示這個OSD.0是PG的primary副本。在通常情況下 up set 與 acting set是相同的。區(qū)別不同之處需要先了解pg_temp。
b.pg_temp : 假設(shè)當(dāng)一個PG的副本數(shù)量不夠時,這時的副本情況為acting/up = [1,2]/[1,2]。這時添加一個OSD.3作為PG的副本。經(jīng)過crush的計算發(fā)現(xiàn),這個OSD.3應(yīng)該為當(dāng)前PG的primary,但是呢,這OSD.3上面還沒有PG的數(shù)據(jù),所以無法承擔(dān)primary,所以需要申請一個pg_temp,這個pg_temp就還采用OSD.1作為primary,此時pg的集合為acting,pg_temp的集合為up。當(dāng)然pg與pg_temp是不一樣的,所以這時pg的集合變成了acting/up = [3,1,2]/[1,2,3]。當(dāng)OSD.3上的數(shù)據(jù)全部都恢復(fù)完成后,就變成了[3,1,2]/[3,1,2]。
** acting set為某一特定PG所映射的OSD;up set是當(dāng)前PG處理客戶端請求時所映射的一組OSD。**在大多數(shù)情況下acting set與up set是一致的,如果不同則說明ceph在遷移數(shù)據(jù)或OSD在恢復(fù),也可能此時ceph集群出現(xiàn)了其他未知故障。
Current Interval or Past Interval
A sequence of OSD map epochs during which the *Acting Set* and *Up
Set* for particular placement group do not change.
某一PG所在*Acting Set*和*Up Set*未更改時的一系列OSD map元版本。
Primary (主 OSD)
The member (and by convention first) of the *Acting Set*, that is
responsible for coordination peering, and is the only OSD that will
accept client-initiated writes to objects in a placement group.
*Acting Set*的成員(按慣例為第一個),它負(fù)責(zé)協(xié)調(diào)互聯(lián),并且是PG內(nèi)惟一接受客戶端初始寫入的OSD。
Replica (副本 OSD)
A non-primary OSD in the *Acting Set* for a placement group (and who
has been recognized as such and *activated* by the primary).
PG的*Acting Set*內(nèi)不是主OSD的其它OSD ,它們被同等對待、由主OSD激活。
Stray (彷徨 OSD)
An OSD that is not a member of the current *Acting Set*, but has not
yet been told that it can delete its copies of a particular
placement group.
不在PG的當(dāng)前*Acting Set*中,但是還沒有被告知要刪除其副本的OSD。
Recovery (恢復(fù))
Ensuring that copies of all of the objects in a placement group are
on all of the OSDs in the *Acting Set*. Once *Peering* has been
performed, the *Primary* can start accepting write operations, and
*Recovery* can proceed in the background.
確保*Acting Set*內(nèi)、PG中的所有對象的副本都存在于所有OSD上。
一旦互聯(lián)完成,主OSD就以接受寫操作,且恢復(fù)進(jìn)程可在后臺進(jìn)行。
PG Info (PG 信息)
Basic metadata about the placement group\'s creation epoch, the
version for the most recent write to the placement group, *last
epoch started*, *last epoch clean*, and the beginning of the
*current interval*. Any inter-OSD communication about placement
groups includes the *PG Info*, such that any OSD that knows a
placement group exists (or once existed) also has a lower bound on
*last epoch clean* or *last epoch started*.
基本元數(shù)據(jù),關(guān)于PG創(chuàng)建元版本、PG的最新寫版本、最近的開始元版本(last epoch started)、
最近的干凈元版本(last epoch clean)、和當(dāng)前間隔(current interval)的起點。?
OSD間關(guān)于PG的任何通訊都包含PG Info,這樣任何知道PG存在(或曾經(jīng)存在)的OSD也必定有l(wèi)ast epoch clean或last epoch started的下限。
PG Log (PG 日志)
A list of recent updates made to objects in a placement group. Note
that these logs can be truncated after all OSDs in the *Acting Set*
have acknowledged up to a certain point.
PG內(nèi)對象的一系列最近更新。需要注意的是這些日志在*Acting Set*內(nèi)的所有OSD確認(rèn)更新到某點后可以刪除。
Missing Set (缺失集合)
Each OSD notes update log entries and if they imply updates to the
contents of an object, adds that object to a list of needed updates.
This list is called the *Missing Set* for that `<OSD,PG>`.
每個OSD都會記錄更新日志,而且如果它們包含對象內(nèi)容的更新,
會把那個對象加入一個待更新列表,這個列表叫做那個`<OSD,PG>`的*Missing Set*。
Authoritative History (權(quán)威歷史)
A complete, and fully ordered set of operations that, if performed,
would bring an OSD\'s copy of a placement group up to date.
一個完整、完全有序的操作集合,如果再次執(zhí)行,可把一個OSD上的PG副本還原到最新。
Epoch (元版本)
A (monotonically increasing) OSD map version number
一個(單調(diào)遞增的)OSD map版本號。
Last Epoch Start (最新起始元版本)
The last epoch at which all nodes in the *Acting Set* for a
particular placement group agreed on an *Authoritative History*. At
this point, *Peering* is deemed to have been successful.
?最新元版本,在這點上,PG所對應(yīng)*Acting Set*內(nèi)的所有節(jié)點都對權(quán)威歷史達(dá)成了一致、
?并且互聯(lián)被認(rèn)為成功了。
up_thru (領(lǐng)導(dǎo)拍板)
Before a *Primary* can successfully complete the *Peering* process,
it must inform a monitor that is alive through the current OSD map
*Epoch* by having the monitor set its *up\_thru* in the osd map.
This helps *Peering* ignore previous *Acting Sets* for which
*Peering* never completed after certain sequences of failures, such
as the second interval below:
?
- ? *acting set* = \[A,B\]
- ? *acting set* = \[A\]
- ? *acting set* = \[\] very shortly after (e.g., simultaneous
? ? failure, but staggered detection)
- ? *acting set* = \[B\] (B restarts, A does not)
主OSD要想成功完成互聯(lián),它必須通過當(dāng)前OSD map元版本通知一個Monitor,讓此Monitor在OSD map中設(shè)置其up_thru。
這會使互聯(lián)進(jìn)程忽略之前的*Acting Set*,因為它經(jīng)歷特定順序的失敗后一直不能互聯(lián),比如像下面的第二周期:
?
acting set = [A,B]
acting set = [A]
acting set = [] 之后很短時間(例如同時失敗、但探測是交叉的)
acting set = [B] ( B 重啟了、但 A 沒有)
Last Epoch Clean (最新干凈元版本)
The last *Epoch* at which all nodes in the *Acting set* for a
particular placement group were completely up to date (both
placement group logs and object contents). At this point, *recovery*
is deemed to have been completed.
最近的Epoch,這時某一特定PG所在*Acting Set*內(nèi)的所有節(jié)點都全部更新了(包括PG日志和對象內(nèi)容)。
在這點上,恢復(fù)被認(rèn)為已完成。
3.故障模擬
3.1 undersized+degraded狀態(tài)
a.停止osd.1
登錄后復(fù)制?
systemctl stop ceph-osd@1
```
* b.查看PG狀態(tài)
```
ceph pg stat
20 pgs: 20 active+undersized+degraded; 14512 kB data, 302 GB used, 6388 GB / 6691 GB avail; 12/36 objects degraded (33.333%)
```
* c.查看集群監(jiān)控狀態(tài)
```
ceph health detail
HEALTH_WARN 1 osds down; Degraded data redundancy: 12/36 objects degraded (33.333%), 20 pgs unclean, 20 pgs degraded; application not enabled on 1 pool(s)
OSD_DOWN 1 osds down
? ? osd.1 (root=default,host=ceph-xx-cc00) is down
PG_DEGRADED Degraded data redundancy: 12/36 objects degraded (33.333%), 20 pgs unclean, 20 pgs degraded
? ? pg 1.0 is active+undersized+degraded, acting [0,2]
? ? pg 1.1 is active+undersized+degraded, acting [2,0]
```
* d.客戶端IO操作
```
#寫入對象
// 把ceph.conf文件存放在test_pool中,對象名字為myobject。
$ bin/rados -p test_pool put myobject ceph.conf
#讀取對象到文件
$ bin/rados -p test_pool get myobject myobject.old
#查看文件
$ ll ceph.conf*
-rw-r--r-- 1 root root 6211 Jun 25 14:01 ceph.conf
-rw-r--r-- 1 root root 6211 Jul ?3 19:57 ceph.conf.old
```
* e.故障總結(jié):
為了模擬故障,(size = 3, min_size = 2) 我們手動停止了 osd.1,然后查看PG狀態(tài),可見,它此刻的狀態(tài)是active+undersized+degraded,當(dāng)一個 PG 所在的 OSD 掛掉之后,這個 PG 就會進(jìn)入undersized+degraded 狀態(tài),而后面的[0,2]的意義就是還有兩個副本存活在 osd.0 和 osd.2 上, 并且這個時候客戶端可以正常讀寫IO。
**3.2 Peered狀態(tài)**
Peering已經(jīng)完成,但是PG當(dāng)前Acting Set規(guī)模小于存儲池規(guī)定的最小副本數(shù)(min_size)。
* a.停掉兩個副本osd.1,osd.0
```
$ systemctl stop ceph-osd@1
$ systemctl stop ceph-osd@0
```
* b.查看集群健康狀態(tài)
```
$ bin/ceph health detail
HEALTH_WARN 1 osds down; Reduced data availability: 4 pgs inactive; Degraded data redundancy: 26/39 objects degraded (66.667%), 20 pgs unclean, 20 pgs degraded; application not enabled on 1 pool(s)
OSD_DOWN 1 osds down
? ? osd.0 (root=default,host=ceph-xx-cc00) is down
PG_AVAILABILITY Reduced data availability: 4 pgs inactive
? ? pg 1.6 is stuck inactive for 516.741081, current state undersized+degraded+peered, last acting [2]
? ? pg 1.10 is stuck inactive for 516.737888, current state undersized+degraded+peered, last acting [2]
? ? pg 1.11 is stuck inactive for 516.737408, current state undersized+degraded+peered, last acting [2]
? ? pg 1.12 is stuck inactive for 516.736955, current state undersized+degraded+peered, last acting [2]
PG_DEGRADED Degraded data redundancy: 26/39 objects degraded (66.667%), 20 pgs unclean, 20 pgs degraded
? ? pg 1.0 is undersized+degraded+peered, acting [2]
? ? pg 1.1 is undersized+degraded+peered, acting [2]
```
* c.客戶端IO操作(夯住)
```
#讀取對象到文件,夯住IO
$ bin/rados -p test_pool get myobject ?ceph.conf.old
```
* d.故障總結(jié):
```
現(xiàn)在pg 只剩下osd.2上存活,并且 pg 還多了一個狀態(tài):peered,英文的意思是仔細(xì)看,這里我們可以理解成協(xié)商、搜索。
這時候讀取文件,會發(fā)現(xiàn)指令會卡在那個地方一直不動,為什么就不能讀取內(nèi)容了,因為我們設(shè)置的 min_size=2 ,如果存活數(shù)少于2,比如這里的 1 ,那么就不會響應(yīng)外部的IO請求。
```
* e.調(diào)整min_size=1可以解決IO夯住問題
```
#設(shè)置min_size = 1
$ bin/ceph osd pool set test_pool min_size 1
set pool 1 min_size to 1
```
* f.查看集群監(jiān)控狀態(tài)
```
$ bin/ceph health detail
HEALTH_WARN 1 osds down; Degraded data redundancy: 26/39 objects degraded (66.667%), 20 pgs unclean, 20 pgs degraded, 20 pgs undersized; application not enabled on 1 pool(s)
OSD_DOWN 1 osds down
? ? osd.0 (root=default,host=ceph-xx-cc00) is down
PG_DEGRADED Degraded data redundancy: 26/39 objects degraded (66.667%), 20 pgs unclean, 20 pgs degraded, 20 pgs undersized
? ? pg 1.0 is stuck undersized for 65.958983, current state active+undersized+degraded, last acting [2]
? ? pg 1.1 is stuck undersized for 65.960092, current state active+undersized+degraded, last acting [2]
? ? pg 1.2 is stuck undersized for 65.960974, current state active+undersized+degraded, last acting [2]
```
* g.客戶端IO操作
```
#讀取對象到文件中
$ ll -lh ceph.conf*
-rw-r--r-- 1 root root 6.1K Jun 25 14:01 ceph.conf
-rw-r--r-- 1 root root 6.1K Jul ?3 20:11 ceph.conf.old
-rw-r--r-- 1 root root 6.1K Jul ?3 20:11 ceph.conf.old.1
```
* h.故障總結(jié):
```
可以看到,PG狀態(tài)Peered沒有了,并且客戶端文件IO可以正常讀寫了。
當(dāng)min_size=1時,只要集群里面有一份副本活著,那就可以響應(yīng)外部的IO請求。
```
*?
> Peered狀態(tài)我們這里可以將它理解成它在等待其他副本上線。
當(dāng)min_size = 2 時,也就是必須保證有兩個副本存活的時候就可以去除Peered這個狀態(tài)。
處于 Peered 狀態(tài)的 PG 是不能響應(yīng)外部的請求的并且IO被掛起。
**3.3 Remapped狀態(tài)**
Peering完成,PG當(dāng)前Acting Set與Up Set不一致就會出現(xiàn)Remapped狀態(tài)。
* a. 停止osd.x
```
systemctl stop ceph-osd@x
```
* b. 間隔5分鐘,啟動osd.x
```
systemctl start ceph-osd@x
```
* c. 查看PG狀態(tài)
```
$ ceph pg stat
1416 pgs: 6 active+clean+remapped, 1288 active+clean, 3 stale+active+clean, 119 active+undersized+degraded; 74940 MB data, 250 GB used, 185 TB / 185 TB avail; 1292/48152 objects degraded (2.683%)
$ ceph pg dump | grep remapped
dumped all
13.cd ? ? ? ? 0 ? ? ? ? ? ? ? ? ?0 ? ? ? ?0 ? ? ? ? 0 ? ? ? 0 ? ? ? ? 0 ? ?2 ? ? ? ?2 ? ? ?active+clean+remapped 2018-07-03 20:26:14.478665 ? ? ? 9453'2 ? 20716:11343 ? ?[10,23] ? ? ? ? 10 [10,23,14] ? ? ? ? ? ? 10 ? ? ? 9453'2 2018-07-03 20:26:14.478597 ? ? ? ? ?9453'2 2018-07-01 13:11:43.262605
3.1a ? ? ? ? 44 ? ? ? ? ? ? ? ? ?0 ? ? ? ?0 ? ? ? ? 0 ? ? ? 0 373293056 1500 ? ? 1500 ? ? ?active+clean+remapped 2018-07-03 20:25:47.885366 ?20272'79063 ?20716:109173 ? ? [9,23] ? ? ? ? ?9 ?[9,23,12] ? ? ? ? ? ? ?9 ?20272'79063 2018-07-03 03:14:23.960537 ? ? 20272'79063 2018-07-03 03:14:23.960537
5.f ? ? ? ? ? 0 ? ? ? ? ? ? ? ? ?0 ? ? ? ?0 ? ? ? ? 0 ? ? ? 0 ? ? ? ? 0 ? ?0 ? ? ? ?0 ? ? ?active+clean+remapped 2018-07-03 20:25:47.888430 ? ? ? ? ?0'0 ? 20716:15530 ? ? [23,8] ? ? ? ? 23 ?[23,8,22] ? ? ? ? ? ? 23 ? ? ? ? ?0'0 2018-07-03 06:44:05.232179 ? ? ? ? ? ? 0'0 2018-06-30 22:27:16.778466
3.4a ? ? ? ? 45 ? ? ? ? ? ? ? ? ?0 ? ? ? ?0 ? ? ? ? 0 ? ? ? 0 390070272 1500 ? ? 1500 ? ? ?active+clean+remapped 2018-07-03 20:25:47.886669 ?20272'78385 ?20716:108086 ? ? [7,23] ? ? ? ? ?7 ?[7,23,17] ? ? ? ? ? ? ?7 ?20272'78385 2018-07-03 13:49:08.190133 ? ? ?7998'78363 2018-06-28 10:30:38.201993
13.102 ? ? ? ?0 ? ? ? ? ? ? ? ? ?0 ? ? ? ?0 ? ? ? ? 0 ? ? ? 0 ? ? ? ? 0 ? ?5 ? ? ? ?5 ? ? ?active+clean+remapped 2018-07-03 20:25:47.884983 ? ? ? 9453'5 ? 20716:11334 ? ? [1,23] ? ? ? ? ?1 ?[1,23,14] ? ? ? ? ? ? ?1 ? ? ? 9453'5 2018-07-02 21:10:42.028288 ? ? ? ? ?9453'5 2018-07-02 21:10:42.028288
13.11d ? ? ? ?1 ? ? ? ? ? ? ? ? ?0 ? ? ? ?0 ? ? ? ? 0 ? ? ? 0 ? 4194304 1539 ? ? 1539 ? ? ?active+clean+remapped 2018-07-03 20:25:47.886535 ?20343'22439 ? 20716:86294 ? ? [4,23] ? ? ? ? ?4 ?[4,23,15] ? ? ? ? ? ? ?4 ?20343'22439 2018-07-03 17:21:18.567771 ? ? 20343'22439 2018-07-03 17:21:18.567771#2分鐘之后查詢$ ceph pg stat
1416 pgs: 2 active+undersized+degraded+remapped+backfilling, 10 active+undersized+degraded+remapped+backfill_wait, 1401 active+clean, 3 stale+active+clean; 74940 MB data, 247 GB used, 179 TB / 179 TB avail; 260/48152 objects degraded (0.540%); 49665 kB/s, 9 objects/s recovering$ ceph pg dump | grep remapped
dumped all
13.1e8 2 0 2 0 0 8388608 1527 1527 active+undersized+degraded+remapped+backfill_wait 2018-07-03 20:30:13.999637 9493'38727 20754:165663 [18,33,10] 18 [18,10] 18 9493'38727 2018-07-03 19:53:43.462188 0'0 2018-06-28 20:09:36.303126$ ceph pg stat
1416 pgs: 6 active+clean+remapped, 1288 active+clean, 3 stale+active+clean, 119 active+undersized+degraded; 74940 MB data, 250 GB used, 185 TB / 185 TB avail; 1292/48152 objects degraded (2.683%)
$ ceph pg dump | grep remapped
dumped all
13.cd ? ? ? ? 0 ? ? ? ? ? ? ? ? ?0 ? ? ? ?0 ? ? ? ? 0 ? ? ? 0 ? ? ? ? 0 ? ?2 ? ? ? ?2 ? ? ?active+clean+remapped 2018-07-03 20:26:14.478665 ? ? ? 9453'2 ? 20716:11343 ? ?[10,23] ? ? ? ? 10 [10,23,14] ? ? ? ? ? ? 10 ? ? ? 9453'2 2018-07-03 20:26:14.478597 ? ? ? ? ?9453'2 2018-07-01 13:11:43.262605
3.1a ? ? ? ? 44 ? ? ? ? ? ? ? ? ?0 ? ? ? ?0 ? ? ? ? 0 ? ? ? 0 373293056 1500 ? ? 1500 ? ? ?active+clean+remapped 2018-07-03 20:25:47.885366 ?20272'79063 ?20716:109173 ? ? [9,23] ? ? ? ? ?9 ?[9,23,12] ? ? ? ? ? ? ?9 ?20272'79063 2018-07-03 03:14:23.960537 ? ? 20272'79063 2018-07-03 03:14:23.960537
5.f ? ? ? ? ? 0 ? ? ? ? ? ? ? ? ?0 ? ? ? ?0 ? ? ? ? 0 ? ? ? 0 ? ? ? ? 0 ? ?0 ? ? ? ?0 ? ? ?active+clean+remapped 2018-07-03 20:25:47.888430 ? ? ? ? ?0'0 ? 20716:15530 ? ? [23,8] ? ? ? ? 23 ?[23,8,22] ? ? ? ? ? ? 23 ? ? ? ? ?0'0 2018-07-03 06:44:05.232179 ? ? ? ? ? ? 0'0 2018-06-30 22:27:16.778466
3.4a ? ? ? ? 45 ? ? ? ? ? ? ? ? ?0 ? ? ? ?0 ? ? ? ? 0 ? ? ? 0 390070272 1500 ? ? 1500 ? ? ?active+clean+remapped 2018-07-03 20:25:47.886669 ?20272'78385 ?20716:108086 ? ? [7,23] ? ? ? ? ?7 ?[7,23,17] ? ? ? ? ? ? ?7 ?20272'78385 2018-07-03 13:49:08.190133 ? ? ?7998'78363 2018-06-28 10:30:38.201993
13.102 ? ? ? ?0 ? ? ? ? ? ? ? ? ?0 ? ? ? ?0 ? ? ? ? 0 ? ? ? 0 ? ? ? ? 0 ? ?5 ? ? ? ?5 ? ? ?active+clean+remapped 2018-07-03 20:25:47.884983 ? ? ? 9453'5 ? 20716:11334 ? ? [1,23] ? ? ? ? ?1 ?[1,23,14] ? ? ? ? ? ? ?1 ? ? ? 9453'5 2018-07-02 21:10:42.028288 ? ? ? ? ?9453'5 2018-07-02 21:10:42.028288
13.11d ? ? ? ?1 ? ? ? ? ? ? ? ? ?0 ? ? ? ?0 ? ? ? ? 0 ? ? ? 0 ? 4194304 1539 ? ? 1539 ? ? ?active+clean+remapped 2018-07-03 20:25:47.886535 ?20343'22439 ? 20716:86294 ? ? [4,23] ? ? ? ? ?4 ?[4,23,15] ? ? ? ? ? ? ?4 ?20343'22439 2018-07-03 17:21:18.567771 ? ? 20343'22439 2018-07-03 17:21:18.567771#2分鐘之后查詢$ ceph pg stat
1416 pgs: 2 active+undersized+degraded+remapped+backfilling, 10 active+undersized+degraded+remapped+backfill_wait, 1401 active+clean, 3 stale+active+clean; 74940 MB data, 247 GB used, 179 TB / 179 TB avail; 260/48152 objects degraded (0.540%); 49665 kB/s, 9 objects/s recovering$ ceph pg dump | grep remapped
dumped all
13.1e8 2 0 2 0 0 8388608 1527 1527 active+undersized+degraded+remapped+backfill_wait 2018-07-03 20:30:13.999637 9493'38727 20754:165663 [18,33,10] 18 [18,10] 18 9493'38727 2018-07-03 19:53:43.462188 0'0 2018-06-28 20:09:36.303126
```
* d. 客戶端IO操作
```
#rados讀寫正常
rados -p test_pool put myobject /tmp/test.log
```
* e. 故障總結(jié):
```
1.在 OSD 掛掉或者在擴(kuò)容的時候PG 上的OSD會按照Crush算法重新分配PG 所屬的osd編號。并且會把 PG Remap到別的OSD上去。
2.Remapped狀態(tài)時,PG當(dāng)前Acting Set與Up Set不一致。
3.客戶端IO可以正常讀寫。
```
****
**4.Recovery狀態(tài)**
指PG通過PGLog日志針對數(shù)據(jù)不一致的對象進(jìn)行同步和修復(fù)的過程。
a. 停止osd.x
```
systemctl stop ceph-osd@x
```
b. 間隔1分鐘啟動osd.x
```
?systemctl start ceph-osd@x
```
c. 查看集群監(jiān)控狀態(tài)
```
$ ceph health detail
HEALTH_WARN Degraded data redundancy: 183/57960 objects degraded (0.316%), 17 pgs unclean, 17 pgs degraded
PG_DEGRADED Degraded data redundancy: 183/57960 objects degraded (0.316%), 17 pgs unclean, 17 pgs degraded
? ? pg 1.19 is active+recovery_wait+degraded, acting [29,9,17]
```
d. 故障總結(jié):
```
Recovery是通過記錄的PGLog進(jìn)行恢復(fù)數(shù)據(jù)的。
記錄的PGLog 在osd_max_pg_log_entries=10000條以內(nèi),這個時候通過PGLog就能增量恢復(fù)數(shù)據(jù)。
```
**5.Backfill狀態(tài)**
當(dāng)PG的副本無非通過PGLog來恢復(fù)數(shù)據(jù),這個時候就需要進(jìn)行全量同步,通過完全拷貝當(dāng)前Primary所有對象的方式進(jìn)行全量同步。
* a. 停止osd.x
```
systemctl stop ceph-osd@x
```
* b. 間隔10分鐘啟動osd.x
```
systemctl start ceph-osd@x
```
* c. 查看集群健康狀態(tài)
```
$ ceph health detail
HEALTH_WARN Degraded data redundancy: 6/57927 objects degraded (0.010%), 1 pg unclean, 1 pg degraded
PG_DEGRADED Degraded data redundancy: 6/57927 objects degraded (0.010%), 1 pg unclean, 1 pg degraded
? ? pg 3.7f is active+undersized+degraded+remapped+backfilling, acting [21,29]
```
* d. 故障總結(jié):
```
1.無法根據(jù)記錄的PGLog進(jìn)行恢復(fù)數(shù)據(jù)時,就需要執(zhí)行Backfill過程全量恢復(fù)數(shù)據(jù)。
2.如果超過osd_max_pg_log_entries=10000條, 這個時候需要全量恢復(fù)數(shù)據(jù)。
```
**6.Stale狀態(tài)**
* mon檢測到當(dāng)前PG的Primary所在的osd宕機(jī)。
* Primary超時未向mon上報pg相關(guān)的信息(例如網(wǎng)絡(luò)阻塞)。
* PG內(nèi)三個副本都掛掉的情況。
a. 分別停止PG中的三個副本osd, 首先停止osd.23
```
systemctl stop ceph-osd@23
```
b. 然后停止osd.24
```
systemctl stop ceph-osd@24
```
c. 查看停止兩個副本PG 1.45的狀態(tài)(undersized+degraded+peered)
```
$ ceph health detail
HEALTH_WARN 2 osds down; Reduced data availability: 9 pgs inactive; Degraded data redundancy: 3041/47574 objects degraded (6.392%), 149 pgs unclean, 149 pgs degraded, 149 pgs undersized
OSD_DOWN 2 osds down
? ? osd.23 (root=default,host=ceph-xx-osd02) is down
? ? osd.24 (root=default,host=ceph-xx-osd03) is down
PG_AVAILABILITY Reduced data availability: 9 pgs inactive
? ? pg 1.45 is stuck inactive for 281.355588, current state undersized+degraded+peered, last acting [10]
```
d. 在停止PG 1.45中第三個副本osd.10
```
systemctl stop ceph-osd@10
```
e. 查看停止三個副本PG 1.45的狀態(tài)(stale+undersized+degraded+peered)
```
ceph health detail
HEALTH_WARN 3 osds down; Reduced data availability: 26 pgs inactive, 2 pgs stale; Degraded data redundancy: 4770/47574 objects degraded (10.026%), 222 pgs unclean, 222 pgs degraded, 222 pgs undersized
OSD_DOWN 3 osds down
? ? osd.10 (root=default,host=ceph-xx-osd01) is down
? ? osd.23 (root=default,host=ceph-xx-osd02) is down
? ? osd.24 (root=default,host=ceph-xx-osd03) is down
PG_AVAILABILITY Reduced data availability: 26 pgs inactive, 2 pgs stale
? ? pg 1.9 is stuck inactive for 171.200290, current state undersized+degraded+peered, last acting [13]
? ? pg 1.45 is stuck stale for 171.206909, current state stale+undersized+degraded+peered, last acting [10]
? ? pg 1.89 is stuck inactive for 435.573694, current state undersized+degraded+peered, last acting [32]
? ? pg 1.119 is stuck inactive for 435.574626, current state undersized+degraded+peered, last acting [28]
```
f. 客戶端IO操作
```
#讀寫掛載磁盤IO 夯住
ll /mnt/
```
g. 故障總結(jié):
```
先停止同一個PG內(nèi)兩個副本,狀態(tài)是undersized+degraded+peered。
然后停止同一個PG內(nèi)第三個副本,狀態(tài)是stale+undersized+degraded+peered。
1.當(dāng)出現(xiàn)一個PG內(nèi)三個副本都掛掉的情況,就會出現(xiàn)stale狀態(tài)。
2.此時該PG不能提供客戶端讀寫,IO掛起夯住。
3.Primary超時未向mon上報pg相關(guān)的信息(例如網(wǎng)絡(luò)阻塞),也會出現(xiàn)stale狀態(tài)。
```
**7.Inconsistent狀態(tài)**
PG通過Scrub檢測到某個或者某些對象在PG實例間出現(xiàn)了不一致
* a. 刪除PG 3.0中副本osd.34頭文件
```
rm -rf /var/lib/ceph/osd/ceph-34/current/3.0_head/DIR_0/1000000697c.0000122c__head_19785300__3
```
* b. 手動執(zhí)行PG 3.0進(jìn)行數(shù)據(jù)清洗
```
$ ceph pg scrub 3.0
instructing pg 3.0 on osd.34 to scrub
```
* c. 檢查集群監(jiān)控狀態(tài)
```
$ ceph health detail
HEALTH_ERR 1 scrub errors; Possible data damage: 1 pg inconsistent
OSD_SCRUB_ERRORS 1 scrub errors
PG_DAMAGED Possible data damage: 1 pg inconsistent
? ? pg 3.0 is active+clean+inconsistent, acting [34,23,1]
```
* d. 修復(fù)PG 3.0
```
$ ceph pg repair 3.0
instructing pg 3.0 on osd.34 to repair
?
#查看集群監(jiān)控狀態(tài)
$ ceph health detail
HEALTH_ERR 1 scrub errors; Possible data damage: 1 pg inconsistent, 1 pg repair
OSD_SCRUB_ERRORS 1 scrub errors
PG_DAMAGED Possible data damage: 1 pg inconsistent, 1 pg repair
? ? pg 3.0 is active+clean+scrubbing+deep+inconsistent+repair, acting [34,23,1]
?
#集群監(jiān)控狀態(tài)已恢復(fù)正常
$ ceph health detail
HEALTH_OK
```
* e. 故障總結(jié):
```
當(dāng)PG內(nèi)部三個副本有數(shù)據(jù)不一致的情況,想要修復(fù)不一致的數(shù)據(jù)文件,只需要執(zhí)行ceph pg repair修復(fù)指令,ceph就會從其他的副本中將丟失的文件拷貝過來就行修復(fù)數(shù)據(jù)。
當(dāng)osd短暫掛掉的時候,因為集群內(nèi)還存在著兩個副本,是可以正常寫入的,但是 osd.34 內(nèi)的數(shù)據(jù)并沒有得到更新,過了一會osd.34上線了,這個時候osd.34的數(shù)據(jù)是陳舊的,就通過其他的OSD 向 osd.34 進(jìn)行數(shù)據(jù)的恢復(fù),使其數(shù)據(jù)為最新的,而這個恢復(fù)的過程中,PG的狀態(tài)會從inconsistent ->recover -> clean,最終恢復(fù)正常。
這是集群故障自愈一種場景。
```
**8.PG down狀態(tài)**
PG為Down的場景是由于osd節(jié)點數(shù)據(jù)太舊,并且其他在線的osd不足以完成數(shù)據(jù)修復(fù)。
這個時候該PG不能提供客戶端IO讀寫。
* a. 查看PG 3.7f內(nèi)副本數(shù)
```
$ ceph pg dump | grep ^3.7f
dumped all
3.7f ? ? ? ? 43 ? ? ? ? ? ? ? ? ?0 ? ? ? ?0 ? ? ? ? 0 ? ? ? 0 494927872 1569 ? ? 1569 ? ? ? ? ? ? ? active+clean 2018-07-05 02:52:51.512598 ?21315'80115 ?21356:111666 ?[5,21,29] ? ? ? ? ?5 ?[5,21,29] ? ? ? ? ? ? ?5 ?21315'80115 2018-07-05 02:52:51.512568 ? ? ?6206'80083 2018-06-29 22:51:05.831219
```
* b. 停止PG 3.7f 副本osd.21?
```
systemctl stop ceph-osd@21
```
* c. 查看PG 3.7f狀態(tài)
```
$ ceph pg dump | grep ^3.7f
dumped all
3.7f ? ? ? ? 66 ? ? ? ? ? ? ? ? ?0 ? ? ? 89 ? ? ? ? 0 ? ? ? 0 591396864 1615 ? ? 1615 active+undersized+degraded 2018-07-05 15:29:15.741318 ?21361'80161 ?21365:128307 ? ? [5,29] ? ? ? ? ?5 ? ? [5,29] ? ? ? ? ? ? ?5 ?21315'80115 2018-07-05 02:52:51.512568 ? ? ?6206'80083 2018-06-29 22:51:05.831219
```
* d. 客戶端寫入數(shù)據(jù),一定要確保數(shù)據(jù)寫入到PG 3.7f的副本中[5,29]
```
fio 寫入數(shù)據(jù)
```
* e. 停止PG 3.7f中副本osd.29,并且查看PG 3.7f狀態(tài)(undersized+degraded+peered)
```
#停止該PG副本osd.29
systemctl stop ceph-osd@29
?
#查看該PG 3.7f狀態(tài)為undersized+degraded+peered
ceph pg dump | grep ^3.7f
dumped all
3.7f ? ? ? ? 70 ? ? ? ? ? ? ? ? ?0 ? ? ?140 ? ? ? ? 0 ? ? ? 0 608174080 1623 ? ? 1623 undersized+degraded+peered 2018-07-05 15:35:51.629636 ?21365'80169 ?21367:132165 ? ? ? ?[5] ? ? ? ? ?5 ? ? ? ?[5] ? ? ? ? ? ? ?5 ?21315'80115 2018-07-05 02:52:51.512568 ? ? ?6206'80083 2018-06-29 22:51:05.831219
```
* f. 停止PG 3.7f中副本osd.5,并且查看PG 3.7f狀態(tài)(undersized+degraded+peered)
```
#停止該PG副本osd.5
$ systemctl stop ceph-osd@5
?
#查看該PG狀態(tài)undersized+degraded+peered
$ ceph pg dump | grep ^3.7f
dumped all
3.7f ? ? ? ? 70 ? ? ? ? ? ? ? ? ?0 ? ? ?140 ? ? ? ? 0 ? ? ? 0 608174080 1623 ? ? 1623 stale+undersized+degraded+peered 2018-07-05 15:35:51.629636 ?21365'80169 ?21367:132165 ? ? ? ?[5] ? ? ? ? ?5 ? ? ? ?[5] ? ? ? ? ? ? ?5 ?21315'80115 2018-07-05 02:52:51.512568 ? ? ?6206'80083 2018-06-29 22:51:05.831219
```
* g. 拉起PG 3.7f中副本osd.21(此時的osd.21數(shù)據(jù)比較陳舊), 查看PG狀態(tài)(down)
```
#拉起該PG的osd.21
$ systemctl start ceph-osd@21
?
#查看該PG的狀態(tài)down
$ ceph pg dump | grep ^3.7f
dumped all
3.7f ? ? ? ? 66 ? ? ? ? ? ? ? ? ?0 ? ? ? ?0 ? ? ? ? 0 ? ? ? 0 591396864 1548 ? ? 1548 ? ? ? ? ? ? ? ? ? ? ? ? ?down 2018-07-05 15:36:38.365500 ?21361'80161 ?21370:111729 ? ? ? [21] ? ? ? ? 21 ? ? ? [21] ? ? ? ? ? ? 21 ?21315'80115 2018-07-05 02:52:51.512568 ? ? ?6206'80083 2018-06-29 22:51:05.831219
```
* h. 客戶端IO操作
```
#此時客戶端IO都會夯住
ll /mnt/
```
* i. 故障總結(jié):
```
首先有一個PG 3.7f有三個副本[5,21,29], 當(dāng)停掉一個osd.21之后, 寫入數(shù)據(jù)到osd.5, osd.29。 這個時候停掉osd.29, osd.5 ,最后拉起osd.21。 這個時候osd.21的數(shù)據(jù)比較舊,就會出現(xiàn)PG為down的情況,這個時候客戶端IO會夯住,只能拉起掛掉的osd才能修復(fù)問題。文章來源:http://www.zghlxwxcb.cn/news/detail-630917.html
典型的場景:A(主)、B、C
a. 首先kill B
b. 新寫入數(shù)據(jù)到 A、C
c. kill A和C
d. 拉起B(yǎng)
出現(xiàn)PG為Down的場景是由于osd節(jié)點數(shù)據(jù)太舊,并且其他在線的osd不足以完成數(shù)據(jù)修復(fù)。
這個時候該PG不能提供客戶端IO讀寫, IO會掛起夯住。
```
-----------------------------------
Ceph PG狀態(tài)及故障模擬
https://blog.51cto.com/wendashuai/2491723文章來源地址http://www.zghlxwxcb.cn/news/detail-630917.html
到了這里,關(guān)于Ceph入門到精通-Ceph PG狀態(tài)詳細(xì)介紹(全)的文章就介紹完了。如果您還想了解更多內(nèi)容,請在右上角搜索TOY模板網(wǎng)以前的文章或繼續(xù)瀏覽下面的相關(guān)文章,希望大家以后多多支持TOY模板網(wǎng)!