本文源碼解析參考: 深入理解TCP/IP協(xié)議的實(shí)現(xiàn)之ip分片重組 – 基于linux1.2.13
計(jì)網(wǎng)理論部分參考: << 自頂向下學(xué)習(xí)計(jì)算機(jī)網(wǎng)絡(luò) >>
Linux 1.2.13 源碼倉庫鏈接: read-linux-1.2.13-net-code
引言
筆者在完成cs144 lab 后,發(fā)現(xiàn)自己對(duì)IP層分片這部分知識(shí)點(diǎn)模糊不清,閱讀了自頂向下學(xué)習(xí)計(jì)算機(jī)網(wǎng)絡(luò)書籍對(duì)應(yīng)章節(jié)后,發(fā)現(xiàn)書上對(duì)IP層分片這部分內(nèi)容講解較為簡(jiǎn)單,所以特此翻閱Linux網(wǎng)絡(luò)子系統(tǒng)源碼進(jìn)行學(xué)習(xí)。
在正式進(jìn)入主題之前,我想先拋出我在沒有研究源碼前的一些疑惑:
- 既然書上說IP協(xié)議是不可靠的協(xié)議,那么IP層進(jìn)行分片,又需要進(jìn)行分片重組,只有重組完畢后才能將數(shù)據(jù)報(bào)交給上層,那么如果分片丟失或者超時(shí)遲遲未到該如何處理呢?
- 如果IP層需要被分片的數(shù)據(jù)在完全組裝后才能上交上層,那么是否需要使用到序列號(hào),ACK,重傳等機(jī)制確保可靠性呢?
- 如果IP層需要實(shí)現(xiàn)可靠性傳輸,那么為什么又說IP協(xié)議是不可靠的呢?
- . . .
帶著以上種種疑惑,我開啟了對(duì)Linux 1.2.13 net模塊的探索之路。
本文所講內(nèi)容未必完全正確,如有錯(cuò)誤,歡迎在評(píng)論區(qū)指出。
為什么需要分片
不同的鏈路層協(xié)議所能承載的網(wǎng)絡(luò)層分組大小是不同的,有的協(xié)議能承載大數(shù)據(jù)報(bào),而有的協(xié)議只能承載小分組。例如:
- 以太網(wǎng)幀能夠承載不超過1500字節(jié)的數(shù)據(jù),而某些廣域網(wǎng)鏈路的幀可承載不超過576字節(jié)的數(shù)據(jù)
我們將一個(gè)鏈路層幀能承載的最大數(shù)據(jù)量叫做最大傳送單元(MTU),因?yàn)槊總€(gè)IP數(shù)據(jù)報(bào)封裝在鏈路層幀中從一臺(tái)路由器傳輸?shù)较乱慌_(tái)路由器,因此鏈路層協(xié)議的MTU嚴(yán)格限制著IP數(shù)據(jù)報(bào)的長(zhǎng)度。同時(shí)發(fā)送方與目的地路徑上每段鏈路可能使用不同的鏈路層協(xié)議,且每種協(xié)議可能具有不同的MTU,這意味著已經(jīng)分片的IP數(shù)據(jù)報(bào)可能面臨再次分片,那么我們?cè)撊绾翁幚磉@種情況呢?
- 如果遇到MTU更小的鏈路層協(xié)議,則將現(xiàn)有分片分成兩個(gè)或多個(gè)更小的IP數(shù)據(jù)報(bào),用單獨(dú)的鏈路層幀封裝這些較小的IP數(shù)據(jù)報(bào),然后通過輸出鏈路發(fā)送這些幀
使用IPV4協(xié)議的路由器才會(huì)執(zhí)行再分片操作,使用IPV6協(xié)議的路由器不會(huì)進(jìn)行再分片操作,而是回復(fù)一個(gè)ICMP錯(cuò)誤報(bào)文,表示IP數(shù)據(jù)包過大
TCP與UDP都希望從網(wǎng)絡(luò)層接受到完整的,未分片的報(bào)文,那么如果我們?cè)诼酚善髦兄匦陆M裝數(shù)據(jù)報(bào)是否合理呢?
- 很顯然,這很不河里 ! 路由器中重新組裝數(shù)據(jù)報(bào)會(huì)給協(xié)議帶來相當(dāng)大的復(fù)雜性并且影響路由器的性能,為堅(jiān)持網(wǎng)路內(nèi)核保持簡(jiǎn)單的原則,IPV4設(shè)計(jì)者決定將數(shù)據(jù)報(bào)的重新組裝工作放到端系統(tǒng)中,而不是網(wǎng)絡(luò)路由器中。
當(dāng)一臺(tái)目的主機(jī)從相同源收到一系列數(shù)據(jù)報(bào)時(shí),它需要確定這些數(shù)據(jù)報(bào)中的某些是否是一些原來較大的數(shù)據(jù)報(bào)的片,這個(gè)該如何實(shí)現(xiàn)呢? 如果某些數(shù)據(jù)報(bào)是這些片的話,則它必須進(jìn)一步確定何時(shí)收到了最后一片,并且將這些接收到的片拼接到一起以形成初始的數(shù)據(jù)報(bào),這又該如何實(shí)現(xiàn)呢?
IPV4的設(shè)計(jì)者將標(biāo)識(shí),標(biāo)志和片偏移字段放在IP數(shù)據(jù)報(bào)首部中:
- 標(biāo)識(shí) : 檢查標(biāo)識(shí)號(hào)以確定哪些數(shù)據(jù)報(bào)實(shí)際是同一較大數(shù)據(jù)報(bào)的分片
- 標(biāo)志: 當(dāng)前分片是否是最后一個(gè)分片
(最后一個(gè)片的比特設(shè)為0,其他片均設(shè)置為1)
,由于IP是一種不可靠服務(wù),一個(gè)或多個(gè)片可能永遠(yuǎn)也無法達(dá)到目的地,所以即使接收到了最后一個(gè)分片,也未必等同于接收到了所有分片,還需要重組后通過校驗(yàn)和來檢驗(yàn)是否接收到完整數(shù)據(jù)報(bào)數(shù)據(jù) - 片偏移: 偏移字段指定當(dāng)前片應(yīng)放在IP數(shù)據(jù)報(bào)的哪個(gè)位置
傳輸層是否存在分段操作
傳輸層是否存在分段行為,這個(gè)問題需要分協(xié)議而論之,但就不可靠無連接的UDP協(xié)議而言,回答是NO!UDP協(xié)議除了端口復(fù)用/分解功能及少量的差錯(cuò)檢測(cè)外,它幾乎沒有對(duì)IP增加別的東西。實(shí)際上,如果應(yīng)用程序開發(fā)人員選擇UDP而不是TCP,則該應(yīng)用程序差不多就是直接和IP打交道。
對(duì)于UDP協(xié)議棧而言,它會(huì)把應(yīng)用程序傳下來的數(shù)據(jù)直接封裝為一個(gè)大的UDP數(shù)據(jù)報(bào),然后傳遞給網(wǎng)絡(luò)層,如果數(shù)據(jù)報(bào)大于當(dāng)前主機(jī)鏈路層協(xié)議的MTU協(xié)議限制,則會(huì)由IP層進(jìn)行分片和重組處理,正如上一小節(jié)所講。而接收端會(huì)接收到IP層重組后得到的完整UDP數(shù)據(jù)報(bào),然后進(jìn)行校驗(yàn)和檢驗(yàn)后,將payload傳遞給應(yīng)用程序,整個(gè)過程中UDP協(xié)議并不會(huì)對(duì)接收的應(yīng)用程序進(jìn)行分段:
上圖存在一些問題,沒有算上udp header的大小,這點(diǎn)大家注意一下
但是UDP協(xié)議的header頭部中存在長(zhǎng)度字段,因此整個(gè)UDP數(shù)據(jù)報(bào)的大小會(huì)受到該字段的長(zhǎng)度限制:
但是對(duì)于TCP協(xié)議而言,這個(gè)回答是YES,TCP協(xié)議本身是可靠的有連接的流傳輸協(xié)議,通過GBN(回退N步)加SR(選擇重傳)協(xié)議混合實(shí)現(xiàn)可靠傳輸,依靠滑動(dòng)窗口實(shí)現(xiàn)流量控制,最后依靠擁塞窗口實(shí)現(xiàn)擁塞控制。
對(duì)于TCP協(xié)議而言,當(dāng)應(yīng)用程序傳遞下來數(shù)據(jù)需要發(fā)送時(shí),是將數(shù)據(jù)全部封裝在單個(gè)TCP數(shù)據(jù)報(bào)中一次性發(fā)送出去,還是拆分成多次發(fā)送取決于以下五個(gè)因素:
-
當(dāng)前TCP連接發(fā)送窗口的剩余空閑大小
-
當(dāng)前TCP連接對(duì)端的接收窗口剩余空閑大小
-
最大報(bào)文段長(zhǎng)度(MSS)
-
擁塞窗口大小
-
tcp數(shù)據(jù)報(bào)中l(wèi)en字段長(zhǎng)度
本次發(fā)送數(shù)據(jù)大小 = Min(當(dāng)前TCP連接發(fā)送窗口的剩余空閑大小,當(dāng)前TCP連接對(duì)端的接收窗口剩余空閑大小,最大報(bào)文段長(zhǎng)度(MSS),擁塞窗口大小,tcp數(shù)據(jù)報(bào)中l(wèi)en字段長(zhǎng)度, 應(yīng)用程序傳輸數(shù)據(jù)大小)
MSS通常根據(jù)最初確定的由本地發(fā)送主機(jī)發(fā)送的最大鏈路層幀長(zhǎng)度(MTU)設(shè)置,MSS的值實(shí)際可以看做是MTU - TCP首部 - IP首部剩下的大小,也就是說MSS實(shí)際指代的是TCP報(bào)文段中應(yīng)用層數(shù)據(jù)的最大長(zhǎng)度,而不是指包括TCP首部的整個(gè)TCP報(bào)文段的最大長(zhǎng)度。
TCP協(xié)議通常會(huì)通過首部中的選項(xiàng)字段完成發(fā)送方和接收方對(duì)最大報(bào)文段長(zhǎng)度(MSS)的協(xié)商。
所以對(duì)于TCP協(xié)議而言,如果應(yīng)用程序傳下來一個(gè)較大的數(shù)據(jù)包,協(xié)議??赡軙?huì)分為多批次進(jìn)行傳輸,也就是進(jìn)行分段,大的數(shù)據(jù)報(bào)切分成多個(gè)小數(shù)據(jù)報(bào)進(jìn)行傳輸,并且由于tcp協(xié)議棧會(huì)保證單次傳輸?shù)臄?shù)據(jù)報(bào)大小小于MTU限制,所以一般不會(huì)在IP層發(fā)生分片操作,但是如果傳輸鏈路上出現(xiàn)了更小的MTU限制,還是會(huì)進(jìn)行IP分片和重組:
并且和UDP不同的一點(diǎn)時(shí),TCP只要接收到按序到達(dá)的一段字節(jié)流,并且此時(shí)應(yīng)用程序正在等待讀取數(shù)據(jù),TCP協(xié)議棧就會(huì)把這段按序到達(dá)的數(shù)據(jù)丟給應(yīng)用程序,然后把接收窗口的已讀指針向前推進(jìn)部分,因此這也是為什么稱TCP為流式協(xié)議 – 就像水龍頭一樣,只要有水就會(huì)流出來。
如果UDP發(fā)送端發(fā)送的是一個(gè)大的數(shù)據(jù)報(bào),那么UDP接收端會(huì)在接收完整個(gè)大的數(shù)據(jù)報(bào)后,才會(huì)把接收到的數(shù)據(jù)丟給應(yīng)用程序,因此也稱UDP協(xié)議為數(shù)據(jù)報(bào)協(xié)議。
IP分片重組源碼分析
上面鋪墊了很多理論知識(shí),從本節(jié)開始,我們進(jìn)入實(shí)踐環(huán)節(jié),看看IP分片重組過程是否如我們所言一般。
在Linux 1.2.13的net模塊中,使用ipfrag結(jié)構(gòu)來描述一個(gè)ip分片信息,使用ipq結(jié)構(gòu)來描述一個(gè)完整的傳輸層數(shù)據(jù)包信息:
ip.h:
/* Describe an IP fragment. */
// 描述一個(gè)IP分片
struct ipfrag {
int offset; /* offset of fragment in IP datagram - IP分片的在IP數(shù)據(jù)報(bào)里面的偏移 */
int end; /* last byte of data in datagram - 是否是最后一個(gè)分片 */
int len; /* length of this fragment -- 當(dāng)前分片大小 */
struct sk_buff *skb; /* complete received fragment */
unsigned char *ptr; /* pointer into real fragment data -- 指向分片數(shù)據(jù) */
struct ipfrag *next; /* linked list pointers -- 串聯(lián)起前后分片 */
struct ipfrag *prev;
};
/* Describe an entry in the "incomplete datagrams" queue. */
// 用于描述一個(gè)完整的傳輸層數(shù)據(jù)包,同時(shí)通過前后指針將未重組完成的IP數(shù)據(jù)報(bào)串聯(lián)起來
struct ipq {
unsigned char *mac; /* pointer to MAC header -- MAC頭部地址 */
struct iphdr *iph; /* pointer to IP header -- IP頭 */
int len; /* total length of original datagram -- 原始數(shù)據(jù)報(bào)大小 */
short ihlen; /* length of the IP header -- IP頭大小 */
short maclen; /* length of the MAC header -- MAC頭大小 */
struct timer_list timer; /* when will this queue expire? -- 定時(shí)器 --> 重組分片最大等待時(shí)長(zhǎng) */
struct ipfrag *fragments; /* linked list of received fragments -- IP分片鏈表 */
struct ipq *next; /* linked list pointers -- 串聯(lián)起未完成重組的IP數(shù)據(jù)報(bào) */
struct ipq *prev;
struct device *dev; /* Device - for icmp replies -- 重組失敗后通過該接口發(fā)送ICMP包 */
};
ip.c:
ip_create
- ip_create函數(shù)用于添加一個(gè)新的ipq節(jié)點(diǎn)到已有的ipq隊(duì)列中,該隊(duì)列用于等待接收一個(gè)新的IP數(shù)據(jù)報(bào)的所有分片到達(dá),其維護(hù)了屬于同一個(gè)分片組(同一個(gè)傳輸層數(shù)據(jù)包)的多個(gè)分片
/*
* Add an entry to the 'ipq' queue for a newly received IP datagram.
* We will (hopefully :-) receive all other fragments of this datagram
* in time, so we just create a queue for this datagram, in which we
* will insert the received fragments at their respective positions.
*/
// 創(chuàng)建一個(gè)隊(duì)列用于重組分片
// 參數(shù): 承載當(dāng)前分片數(shù)據(jù)信息,ip首部,從哪個(gè)鏈路層設(shè)備上接收到的以太網(wǎng)幀
static struct ipq *ip_create(struct sk_buff *skb, struct iphdr *iph, struct device *dev)
{
struct ipq *qp;
int maclen;
int ihlen;
// 分片一個(gè)新的表示分片隊(duì)列的節(jié)點(diǎn)
qp = (struct ipq *) kmalloc(sizeof(struct ipq), GFP_ATOMIC);
if (qp == NULL)
{
printk("IP: create: no memory left !\n");
return(NULL);
skb->dev = qp->dev;
}
memset(qp, 0, sizeof(struct ipq));
/*
* Allocate memory for the MAC header.
*
* FIXME: We have a maximum MAC address size limit and define
* elsewhere. We should use it here and avoid the 3 kmalloc() calls
*/
// mac頭長(zhǎng)度等于ip頭減去mac頭首地址
maclen = ((unsigned long) iph) - ((unsigned long) skb->data);
qp->mac = (unsigned char *) kmalloc(maclen, GFP_ATOMIC);
if (qp->mac == NULL)
{
printk("IP: create: no memory left !\n");
kfree_s(qp, sizeof(struct ipq));
return(NULL);
}
/*
* Allocate memory for the IP header (plus 8 octets for ICMP).
*/
// ip頭長(zhǎng)度由ip頭字段得出,多分配8個(gè)字節(jié)給icmp
ihlen = (iph->ihl * sizeof(unsigned long));
qp->iph = (struct iphdr *) kmalloc(ihlen + 8, GFP_ATOMIC);
if (qp->iph == NULL)
{
printk("IP: create: no memory left !\n");
kfree_s(qp->mac, maclen);
kfree_s(qp, sizeof(struct ipq));
return(NULL);
}
/* Fill in the structure. */
// 把mac頭內(nèi)容復(fù)制到mac字段
// 第一個(gè)參數(shù)是dst,第二個(gè)是source,是將skb中相關(guān)信息copy到qp中
memcpy(qp->mac, skb->data, maclen);
// 把ip頭和傳輸層的8個(gè)字節(jié)復(fù)制到iph字段,8個(gè)字段的內(nèi)容用于發(fā)送icmp報(bào)文時(shí)
memcpy(qp->iph, iph, ihlen + 8);
// 未分片的ip報(bào)文的總長(zhǎng)度,未知,收到所有分片后重新賦值
qp->len = 0;
// 當(dāng)前分片的ip頭和mac頭長(zhǎng)度
qp->ihlen = ihlen;
qp->maclen = maclen;
qp->fragments = NULL;
qp->dev = dev;
/* Start a timer for this entry. */
// 開始計(jì)時(shí),一定時(shí)間內(nèi)還沒收到所有分片則重組失敗,發(fā)送icmp報(bào)文
qp->timer.expires = IP_FRAG_TIME; /* about 30 seconds */
qp->timer.data = (unsigned long) qp; /* pointer to queue */
qp->timer.function = ip_expire; /* expire function */
add_timer(&qp->timer);
/* Add this entry to the queue. */
qp->prev = NULL;
cli();
// 頭插法插入分片重組的隊(duì)列
// ipqueue是全局頭指針,指向ipq隊(duì)列首元素
qp->next = ipqueue;
// 如果當(dāng)前新增的節(jié)點(diǎn)不是第一個(gè)節(jié)點(diǎn)則把當(dāng)前第一個(gè)節(jié)點(diǎn)的prev指針指向新增的節(jié)點(diǎn)
if (qp->next != NULL)
qp->next->prev = qp;
//更新ipqueue指向新增的節(jié)點(diǎn),新增節(jié)點(diǎn)是首節(jié)點(diǎn)
ipqueue = qp;
sti();
return(qp);
}
ip_find
- ip_find函數(shù)負(fù)責(zé)根據(jù)ip頭查找對(duì)應(yīng)的ipq隊(duì)列
/*
* Find the correct entry in the "incomplete datagrams" queue for
* this IP datagram, and return the queue entry address if found.
*/
// 根據(jù)ip頭找到分片隊(duì)列的頭指針
static struct ipq *ip_find(struct iphdr *iph)
{
struct ipq *qp;
struct ipq *qplast;
cli();
qplast = NULL;
for(qp = ipqueue; qp != NULL; qplast = qp, qp = qp->next)
{ // 對(duì)比ip頭里的幾個(gè)字段
if (iph->id== qp->iph->id && iph->saddr == qp->iph->saddr &&
iph->daddr == qp->iph->daddr && iph->protocol == qp->iph->protocol)
{ // 找到后重置計(jì)時(shí)器,在這刪除,在ip_find外面新增一個(gè)計(jì)時(shí)
del_timer(&qp->timer); /* So it doesn't vanish on us. The timer will be reset anyway */
sti();
return(qp);
}
}
sti();
return(NULL);
}
ip_frag_create
- ip_frag_create函數(shù)負(fù)責(zé)創(chuàng)建一個(gè)表示單個(gè)ip分片的結(jié)構(gòu)體ipfrag – 它表示其中一個(gè)分片
/*
* Create a new fragment entry.
*/
// 創(chuàng)建一個(gè)表示ip分片的結(jié)構(gòu)體
static struct ipfrag *ip_frag_create(int offset, int end, struct sk_buff *skb, unsigned char *ptr)
{
struct ipfrag *fp;
fp = (struct ipfrag *) kmalloc(sizeof(struct ipfrag), GFP_ATOMIC);
if (fp == NULL)
{
printk("IP: frag_create: no memory left !\n");
return(NULL);
}
memset(fp, 0, sizeof(struct ipfrag));
/* Fill in the structure. */
fp->offset = offset; // ip分配的首字節(jié)在未分片數(shù)據(jù)中的偏移
fp->end = end; // 最后一個(gè)字節(jié)的偏移 + 1,即下一個(gè)分片的首字節(jié)偏移
fp->len = end - offset; // 分片長(zhǎng)度
fp->skb = skb;
fp->ptr = ptr; // 指向分片的數(shù)據(jù)首地址
return(fp);
}
ip_done
- ip_done函數(shù)負(fù)責(zé)判斷分片是否已經(jīng)全部到達(dá)
/*
* See if a fragment queue is complete.
*/
// 判斷分片是否全部到達(dá)
static int ip_done(struct ipq *qp)
{
struct ipfrag *fp;
int offset;
/* Only possible if we received the final fragment. */
// 收到最后分片的時(shí)候會(huì)更新len字段,如果沒有收到他就是初始化0,所以為0說明最后一個(gè)分片還沒到達(dá),直接返回未完成
if (qp->len == 0)
return(0);
// 接收到最后一個(gè)分片,但分片可能是無序到達(dá)的,因此需要檢查是否接收到了當(dāng)前IP數(shù)據(jù)報(bào)的所有IP分片
/* Check all fragment offsets to see if they connect. */
fp = qp->fragments;
offset = 0;
// 檢查所有分片,每個(gè)分片是按照偏移量從小到大排序的鏈表,因?yàn)槊看畏制?jié)點(diǎn)到達(dá)時(shí)會(huì)插入相應(yīng)的位置
while (fp != NULL)
{ /*
如果當(dāng)前節(jié)點(diǎn)的偏移大于期待的偏移(即上一個(gè)節(jié)點(diǎn)的最后一個(gè)字節(jié)的偏移+1,由end字段表示),
說明有中間節(jié)點(diǎn)沒到達(dá),直接返回未完成
*/
if (fp->offset > offset)
return(0); /* fragment(s) missing */
offset = fp->end;
fp = fp->next;
}
/* All fragments are present. */
// 分片全部到達(dá)并且每個(gè)分片的字節(jié)連續(xù)則重組完成
return(1);
}
ip_glue
- ip_glue函數(shù)負(fù)責(zé)重組同一隊(duì)列里的所有ip分片
/*
* Build a new IP datagram from all its fragments.
*
* FIXME: We copy here because we lack an effective way of handling lists
* of bits on input. Until the new skb data handling is in I'm not going
* to touch this with a bargepole. This also causes a 4Kish limit on
* packet sizes.
*/
// 重組成功后構(gòu)造完整的ip報(bào)文
static struct sk_buff *ip_glue(struct ipq *qp)
{
struct sk_buff *skb;
struct iphdr *iph;
struct ipfrag *fp;
unsigned char *ptr;
int count, len;
/*
* Allocate a new buffer for the datagram.
*/
// 整個(gè)包的長(zhǎng)度等于mac頭長(zhǎng)度+ip頭長(zhǎng)度+數(shù)據(jù)長(zhǎng)度
len = qp->maclen + qp->ihlen + qp->len;
// 分配新的skb
if ((skb = alloc_skb(len,GFP_ATOMIC)) == NULL)
{
ip_statistics.IpReasmFails++;
printk("IP: queue_glue: no memory for gluing queue 0x%X\n", (int) qp);
ip_free(qp);
return(NULL);
}
/* Fill in the basic details. */
// 這里應(yīng)該是等于qp->len?
skb->len = (len - qp->maclen);
skb->h.raw = skb->data; // data字段指向新分配的內(nèi)存首地址
skb->free = 1;
/* Copy the original MAC and IP headers into the new buffer. */
ptr = (unsigned char *) skb->h.raw;
memcpy(ptr, ((unsigned char *) qp->mac), qp->maclen); // 把mac頭復(fù)制到新的內(nèi)存
ptr += qp->maclen;
memcpy(ptr, ((unsigned char *) qp->iph), qp->ihlen); // 把ip頭復(fù)制到新的內(nèi)存
ptr += qp->ihlen; // 指向數(shù)據(jù)部分的首地址
skb->h.raw += qp->maclen;// 指向ip頭首地址
count = 0;
/* Copy the data portions of all fragments into the new buffer. */
fp = qp->fragments;
// 開始復(fù)制數(shù)據(jù)部分
while(fp != NULL)
{ // 如果當(dāng)前節(jié)點(diǎn)的數(shù)據(jù)長(zhǎng)度+已經(jīng)復(fù)制的內(nèi)容長(zhǎng)度大于skb->len則說明內(nèi)容溢出了,丟棄該數(shù)據(jù)包
if(count+fp->len > skb->len)
{
printk("Invalid fragment list: Fragment over size.\n");
ip_free(qp);
kfree_skb(skb,FREE_WRITE);
ip_statistics.IpReasmFails++;
return NULL;
}
// 把分片中的數(shù)據(jù)復(fù)制到對(duì)應(yīng)偏移的位置
memcpy((ptr + fp->offset), fp->ptr, fp->len);
// 已復(fù)制的數(shù)據(jù)長(zhǎng)度
count += fp->len;
fp = fp->next;
}
/* We glued together all fragments, so remove the queue entry. */
ip_free(qp);// 數(shù)據(jù)復(fù)制完后可以釋放分片隊(duì)列了
/* Done with all fragments. Fixup the new IP header. */
iph = skb->h.iph; // 上面的raw字段指向了ip頭首地址,skb->h.iph等價(jià)于raw字段的值
iph->frag_off = 0; // 清除分片字段
// 更新總長(zhǎng)度為ip頭+數(shù)據(jù)的長(zhǎng)度
iph->tot_len = htons((iph->ihl * sizeof(unsigned long)) + count);
skb->ip_hdr = iph;
ip_statistics.IpReasmOKs++;
return(skb);
}
重組的大致流程就是申請(qǐng)一塊新內(nèi)存,然后把mac頭、ip頭復(fù)制過去。再遍歷分片隊(duì)列,把每個(gè)分片的數(shù)據(jù)拼起來。最后更新一些字段。文章來源:http://www.zghlxwxcb.cn/news/detail-647445.html
ip_free
- ip_free函數(shù)負(fù)責(zé)釋放ip分片隊(duì)列
/*
* Remove an entry from the "incomplete datagrams" queue, either
* because we completed, reassembled and processed it, or because
* it timed out.
*/
// 釋放ip分片隊(duì)列
static void ip_free(struct ipq *qp)
{
struct ipfrag *fp;
struct ipfrag *xp;
/*
* Stop the timer for this entry.
*/
// 刪除定時(shí)器
del_timer(&qp->timer);
/* Remove this entry from the "incomplete datagrams" queue. */
cli();
/*
被刪除的節(jié)點(diǎn)前面沒有節(jié)點(diǎn)說明他是第一個(gè)節(jié)點(diǎn),因?yàn)椴皇茄h(huán)鏈表,
修改首指針ipqueue指向被刪除節(jié)點(diǎn)的下一個(gè),如果下一個(gè)不為空,下一個(gè)節(jié)點(diǎn)的prev節(jié)點(diǎn)指向空,
因?yàn)檫@時(shí)候他為第一個(gè)節(jié)點(diǎn)。
*/
if (qp->prev == NULL)
{
ipqueue = qp->next;
if (ipqueue != NULL)
ipqueue->prev = NULL;
}
else
{
/*
被刪除節(jié)點(diǎn)不是第一個(gè)節(jié)點(diǎn),但可能是最后一個(gè),
被刪除節(jié)點(diǎn)的前一個(gè)節(jié)點(diǎn)的next指針指向被刪除節(jié)點(diǎn)的下一個(gè)節(jié)點(diǎn),
如果如果被刪除節(jié)點(diǎn)的下一個(gè)節(jié)點(diǎn)不為空則他的prev指針執(zhí)行被刪除節(jié)點(diǎn)
前面的節(jié)點(diǎn)
*/
qp->prev->next = qp->next;
if (qp->next != NULL)
qp->next->prev = qp->prev;
}
/* Release all fragment data. */
fp = qp->fragments;
// 刪除所有分片節(jié)點(diǎn)
while (fp != NULL)
{
xp = fp->next;
IS_SKB(fp->skb);
kfree_skb(fp->skb,FREE_READ);
kfree_s(fp, sizeof(struct ipfrag));
fp = xp;
}
// 刪除mac頭和ip頭,8字節(jié)是icmp用的,存放傳輸層的前8個(gè)字節(jié)
/* Release the MAC header. */
kfree_s(qp->mac, qp->maclen);
/* Release the IP header. */
kfree_s(qp->iph, qp->ihlen + 8);
/* Finally, release the queue descriptor itself. */
kfree_s(qp, sizeof(struct ipq));
sti();
}
ip_expire
- ip_expire函數(shù)負(fù)責(zé)處理分片重組超時(shí)的情況
/*
* Oops- a fragment queue timed out. Kill it and send an ICMP reply.
*/
// 分片重組超時(shí)處理函數(shù)
static void ip_expire(unsigned long arg)
{
struct ipq *qp;
qp = (struct ipq *)arg;
/*
* Send an ICMP "Fragment Reassembly Timeout" message.
*/
ip_statistics.IpReasmTimeout++;
ip_statistics.IpReasmFails++;
/* This if is always true... shrug */
// 發(fā)送icmp超時(shí)報(bào)文
if(qp->fragments!=NULL)
icmp_send(qp->fragments->skb,ICMP_TIME_EXCEEDED,
ICMP_EXC_FRAGTIME, 0, qp->dev);
/*
* Nuke the fragment queue.
*/
// 釋放分片隊(duì)列
ip_free(qp);
}
ip_defrag
- ip_defrag函數(shù)接收到一個(gè)IP數(shù)據(jù)報(bào)后判斷是否為某個(gè)IP數(shù)據(jù)報(bào)分片的一部分,如果是,則處理好分片重疊問題,然后將當(dāng)前分片插入ipq隊(duì)列對(duì)應(yīng)位置處,最后檢查當(dāng)前IP數(shù)據(jù)報(bào)全部分片是否都已到達(dá),如果是,則進(jìn)入重組階段,最終返回重組后的IP數(shù)據(jù)報(bào)
/*
* Process an incoming IP datagram fragment.
*/
// 處理分片報(bào)文
static struct sk_buff *ip_defrag(struct iphdr *iph, struct sk_buff *skb, struct device *dev)
{
struct ipfrag *prev, *next;
struct ipfrag *tfp;
struct ipq *qp;
struct sk_buff *skb2;
unsigned char *ptr;
int flags, offset;
int i, ihl, end;
ip_statistics.IpReasmReqds++;
/* Find the entry of this IP datagram in the "incomplete datagrams" queue. */
qp = ip_find(iph); // 根據(jù)ip頭找是否已經(jīng)存在分片隊(duì)列
/* Is this a non-fragmented datagram? */
offset = ntohs(iph->frag_off);
flags = offset & ~IP_OFFSET; // 取得三個(gè)分片標(biāo)記位
offset &= IP_OFFSET; // 取得分片偏移
// 如果沒有更多分片了,并且offset=0(第一個(gè)分片),則屬于出錯(cuò),第一個(gè)分片后面肯定還有分片,否則干嘛要分片
if (((flags & IP_MF) == 0) && (offset == 0))
{
if (qp != NULL)
ip_free(qp); /* Huh? How could this exist?? */
return(skb);
}
// 偏移乘以8得到數(shù)據(jù)的真實(shí)偏移
offset <<= 3; /* offset is in 8-byte chunks */
/*
* If the queue already existed, keep restarting its timer as long
* as we still are receiving fragments. Otherwise, create a fresh
* queue entry.
*/
/*
如果已經(jīng)存在分片隊(duì)列,說明之前已經(jīng)有分片到達(dá),重置計(jì)時(shí)器,所以超時(shí)的邏輯是,
如果IP_FRAG_TIME時(shí)間內(nèi)沒有分片到達(dá),則認(rèn)為重組超時(shí),這里沒有以總時(shí)間來判斷。
*/
if (qp != NULL)
{
del_timer(&qp->timer);
qp->timer.expires = IP_FRAG_TIME; /* about 30 seconds */
qp->timer.data = (unsigned long) qp; /* pointer to queue */
qp->timer.function = ip_expire; /* expire function */
add_timer(&qp->timer);
}
else
{
/*
* If we failed to create it, then discard the frame
*/
// 新建一個(gè)管理分片隊(duì)列的節(jié)點(diǎn)
if ((qp = ip_create(skb, iph, dev)) == NULL)
{
skb->sk = NULL;
kfree_skb(skb, FREE_READ);
ip_statistics.IpReasmFails++;
return NULL;
}
}
/*
* Determine the position of this fragment.
*/
// ip頭長(zhǎng)度
ihl = (iph->ihl * sizeof(unsigned long));
// 偏移+數(shù)據(jù)部分長(zhǎng)度等于end,end的值是最后一個(gè)字節(jié)+1
end = offset + ntohs(iph->tot_len) - ihl;
/*
* Point into the IP datagram 'data' part.
*/
// data指向整個(gè)報(bào)文首地址,即mac頭首地址,ptr指向ip報(bào)文的數(shù)據(jù)部分
ptr = skb->data + dev->hard_header_len + ihl;
/*
* Is this the final fragment?
*/
// 是否是最后一個(gè)分片,是的話,未分片的ip報(bào)文長(zhǎng)度為end,即最后一個(gè)報(bào)文的最后一個(gè)字節(jié)的偏移+1,因?yàn)槠茝?算起
if ((flags & IP_MF) == 0)
qp->len = end;
/*
* Find out which fragments are in front and at the back of us
* in the chain of fragments so far. We must know where to put
* this fragment, right?
*/
prev = NULL;
// 插入分片隊(duì)列相應(yīng)的位置,保證分片的有序
for(next = qp->fragments; next != NULL; next = next->next)
{ // 找出第一個(gè)比當(dāng)前分片偏移大的節(jié)點(diǎn)
if (next->offset > offset)
break; /* bingo! */
prev = next;
}
/*
* We found where to put this one.
* Check for overlap with preceding fragment, and, if needed,
* align things so that any overlaps are eliminated.
*/
// 處理分片重疊問題
/*
處理當(dāng)前節(jié)點(diǎn)和前面節(jié)點(diǎn)的重疊問題,因?yàn)樯厦姹WC了offset >= prev->offset,
所以只需要比較當(dāng)前節(jié)點(diǎn)的偏移和prev節(jié)點(diǎn)的end字段
*/
if (prev != NULL && offset < prev->end)
{
// 說明存在重疊,算出重疊的大小,把當(dāng)前節(jié)點(diǎn)的重疊部分丟棄,更新offset和ptr指針往前走,沒處理完全重疊的情況
i = prev->end - offset;
offset += i; /* ptr into datagram */
ptr += i; /* ptr into fragment data */
}
/*
* Look for overlap with succeeding segments.
* If we can merge fragments, do it.
*/
// 處理當(dāng)前節(jié)點(diǎn)和后面節(jié)點(diǎn)的重疊問題
for(; next != NULL; next = tfp)
{
tfp = next->next;
// 當(dāng)前節(jié)點(diǎn)及其后面的節(jié)點(diǎn)都不會(huì)發(fā)生重疊了
if (next->offset >= end)
break; /* no overlaps at all */
// 反之發(fā)生了重疊,算出重疊大小
i = end - next->offset; /* overlap is 'i' bytes */
// 更新和當(dāng)前節(jié)點(diǎn)重疊的節(jié)點(diǎn)的字段,往后挪
next->len -= i; /* so reduce size of */
next->offset += i; /* next fragment */
next->ptr += i;
/*
* If we get a frag size of <= 0, remove it and the packet
* that it goes with.
*/
// 發(fā)生了完全重疊,則刪除舊的節(jié)點(diǎn)
if (next->len <= 0)
{
if (next->prev != NULL)
next->prev->next = next->next;// 說明舊節(jié)點(diǎn)不是第一個(gè)節(jié)點(diǎn)
else
qp->fragments = next->next;// 說明舊節(jié)點(diǎn)是第一個(gè)節(jié)點(diǎn)
// 這里應(yīng)該是tfp !=NULL ?
if (tfp->next != NULL)
next->next->prev = next->prev;
kfree_skb(next->skb,FREE_READ);
kfree_s(next, sizeof(struct ipfrag));
}
}
/*
* Insert this fragment in the chain of fragments.
*/
tfp = NULL;
// 創(chuàng)建一個(gè)分片節(jié)點(diǎn)
tfp = ip_frag_create(offset, end, skb, ptr);
/*
* No memory to save the fragment - so throw the lot
*/
if (!tfp)
{
skb->sk = NULL;
kfree_skb(skb, FREE_READ);
return NULL;
}
// 插入分片隊(duì)列
tfp->prev = prev;
tfp->next = next;
if (prev != NULL)
prev->next = tfp;
else
qp->fragments = tfp;
if (next != NULL)
next->prev = tfp;
/*
* OK, so we inserted this new fragment into the chain.
* Check if we now have a full IP datagram which we can
* bump up to the IP layer...
*/
// 判斷全部分片是否到達(dá),是的話重組
if (ip_done(qp))
{
skb2 = ip_glue(qp); /* glue together the fragments */
return(skb2);
}
return(NULL);
}
ip_rcv
- ip_rcv函數(shù)負(fù)責(zé)完成一個(gè)IP數(shù)據(jù)報(bào)的接收過程
/*
* This function receives all incoming IP datagrams.
*/
int ip_rcv(struct sk_buff *skb, struct device *dev, struct packet_type *pt)
{
struct iphdr *iph = skb->h.iph;
struct sock *raw_sk=NULL;
unsigned char hash;
unsigned char flag = 0;
unsigned char opts_p = 0; /* Set iff the packet has options. */
struct inet_protocol *ipprot;
static struct options opt; /* since we don't use these yet, and they
take up stack space. */
int brd=IS_MYADDR;
int is_frag=0;
#ifdef CONFIG_IP_FIREWALL
int err;
#endif
ip_statistics.IpInReceives++;
/*
* Tag the ip header of this packet so we can find it
*/
skb->ip_hdr = iph;
/*
* Is the datagram acceptable?
*
* 1. Length at least the size of an ip header
* 2. Version of 4
* 3. Checksums correctly. [Speed optimisation for later, skip loopback checksums]
* (4. We ought to check for IP multicast addresses and undefined types.. does this matter ?)
*/
// 參數(shù)檢查
if (skb->len<sizeof(struct iphdr) || iph->ihl<5 || iph->version != 4 ||
skb->len<ntohs(iph->tot_len) || ip_fast_csum((unsigned char *)iph, iph->ihl) !=0)
{
ip_statistics.IpInHdrErrors++;
kfree_skb(skb, FREE_WRITE);
return(0);
}
/*
* See if the firewall wants to dispose of the packet.
*/
// 配置了防火墻,則先檢查是否符合防火墻的過濾規(guī)則,否則則丟掉
#ifdef CONFIG_IP_FIREWALL
if ((err=ip_fw_chk(iph,dev,ip_fw_blk_chain,ip_fw_blk_policy, 0))!=1)
{
if(err==-1)
icmp_send(skb, ICMP_DEST_UNREACH, ICMP_PORT_UNREACH, 0, dev);
kfree_skb(skb, FREE_WRITE);
return 0;
}
#endif
/*
* Our transport medium may have padded the buffer out. Now we know it
* is IP we can trim to the true length of the frame.
*/
skb->len=ntohs(iph->tot_len);
/*
* Next analyse the packet for options. Studies show under one packet in
* a thousand have options....
*/
// ip頭超過20字節(jié),說明有選項(xiàng)
if (iph->ihl != 5)
{ /* Fast path for the typical optionless IP packet. */
memset((char *) &opt, 0, sizeof(opt));
if (do_options(iph, &opt) != 0)
return 0;
opts_p = 1;
}
/*
* Remember if the frame is fragmented.
*/
// 非0則說明是分片
if(iph->frag_off)
{
// 是否設(shè)置了MF,即還有更多分片,是的話is_frag等于1
if (iph->frag_off & 0x0020)
is_frag|=1;
/*
* Last fragment ?
*/
// 非0說明有偏移,即不是第一個(gè)塊分片
if (ntohs(iph->frag_off) & 0x1fff)
is_frag|=2;
}
/*
* Do any IP forwarding required. chk_addr() is expensive -- avoid it someday.
*
* This is inefficient. While finding out if it is for us we could also compute
* the routing table entry. This is where the great unified cache theory comes
* in as and when someone implements it
*
* For most hosts over 99% of packets match the first conditional
* and don't go via ip_chk_addr. Note: brd is set to IS_MYADDR at
* function entry.
*/
if ( iph->daddr != skb->dev->pa_addr && (brd = ip_chk_addr(iph->daddr)) == 0)
{
/*
* Don't forward multicast or broadcast frames.
*/
if(skb->pkt_type!=PACKET_HOST || brd==IS_BROADCAST)
{
kfree_skb(skb,FREE_WRITE);
return 0;
}
/*
* The packet is for another target. Forward the frame
*/
#ifdef CONFIG_IP_FORWARD
ip_forward(skb, dev, is_frag);
#else
/* printk("Machine %lx tried to use us as a forwarder to %lx but we have forwarding disabled!\n",
iph->saddr,iph->daddr);*/
ip_statistics.IpInAddrErrors++;
#endif
/*
* The forwarder is inefficient and copies the packet. We
* free the original now.
*/
kfree_skb(skb, FREE_WRITE);
return(0);
}
#ifdef CONFIG_IP_MULTICAST
if(brd==IS_MULTICAST && iph->daddr!=IGMP_ALL_HOSTS && !(dev->flags&IFF_LOOPBACK))
{
/*
* Check it is for one of our groups
*/
struct ip_mc_list *ip_mc=dev->ip_mc_list;
do
{
if(ip_mc==NULL)
{
kfree_skb(skb, FREE_WRITE);
return 0;
}
if(ip_mc->multiaddr==iph->daddr)
break;
ip_mc=ip_mc->next;
}
while(1);
}
#endif
/*
* Account for the packet
*/
#ifdef CONFIG_IP_ACCT
ip_acct_cnt(iph,dev, ip_acct_chain);
#endif
/*
* Reassemble IP fragments.
*/
// 還有更多分片(等于1),不是第一個(gè)分片(等于2)或者兩者(等于3)則分片重組
if(is_frag)
{
/* Defragment. Obtain the complete packet if there is one */
skb=ip_defrag(iph,skb,dev);
if(skb==NULL)
return 0;
skb->dev = dev;
iph=skb->h.iph;
}
/*
* Point into the IP datagram, just past the header.
*/
skb->ip_hdr = iph;
// 往上層傳之前先指向上層的頭
skb->h.raw += iph->ihl*4;
/*
* Deliver to raw sockets. This is fun as to avoid copies we want to make no surplus copies.
*/
hash = iph->protocol & (SOCK_ARRAY_SIZE-1);
/* If there maybe a raw socket we must check - if not we don't care less */
if((raw_sk=raw_prot.sock_array[hash])!=NULL)
{
struct sock *sknext=NULL;
struct sk_buff *skb1;
// 找對(duì)應(yīng)的socket
raw_sk=get_sock_raw(raw_sk, hash, iph->saddr, iph->daddr);
if(raw_sk) /* Any raw sockets */
{
do
{
/* Find the next */
// 從隊(duì)列中raw_sk的下一個(gè)節(jié)點(diǎn)開始找滿足條件的socket,因?yàn)橹暗牡目隙ú粷M足條件了
sknext=get_sock_raw(raw_sk->next, hash, iph->saddr, iph->daddr);
// 復(fù)制一份skb給符合條件的socket
if(sknext)
skb1=skb_clone(skb, GFP_ATOMIC);
else
break; /* One pending raw socket left */
if(skb1)
raw_rcv(raw_sk, skb1, dev, iph->saddr,iph->daddr);
// 記錄最近符合條件的socket
raw_sk=sknext;
}
while(raw_sk!=NULL);
/* Here either raw_sk is the last raw socket, or NULL if none */
/* We deliver to the last raw socket AFTER the protocol checks as it avoids a surplus copy */
}
}
/*
* skb->h.raw now points at the protocol beyond the IP header.
*/
// 傳給ip層的上傳協(xié)議
hash = iph->protocol & (MAX_INET_PROTOS -1);
// 獲取哈希鏈表中的一個(gè)隊(duì)列,遍歷
for (ipprot = (struct inet_protocol *)inet_protos[hash];ipprot != NULL;ipprot=(struct inet_protocol *)ipprot->next)
{
struct sk_buff *skb2;
if (ipprot->protocol != iph->protocol)
continue;
/*
* See if we need to make a copy of it. This will
* only be set if more than one protocol wants it.
* and then not for the last one. If there is a pending
* raw delivery wait for that
*/
/*
是否需要復(fù)制一份skb,copy字段這個(gè)版本中都是0,有多個(gè)一樣的協(xié)議才需要復(fù)制一份,
否則一份就夠,因?yàn)橹挥幸粋€(gè)協(xié)議需要使用,raw_sk的值是上面代碼決定的
*/
if (ipprot->copy || raw_sk)
{
skb2 = skb_clone(skb, GFP_ATOMIC);
if(skb2==NULL)
continue;
}
else
{
skb2 = skb;
}
// 找到了處理該數(shù)據(jù)包的上層協(xié)議
flag = 1;
/*
* Pass on the datagram to each protocol that wants it,
* based on the datagram protocol. We should really
* check the protocol handler's return values here...
*/
ipprot->handler(skb2, dev, opts_p ? &opt : 0, iph->daddr,
(ntohs(iph->tot_len) - (iph->ihl * 4)),
iph->saddr, 0, ipprot);
}
/*
* All protocols checked.
* If this packet was a broadcast, we may *not* reply to it, since that
* causes (proven, grin) ARP storms and a leakage of memory (i.e. all
* ICMP reply messages get queued up for transmission...)
*/
if(raw_sk!=NULL) /* Shift to last raw user */
raw_rcv(raw_sk, skb, dev, iph->saddr, iph->daddr);
// 沒找到處理該數(shù)據(jù)包的上層協(xié)議,報(bào)告錯(cuò)誤
else if (!flag) /* Free and report errors */
{
// 不是廣播不是多播,發(fā)送目的地不可達(dá)的icmp包
if (brd != IS_BROADCAST && brd!=IS_MULTICAST)
icmp_send(skb, ICMP_DEST_UNREACH, ICMP_PROT_UNREACH, 0, dev);
kfree_skb(skb, FREE_WRITE);
}
return(0);
}
總結(jié)
經(jīng)過理論和實(shí)踐,相信各位已經(jīng)知道了最開始拋出的答案了:文章來源地址http://www.zghlxwxcb.cn/news/detail-647445.html
- IP協(xié)議是不可靠協(xié)議,雖然IP層需要進(jìn)行分片和重組,但是不會(huì)使用ACK,重傳等機(jī)制確保該過程的可靠性,而僅僅使用超時(shí)定時(shí)器來判斷分組重組過程是否超時(shí),如果超時(shí),則回應(yīng)一個(gè)ICMP重組超時(shí)錯(cuò)誤報(bào)文
到了這里,關(guān)于Linux 1.2.13 -- IP分片重組源碼分析的文章就介紹完了。如果您還想了解更多內(nèi)容,請(qǐng)?jiān)谟疑辖撬阉鱐OY模板網(wǎng)以前的文章或繼續(xù)瀏覽下面的相關(guān)文章,希望大家以后多多支持TOY模板網(wǎng)!