LLaMA-7B微調(diào)記錄

這篇具有很好參考價(jià)值的文章主要介紹了LLaMA-7B微調(diào)記錄。希望對(duì)大家有所幫助。如果存在錯(cuò)誤或未考慮完全的地方，請(qǐng)大家不吝賜教，您也可以點(diǎn)擊"舉報(bào)違法"按鈕提交疑問(wèn)。

Alpaca（https://github.com/tatsu-lab/stanford_alpaca）在70億參數(shù)的LLaMA-7B上進(jìn)行微調(diào)，通過(guò)52k指令數(shù)據(jù)（https://github.com/tatsu-lab/stanford_alpaca/blob/main/alpaca_data.json），在8個(gè)80GB A100上訓(xùn)練3小時(shí)，在對(duì)話上的性能比肩GPT-3.5。

指令微調(diào)數(shù)據(jù)來(lái)源：self-instruct

52k數(shù)據(jù)所對(duì)應(yīng)的alpaca_data.json是一個(gè)字典列表，每個(gè)字典包含以下字段：

instruction：52k條指令中的每一條都是唯一的；
input：可以是空的（以instruction作為上下文），也可以是與instruction相關(guān)的文本；
output：由GPT3.5的API生成的對(duì)應(yīng)回答。

比如在中文版52k指令（https://github.com/hikariming/alpaca_chinese_dataset）中：

[
  {
    "instruction": "\"給出三個(gè)保持健康的小貼士。\"",
    "input": "",
    "output": "1. 飲食要均衡，確保攝入足夠的水果和蔬菜。\n2. 經(jīng)常鍛煉，保持身體活躍和強(qiáng)壯。\n3. 要保證充足的睡眠，并保持一個(gè)穩(wěn)定的睡眠時(shí)間表。"
  },
  {
    "instruction": "\"三原色是什么？\"",
    "input": "",
    "output": "\"三原色是紅色、藍(lán)色和黃色。\""
  },
  ...
  {
    "instruction": "將以下內(nèi)容分類為動(dòng)物、植物和礦物。",
    "input": "\"橡樹(shù)，銅礦石，大象\"",
    "output": "\"橡樹(shù)：植物\n銅礦石：礦物\n大象：動(dòng)物\""
  },  
]

關(guān)于52k數(shù)據(jù)，是通過(guò)openAI的API用self-instruct方式生成的，花費(fèi)500美元，self-instruct是將預(yù)訓(xùn)練語(yǔ)言模型與指令對(duì)齊的方法（https://github.com/yizhongw/self-instruct）：
LLaMA-7B微調(diào)記錄,生成式AI,llama
一共包含4個(gè)步驟：

Step1：通過(guò)模型生成新的指令，根據(jù)人工設(shè)計(jì)的175個(gè)任務(wù)，每個(gè)任務(wù)都有對(duì)應(yīng)的（指令，輸入，輸出）或（指令，輸出）；使用模型生成新的指令；
Step2：對(duì)模型生成的指令進(jìn)行判斷（指令是否是一個(gè)分類任務(wù)）；
Step3：根據(jù)Step2的判斷結(jié)果，給出不同的輸出，
如果是分類任務(wù)，就通過(guò)模型輸出 Class_label 和 Input（Output-first）；
如果不是分類任務(wù)，就通過(guò)模型輸出 Input 和 Output（Input-first）；
Step4：過(guò)濾及后處理，對(duì)上述模型生成的數(shù)據(jù)進(jìn)行過(guò)濾和后處理，將經(jīng)過(guò)過(guò)濾和后處理的數(shù)據(jù)添加到種子池中。

對(duì)于以上4個(gè)步驟進(jìn)行不斷循環(huán)，直到種子池有足夠多的數(shù)據(jù)（通常會(huì)設(shè)定一個(gè)具體的參數(shù)，比如：52000），生成過(guò)程停止。

比如在項(xiàng)目 https://github.com/yizhongw/self-instruct 中，依次執(zhí)行：

# 1. Generate instructions from the seed tasks
./scripts/generate_instructions.sh

# 2. Identify whether the instruction represents a classification task or not
./scripts/is_clf_or_not.sh

# 3. Generate instances for each instruction
./scripts/generate_instances.sh

# 4. Filtering, processing, and reformatting
./scripts/prepare_for_finetuning.sh

175個(gè)人工設(shè)計(jì)的種子任務(wù)在 https://github.com/yizhongw/self-instruct/blob/main/data/seed_tasks.jsonl 中：

{"id": "seed_task_0", 
 "name": "breakfast_suggestion", 
 "instruction": "Is there anything I can eat for a breakfast that doesn't include eggs, yet includes protein, and has roughly 700-1000 calories?", 
 "instances": 
 	[{
 	  "input": "", 
 	  "output": "Yes, you can have 1 oatmeal banana protein shake and 4 strips of bacon. The oatmeal banana protein shake may contain 1/2 cup oatmeal, 60 grams whey protein powder, 1/2 medium banana, 1tbsp flaxseed oil and 1/2 cup watter, totalling about 550 calories. The 4 strips of bacon contains about 200 calories."
 	  }], 
 "is_classification": false}
...
{"id": "seed_task_174", 
 "name": "fact_checking", 
 "instruction": "Fact checking - tell me if the statement is true, false, or unknown, based on your knowledge and common sense.", 
 "instances": 
 [{
 	"input": "Philadelphia is among the top 10 safest cities in the US.", 
 	"output": "false"
 	}], 
 "is_classification": true}

執(zhí)行：

python self_instruct/bootstrap_instructions.py --batch_dir "自己定義的目錄，比如data/gpt3.5" --num_instructions_to_generate 100 --seed_tasks_path data/seed_tasks.jsonl --engine "davinci" --api_key "自己的openai API"

上述指令生成100條數(shù)據(jù)，這只會(huì)產(chǎn)生較少費(fèi)用，生成數(shù)據(jù)會(huì)寫入data/gpt3.5/machine_generated_instructions.jsonl中，這些數(shù)據(jù)是通過(guò)openAI的API生成了與種子任務(wù)關(guān)聯(lián)度比較弱的一些任務(wù)描述（因?yàn)橄嗨贫雀叩膶?duì)微調(diào)沒(méi)有用）。

然后判斷是否為分類任務(wù)：

python self_instruct/identify_clf_or_not.py --batch_dir data/gpt3.5 --engine "davinci" --request_batch_size 5 --api_key "自己的openai API"

結(jié)果寫入data/gpt3.5/is_clf_or_not_davinci_template_1.jsonl中，然后根據(jù)步驟2的結(jié)果生成輸出：

python self_instruct/generate_instances.py --batch_dir data/gpt3.5 --input_file machine_generated_instructions.jsonl --output_file machine_generated_instances.jsonl --max_instances_to_gen 5 --engine "davinci" --request_batch_size 5 --api_key "自己的openai API"

結(jié)果寫入 data/gpt3.5/machine_generated_instances.jsonl中，然后進(jìn)行過(guò)濾和后處理：

python self_instruct/prepare_for_finetuning.py --instance_files data/gpt3.5/machine_generated_instances.jsonl --classification_type_files data/gpt3.5/is_clf_or_not_davinci_template_1.jsonl --output_dir data/gpt3.5/finetuning_data --include_seed_tasks --seed_tasks_path data/seed_tasks.jsonl

運(yùn)行后會(huì)生成兩個(gè)數(shù)據(jù)文件，均在data/gpt3.5/finetuning_data目錄下：

all_generated_instances.jsonl，all_generated_instances.jsonl中包含的是 instruction，input，output，這是用于微調(diào)LLaMA-7B的格式。
gpt3_finetuning_data_xxx.jsonl，包含的是prompt，completion，這是用于微調(diào)GPT3的格式。

Alpaca-LoRA

LoRA可以降低微調(diào)LLM的成本，在神經(jīng)?絡(luò)模型中，模型參數(shù)通常以矩陣的形式表示。對(duì)于?個(gè)預(yù)訓(xùn)練好的模型，其參數(shù)矩陣已經(jīng)包含了很多有?的信息。為了使模型適應(yīng)特定任務(wù)，需要對(duì)這些參數(shù)進(jìn)?微調(diào)。LoRA是一種思想：用低秩的方法調(diào)整參數(shù)矩陣，低秩表示一個(gè)矩陣可以用兩個(gè)小矩陣相乘近似（LoRA: LOW-RANK ADAPTATION OF LARGE LANGUAGE MODELS）。
LLaMA-7B微調(diào)記錄,生成式AI,llama
LoRA包含以下步驟：

1.選擇目標(biāo)層：首先，在預(yù)訓(xùn)練神經(jīng)網(wǎng)絡(luò)模型中選擇要應(yīng)用LoRA的目標(biāo)層，這些層通常是與特定任務(wù)相關(guān)的，比如自注意力機(jī)制中的Q和K矩陣；
2.初始化映射矩陣和逆映射矩陣：為目標(biāo)層創(chuàng)建兩個(gè)較小的矩陣A和B；
A是映射矩陣，一般用隨機(jī)高斯分布初始化，deepspeed chat中用LoRA策略時(shí)則通過(guò)0矩陣占位，A矩陣用于降維；
B是逆映射矩陣，用0矩陣初始化，用于升維；
3.參數(shù)變換：將目標(biāo)層的原始參數(shù)矩陣W通過(guò)A和B進(jìn)行變換： $W^{'} = W + A B$ ， $W^{'}$ 是變換后的參數(shù)矩陣；
4.微調(diào)：使用 $W^{'}$ 替換 $W$ 在特定任務(wù)的訓(xùn)練數(shù)據(jù)上進(jìn)行微調(diào)；
5.梯度更新：在微調(diào)過(guò)程中，計(jì)算損失函數(shù)關(guān)于映射矩陣A和逆映射矩陣B的梯度，并使?優(yōu)化算法，如Adam、SGD對(duì)A和B進(jìn)?更新，注意，在更新過(guò)程中，原始參數(shù)矩陣W保持不變，即訓(xùn)練的時(shí)候固定原始LLM的參數(shù)，只訓(xùn)練A和B；
6.重復(fù)更新：重復(fù)步驟3-5，直到達(dá)到預(yù)定的epoch或模型收斂。

HuggingFace已經(jīng)將LoRA封裝到了PEFT中（Parameter-Efficient Fine-Tuning），PEFT庫(kù)可以使預(yù)訓(xùn)練語(yǔ)?模型?效適應(yīng)各種下游任務(wù)，??需微調(diào)模型的所有參數(shù)，即僅微調(diào)少量模型參數(shù)，從???降低了計(jì)算和存儲(chǔ)成本。

歷史：
Alpaca率先帶動(dòng)self-instruct，啟發(fā)后續(xù)的人也采用提示GPT API的方式生成數(shù)據(jù)，比如BELLE、ChatLLaMA、ColossalChat，從而解決數(shù)據(jù)擴(kuò)展的問(wèn)題。然后又有新的LLM用Alpaca去生成新的數(shù)據(jù)進(jìn)行微調(diào)，?如ChatDoctor ?到Alpaca的數(shù)據(jù)進(jìn)?微調(diào)，有?用BELLE數(shù)據(jù)微調(diào)chatGLM。
LLaMA-7B微調(diào)記錄,生成式AI,llama

微調(diào)LLaMA-7B

下載Alpaca-LoRA項(xiàng)目，并安裝所需的依賴：

$ git clone https://github.com/tloen/alpaca-lora.git
$ pip install -r requirements.txt

下載預(yù)訓(xùn)練模型的權(quán)重，以及斯坦福進(jìn)一步清洗后的微調(diào)數(shù)據(jù)（原本的52k數(shù)據(jù)中存在一些有問(wèn)題的信息）：

$ git clone https://huggingface.co/decapoda-research/llama-7b-hf
$ git clone https://huggingface.co/datasets/yahma/alpaca-cleaned

預(yù)訓(xùn)練模型包含33個(gè)405MB的bin文件，大約占14GB內(nèi)存。

在alpaca-lora-main/finetune.py中，設(shè)置batch_size=4（micro_batch_size: int = 4）以適配16GB的單個(gè)GPU（顯存占用9GB），由于微調(diào)時(shí)間很長(zhǎng)，大約60h，所以新建finetune.sh后臺(tái)運(yùn)行：

nohup python -u finetune.py \
	--base_model '/data/temp/my-alpaca-lora/llama-7b-hf' \
	--data_path '/students/julyedu_636353/alpaca-lora-main/alpaca-cleaned' \
	--output_dir '/data/temp/my-alpaca-lora' \
	>> log.out 2>&1 & # 后臺(tái)運(yùn)行, 日志寫到 log.out

可以直接獲取已經(jīng)訓(xùn)練好的LoRA權(quán)重（67MB）：

git clone https://huggingface.co/tloen/alpaca-lora-7b

或者獲取通過(guò)GPT4生成指令數(shù)據(jù)微調(diào)后的LoRA權(quán)重（模型為L(zhǎng)LaMA-7B，主要微調(diào)方式為Alpaca，低成本的微調(diào)策略為L(zhǎng)oRA），故稱LoRA權(quán)重為適配器adapter weights，GPT4對(duì)應(yīng)的LoRA權(quán)重也應(yīng)該是67MB：

git clone https://huggingface.co/chansung/gpt4-alpaca-lora-7b

利用alpaca-lora-main/generate.py進(jìn)行推理，其中使用import gradio as gr實(shí)現(xiàn)了快捷的可視化界面，新建inference.sh，推理時(shí)占用顯存8GB：

python generate.py \
    --load_8bit \
    --base_model '/data/temp/my-alpaca-lora/llama-7b-hf' \
    --lora_weights 'home/user/alpaca-lora-main/gpt4-alpaca-lora-7b'

對(duì)于一個(gè)問(wèn)題，單個(gè)GPU上生成所需時(shí)間依舊很慢，大約1分鐘，示例如下：
LLaMA-7B微調(diào)記錄,生成式AI,llama
生成回答為：
Based on the MRI scan of the patient’s brain, it is possible that the patient may have Alzheimer’s disease. However, it is important to note that the presence of a “false shadow” in the MRI scan caused by the patient’s physical activity does not necessarily mean that the patient has Alzheimer’s disease. There are several methods that can be used to reduce or eliminate the “false shadow” in the MRI scan, such as:
Asking the patient to remain still during the MRI scan.
Asking the patient to wear earplugs.

再嘗試一個(gè)新問(wèn)答：
LLaMA-7B微調(diào)記錄,生成式AI,llama
生成的回答為：
The expression of Alzheimer’s disease is a decline in memory.
The doctor showed a scenario in which there was a dog and a person playing the violin in a garden. When the patient was asked to recall the scene, the patient did not mention the dog, which could indicate that the patient may have Alzheimer’s disease.
However, it is important to note that this is only one scenario and does not necessarily mean that the patient has Alzheimer’s disease. It is recommended that the patient be evaluated by a medical professional to confirm the diagnosis and receive appropriate treatment and care.

也可以只使用instruct進(jìn)行問(wèn)答：
LLaMA-7B微調(diào)記錄,生成式AI,llama
生成的回答為：
Alzheimer’s disease is a progressive neurodegenerative disorder that affects memory, thinking, and behavior. It is the most common form of dementia, accounting for 60 to 80 percent of cases. The exact cause of Alzheimer’s is unknown, but it is believed to be the result of a combination of genetic, environmental, and lifestyle factors. There is no cure for Alzheimer’s, but medications and lifestyle changes can help manage symptoms and slow the progression of the disease.文章來(lái)源地址http://www.zghlxwxcb.cn/news/detail-659591.html