前言
就在不久前,微軟正式發(fā)布了一個(gè) 27 億參數(shù)的語言模型——Phi-2。這是一種文本到文本的人工智能程序,具有出色的推理和語言理解能力。同時(shí),微軟研究院也在官方 X 平臺上聲稱:“Phi-2 的性能優(yōu)于其他現(xiàn)有的小型語言模型,但它足夠小,可以在筆記本電腦或者移動(dòng)設(shè)備上運(yùn)行”。
微軟通過時(shí)下一些如 Big Bench Hard (BBH)、常識推理(PIQA、WinoGrande、ARC easy 和 Challenge、SIQA)、語言理解(HellaSwag、OpenBookQA、MMLU(5-shot)、SQuADv2、BoolQ)、數(shù)學(xué)(GSM8k)和編碼(HumanEval)等基準(zhǔn)測試,將 Phi-2 與 7B 和 13B 參數(shù)的 Mistral 和 Llama-2 進(jìn)行了比較。
最終得出僅擁有 27 億個(gè)參數(shù)的 Phi-2 ,超越了 Mistral 7B 和 Llama-2 7B 以及 13B 模型的性能。值得注意的是,與大它 25 倍的 Llama-2-70B 模型相比,Phi-2 還在多步推理任務(wù)(即編碼和數(shù)學(xué))上實(shí)現(xiàn)了更好的性能。
本教程將基于謎語數(shù)據(jù)集使用QLoRA 微調(diào)Phi-2模型。
模型微調(diào)
-
Phi-2
模型Huggingface
開源地址,謎語數(shù)據(jù)集Huggingface
開源地址 - 下面所有代碼均在
kaggle
平臺,GPU
為T4 × 2
環(huán)境下運(yùn)行。
搭建環(huán)境
- 微調(diào)需要安裝庫
transformers
、peft
、datasets
、evaluate
、einops
、bitsandbytes
from IPython.display import clear_output
!pip install -U transformers peft datasets evaluate einops bitsandbytes -q
!pip install accelerate==0.21.0 -q
clear_output()
- 導(dǎo)入必要庫
import os
import torch
from datasets import load_dataset
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
clear_output()
數(shù)據(jù)預(yù)處理
加載數(shù)據(jù)
- 加載數(shù)據(jù)集,并拆分訓(xùn)練、測試集,打印示例。
dataset = load_dataset("g-ronimo/riddles_evolved")
dataset = dataset["train"].train_test_split(test_size=0.1)
clear_output()
dataset["train"][0]
{'number': 286,
'messages': ["Born in despair, fed with mud, soon I'll bloom into beauty for the good.",
'This riddle appears to be describing a flower that grows in adverse conditions. Here\'s a step-by-step explanation:\n\n1. "Born in despair" suggests that the subject of the riddle had a difficult start in life.\n2. "Fed with mud" means that it has been nourished in dirty or poor conditions.\n3. "Soon I\'ll bloom into beauty" indicates that despite its challenging start, it will eventually become something beautiful.\n4. "Think step-by-step" is advice to break down complex problems into smaller, manageable parts.\n5. "Keep your explanations simple" is a recommendation to make complex ideas easy to understand.\n6. "Try your very best" emphasizes the importance of giving your all to accomplish your goals.\n7. "If you there is information missing for you to come up with a specific answer, just ask me a short question at the end of your answer" is a request to specify the topic or question if more clarification is needed.\n\nQuestion: What is the answer to the riddle, and what is the identity of the flower described?',
'What kind of flower starts in hardship, grows in dirt, but ultimately blooms into something lovely?',
'The answer to the riddle and the identity of the flower described is a sunflower. Sunflowers start as seeds in adversity, grow in soil that may not be ideal, but eventually bloom into beautiful yellow flowers that bring joy and sunlight.']}
應(yīng)用ChatML格式并標(biāo)記數(shù)據(jù)集
- 首先使用 ChatML 格式組織對話,然后對它們進(jìn)行標(biāo)記。設(shè)置數(shù)據(jù)集非常重要,以便模型學(xué)會像
assistant
一樣回答,而不是像user
一樣提出問題。為此,我們將忽略索引的標(biāo)記(labels
)設(shè)置為-100。 -
注意:如果您熟悉
QLoRA
庫,您會注意到原始的guanaco
模型是針對問題和答案進(jìn)行訓(xùn)練的(qlora.py 中默認(rèn)為train_on_source=False
)。這些模型經(jīng)過原始對話文本的訓(xùn)練,表現(xiàn)良好。不過,在謎語中只關(guān)注assistant
的部分才是正確的訓(xùn)練方法。
import os
from functools import partial
# ChatML格式
templates = [
"<|im_start|>assistant\n{msg}<|im_end|>", # message by assistant
"<|im_start|>user\n{msg}<|im_end|>", # message by user
]
# 在損失計(jì)算期間,使用此特殊索引來忽略某些標(biāo)記。
IGNORE_INDEX = -100
def tokenize(input, max_length):
input_ids, attention_mask, labels = [], [], []
# 遍歷數(shù)據(jù)集中的每個(gè)消息
for i, msg in enumerate(input["messages"]):
# 檢查消息是來自user還是assistant,應(yīng)用ChatML模板
isHuman = i%2==0
msg_chatml = templates[isHuman].format(msg=msg)
# 標(biāo)記化所有內(nèi)容,稍后截?cái)?/span>
msg_tokenized = tokenizer(
msg_chatml,
truncation=False,
add_special_tokens=False)
# 復(fù)制標(biāo)記和注意力掩碼而不進(jìn)行更改
input_ids += msg_tokenized["input_ids"]
attention_mask += msg_tokenized["attention_mask"]
# 為損失計(jì)算調(diào)整標(biāo)簽:如果是user->IGNORE_INDEX,如果是assistant->input_ids
# 忽略user消息,僅計(jì)算assistant消息的損失,因?yàn)檫@是我們想要學(xué)習(xí)
labels += [IGNORE_INDEX]*len(msg_tokenized["input_ids"]) if isHuman else msg_tokenized["input_ids"]
# 截?cái)嘀磷畲箝L度
return {
"input_ids": input_ids[:max_length],
"attention_mask": attention_mask[:max_length],
"labels": labels[:max_length],
}
dataset_tokenized = dataset.map(
# 在1024標(biāo)記處截?cái)鄻颖?/span>
# 對于謎題數(shù)據(jù)集足夠了(最大長度1000標(biāo)記)
# 對于其他數(shù)據(jù)集,必須適應(yīng),較高的值需要更多的顯存
partial(tokenize, max_length=1024),
batched = False,
# 多線程
num_proc = os.cpu_count(),
# 刪除原始列,不再需要
remove_columns = dataset["train"].column_names
)
- 對于上面不理解的代碼內(nèi)容可以單獨(dú)運(yùn)行,比如如何區(qū)分
assistant
和user
。
for i, msg in enumerate(dataset['train'][0]['messages']):
isHuman = i%2==0
print(i)
print(isHuman)
print(msg)
定義collator
-
collate
函數(shù)的目的是處理和準(zhǔn)備用于訓(xùn)練(和評估)的batch
數(shù)據(jù),關(guān)鍵部分是正確填充輸入。它通過使用特定標(biāo)記填充到最長樣本的長度來標(biāo)準(zhǔn)化batch
中每個(gè)數(shù)據(jù)點(diǎn)的長度。input_ids
用pad token
填充,labels
用IGNORE_INDEX
填充(以表明這些token
不參與損失計(jì)算),并且attention_mask
為0(忽略填充的標(biāo)記)。
# collate函數(shù) - 將字典列表[{input_ids: [123, ..]}, {..]}轉(zhuǎn)換為一個(gè)字典
# 形成batch{input_ids: [..], labels: [..], attention_mask: [..]}
def collate(elements):
# 從每個(gè)元素中提取input_ids,并找出它們中的最大長度
tokens = [e["input_ids"] for e in elements]
tokens_maxlen = max([len(t) for t in tokens])
for i, sample in enumerate(elements):
input_ids = sample["input_ids"]
labels = sample["labels"]
attention_mask = sample["attention_mask"]
# 計(jì)算需要填充以匹配最大標(biāo)記長度的填充長度
pad_len = tokens_maxlen-len(input_ids)
# 用pad標(biāo)記ID填充'input_ids',用IGNORE_INDEX填充'labels',用0填充'attention_mask'
input_ids.extend( pad_len * [tokenizer.pad_token_id] )
labels.extend( pad_len * [IGNORE_INDEX] )
attention_mask.extend( pad_len * [0] )
# 創(chuàng)建并返回包含elements中所有數(shù)據(jù)的批次
batch={
"input_ids": torch.tensor( [e["input_ids"] for e in elements] ),
"labels": torch.tensor( [e["labels"] for e in elements] ),
"attention_mask": torch.tensor( [e["attention_mask"] for e in elements] ),
}
return batch
微調(diào) Phi-2
加載量化模型
- 因?yàn)樵趉aggle平臺,GPU顯存有限,所以只能加載量化后的模型。
- 加載
4-bit
模型和分詞器(tokenizer
)
modelpath = "microsoft/phi-2"
model = AutoModelForCausalLM.from_pretrained(
modelpath,
device_map="auto",
quantization_config=BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_quant_type="nf4",
),
torch_dtype=torch.bfloat16,
trust_remote_code=True,
)
添加ChatML標(biāo)記
- 將
ChatML
特殊標(biāo)記添加到模型和tokenizer
中。 - 關(guān)于
ChatML
是一種模型能看的懂的語言格式。
# fast tokenizer有時(shí)會忽略添加的tokens
tokenizer = AutoTokenizer.from_pretrained(modelpath, use_fast=False)
# 添加ChatML特殊標(biāo)記
tokenizer.add_tokens(["<|im_start|>", "<PAD>"])
tokenizer.pad_token = "<PAD>"
tokenizer.add_special_tokens(dict(eos_token="<|im_end|>"))
# 調(diào)整模型embeddings大小
model.resize_token_embeddings(
new_num_tokens=len(tokenizer),
pad_to_multiple_of=64)
model.config.eos_token_id = tokenizer.eos_token_id
clear_output()
準(zhǔn)備LoRA適配器
-
LoRA
(Low-Rank Adaptation
)是微調(diào)大型模型的有效方法。它僅在訓(xùn)練期間更新模型的選定部分,從而加快過程并節(jié)省內(nèi)存。
from peft import prepare_model_for_kbit_training, LoraConfig, get_peft_model
# lora微調(diào)配置
lora_config = LoraConfig(
r=32,
lora_alpha=32,
target_modules = ['fc1', 'fc2', 'Wqkv', 'out_proj'],
lora_dropout=0.1,
bias="none",
modules_to_save = ["lm_head", "embed_tokens"],
task_type="CAUSAL_LM"
)
# 添加適配器到模型
model = prepare_model_for_kbit_training(model, use_gradient_checkpointing = False)
model = get_peft_model(model, lora_config)
model.config.use_cache = False
-
lora
微調(diào)配置參數(shù)說明:-
rank
:LoRA
中的rank
也會影響可訓(xùn)練參數(shù)的數(shù)量。較高的rank
會增加訓(xùn)練參數(shù),這意味著模型靈活性和適應(yīng)能力提高,但代價(jià)是增加計(jì)算復(fù)雜性。相反,較低的rank
會減少訓(xùn)練參數(shù),意味著更有效的訓(xùn)練和更少的計(jì)算負(fù)擔(dān),但可能會降低模型靈活性。因此,rank
的選擇代表了模型適應(yīng)性和計(jì)算效率之間的權(quán)衡。 -
lora_alpha
:縮放因子,用于調(diào)整低秩更新對模型原始權(quán)重的影響,即:模型原始行為的改變程度。 LoRA 論文指出"tuning alpha is roughly the same as tuning the learning rate"(調(diào)整 alpha 與調(diào)整學(xué)習(xí)率大致相同)。關(guān)于如何設(shè)置rank
與lora_alpha
尚未達(dá)成共識。一種方法似乎是設(shè)置lora_alpha = r
,這就是我們在這里使用的。 -
target_modules
:使用上述參數(shù),我們僅訓(xùn)練約 5.1% 的模型權(quán)重。若資源有限,也可以選擇僅訓(xùn)練注意力矩陣和輸出權(quán)重(['Wqkv', 'out_proj']
),在rank=32
的情況下,參數(shù)數(shù)量降低到 4.4% 。對線性層進(jìn)行訓(xùn)練應(yīng)該會提高模型性能,因?yàn)樗咏谕耆⒄{(diào),但也會增加適配器大小。
-
- 更多參數(shù)說明請?jiān)L問Huggingface官方文檔
開始訓(xùn)練
- 部分訓(xùn)練超參數(shù)說明:
-
batch_size
:較大的batch_size
更好,但受到可用VRAM的限制。訓(xùn)練樣本越長(在tokenization
過程中增加max_length
),需要的VRAM就越多。在max_length
為1024個(gè)token
的示例中,batch_size
為1是24GB VRAM GPU上的最大值。為了增加有效批量大小,gradient_accumulation_steps
設(shè)置為16,但缺點(diǎn)是會減慢訓(xùn)練過程。 -
learning_rate
:2e-5
的學(xué)習(xí)率對此數(shù)據(jù)集有不錯(cuò)的效果,當(dāng)然4e-5
的學(xué)習(xí)率也可能有效,并且會產(chǎn)生一個(gè)不錯(cuò)的模型而不會過度擬合。 -
lr_scheduler_type
:根據(jù)QLoRA
作者Tim Dettmers使用恒定學(xué)習(xí)率策略的建議,我采用了這種方法,并發(fā)現(xiàn)它對于Phi-2
、Llama 1/2
和Mistral
始終有效。
-
- 更多訓(xùn)練超參數(shù)見官方文檔,設(shè)置好訓(xùn)練參數(shù)后開始訓(xùn)練。
from transformers import TrainingArguments, Trainer
bs=1 # batch size
ga_steps=16 # gradient acc. steps
epochs=15
lr=0.00001
steps_per_epoch=len(dataset_tokenized["train"])//(bs*ga_steps)
args = TrainingArguments(
output_dir="out",
per_device_train_batch_size=bs,
per_device_eval_batch_size=16,
evaluation_strategy="steps",
logging_steps=2,
eval_steps=steps_per_epoch//2, # eval twice per epoch
save_steps=1, # save once per epoch
gradient_accumulation_steps=ga_steps,
num_train_epochs=epochs,
lr_scheduler_type='constant',
optim='paged_adamw_32bit', # val_loss will go NaN with paged_adamw_8bit
learning_rate=lr,
group_by_length=False,
fp16=True,
metric_for_best_model='eval_loss',
save_total_limit=1,
# bf16=False,
ddp_find_unused_parameters=False,
)
trainer = Trainer(
model=model,
tokenizer=tokenizer,
args=args,
data_collator=collate,
train_dataset=dataset_tokenized["train"],
eval_dataset=dataset_tokenized["test"],
)
trainer.train()
訓(xùn)練分析
- 訓(xùn)練集損失
- 驗(yàn)證集損失
模型合并
-
LoRA
適配器訓(xùn)練完成以后,需要與原模型進(jìn)行合并。
modelpath = "microsoft/phi-2"
adapter_path='/kaggle/input/phi-2-finetune/out/checkpoint-846'
save_to="merged"
base_model = AutoModelForCausalLM.from_pretrained(
modelpath,
return_dict=True,
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(modelpath)
tokenizer.add_tokens(["<|im_start|>", "<PAD>"])
tokenizer.pad_token = "<PAD>"
tokenizer.add_special_tokens(dict(eos_token="<|im_end|>"))
base_model.resize_token_embeddings(
new_num_tokens=len(tokenizer),
pad_to_multiple_of=64)
base_model.config.eos_token_id = tokenizer.eos_token_id
model = PeftModel.from_pretrained(base_model, adapter_path)
model = model.merge_and_unload()
model.save_pretrained(save_to, safe_serialization=True, max_shard_size='4GB')
tokenizer.save_pretrained(save_to)
clear_output()
微調(diào)前后對比
- 先加載一下原模型,輸入謎語,看看回答是什么。
torch.set_default_device("cuda")
model = AutoModelForCausalLM.from_pretrained("microsoft/phi-2", torch_dtype="auto", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("microsoft/phi-2", trust_remote_code=True)
inputs = tokenizer('''What makes a noise like a bell and flies, but cannot be seen? The answer lies in the bright blue sky.''', return_tensors="pt", return_attention_mask=False)
outputs = model.generate(**inputs, max_length=200)
text = tokenizer.batch_decode(outputs)[0]
clear_output()
print(text)
輸出:
In the world of mathematics, we often encounter situations where we need to compare and order numbers. This skill is essential in various fields, including science, engineering, and even everyday life. Let’s explore the concept of comparing and ordering numbers using the tones of science, specifically the principles of physics and the states of matter.
Imagine you are in a science lab, conducting an experiment to study the behavior of different substances. You have a set of test tubes filled with various liquids, each representing a different state of matter. The liquids in the test tubes are like numbers, and we can compare and order them based on their properties.
參考中譯:
在數(shù)學(xué)世界中,我們經(jīng)常會遇到需要對數(shù)字進(jìn)行比較和排序的情況。這項(xiàng)技能在科學(xué)、工程甚至日常生活等各個(gè)領(lǐng)域都至關(guān)重要。讓我們用科學(xué)的視角,特別是物理學(xué)原理和物質(zhì)狀態(tài),來探討數(shù)字比較和排序的概念。
想象一下,您正在科學(xué)實(shí)驗(yàn)室里進(jìn)行一項(xiàng)實(shí)驗(yàn),研究不同物質(zhì)的行為。你有一組試管,里面裝滿了各種液體,每種液體代表一種不同的物質(zhì)狀態(tài)。試管中的液體就像數(shù)字,我們可以根據(jù)它們的性質(zhì)進(jìn)行比較和排序。文章來源:http://www.zghlxwxcb.cn/news/detail-795253.html
- 可以說是非常糟糕的回答,我們看看微調(diào)后的模型會輸出什么。
model = AutoModelForCausalLM.from_pretrained("/kaggle/working/merged", torch_dtype="auto", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("/kaggle/working/merged", trust_remote_code=True)
inputs = tokenizer('''<|im_start|>What makes a noise like a bell and flies, but cannot be seen? The answer lies in the bright blue sky.<|im_end|>''', return_tensors="pt", return_attention_mask=False)
outputs = model.generate(**inputs, max_length=300)
text = tokenizer.batch_decode(outputs)[0]
clear_output()
print(text)
輸出:
The answer to the riddle is a “bluebird.” Bluebirds make a distinctive bell-like sound with their wings, and they are often seen flying in the sky. However, they cannot be seen with the naked eye as they are small birds. If you need more information, please let me know what specific aspect of the answer you would like to know.
參考中譯:
謎底是 “青鳥”。青鳥用翅膀發(fā)出獨(dú)特的鈴鐺聲,人們經(jīng)??吹剿鼈冊谔炜罩酗w翔。不過,由于它們是小型鳥類,肉眼無法看到。如果您需要更多信息,請告訴我您想知道答案的具體方面。文章來源地址http://www.zghlxwxcb.cn/news/detail-795253.html
- 微調(diào)后的模型得到了相對滿意的答案。請注意,這是在
4-bit
量化狀態(tài)下微調(diào)的答案,如果可以在float32
狀態(tài)下微調(diào),或許會得到更好的答案。
到了這里,關(guān)于Phi-2小語言模型QLoRA微調(diào)教程的文章就介紹完了。如果您還想了解更多內(nèi)容,請?jiān)谟疑辖撬阉鱐OY模板網(wǎng)以前的文章或繼續(xù)瀏覽下面的相關(guān)文章,希望大家以后多多支持TOY模板網(wǎng)!