項(xiàng)目地址: https://github.com/ymcui/Chinese-LLaMA-Alpaca
模型詞表擴(kuò)充
由于LLaMA 原生僅支持 Latin 或 Cyrillic 語系,對(duì)于中文支持不是特別理想,并不像ChatGLM 和 Bloom 原生支持中文。但由于LLaMA模型在英文上的效果本身還是不錯(cuò)的,因此使用模型詞表擴(kuò)充(中文詞表),配合二次預(yù)訓(xùn)練及微調(diào)的方式,來提升LLaMA模型在中文上的效果是一種可以嘗試的方案。
原版LLaMA模型的詞表大小是32K,中文token比較少(只有幾百個(gè))。可以在中文語料庫上訓(xùn)練一個(gè)中文tokenizer模型,然后將中文 tokenizer 與 LLaMA 原生的 tokenizer 進(jìn)行合并,通過組合它們的詞匯表,最終獲得一個(gè)合并后的 tokenizer 模型。
對(duì)原版LLaMA模型的32K的詞表,擴(kuò)充20K中文詞表。
代碼地址:https://github.com/ymcui/Chinese-LLaMA-Alpaca/blob/main/scripts/merge_tokenizer/merge_tokenizers.py
運(yùn)行代碼前需要配置好加載LLaMA模型的環(huán)境,下載好LLaMA模型(中文詞表模型代碼中已經(jīng)提供了。
# 下載Chinese-LLaMA-Alpaca官網(wǎng)代碼
git clone https://github.com/ymcui/Chinese-LLaMA-Alpaca.git
# 下載llama-7b-hf模型
git lfs clone https://huggingface.co/yahma/llama-7b-hf
運(yùn)行合并詞表的文件
python merge_tokenizers.py \
--llama_tokenizer_dir '/data/sim_chatgpt/llama-7b-hf' \
--chinese_sp_model_file './chinese_sp.model'
- llama_tokenizer_dir:指向存放原版LLaMA tokenizer的目錄。
- chinese_sp_model_file:指向用sentencepiece訓(xùn)練的中文詞表文件。
代碼注釋:
import os
from transformers import LlamaTokenizer
from sentencepiece import sentencepiece_model_pb2 as sp_pb2_model
import sentencepiece as spm
import argparse
os.environ["PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION"] = "python"
# 創(chuàng)建一個(gè)ArgumentParser對(duì)象
parser = argparse.ArgumentParser()
parser.add_argument('--llama_tokenizer_dir', default=r'L:/20230902_Llama1/llama-7b-hf', type=str, required=True) # 添加參數(shù)
parser.add_argument('--llama_tokenizer_dir', default=None, type=str, required=True)
parser.add_argument('--chinese_sp_model_file', default='./chinese_sp.model', type=str)
# 解析參數(shù)
args = parser.parse_args()
# 這里是LLaMA tokenizer的路徑
llama_tokenizer_dir = args.llama_tokenizer_dir
# 這里是Chinese tokenizer的路徑
chinese_sp_model_file = args.chinese_sp_model_file
# 加載tokenizer
# 加載LLaMA tokenizer
llama_tokenizer = LlamaTokenizer.from_pretrained(llama_tokenizer_dir)
# 定義Chinese tokenizer
chinese_sp_model = spm.SentencePieceProcessor()
# 加載Chinese tokenizer
chinese_sp_model.Load(chinese_sp_model_file)
# 定義LLaMA tokenizer的sentencepiece model
llama_spm = sp_pb2_model.ModelProto()
# 從LLaMA tokenizer中加載sentencepiece model
llama_spm.ParseFromString(llama_tokenizer.sp_model.serialized_model_proto())
# 定義Chinese tokenizer的sentencepiece model
chinese_spm = sp_pb2_model.ModelProto()
# 從Chinese tokenizer中加載sentencepiece model
chinese_spm.ParseFromString(chinese_sp_model.serialized_model_proto())
# 輸出tokens的信息
# 兩個(gè)tokenizer的詞表大?。惠敵鰹?2000、20000
print(len(llama_tokenizer), len(chinese_sp_model))
# LLaMA tokenizer的special tokens;輸出為['']
print(llama_tokenizer.all_special_tokens)
# LLaMA tokenizer的special tokens對(duì)應(yīng)的id;輸出為[0]
print(llama_tokenizer.all_special_ids)
# LLaMA tokenizer的special tokens;輸出為{'bos_token': '', 'eos_token': '', 'unk_token': ''}
print(llama_tokenizer.special_tokens_map)
# 將Chinese tokenizer的詞表添加到LLaMA tokenizer中(合并過程)
# LLaMA tokenizer的詞表
llama_spm_tokens_set = set(p.piece for p in llama_spm.pieces)
# LLaMA tokenizer的詞表大?。惠敵鰹?2000
print(len(llama_spm_tokens_set))
# LLaMA tokenizer的詞表大??;輸出為Before:32000
print(f"Before:{len(llama_spm_tokens_set)}")
# 遍歷Chinese tokenizer的詞表
for p in chinese_spm.pieces:
# Chinese tokenizer的詞
piece = p.piece
# 如果Chinese tokenizer的詞不在LLaMA tokenizer的詞表中
if piece not in llama_spm_tokens_set:
# 創(chuàng)建一個(gè)新的sentencepiece
new_p = sp_pb2_model.ModelProto().SentencePiece()
# 設(shè)置sentencepiece的詞
new_p.piece = piece
# 設(shè)置sentencepiece的score
new_p.score = 0
# 將sentencepiece添加到LLaMA tokenizer的詞表中
llama_spm.pieces.append(new_p)
# LLaMA tokenizer的詞表大??;輸出為New model pieces: 49953
print(f"New model pieces: {len(llama_spm.pieces)}")
# 保存LLaMA tokenizer
# 這里是保存LLaMA tokenizer的路徑
output_sp_dir = 'merged_tokenizer_sp'
# 這里是保存Chinese-LLaMA tokenizer的路徑
output_hf_dir = 'merged_tokenizer_hf'
# 創(chuàng)建保存LLaMA tokenizer的文件夾
os.makedirs(output_sp_dir, exist_ok=True)
with open(output_sp_dir + '/chinese_llama.model', 'wb') as f:
f.write(llama_spm.SerializeToString())
# 創(chuàng)建LLaMA tokenizer
tokenizer = LlamaTokenizer(vocab_file=output_sp_dir + '/chinese_llama.model')
# 保存Chinese-LLaMA tokenizer
tokenizer.save_pretrained(output_hf_dir)
print(f"Chinese-LLaMA tokenizer has been saved to {output_hf_dir}")
# 測(cè)試tokenizer
# LLaMA tokenizer
llama_tokenizer = LlamaTokenizer.from_pretrained(llama_tokenizer_dir)
# Chinese-LLaMA tokenizer
chinese_llama_tokenizer = LlamaTokenizer.from_pretrained(output_hf_dir)
# LLaMA tokenizer的special tokens;輸出為['<s>', '</s>', '<unk>']
print(tokenizer.all_special_tokens)
# LLaMA tokenizer的special tokens對(duì)應(yīng)的id;輸出為[0, 1, 2]
print(tokenizer.all_special_ids)
print(tokenizer.special_tokens_map) # LLaMA tokenizer的special tokens;輸出為{'bos_token': '<s>', 'eos_token': '</s>', 'unk_token': '<unk>'}
text = '''白日依山盡,黃河入海流。欲窮千里目,更上一層樓。
The primary use of LLaMA is research on large language models, including'''
print("Test text:\n", text)
print(f"Tokenized by LLaMA tokenizer:{llama_tokenizer.tokenize(text)}")
# 輸出結(jié)果
# Tokenized by LLaMA tokenizer:['▁', '白', '日', '<0xE4>', '<0xBE>', '<0x9D>', '山', '<0xE5>', '<0xB0>', '<0xBD>', ',', '黃', '河', '入', '海', '流', '。', '<0xE6>', '<0xAC>', '<0xB2>', '<0xE7>', '<0xA9>', '<0xB7>', '千', '里', '目', ',', '更', '上', '一', '<0xE5>', '<0xB1>', '<0x82>', '<0xE6>', '<0xA5>', '<0xBC>', '。', '<0x0A>', 'The', '▁primary', '▁use', '▁of', '▁L', 'La', 'MA', '▁is', '▁research', '▁on', '▁large', '▁language', '▁models', ',', '▁including']
print(f"Tokenized by Chinese-LLaMA tokenizer:{chinese_llama_tokenizer.tokenize(text)}")
# 輸出結(jié)果
# Tokenized by Chinese-LLaMA tokenizer:['▁白', '日', '依', '山', '盡', ',', '黃河', '入', '海', '流', '。', '欲', '窮', '千里', '目', ',', '更', '上', '一層', '樓', '。', '<0x0A>', 'The', '▁primary', '▁use', '▁of', '▁L', 'La', 'MA', '▁is', '▁research', '▁on', '▁large', '▁language', '▁models', ',', '▁including']
模型列表
微調(diào)chinese-alpaca
本項(xiàng)目基于中文數(shù)據(jù)
- 開源了使用中文文本數(shù)據(jù)預(yù)訓(xùn)練的中文LLaMA大模型(7B、13B)
- 開源了進(jìn)一步經(jīng)過指令精調(diào)的中文Alpaca大模型(7B、13B)
使用text-generation-webui搭建界面
接下來以 text-generation-webui 工具為例,介紹無需合并模型即可進(jìn)行本地化部署的詳細(xì)步驟。
1、先新建一個(gè)conda環(huán)境。
conda create -n textgen python=3.10
conda activate textgen
pip install torch torchvision torchaudio
2、下載chinese-alpaca-lora-7b權(quán)重:https://drive.google.com/file/d/1JvFhBpekYiueWiUL3AF1TtaWDb3clY5D/view?usp=sharing
# 克隆text-generation-webui
git clone https://github.com/oobabooga/text-generation-webui
cd text-generation-webui
pip install -r requirements.txt
# 將下載后的lora權(quán)重放到loras文件夾下
ls loras/chinese-alpaca-lora-7b
adapter_config.json adapter_model.bin special_tokens_map.json tokenizer_config.json tokenizer.model
三種方式下載
- 通過transformers-cli下載HuggingFace格式的llama-7B模型文件
transformers-cli download decapoda-research/llama-7b-hf --cache-dir ./llama-7b-hf
- 通過snapshot_download下載:
pip install huggingface_hub
python
from huggingface_hub import snapshot_download
snapshot_download(repo_id="decapoda-research/llama-7b-hf", cache_dir="./llama-7b-hf")
- 通過git命令進(jìn)行下載(需要提前安裝git-lfs)
git clone https://huggingface.co/decapoda-research/llama-7b-hf
我這里用的第二種。
# 將HuggingFace格式的llama-7B模型文件放到models文件夾下
ls models/llama-7b-hf
pytorch_model-00001-of-00002.bin pytorch_model-00002-of-00002.bin config.json pytorch_model.bin.index.json generation_config.json
# 復(fù)制lora權(quán)重的tokenizer到models/llama-7b-hf下
cp loras/chinese-alpaca-lora-7b/tokenizer.model ~/text-generation-webui/models/llama-7b-hf/models--decapoda-research--llama-7b-hf/snapshots/5f98eefcc80e437ef68d457ad7bf167c2c6a1348/
cp loras/chinese-alpaca-lora-7b/special_tokens_map.json ~/text-generation-webui/models/llama-7b-hf/models--decapoda-research--llama-7b-hf/snapshots/5f98eefcc80e437ef68d457ad7bf167c2c6a1348/
cp loras/chinese-alpaca-lora-7b/tokenizer_config.json ~/text-generation-webui/models/llama-7b-hf/models--decapoda-research--llama-7b-hf/snapshots/5f98eefcc80e437ef68d457ad7bf167c2c6a1348/
# 修改/modules/LoRA.py文件,大約在第28行
shared.model.resize_token_embeddings(len(shared.tokenizer))
shared.model = PeftModel.from_pretrained(shared.model, Path(f"{shared.args.lora_dir}/{lora_names[0]}"), **params)
# 接下來就可以愉快的運(yùn)行了,參考https://github.com/oobabooga/text-generation-webui/wiki/Using-LoRAs
# python server.py --model llama-7b-hf --lora chinese-alpaca-lora-7b
# 使用int8
python server.py --model llama-7b-hf/models--decapoda-research--llama-7b-hf/snapshots/5f98eefcc80e437ef68d457ad7bf167c2c6a1348/ --lora chinese-alpaca-lora-7b --load-in-8bit
報(bào)錯(cuò)
RuntimeError: Error(s) in loading state_dict for PeftModelForCausalLM:
size mismatch for base_model.model.model.embed_tokens.weight: copying a param with shape torch.Size([49954, 4096]) from checkpoint, the shape in current model is torch.Size([32000, 4096]).
size mismatch for base_model.model.lm_head.weight: copying a param with shape torch.Size([49954, 4096]) from checkpoint, the shape in current model is torch.Size([32000, 4096]).
解決(用下面代碼進(jìn)行替換):
shared.model.resize_token_embeddings(49954)
assert shared.model.get_input_embeddings().weight.size(0) == 49954
shared.model = PeftModel.from_pretrained(shared.model, Path(f"{shared.args.lora_dir}/{lora_names[0]}"), **params)
設(shè)置下對(duì)外開放
To create a public link, set share=True
in launch()
.
實(shí)驗(yàn)效果:生成的中文較短
示例:
below is an instruction rthat destribes a task.
write a response that appropriately conpletes the request.
### Instruction:
我得了流感,請(qǐng)幫我寫一封請(qǐng)假條
### Response:
部署llama.cpp
下載合并后的模型權(quán)重:
- Colab notebook:https://colab.research.google.com/drive/1Eak6azD3MLeb-YsfbP8UZC8wrL1ddIMI?usp=sharing
- 或者notebook/文件夾下的ipynb文件:https://github.com/ymcui/Chinese-LLaMA-Alpaca/blob/main/notebooks/convert_and_quantize_chinese_llama.ipynb
將合并后的模型權(quán)重下載到本地,然后傳到服務(wù)器上。
# 下載項(xiàng)目
git clone https://github.com/ggerganov/llama.cpp
# 編譯
cd llama.cpp && make
# 建一個(gè)文件夾
cd llama.cpp && mkdir zh-models && mkdir 7B
將alpaca-combined下的文件都放到7B目錄下后,執(zhí)行下面的操作
mv llama.cpp/zh-models/7B/tokenizer.model llama.cpp/zh-models/
ls llama.cpp/zh-models/
會(huì)顯示:7B tokenizer.model
執(zhí)行轉(zhuǎn)換過程
python convert.py zh-models/7B/
會(huì)生成ggml-model-f16.bin
將FP16模型量化為4-bit
我們進(jìn)一步將FP16模型轉(zhuǎn)換為4-bit量化模型。
./quantize ./zh-models/7B/ggml-model-f16.bin ./zh-models/7B/ggml-model-q4_0.bin 2
可按需使用文章來源:http://www.zghlxwxcb.cn/news/detail-487889.html
./main -m ./zh-models/7B/ggml-model-f16.bin --color -f ./prompts/alpaca.txt -p "詳細(xì)介紹一下北京的名勝古跡:" -n 512
參考:https://zhuanlan.zhihu.com/p/631360711?utm_id=0文章來源地址http://www.zghlxwxcb.cn/news/detail-487889.html
到了這里,關(guān)于Chinese-LLaMA-Alpaca代碼實(shí)戰(zhàn)的文章就介紹完了。如果您還想了解更多內(nèi)容,請(qǐng)?jiān)谟疑辖撬阉鱐OY模板網(wǎng)以前的文章或繼續(xù)瀏覽下面的相關(guān)文章,希望大家以后多多支持TOY模板網(wǎng)!