視頻鏈接:HuggingFace簡明教程,BERT中文模型實(shí)戰(zhàn)示例.NLP預(yù)訓(xùn)練模型,Transformers類庫,datasets類庫快速入門._嗶哩嗶哩_bilibili
1.huggingface簡介與安裝
什么是huggingface?huggingface是一個(gè)開源社區(qū),它提供了先進(jìn)的NLP模型,數(shù)據(jù)集,以及其他便利的工具。
數(shù)據(jù)集:Hugging Face – The AI community building the future.?
這些數(shù)據(jù)集可以根據(jù)任務(wù)、語言等來分類
模型:Models - Hugging Face?
官方文檔:?Hugging Face - Documentation?
?主要的模型:
? ? ? ? 自回歸:GPT2、Transformer-XL、XLNet
? ? ? ? 自編碼:BERT、ALBERT、RoBERTa、ELECTRA
? ? ? ? Seq2Seq:BART、Pegasus、T5
安裝環(huán)境:
? ? ? ? 前置環(huán)境:python、pytorch安裝
? ? ? ? 安裝transformers、datasets包:
#安裝transformers
#pip安裝
pip install transformers
#conda安裝
conda install -c huggingface transformers
#安裝datasets
#pip安裝
pip install datasets
#conda安裝
conda install -c huggingface -c conda-forge datasets
推薦使用pip進(jìn)行安裝
2.使用字典和分詞工具
加載tokenizer,準(zhǔn)備語料
????????在加載tokenizer的時(shí)候要傳一個(gè)name,這個(gè)name與模型的name相一致,所以一個(gè)模型對(duì)應(yīng)一個(gè)tokenizer
from transformers import BertTokenizer
#加載預(yù)訓(xùn)練字典和分詞方法
tokenizer = BertTokenizer.from_pretrained(
pretrained_model_name_or_path='bert-base-chinese',
cache_dir=None,
force_download=False,
)
sents = [
'選擇珠江花園的原因就是方便。',
'筆記本的鍵盤確實(shí)爽。',
'房間太小。其他的都一般。',
'今天才知道這書還有第6卷,真有點(diǎn)郁悶.',
'機(jī)器背面似乎被撕了張什么標(biāo)簽,殘膠還在。',
]
tokenizer, sents
簡單的編碼
一次編碼兩個(gè)句子,text_pair是可以不傳的,如果不傳的話就是一次編碼一個(gè)句子
#編碼兩個(gè)句子
out = tokenizer.encode(
text=sents[0],
text_pair=sents[1],
#當(dāng)句子長度大于max_length時(shí),截?cái)? truncation=True,
#一律補(bǔ)pad到max_length長度
padding='max_length',
add_special_tokens=True,
max_length=30,
return_tensors=None,# 默認(rèn)返回list
)
print(out)
tokenizer.decode(out)
增強(qiáng)的編碼函數(shù)
#增強(qiáng)的編碼函數(shù)
out = tokenizer.encode_plus(
text=sents[0],
text_pair=sents[1],
#當(dāng)句子長度大于max_length時(shí),截?cái)? truncation=True,
#一律補(bǔ)零到max_length長度
padding='max_length',
max_length=30,
add_special_tokens=True,
#可取值tf,pt,np,默認(rèn)為返回list
return_tensors=None,
#返回token_type_ids
return_token_type_ids=True,
#返回attention_mask
return_attention_mask=True,
#返回special_tokens_mask 特殊符號(hào)標(biāo)識(shí)
return_special_tokens_mask=True,
#返回offset_mapping 標(biāo)識(shí)每個(gè)詞的起止位置,這個(gè)參數(shù)只能BertTokenizerFast使用
#return_offsets_mapping=True,
#返回length 標(biāo)識(shí)長度
return_length=True,
)
增強(qiáng)編碼的結(jié)果:
#input_ids 就是編碼后的詞
#token_type_ids 第一個(gè)句子和特殊符號(hào)的位置是0,第二個(gè)句子的位置是1
#special_tokens_mask 特殊符號(hào)的位置是1,其他位置是0
#attention_mask pad的位置是0,其他位置是1
#length 返回句子長度
for k, v in out.items():
print(k, ':', v)
tokenizer.decode(out['input_ids'])
?
?批量編碼句子
上述方式是一次編碼一個(gè)或者一對(duì)句子,但是實(shí)際操作中需要批量編碼句子。這里編碼的是一個(gè)一個(gè)的句子,而不是一對(duì)一對(duì)的句子
#批量編碼句子
out = tokenizer.batch_encode_plus(
batch_text_or_text_pairs=[sents[0], sents[1]],
add_special_tokens=True,
#當(dāng)句子長度大于max_length時(shí),截?cái)? truncation=True,
#一律補(bǔ)零到max_length長度
padding='max_length',
max_length=15,
#可取值tf,pt,np,默認(rèn)為返回list
return_tensors=None,
#返回token_type_ids
return_token_type_ids=True,
#返回attention_mask
return_attention_mask=True,
#返回special_tokens_mask 特殊符號(hào)標(biāo)識(shí)
return_special_tokens_mask=True,
#返回offset_mapping 標(biāo)識(shí)每個(gè)詞的起止位置,這個(gè)參數(shù)只能BertTokenizerFast使用
#return_offsets_mapping=True,
#返回length 標(biāo)識(shí)長度
return_length=True,
)
批量編碼的結(jié)果:
#input_ids 就是編碼后的詞
#token_type_ids 第一個(gè)句子和特殊符號(hào)的位置是0,第二個(gè)句子的位置是1
#special_tokens_mask 特殊符號(hào)的位置是1,其他位置是0
#attention_mask pad的位置是0,其他位置是1
#length 返回句子長度
for k, v in out.items():
print(k, ':', v)
tokenizer.decode(out['input_ids'][0]), tokenizer.decode(out['input_ids'][1])
批量成對(duì)編碼
傳入的list中是一個(gè)一個(gè)的tuple,tuple中是一對(duì)句子
#批量編碼成對(duì)的句子
out = tokenizer.batch_encode_plus(
batch_text_or_text_pairs=[(sents[0], sents[1]), (sents[2], sents[3])],
add_special_tokens=True,
#當(dāng)句子長度大于max_length時(shí),截?cái)? truncation=True,
#一律補(bǔ)零到max_length長度
padding='max_length',
max_length=30,
#可取值tf,pt,np,默認(rèn)為返回list
return_tensors=None,
#返回token_type_ids
return_token_type_ids=True,
#返回attention_mask
return_attention_mask=True,
#返回special_tokens_mask 特殊符號(hào)標(biāo)識(shí)
return_special_tokens_mask=True,
#返回offset_mapping 標(biāo)識(shí)每個(gè)詞的起止位置,這個(gè)參數(shù)只能BertTokenizerFast使用
#return_offsets_mapping=True,
#返回length 標(biāo)識(shí)長度
return_length=True,
)
批量成對(duì)編碼結(jié)果:
#input_ids 就是編碼后的詞
#token_type_ids 第一個(gè)句子和特殊符號(hào)的位置是0,第二個(gè)句子的位置是1
#special_tokens_mask 特殊符號(hào)的位置是1,其他位置是0
#attention_mask pad的位置是0,其他位置是1
#length 返回句子長度
for k, v in out.items():
print(k, ':', v)
tokenizer.decode(out['input_ids'][0])
?字典操作
操作tokenizer中的字典,當(dāng)前的字典以一個(gè)字為一個(gè)詞
#獲取字典
zidian = tokenizer.get_vocab()
type(zidian), len(zidian), '月光' in zidian,
?
#添加新詞
tokenizer.add_tokens(new_tokens=['月光', '希望'])
#添加新符號(hào)
tokenizer.add_special_tokens({'eos_token': '[EOS]'})
zidian = tokenizer.get_vocab()
type(zidian), len(zidian), zidian['月光'], zidian['[EOS]']
?編碼新詞:
#編碼新添加的詞
out = tokenizer.encode(
text='月光的新希望[EOS]',
text_pair=None,
#當(dāng)句子長度大于max_length時(shí),截?cái)? truncation=True,
#一律補(bǔ)pad到max_length長度
padding='max_length',
add_special_tokens=True,
max_length=8,
return_tensors=None,
)
print(out)
tokenizer.decode(out)
3.數(shù)據(jù)集操作
加載數(shù)據(jù)集
以情感分類數(shù)據(jù)集為例
from datasets import load_dataset
#加載數(shù)據(jù)
dataset = load_dataset(path='seamew/ChnSentiCorp')
dataset
#查看一個(gè)數(shù)據(jù)
dataset[0]
??排序和打亂
#sort
#未排序的label是亂序的
print(dataset['label'][:10])
#排序之后label有序了
sorted_dataset = dataset.sort('label')
print(sorted_dataset['label'][:10])
print(sorted_dataset['label'][-10:])
#shuffle
#打亂順序
shuffled_dataset = sorted_dataset.shuffle(seed=42)
shuffled_dataset['label'][:10]
選擇和過濾
#select
dataset.select([0, 10, 20, 30, 40, 50])
?
#filter
def f(data):
return data['text'].startswith('選擇')
start_with_ar = dataset.filter(f)
len(start_with_ar), start_with_ar['text']
切分和分桶
#train_test_split, 切分訓(xùn)練集和測(cè)試集
dataset.train_test_split(test_size=0.1)
#shard
#把數(shù)據(jù)切分到4個(gè)桶中,均勻分配
dataset.shard(num_shards=4, index=0)
列操作和類型轉(zhuǎn)換
#rename_column
dataset.rename_column('text', 'textA')
#remove_columns
dataset.remove_columns(['text'])
#set_format
dataset.set_format(type='torch', columns=['label'])
dataset[0]
?map函數(shù)
對(duì)數(shù)據(jù)集中的每一條數(shù)據(jù)都做函數(shù)f操作
#map
def f(data):
data['text'] = 'My sentence: ' + data['text']
return data
datatset_map = dataset.map(f)
datatset_map['text'][:5]
保存和加載
#保存數(shù)據(jù)集到磁盤
dataset.save_to_disk(dataset_dict_path='./data/ChnSentiCorp')
#從磁盤加載數(shù)據(jù)
from datasets import load_from_disk
dataset = load_from_disk('./data/ChnSentiCorp')
導(dǎo)出和保存為其他格式
#導(dǎo)出為csv格式
dataset = load_dataset(path='seamew/ChnSentiCorp', split='train')
dataset.to_csv(path_or_buf='./data/ChnSentiCorp.csv')
#加載csv格式數(shù)據(jù)
csv_dataset = load_dataset(path='csv',
data_files='./data/ChnSentiCorp.csv',
split='train')
#導(dǎo)出為json格式
dataset = load_dataset(path='seamew/ChnSentiCorp', split='train')
dataset.to_json(path_or_buf='./data/ChnSentiCorp.json')
#加載json格式數(shù)據(jù)
json_dataset = load_dataset(path='json',
data_files='./data/ChnSentiCorp.json',
split='train')
4.使用評(píng)價(jià)函數(shù)
查看可用的評(píng)價(jià)指標(biāo)
from datasets import list_metrics
#列出評(píng)價(jià)指標(biāo)
metrics_list = list_metrics()
len(metrics_list), metrics_list
?查看該指標(biāo)的說明文檔
可以按照評(píng)價(jià)指標(biāo)的說明文檔中的示例代碼來使用該指標(biāo)
from datasets import load_metric
#加載一個(gè)評(píng)價(jià)指標(biāo)
metric = load_metric('glue', 'mrpc')
print(metric.inputs_description)
計(jì)算一個(gè)評(píng)價(jià)指標(biāo)
#計(jì)算一個(gè)評(píng)價(jià)指標(biāo)
predictions = [0, 1, 0]
references = [0, 1, 1]
final_score = metric.compute(predictions=predictions, references=references)
final_score
5.使用pipline函數(shù)
pipeline提供了一些不需要訓(xùn)練就可以執(zhí)行一些nlp任務(wù)的模型,實(shí)用價(jià)值不高
情感分類
from transformers import pipeline
#文本分類
classifier = pipeline("sentiment-analysis")
result = classifier("I hate you")[0]
print(result)
result = classifier("I love you")[0]
print(result)
?閱讀理解
from transformers import pipeline
#閱讀理解
question_answerer = pipeline("question-answering")
context = r"""
Extractive Question Answering is the task of extracting an answer from a text given a question. An example of a
question answering dataset is the SQuAD dataset, which is entirely based on that task. If you would like to fine-tune
a model on a SQuAD task, you may leverage the examples/pytorch/question-answering/run_squad.py script.
"""
result = question_answerer(question="What is extractive question answering?",
context=context)
print(result)
result = question_answerer(
question="What is a good example of a question answering dataset?",
context=context)
print(result)
完形填空
from transformers import pipeline
#完形填空
unmasker = pipeline("fill-mask")
from pprint import pprint
sentence = 'HuggingFace is creating a <mask> that the community uses to solve NLP tasks.'
unmasker(sentence)
文本生成
from transformers import pipeline
#文本生成
text_generator = pipeline("text-generation")
text_generator("As far as I am concerned, I will",
max_length=50,
do_sample=False)
命名實(shí)體識(shí)別
from transformers import pipeline
#命名實(shí)體識(shí)別
ner_pipe = pipeline("ner")
sequence = """Hugging Face Inc. is a company based in New York City. Its headquarters are in DUMBO,
therefore very close to the Manhattan Bridge which is visible from the window."""
for entity in ner_pipe(sequence):
print(entity)
文本摘要
from transformers import pipeline
#文本總結(jié)
summarizer = pipeline("summarization")
ARTICLE = """ New York (CNN)When Liana Barrientos was 23 years old, she got married in Westchester County, New York.
A year later, she got married again in Westchester County, but to a different man and without divorcing her first husband.
Only 18 days after that marriage, she got hitched yet again. Then, Barrientos declared "I do" five more times, sometimes only within two weeks of each other.
In 2010, she married once more, this time in the Bronx. In an application for a marriage license, she stated it was her "first and only" marriage.
Barrientos, now 39, is facing two criminal counts of "offering a false instrument for filing in the first degree," referring to her false statements on the
2010 marriage license application, according to court documents.
Prosecutors said the marriages were part of an immigration scam.
On Friday, she pleaded not guilty at State Supreme Court in the Bronx, according to her attorney, Christopher Wright, who declined to comment further.
After leaving court, Barrientos was arrested and charged with theft of service and criminal trespass for allegedly sneaking into the New York subway through an emergency exit, said Detective
Annette Markowski, a police spokeswoman. In total, Barrientos has been married 10 times, with nine of her marriages occurring between 1999 and 2002.
All occurred either in Westchester County, Long Island, New Jersey or the Bronx. She is believed to still be married to four men, and at one time, she was married to eight men at once, prosecutors say.
Prosecutors said the immigration scam involved some of her husbands, who filed for permanent residence status shortly after the marriages.
Any divorces happened only after such filings were approved. It was unclear whether any of the men will be prosecuted.
The case was referred to the Bronx District Attorney\'s Office by Immigration and Customs Enforcement and the Department of Homeland Security\'s
Investigation Division. Seven of the men are from so-called "red-flagged" countries, including Egypt, Turkey, Georgia, Pakistan and Mali.
Her eighth husband, Rashid Rajput, was deported in 2006 to his native Pakistan after an investigation by the Joint Terrorism Task Force.
If convicted, Barrientos faces up to four years in prison. Her next court appearance is scheduled for May 18.
"""
summarizer(ARTICLE, max_length=130, min_length=30, do_sample=False)
翻譯
from transformers import pipeline
#翻譯
translator = pipeline("translation_en_to_de")
sentence = "Hugging Face is a technology company based in New York and Paris"
translator(sentence, max_length=40)
trainer API
加載分詞工具
from transformers import AutoTokenizer
#加載分詞工具
tokenizer = AutoTokenizer.from_pretrained('bert-base-cased')
定義數(shù)據(jù)集
from datasets import load_dataset
from datasets import load_from_disk
#加載數(shù)據(jù)集
#從網(wǎng)絡(luò)加載
#datasets = load_dataset(path='glue', name='sst2')
#從本地磁盤加載數(shù)據(jù)
datasets = load_from_disk('./data/glue_sst2')
#分詞
def f(data):
return tokenizer(
data['sentence'],
padding='max_length',
truncation=True,
max_length=30,
)
datasets = datasets.map(f, batched=True, batch_size=1000, num_proc=4)
#取數(shù)據(jù)子集,否則數(shù)據(jù)太多跑不動(dòng)
dataset_train = datasets['train'].shuffle().select(range(1000))
dataset_test = datasets['validation'].shuffle().select(range(200))
del datasets
dataset_train
加載模型
from transformers import AutoModelForSequenceClassification
#加載模型
model = AutoModelForSequenceClassification.from_pretrained('bert-base-cased',
num_labels=2)
print(sum([i.nelement() for i in model.parameters()]) / 10000) # 查看模型參數(shù)數(shù)量
定義評(píng)價(jià)函數(shù)
import numpy as np
from datasets import load_metric
from transformers.trainer_utils import EvalPrediction
#加載評(píng)價(jià)函數(shù)
metric = load_metric('accuracy')
#定義評(píng)價(jià)函數(shù)
def compute_metrics(eval_pred):
logits, labels = eval_pred
logits = logits.argmax(axis=1)
return metric.compute(predictions=logits, references=labels)
#模擬測(cè)試輸出
eval_pred = EvalPrediction(
predictions=np.array([[0, 1], [2, 3], [4, 5], [6, 7]]),
label_ids=np.array([1, 1, 1, 1]),
)
compute_metrics(eval_pred)
定義訓(xùn)練器并測(cè)試
from transformers import TrainingArguments, Trainer
#初始化訓(xùn)練參數(shù)
args = TrainingArguments(output_dir='./output_dir', evaluation_strategy='epoch')
args.num_train_epochs = 1
args.learning_rate = 1e-4
args.weight_decay = 1e-2
args.per_device_eval_batch_size = 32
args.per_device_train_batch_size = 16
#初始化訓(xùn)練器
trainer = Trainer(
model=model,
args=args,
train_dataset=dataset_train,
eval_dataset=dataset_test,
compute_metrics=compute_metrics,
)
#評(píng)價(jià)模型
trainer.evaluate()
模型未訓(xùn)練前的準(zhǔn)確率是0.49
#訓(xùn)練
trainer.train()
?文章來源:http://www.zghlxwxcb.cn/news/detail-822415.html
?訓(xùn)練一個(gè)epoch之后的準(zhǔn)確率為0.8文章來源地址http://www.zghlxwxcb.cn/news/detail-822415.html
保存模型參數(shù)
#保存模型
trainer.save_model(output_dir='./output_dir')
使用保存的模型參數(shù)
定義測(cè)試數(shù)據(jù)集
import torch
def collate_fn(data):
label = [i['label'] for i in data]
input_ids = [i['input_ids'] for i in data]
token_type_ids = [i['token_type_ids'] for i in data]
attention_mask = [i['attention_mask'] for i in data]
label = torch.LongTensor(label)
input_ids = torch.LongTensor(input_ids)
token_type_ids = torch.LongTensor(token_type_ids)
attention_mask = torch.LongTensor(attention_mask)
return label, input_ids, token_type_ids, attention_mask
#數(shù)據(jù)加載器
loader_test = torch.utils.data.DataLoader(dataset=dataset_test,
batch_size=4,
collate_fn=collate_fn,
shuffle=True,
drop_last=True)
for i, (label, input_ids, token_type_ids,
attention_mask) in enumerate(loader_test):
break
label, input_ids, token_type_ids, attention_mask
測(cè)試
import torch
#測(cè)試
def test():
#加載參數(shù)
model.load_state_dict(torch.load('./output_dir/pytorch_model.bin'))
model.eval()
#運(yùn)算
out = model(input_ids=input_ids,
token_type_ids=token_type_ids,
attention_mask=attention_mask)
#[4, 2] -> [4]
out = out['logits'].argmax(dim=1)
correct = (out == label).sum().item()
return correct / len(label)
test()
到了這里,關(guān)于HuggingFace簡明教程的文章就介紹完了。如果您還想了解更多內(nèi)容,請(qǐng)?jiān)谟疑辖撬阉鱐OY模板網(wǎng)以前的文章或繼續(xù)瀏覽下面的相關(guān)文章,希望大家以后多多支持TOY模板網(wǎng)!