人工智能專欄文章匯總:人工智能學(xué)習(xí)專欄文章匯總-CSDN博客
本篇目錄
一、詞向量處理
1.01?詞袋模型(Bag-of-words model)
1.02?simtext
1.03?百度飛槳(paddlenlp.embeddings)
1.04?百度千帆SDK(qianfan.Embedding)
?1.2 SentenceTransformers(資源國內(nèi)可訪問)
1.2.1 句向量生成(SentenceTransformer)
1.2.1 文本相似度比較(util.cos_sim)
1.2.3 文本匹配搜索(util.semantic_search)
1.2.4 相近語義挖掘(util.paraphrase_mining)
1.2.5 圖文搜索
1.3?text2vec
1.3.1 句向量生成
Word2Vec
SentenceModel
1.3.2 文本相似度比較(Similarity)
1.3.3 文本匹配搜索(semantic_search)
1.5?HuggingFace Transformers
二、基于BERT預(yù)訓(xùn)練模型+微調(diào)完成NLP主流任務(wù)
2.1 任務(wù)說明
2.2?數(shù)據(jù)準(zhǔn)備
2.1.1 數(shù)據(jù)加載
2.1.2 轉(zhuǎn)換數(shù)據(jù)格式
2.1.3?構(gòu)建DataLoader
2.2 模型構(gòu)建
2.3 訓(xùn)練配置
2.4 模型訓(xùn)練與評估
2.5 模型測試
本文主要介紹NLP中常用的詞向量處理方法和基于NLP預(yù)訓(xùn)練模型的微調(diào)方法。
自然語言處理( Natural Language Processing, NLP)是計(jì)算機(jī)科學(xué)領(lǐng)域與人工智能領(lǐng)域中的一個(gè)重要方向。NLP下游任務(wù)主要包括:機(jī)器翻譯、輿情監(jiān)測、自動(dòng)摘要、觀點(diǎn)提取、文本分類、問題回答、文本語義對比、語音識(shí)別、中文OCR等方面。
自然語言是天然具有上下文序列關(guān)系的表達(dá)方式。針對自然語言處理(NLP),科學(xué)家們一步一步發(fā)展的模型包括RNN,LSTM,Transformers,BERT, GPT,等等。
其中,BERT 模型的預(yù)訓(xùn)練任務(wù)主要為模擬人類的完形填空任務(wù),在這種預(yù)訓(xùn)練方法下,模型需要同時(shí)關(guān)注上下文間的信息,從而得出當(dāng)前位置的 token。另一種較強(qiáng)的 NLP模型GPT,則使用了自回歸的方法來訓(xùn)練,也就是說,模型僅可通過當(dāng)前位置之前的字段來推理當(dāng)前位置的 token。
如上一篇所述,要實(shí)現(xiàn)NLP任務(wù),首先我們需要對文本進(jìn)行向量化處理。
一、詞向量處理
NLP的文本向量處理主要是指將原始文本轉(zhuǎn)換成詞向量和句向量,方便做詞語和句子之間的語義匹配,搜索等NLP任務(wù)。我嘗試過整理出來的文本向量處理工具如下:
1.01?詞袋模型(Bag-of-words model)
詞袋模型(Bag-of-words model)是用于自然語言處理和信息檢索中的一種簡單的文檔處理方法。通過這一模型,一篇文檔可以通過統(tǒng)計(jì)所有單詞的數(shù)目來表示,這種方法不考慮語法和單詞出現(xiàn)的先后順序。這一模型在文檔分類里廣為應(yīng)用,通過統(tǒng)計(jì)每個(gè)單詞的出現(xiàn)次數(shù)(頻率)作為分類器的特征。
如下兩篇簡單的文本文檔:
Jane wants to go to Shenzhen.
Bob wants to go to Shanghai.
基于這兩篇文檔我們可以構(gòu)建一個(gè)字典:
{‘Jane’:1, ‘wants’:2, ‘to’:4, ‘go’:2, ‘Shenzhen’:1, ‘Bob’:1, ‘Shanghai’:1}
我們可將兩篇文檔表示為如下的向量:
例句1:[1,1,2,1,1,0,0]
例句2:[0,1,2,1,0,1,1]
詞袋模型實(shí)際就是把文檔表示成向量,其中向量的維數(shù)就是字典所含詞的個(gè)數(shù),在上例中,向量中的第i個(gè)元素就是統(tǒng)計(jì)該文檔中對應(yīng)字典中的第i個(gè)單詞出現(xiàn)的個(gè)數(shù),因此可認(rèn)為詞袋模型就是統(tǒng)計(jì)詞頻直方圖的簡單文檔表示方法。
詞袋模型的思路還可以用于處理圖像分類,可以參考:詞袋模型(Bag-of-words model)-CSDN博客
1.02?simtext
simtext可以計(jì)算兩文檔間四大文本相似性指標(biāo),分別為:
- Sim_Cosine cosine相似性(余弦相似度,常用)
- Sim_Jaccard Jaccard相似性
- Sim_MinEdit 最小編輯距離
- Sim_Simple 微軟Word中的track changes
它的好處是不需要下載預(yù)訓(xùn)練模型,直接用pip安裝即可使用:
pip install simtext
中文文本相似性代碼:
from simtext import similarity
text1 = '在宏觀經(jīng)濟(jì)背景下,為繼續(xù)優(yōu)化貸款結(jié)構(gòu),重點(diǎn)發(fā)展可以抵抗經(jīng)濟(jì)周期不良的貸款'
text2 = '在宏觀經(jīng)濟(jì)背景下,為繼續(xù)優(yōu)化貸款結(jié)構(gòu),重點(diǎn)發(fā)展可三年專業(yè)化、集約化、綜合金融+物聯(lián)網(wǎng)金融四大金融特色的基礎(chǔ)上'
sim = similarity()
res = sim.compute(text1, text2)
print(res)
打印結(jié)果:
{'Sim_Cosine': 0.46475800154489,
'Sim_Jaccard': 0.3333333333333333,
'Sim_MinEdit': 29,
'Sim_Simple': 0.9889595182335229}
英文文本相似性代碼:
from simtext import similarity
A = 'We expect demand to increase.'
B = 'We expect worldwide demand to increase.'
C = 'We expect weakness in sales'
sim = similarity()
AB = sim.compute(A, B)
AC = sim.compute(A, C)
print(AB)
print(AC)
打印結(jié)果:
{'Sim_Cosine': 0.9128709291752769,
'Sim_Jaccard': 0.8333333333333334,
'Sim_MinEdit': 2,
'Sim_Simple': 0.9545454545454546}
{'Sim_Cosine': 0.39999999999999997,
'Sim_Jaccard': 0.25,
'Sim_MinEdit': 4,
'Sim_Simple': 0.9315789473684211}
1.03?百度飛槳(paddlenlp.embeddings)
首先使用 pip install -U paddlenlp 安裝 paddlenlp?包。
詞向量
使用百度飛槳的paddlenlp embeddings的預(yù)訓(xùn)練模型,可以直接獲得一個(gè)單詞的詞向量,并可對詞向量進(jìn)行相似度比較。代碼如下:
from paddlenlp.embeddings import TokenEmbedding
# 初始化TokenEmbedding, 預(yù)訓(xùn)練embedding未下載時(shí)會(huì)自動(dòng)下載并加載數(shù)據(jù)
token_embedding = TokenEmbedding(embedding_name="w2v.baidu_encyclopedia.target.word-word.dim300")
# 查看token_embedding詳情
#print(token_embedding)
#獲得詞向量
test_token_embedding = token_embedding.search("中國")
#print(test_token_embedding)
#比較詞向量
score1 = token_embedding.cosine_sim("女孩", "女皇")
score2 = token_embedding.cosine_sim("女孩", "小女孩")
score3 = token_embedding.cosine_sim("女孩", "中國")
print('score1:', score1)
print('score2:', score2)
print('score3:', score3)
----------------------------------------------------------------------------
score1: 0.32632214
score2: 0.7869123
score3: 0.15649165
句向量
句向量有一種比較簡單粗暴的方式,就是將句子里的所有詞向量相加,但是這種方式獲得的向量不能很好的表述句子的意思,準(zhǔn)確度不高。
# 初始化TokenEmbedding, 預(yù)訓(xùn)練embedding沒下載時(shí)會(huì)自動(dòng)下載并加載數(shù)據(jù)
token_embedding = TokenEmbedding(embedding_name="w2v.baidu_encyclopedia.target.word-word.dim300")
# 查看token_embedding詳情
#print(token_embedding)
tokenizer = JiebaTokenizer(vocab=token_embedding.vocab)
def get_sentence_embedding(text):
# 分詞
words = tokenizer.cut(text)
print(words)
# 獲取詞向量
word_embeddings = token_embedding.search(words)
#print(word_embeddings)
# 通過詞向量相加,計(jì)算句向量
sentence_embedding = np.sum(word_embeddings, axis=0) / len(words)
#print(sentence_embedding)
return sentence_embedding
text1 = "飛槳是優(yōu)秀的深度學(xué)習(xí)平臺(tái)"
text2 = "我喜歡喝咖啡"
sen_emb1 = get_sentence_embedding(text1)
print("句向量1:\n", sen_emb1.shape)
sen_emb2 = get_sentence_embedding(text2)
print("句向量2:\n", sen_emb2.shape)
sim = F.cosine_similarity(paddle.to_tensor(sen_emb1).unsqueeze(0), paddle.to_tensor(sen_emb2).unsqueeze(0))
print("Similarity: {:.5f}".format(sim.item()))
1.04?百度千帆SDK(qianfan.Embedding)
百度千帆大模型SDK也提供了詞向量的API。首先安裝千帆SDK:
pip install qianfan -U
調(diào)用方法如下:
# Embedding 基礎(chǔ)功能
import qianfan
# 替換下列示例中參數(shù),應(yīng)用API Key替換your_ak,Secret Key替換your_sk
emb = qianfan.Embedding(ak="your_ak", sk="your_sk")
resp = emb.do(texts=[ # 省略 model 時(shí)則調(diào)用默認(rèn)模型 Embedding-V1
"世界上最高的山"
])
?1.2 SentenceTransformers(資源國內(nèi)可訪問)
SentenceTransformers是Python里用于對文本圖像進(jìn)行向量操作的庫。
(官網(wǎng):SentenceTransformers Documentation — Sentence-Transformers documentation)
首先使用 pip install -U sentence_transformers 安裝 sentence_transformers?包。
這個(gè)庫提供的生成詞向量的方法是使用BERT算法,對句意的表達(dá)比較準(zhǔn)確??梢杂糜谖谋镜南蛄可?,相似度比較,匹配等任務(wù)。
這個(gè)包的模型資源目前在國內(nèi)是可以訪問的,可以直接下載到本地:
https://public.ukp.informatik.tu-darmstadt.de/reimers/sentence-transformers/v0.2/
然后查找paraphrase-multilingual-MiniLM-L12-v2
這個(gè)模型名字,點(diǎn)擊下載即可。
1.2.1 句向量生成(SentenceTransformer)
可以用sentence_transformers包里的SentenceTransformer來生成句向量。
示例代碼:
import sys
from sentence_transformers.util import cos_sim
from sentence_transformers import SentenceTransformer as SBert
#model = SBert('paraphrase-multilingual-MiniLM-L12-v2')
model = SBert("C:\\Users\\aric\\.models\\paraphrase-multilingual-MiniLM-L12-v2")
# Two lists of sentences
sentences1 = ['如何更換花唄綁定銀行卡',
'The cat sits outside',
'A man is playing guitar',
'The new movie is awesome']
sentences2 = ['花唄更改綁定銀行卡',
'The dog plays in the garden',
'A woman watches TV',
'The new movie is so great']
# Compute embedding for both lists
embeddings1 = model.encode(sentences1)
embeddings2 = model.encode(sentences2)
print(type(embeddings1), embeddings1.shape)
# The result is a list of sentence embeddings as numpy arrays
for sentence, embedding in zip(sentences1, embeddings1):
print("Sentence:", sentence)
print("Embedding shape:", embedding.shape)
print("Embedding head:", embedding[:10])
print()
-----------------------------------------------------------------------------------
<class 'numpy.ndarray'> (4, 384)
Sentence: 如何更換花唄綁定銀行卡
Embedding shape: (384,)
Embedding head: [-0.08839616 0.29445878 -0.25130653 -0.00759273 -0.0749087 -0.12786895
0.07136863 -0.01503289 -0.19017595 -0.12699445]
Sentence: The cat sits outside
Embedding shape: (384,)
Embedding head: [ 0.45684573 -0.14459176 -0.0388849 0.2711025 0.0222025 0.2317232
0.14208616 0.13658428 -0.27846363 0.05661529]
Sentence: A man is playing guitar
Embedding shape: (384,)
Embedding head: [-0.20837498 0.00522519 -0.23411965 -0.07861497 -0.35490423 -0.27809393
0.24954818 0.15160584 0.01028005 0.1939052 ]
Sentence: The new movie is awesome
Embedding shape: (384,)
Embedding head: [-0.5378314 -0.36144564 -0.5304235 -0.20994733 -0.03825595 0.22604015
0.35931802 0.14547679 0.05396605 -0.08255189]
1.2.1 文本相似度比較(util.cos_sim)
示例代碼:
import sys
from sentence_transformers.util import cos_sim
from sentence_transformers import SentenceTransformer as SBert
#model = SBert('paraphrase-multilingual-MiniLM-L12-v2')
model = SBert("C:\\Users\\aric\\.models\\paraphrase-multilingual-MiniLM-L12-v2")
# Two lists of sentences
sentences1 = ['如何更換花唄綁定銀行卡',
'The cat sits outside',
'A man is playing guitar',
'The new movie is awesome']
sentences2 = ['花唄更改綁定銀行卡',
'The dog plays in the garden',
'A woman watches TV',
'The new movie is so great']
# Compute embedding for both lists
embeddings1 = model.encode(sentences1)
embeddings2 = model.encode(sentences2)
print(type(embeddings1), embeddings1.shape)
# The result is a list of sentence embeddings as numpy arrays
"""
for sentence, embedding in zip(sentences1, embeddings1):
print("Sentence:", sentence)
print("Embedding shape:", embedding.shape)
print("Embedding head:", embedding[:10])
print()
"""
# Compute cosine-similarits
cosine_scores_0 = cos_sim(embeddings1[0], embeddings2[0])
cosine_scores = cos_sim(embeddings1, embeddings2)
print(cosine_scores_0)
print(cosine_scores)
---------------------------------------------------------------------------------------
<class 'numpy.ndarray'> (4, 384)
tensor([[0.9477]])
tensor([[ 0.9477, -0.1748, -0.0839, -0.0044],
[-0.0097, 0.1908, -0.0203, 0.0302],
[-0.0010, 0.1062, 0.0055, 0.0097],
[ 0.0302, -0.0160, 0.1321, 0.9591]])
Note:最后這個(gè)4x4的向量的對角線上的數(shù)值,代表每一對句向量的相似度結(jié)果)
1.2.3 文本匹配搜索(util.semantic_search)
文本匹配搜索通過理解搜索查詢的內(nèi)容來提高搜索的準(zhǔn)確性,而不是僅僅依賴于詞匯匹配。這是利用句向量之間的相似性完成的。文本匹配搜索是將語料庫中的所有條目(句子)嵌入到向量空間中。在搜索時(shí),查詢語句也會(huì)被嵌入到相同的向量空間中,并從語料庫中找到最接近的向量。
示例代碼:
from sentence_transformers import SentenceTransformer, util
# Download model
model = SentenceTransformer("C:\\Users\\aric\\.models\\paraphrase-multilingual-MiniLM-L12-v2")
# Corpus of documents and their embeddings
corpus = ['Python is an interpreted high-level general-purpose programming language.',
'Python is dynamically-typed and garbage-collected.',
'The quick brown fox jumps over the lazy dog.']
corpus_embeddings = model.encode(corpus)
# Queries and their embeddings
queries = ["What is Python?", "What did the fox do?"]
queries_embeddings = model.encode(queries)
# Find the top-2 corpus documents matching each query
hits = util.semantic_search(queries_embeddings, corpus_embeddings, top_k=2)
# Print results of first query
print(f"Query: {queries[0]}")
for hit in hits[0]:
print(corpus[hit['corpus_id']], "(Score: {:.4f})".format(hit['score']))
# Query: What is Python?
# Python is an interpreted high-level general-purpose programming language. (Score: 0.6759)
# Python is dynamically-typed and garbage-collected. (Score: 0.6219)
# Print results of second query
print(f"Query: {queries[1]}")
for hit in hits[1]:
print(corpus[hit['corpus_id']], "(Score: {:.4f})".format(hit['score']))
---------------------------------------------------------------------------------------
打印結(jié)果:
Query: What is Python?
Python is an interpreted high-level general-purpose programming language. (Score: 0.7616)
Python is dynamically-typed and garbage-collected. (Score: 0.6267)
Query: What did the fox do?
The quick brown fox jumps over the lazy dog. (Score: 0.4893)
Python is dynamically-typed and garbage-collected. (Score: 0.0746)
1.2.4 相近語義挖掘(util.paraphrase_mining)
Paraphrase Mining是在大量句子中尋找相近釋義的句子,即具有非常相似含義的文本。
這可以使用 util 模塊的 paraphrase_mining 函數(shù)來實(shí)現(xiàn)。
from sentence_transformers import SentenceTransformer, util
# Download model
model = SentenceTransformer('all-MiniLM-L6-v2')
# List of sentences
sentences = ['The cat sits outside',
'A man is playing guitar',
'I love pasta',
'The new movie is awesome',
'The cat plays in the garden',
'A woman watches TV',
'The new movie is so great',
'Do you like pizza?',
'我喜歡喝咖啡',
'我愛喝咖啡',
'我喜歡喝牛奶',]
# Look for paraphrases
paraphrases = util.paraphrase_mining(model, sentences)
# Print paraphrases
print("Top 5 paraphrases")
for paraphrase in paraphrases[0:5]:
score, i, j = paraphrase
print("Score {:.4f} ---- {} ---- {}".format(score, sentences[i], sentences[j]))
---------------------------------------------------------------------------------------
Top 5 paraphrases
Score 0.9751 ---- 我喜歡喝咖啡 ---- 我愛喝咖啡
Score 0.9591 ---- The new movie is awesome ---- The new movie is so great
Score 0.6774 ---- The cat sits outside ---- The cat plays in the garden
Score 0.6384 ---- 我喜歡喝咖啡 ---- 我喜歡喝牛奶
Score 0.6007 ---- 我愛喝咖啡 ---- 我喜歡喝牛奶
1.2.5 圖文搜索
SentenceTransformers 提供允許將圖像和文本嵌入到同一向量空間,通過這中模型可以找到相似的圖像以及實(shí)現(xiàn)圖像搜索,即使用文本搜索圖像,反之亦然。
同一向量空間中的文本和圖像示例:
要執(zhí)行圖像搜索,需要加載像 CLIP 這樣的模型,并使用其encode 方法對圖像和文本進(jìn)行編碼:
from sentence_transformers import SentenceTransformer, util
from PIL import Image
# Load CLIP model
model = SentenceTransformer('clip-ViT-B-32')
# Encode an image
img_emb = model.encode(Image.open('two_dogs_in_snow.jpg'))
# Encode text descriptions
text_emb = model.encode(['Two dogs in the snow', 'A cat on a table', 'A picture of London at night'])
# Compute cosine similarities
cos_scores = util.cos_sim(img_emb, text_emb)
print(cos_scores)
更多可參考:[NLP] SentenceTransformers使用介紹_sentence transformer訓(xùn)練-CSDN博客
1.3?text2vec
這個(gè)好像是國內(nèi)的開發(fā)者做的(據(jù)說里面是封裝了sentence-transormers的內(nèi)容)。同樣也可以進(jìn)行文本向量的生成,相似度比較,匹配等任務(wù)。
它的模型基本都發(fā)布在HuggingFace上,現(xiàn)在國內(nèi)也無法正常訪問。
1.3.1 句向量生成
Word2Vec
第一種方式,是使用text2vec包里的Word2Vec:
這種方式使用騰訊詞向量Tencent_AILab_ChineseEmbedding(這個(gè)目前是可以下載的)
計(jì)算各字詞的詞向量,句子向量通過單詞詞向量取平均值得到(這種方式無法保證句意的正確理解)
首先使用 pip install -U text2vec 安裝 text2vec 包。
from text2vec import Word2Vec
def compute_emb(model):
# Embed a list of sentences
sentences = [
'卡',
'銀行卡',
'如何更換花唄綁定銀行卡',
'花唄更改綁定銀行卡',
'This framework generates embeddings for each input sentence',
'Sentences are passed as a list of string.',
'The quick brown fox jumps over the lazy dog.'
]
sentence_embeddings = model.encode(sentences)
print(type(sentence_embeddings), sentence_embeddings.shape)
# The result is a list of sentence embeddings as numpy arrays
for sentence, embedding in zip(sentences, sentence_embeddings):
print("Sentence:", sentence)
print("Embedding shape:", embedding.shape)
print("Embedding head:", embedding[:10])
print()
# 中文詞向量模型(word2vec),中文字面匹配任務(wù)和冷啟動(dòng)適用
w2v_model = Word2Vec("w2v-light-tencent-chinese")
compute_emb(w2v_model)
------------------------------------------------------------------------------------
打印結(jié)果:
<class 'numpy.ndarray'> (7, 200)
Sentence: 卡
Embedding shape: (200,)
Embedding head: [ 0.06761453 -0.10960816 -0.04829824 0.0156597 -0.09412017 -0.04805465
-0.03369278 -0.07476041 -0.01600934 0.03106228]
Sentence: 銀行卡
Embedding shape: (200,)
Embedding head: [ 0.01032454 -0.13564903 -0.00089282 0.02286329 -0.03501284 0.00987683
0.02884413 -0.03491557 0.02036332 0.04516884]
Sentence: 如何更換花唄綁定銀行卡
Embedding shape: (200,)
Embedding head: [ 0.02396784 -0.13885356 0.00176219 0.02540027 0.00949343 -0.01486312
0.01011733 0.00190828 0.02708069 0.04316072]
Sentence: 花唄更改綁定銀行卡
Embedding shape: (200,)
Embedding head: [ 0.00871027 -0.14244929 -0.00959482 0.03021128 0.01514321 -0.01624702
0.00260827 0.0131352 0.02293272 0.04481505]
Sentence: This framework generates embeddings for each input sentence
Embedding shape: (200,)
Embedding head: [-0.08317478 -0.00601972 -0.06293213 -0.03963032 -0.0145333 -0.0549945
0.05606257 0.02389491 -0.02102496 0.03023159]
Sentence: Sentences are passed as a list of string.
Embedding shape: (200,)
Embedding head: [-0.08008799 -0.01654172 -0.04550576 -0.03715633 0.00133283 -0.04776235
0.04780829 0.01377041 -0.01251951 0.02603387]
Sentence: The quick brown fox jumps over the lazy dog.
Embedding shape: (200,)
Embedding head: [-0.08605123 -0.01434057 -0.06376401 -0.03962022 -0.00724643 -0.05585583
0.05175515 0.02725058 -0.01821304 0.02920807]
w2v-light-tencent-chinese
是通過gensim加載的Word2Vec模型,模型自動(dòng)下載到本機(jī)路徑:~/.text2vec/datasets/light_Tencent_AILab_ChineseEmbedding.bin
SentenceModel
第二種方式,是使用text2vec包里的SentenceModel方法(和SentenceTransformers類似):
import sys
sys.path.append('..')
from text2vec import SentenceModel
def compute_emb(model):
# Embed a list of sentences
sentences = [
'卡',
'銀行卡',
'如何更換花唄綁定銀行卡',
'花唄更改綁定銀行卡',
'This framework generates embeddings for each input sentence',
'Sentences are passed as a list of string.',
'The quick brown fox jumps over the lazy dog.'
]
sentence_embeddings = model.encode(sentences)
print(type(sentence_embeddings), sentence_embeddings.shape)
# The result is a list of sentence embeddings as numpy arrays
for sentence, embedding in zip(sentences, sentence_embeddings):
print("Sentence:", sentence)
print("Embedding shape:", embedding.shape)
print("Embedding head:", embedding[:10])
print()
if __name__ == "__main__":
# 中文句向量模型(CoSENT),中文語義匹配任務(wù)推薦,支持fine-tune繼續(xù)訓(xùn)練
t2v_model = SentenceModel("shibing624/text2vec-base-chinese")
compute_emb(t2v_model)
# 支持多語言的句向量模型(CoSENT),多語言(包括中英文)語義匹配任務(wù)推薦,支持fine-tune繼續(xù)訓(xùn)練
sbert_model = SentenceModel("shibing624/text2vec-base-multilingual")
compute_emb(sbert_model)
1.3.2 文本相似度比較(Similarity)
使用text2vec.Similarity可以直接比較文本的相似度,它默認(rèn)會(huì)調(diào)用“shibing624/text2vec-base-chinese”模型產(chǎn)生文本句向量。但是也有同樣的問題,模型資源是在HuggingFace上的,國內(nèi)還是有無法訪問的問題。
import sys
sys.path.append('..')
from text2vec import Similarity
# Two lists of sentences
sentences1 = ['如何更換花唄綁定銀行卡',
'The cat sits outside',
'A man is playing guitar',
'The new movie is awesome']
sentences2 = ['花唄更改綁定銀行卡',
'The dog plays in the garden',
'A woman watches TV',
'The new movie is so great']
sim_model = Similarity()
for i in range(len(sentences1)):
for j in range(len(sentences2)):
score = sim_model.get_score(sentences1[i], sentences2[j])
print("{} \t\t {} \t\t Score: {:.4f}".format(sentences1[i], sentences2[j], score))
-------------------------------------------------------------------------------------------
如何更換花唄綁定銀行卡 花唄更改綁定銀行卡 Score: 0.9477
如何更換花唄綁定銀行卡 The dog plays in the garden Score: -0.1748
如何更換花唄綁定銀行卡 A woman watches TV Score: -0.0839
如何更換花唄綁定銀行卡 The new movie is so great Score: -0.0044
The cat sits outside 花唄更改綁定銀行卡 Score: -0.0097
The cat sits outside The dog plays in the garden Score: 0.1908
The cat sits outside A woman watches TV Score: -0.0203
The cat sits outside The new movie is so great Score: 0.0302
A man is playing guitar 花唄更改綁定銀行卡 Score: -0.0010
A man is playing guitar The dog plays in the garden Score: 0.1062
A man is playing guitar A woman watches TV Score: 0.0055
A man is playing guitar The new movie is so great Score: 0.0097
The new movie is awesome 花唄更改綁定銀行卡 Score: 0.0302
The new movie is awesome The dog plays in the garden Score: -0.0160
The new movie is awesome A woman watches TV Score: 0.1321
The new movie is awesome The new movie is so great Score: 0.9591
1.3.3 文本匹配搜索(semantic_search)
一般是在文檔候選集中找與query最相似的文本,常用于QA場景的問句相似匹配、文本相似檢索等任務(wù)??梢允褂胻ext2vec包里的semantic_search。它默認(rèn)使用的也是“shibing624/text2vec-base-chinese”模型。
import sys
sys.path.append('..')
from text2vec import SentenceModel, cos_sim, semantic_search
embedder = SentenceModel()
# Corpus with example sentences
corpus = [
'花唄更改綁定銀行卡',
'我什么時(shí)候開通了花唄',
'A man is eating food.',
'A man is eating a piece of bread.',
'The girl is carrying a baby.',
'A man is riding a horse.',
'A woman is playing violin.',
'Two men pushed carts through the woods.',
'A man is riding a white horse on an enclosed ground.',
'A monkey is playing drums.',
'A cheetah is running behind its prey.'
]
corpus_embeddings = embedder.encode(corpus)
# Query sentences:
queries = [
'如何更換花唄綁定銀行卡',
'A man is eating pasta.',
'Someone in a gorilla costume is playing a set of drums.',
'A cheetah chases prey on across a field.']
for query in queries:
query_embedding = embedder.encode(query)
hits = semantic_search(query_embedding, corpus_embeddings, top_k=5)
print("\n\n======================\n\n")
print("Query:", query)
print("\nTop 5 most similar sentences in corpus:")
hits = hits[0] # Get the hits for the first query
for hit in hits:
print(corpus[hit['corpus_id']], "(Score: {:.4f})".format(hit['score']))
-------------------------------------------------------------------------------------
Query: 如何更換花唄綁定銀行卡
Top 5 most similar sentences in corpus:
花唄更改綁定銀行卡 (Score: 0.9477)
我什么時(shí)候開通了花唄 (Score: 0.3635)
A man is eating food. (Score: 0.0321)
A man is riding a horse. (Score: 0.0228)
Two men pushed carts through the woods. (Score: 0.0090)
======================
Query: A man is eating pasta.
Top 5 most similar sentences in corpus:
A man is eating food. (Score: 0.6734)
A man is eating a piece of bread. (Score: 0.4269)
A man is riding a horse. (Score: 0.2086)
A man is riding a white horse on an enclosed ground. (Score: 0.1020)
A cheetah is running behind its prey. (Score: 0.0566)
======================
Query: Someone in a gorilla costume is playing a set of drums.
Top 5 most similar sentences in corpus:
A monkey is playing drums. (Score: 0.8167)
A cheetah is running behind its prey. (Score: 0.2720)
A woman is playing violin. (Score: 0.1721)
A man is riding a horse. (Score: 0.1291)
A man is riding a white horse on an enclosed ground. (Score: 0.1213)
======================
Query: A cheetah chases prey on across a field.
Top 5 most similar sentences in corpus:
A cheetah is running behind its prey. (Score: 0.9147)
A monkey is playing drums. (Score: 0.2655)
A man is riding a horse. (Score: 0.1933)
A man is riding a white horse on an enclosed ground. (Score: 0.1733)
A man is eating food. (Score: 0.0329)
1.5?HuggingFace Transformers
可以直接用AutoModel, AutoTokenizer這種方式來使用在HuggingFace Hub發(fā)布的模型。它會(huì)自動(dòng)去HuggingFace匹配和下載對應(yīng)的模型(可惜,目前國內(nèi)無法正常訪問)。
import os
import torch
from transformers import AutoTokenizer, AutoModel
os.environ["KMP_DUPLICATE_LIB_OK"] = "TRUE"
# Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
token_embeddings = model_output[0] # First element of model_output contains all token embeddings
input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained('shibing624/text2vec-base-chinese')
model = AutoModel.from_pretrained('shibing624/text2vec-base-chinese')
sentences = ['如何更換花唄綁定銀行卡', '花唄更改綁定銀行卡']
# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
# Compute token embeddings
with torch.no_grad():
model_output = model(**encoded_input)
# Perform pooling. In this case, max pooling.
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
print("Sentence embeddings:")
print(sentence_embeddings)
二、基于BERT預(yù)訓(xùn)練模型+微調(diào)完成NLP主流任務(wù)
預(yù)訓(xùn)練模型基于新的自然語言處理任務(wù)范式:預(yù)訓(xùn)練+微調(diào),極大推動(dòng)了自然語言處理領(lǐng)域的發(fā)展。
基于這個(gè)新的訓(xùn)練范式,預(yù)訓(xùn)練模型可以被廣泛應(yīng)用于NLP領(lǐng)域的各項(xiàng)任務(wù)中。一般來講,比較常見的經(jīng)典NLP任務(wù)包括以下四類:
- 分類式任務(wù):給定一串文本,判斷該文本的類別標(biāo)簽
- 匹配式任務(wù):對給定的兩個(gè)文本判斷其是否語義相似
- 問答式任務(wù):給定問題和文檔,要求從文檔中抽取出問題的答案
- 序列標(biāo)注式任務(wù):給定一串文本,輸出對應(yīng)的標(biāo)簽序列
- 生成式任務(wù):給定一串文本,同時(shí)要求模型輸出一串文本
下面以文本匹配任務(wù)為例來說明預(yù)訓(xùn)練模型的使用和微調(diào)過程。
2.1 任務(wù)說明
文本匹配是自然語言處理領(lǐng)域基礎(chǔ)的核心任務(wù)之一,其主要用于判斷給定的兩句文本是否語義相似。文本匹配技術(shù)具有廣泛的應(yīng)用場景,比如信息檢索、問答系統(tǒng),文本蘊(yùn)含等場景。
例如,文本匹配技術(shù)可以用于判定以下三句話之間的語義相似關(guān)系:
- 蘋果在什么時(shí)候成熟?
- 蘋果一般在幾月份成熟?
- 蘋果手機(jī)什么時(shí)候可以買?
文本匹配技術(shù)期望能夠使得計(jì)算機(jī)自動(dòng)判定第1和第2句話是語義相似的,第1和第3句話,第2和第3句話之間是不相似的。
本節(jié)將基于PaddleNLP庫中的BERT模型建模文本匹配任務(wù),帶領(lǐng)大家體驗(yàn)預(yù)訓(xùn)練+微調(diào)的訓(xùn)練新范式。由于PaddleNLP庫中的BERT模型已經(jīng)預(yù)訓(xùn)練過,因此本節(jié)將基于預(yù)訓(xùn)練后的BERT模型,在LCQMC數(shù)據(jù)集上微調(diào)BERT,建模文本匹配任務(wù)。
2.2?數(shù)據(jù)準(zhǔn)備
LCQMC是百度知道領(lǐng)域的中文問題匹配數(shù)據(jù)集,該數(shù)據(jù)集是從不同領(lǐng)域的用戶中提取出來。LCQMC的訓(xùn)練集的數(shù)量是 238766條,驗(yàn)證集的大小是4401條,測試集的大小是4401條。 下面展示了一條LCQMC數(shù)據(jù)集的樣例,數(shù)據(jù)分為三列,前兩列是判定語義相似的文本對,后一列是標(biāo)簽,其中1表示相似,0表示不相似。
什么花一年四季都開 什么花一年四季都是開的 1
大家覺得她好看嗎 大家覺得跑男好看嗎? 0
2.1.1 數(shù)據(jù)加載
由于LCQMC數(shù)據(jù)集已經(jīng)集成到PaddleNLP中,因此本節(jié)我們將使用PaddleNLP內(nèi)置的LCQMC數(shù)據(jù)集進(jìn)行文本匹配任務(wù)??梢允褂萌缦路绞郊虞dLCQMC數(shù)據(jù)集中的訓(xùn)練集、驗(yàn)證集和測試集,需要注意的是訓(xùn)練集和驗(yàn)證集是有標(biāo)簽的,測試集是沒有標(biāo)簽的。
import os
import paddle
import paddle.nn as nn
import paddle.nn.functional as F
from paddle.utils.download import get_path_from_url
from paddlenlp.datasets import load_dataset
from paddlenlp.data import Pad, Stack, Tuple, Vocab
# 加載 Lcqmc 的訓(xùn)練集、驗(yàn)證集
train_set, dev_set, test_set = load_dataset("lcqmc", splits=["train", "dev", "test"])
# 輸出訓(xùn)練集的前 3 條樣本
for idx, example in enumerate(train_set):
if idx <= 3:
#example['query'] = "我愛中國"
print(example)
打印結(jié)果: {'query': '喜歡打籃球的男生喜歡什么樣的女生', 'title': '愛打籃球的男生喜歡什么樣的女生', 'label': 1} {'query': '我手機(jī)丟了,我想換個(gè)手機(jī)', 'title': '我想買個(gè)新手機(jī),求推薦', 'label': 1} {'query': '大家覺得她好看嗎', 'title': '大家覺得跑男好看嗎?', 'label': 0} {'query': '求秋色之空漫畫全集', 'title': '求秋色之空全集漫畫', 'label': 1}
2.1.2 轉(zhuǎn)換數(shù)據(jù)格式
BERT的輸入編碼由文本編碼、分段編碼和位置編碼組合而成,如圖所示。
因此需要將加載的數(shù)據(jù)轉(zhuǎn)換成這樣的輸入格式,一般來說,需要先對文本串進(jìn)行分詞,獲取對應(yīng)的token序列,并根據(jù)這個(gè)token序列構(gòu)建對應(yīng)的ID序列。這里不同的編碼對應(yīng)不同的ID序列,具體來講,文本編碼、分段編碼和位置編碼分別對應(yīng)input ids 和 segment ids 和 position ids:
- input_ids: 將輸入的文本序列經(jīng)過分詞后轉(zhuǎn)換為對應(yīng)的詞典ID序列獲得
- segment_ids: 通常也被稱為token_type_ids,可以通過根據(jù)輸入文本序列單句/多句情況構(gòu)建
- position ids:一般來講,無需自己生成,模型內(nèi)部可以自動(dòng)生成
本節(jié)使用PaddleNLP內(nèi)置的BertTokenizer進(jìn)行處理文本數(shù)據(jù),其能夠?qū)⑤斎氲奈谋拘蛄兄苯犹幚沓蛇m合BERT模型輸入的形式。即BertTokenizer根據(jù)第1節(jié)介紹的數(shù)據(jù)拼接形式,在合適的地方自動(dòng)拼接[CLS]和[SEP] token, 同時(shí)會(huì)對輸入的文本序列進(jìn)行分詞,并將該文本序列轉(zhuǎn)換為對應(yīng)的ID序列。
默認(rèn)情況下,BertTokenizer在處理數(shù)據(jù)后,將會(huì)返回input_ids和token_type_ids數(shù)據(jù),下面給出了在輸入序列分別是單句和句對時(shí)的示例代碼。
from paddlenlp.transformers import BertTokenizer
# 加載BERT的tokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-chinese")
# 輸入單句形式
text = "今天天氣很好呀"
# max_seq_len表示最大序列長度,如果一個(gè)輸入序列長度超過max_seq_len,將會(huì)對齊截?cái)嘀羗ax_seq_len長度。
encoded_input = tokenizer(text=text, max_seq_len=512)
print(encoded_input)
# 輸入句對形式
text_a = "今天天氣很好呀"
text_b = "明天天氣會(huì)更好"
encoded_input = tokenizer(text=text_a, text_pair=text_b, max_seq_len=512)
print(encoded_input)
打印結(jié)果: {'input_ids': [101, 791, 1921, 1921, 3698, 2523, 1962, 1435, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0]} {'input_ids': [101, 791, 1921, 1921, 3698, 2523, 1962, 1435, 102, 3209, 1921, 1921, 3698, 833, 3291, 1962, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1]}
下面正式定義convert_example_to_feature函數(shù),用以將加載的文本序列數(shù)據(jù)轉(zhuǎn)換為對應(yīng)的ID形式,相應(yīng)代碼如下:
from functools import partial
from paddlenlp.transformers import BertTokenizer
# 將輸入樣本轉(zhuǎn)換為適合模型輸入的特征形式
def convert_example_to_feature(example, tokenizer, max_seq_len=128, is_test=False):
encoded_inputs = tokenizer(text=example["query"], text_pair=example["title"], max_seq_len=max_seq_len)
input_ids = encoded_inputs["input_ids"]
token_type_ids = encoded_inputs["token_type_ids"]
label = example["label"]
if not is_test:
return input_ids, token_type_ids, label
else:
return input_ids, token_type_ids
# 設(shè)置輸入模型的最大序列長度
max_seq_len = 512
tokenizer = BertTokenizer.from_pretrained("bert-base-chinese")
# 使用partial 將convert_example_to_feature的部分參數(shù)進(jìn)行固定
train_trans_func = partial(convert_example_to_feature, tokenizer=tokenizer, max_seq_len=max_seq_len, is_test=False)
test_trans_func = partial(convert_example_to_feature, tokenizer=tokenizer, max_seq_len=max_seq_len, is_test=True)
# 將輸入數(shù)據(jù)轉(zhuǎn)換為適合模型輸入的特征形式
train_set = train_set.map(train_trans_func, lazy=False)
dev_set = dev_set.map(train_trans_func, lazy=False)
test_set = test_set.map(test_trans_func, lazy=False)
# 輸出訓(xùn)練集的前 3 條樣本
for idx, example in enumerate(train_set):
if idx <= 3:
print(example)
([101, 1599, 3614, 2802, 5074, 4413, 4638, 4511, 4495, 1599, 3614, 784, 720, 3416, 4638, 1957, 4495, 102, 4263, 2802, 5074, 4413, 4638, 4511, 4495, 1599, 3614, 784, 720, 3416, 4638, 1957, 4495, 102], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 1) ([101, 2769, 2797, 3322, 696, 749, 8024, 2769, 2682, 2940, 702, 2797, 3322, 102, 2769, 2682, 743, 702, 3173, 2797, 3322, 8024, 3724, 2972, 5773, 102], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 1) ([101, 1920, 2157, 6230, 2533, 1961, 1962, 4692, 1408, 102, 1920, 2157, 6230, 2533, 6651, 4511, 1962, 4692, 1408, 8043, 102], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 0) ([101, 3724, 4904, 5682, 722, 4958, 4035, 4514, 1059, 7415, 102, 3724, 4904, 5682, 722, 4958
2.1.3?構(gòu)建DataLoader
接下來,構(gòu)建一個(gè)DataLoader,用以幫助我們按批迭代數(shù)據(jù),方便模型訓(xùn)練。但這里還會(huì)存在一個(gè)問題,按批迭代出的數(shù)據(jù)需要是規(guī)整的,即一批數(shù)據(jù)中的每條樣本需要長度相同。因此還需實(shí)現(xiàn)一個(gè)函數(shù)batchify_fn對各項(xiàng)輸入數(shù)據(jù)進(jìn)行填充,并將各項(xiàng)輸入數(shù)據(jù)依次處理成batch的形式。
圖2.2 展示了batch_ify函數(shù)將數(shù)據(jù)組裝成batch形式的過程圖,首先給定了兩個(gè)輸入樣本,每個(gè)樣本均包含input_ids、token_type_ids和label三項(xiàng)數(shù)據(jù),其中batchify_fn中的第1項(xiàng)Pad操作用于對input_ids進(jìn)行填充補(bǔ)齊,第2項(xiàng)Pad操作用于對token_type_ids進(jìn)行填充補(bǔ)齊,最后的stack操作用于將label數(shù)據(jù)疊加起來??梢钥吹剑?jīng)過batchify_bn函數(shù)處理后,各項(xiàng)數(shù)據(jù)均形成了規(guī)整的數(shù)據(jù)。
由于測試集中沒有標(biāo)簽數(shù)據(jù),因此這里針對訓(xùn)練集和測試集數(shù)據(jù)形式各自定義對應(yīng)的batchify_fn,相應(yīng)代碼如下所示,其中train_batchify_fn用以處理訓(xùn)練集和驗(yàn)證集,test_batchify_fn用以處理測試集。
# 定義用于訓(xùn)練數(shù)據(jù)的batchify_fn函數(shù)
train_batchify_fn = lambda samples, fn=Tuple(
Pad(axis=0, pad_val=tokenizer.pad_token_id), # input_ids
Pad(axis=0, pad_val=tokenizer.pad_token_type_id), # token_type_ids
Stack(dtype="int64") # label
): [data for data in fn(samples)]
# 定義用于測試數(shù)據(jù)的batchify_fn函數(shù)
test_batchify_fn = lambda samples, fn=Tuple(
Pad(axis=0, pad_val=tokenizer.pad_token_id), # input_ids
Pad(axis=0, pad_val=tokenizer.pad_token_type_id) # token_type_ids
): [data for data in fn(samples)]
接下來便可以正式構(gòu)建相應(yīng)的DataLoader,用以按批迭代數(shù)據(jù),相關(guān)代碼如下。
batch_size = 32
train_loader = paddle.io.DataLoader(dataset=train_set, batch_size=batch_size, collate_fn=train_batchify_fn, shuffle=True)
dev_loader = paddle.io.DataLoader(dataset=dev_set, batch_size=batch_size, collate_fn=train_batchify_fn, shuffle=False)
test_loader = paddle.io.DataLoader(dataset=test_set, batch_size=batch_size, collate_fn=tes
2.2 模型構(gòu)建
在本節(jié)的文本匹配模型如圖2.2所示,首先我們將待匹配的兩句話進(jìn)行拼接成一串文本序列,然后將該文本序列傳入BERT模型中,接下來將BERT模型的CLS位置的輸出向量傳入線性層中,去判斷該文本序列中的兩句話是否語義相似。由于判斷兩句話是否語義相似只有兩種可能:相似和不相似,因此本節(jié)的文本匹配任務(wù)將被建模為2分類任務(wù)。
在PaddleNLP中,已經(jīng)內(nèi)置了基于BERT模型實(shí)現(xiàn)的圖中展示的序列分類功能:BertForSequenceClassification,本節(jié)我們將基于該API建模文本匹配任務(wù),首先對BertForSequenceClassification類進(jìn)實(shí)例化,相應(yīng)代碼如下。
備注:代碼在運(yùn)行過程中會(huì)下載預(yù)訓(xùn)練的BERT模型參數(shù),這里我們通過指定bert-base-chinese 加載了base版的BERT,其大約有110M參數(shù)。
from paddlenlp.transformers import BertForSequenceClassification
model_name = "bert-base-chinese"
model = BertForSequenceClassification.from_pretrained(model_name, num_classes=2)
2.3 訓(xùn)練配置
本節(jié)將定義模型訓(xùn)練時(shí)用到的一些組件和資源,包括超參數(shù)定義,指定模型訓(xùn)練迭代的優(yōu)化算法, 評估指標(biāo)等等。由于BERT預(yù)訓(xùn)練模型參數(shù)較多,為了更快訓(xùn)練,這里推薦使用GPU環(huán)境進(jìn)行模型訓(xùn)練。
from paddlenlp.transformers import LinearDecayWithWarmup
# 超參設(shè)置
n_epochs = 3
batch_size = 128
max_seq_length = 256
n_classes=2
learning_rate = 5e-5
warmup_proportion = 0.1
weight_decay = 0.01
eval_steps = 500
log_steps = 50
save_dir = "./checkpoints"
# 設(shè)置優(yōu)化器
num_training_steps = len(train_loader) * n_epochs
lr_scheduler = LinearDecayWithWarmup(learning_rate, num_training_steps, warmup_proportion)
optimizer = paddle.optimizer.AdamW(
learning_rate=lr_scheduler,
parameters=model.parameters(),
weight_decay=weight_decay,
apply_decay_param_fun=lambda x: x in [
p.name for n, p in model.named_parameters()
if not any(nd in n for nd in ["bias", "norm"])
])
# 定義統(tǒng)計(jì)指標(biāo)
metric = paddle.metric.Accuracy()
2.4 模型訓(xùn)練與評估
上文已經(jīng)實(shí)現(xiàn)了數(shù)據(jù)處理、模型加載和訓(xùn)練配置功能,接下來就可以開始模型的訓(xùn)練了。在訓(xùn)練過程中,每隔eval_steps步便使用驗(yàn)證集進(jìn)行評估一次,同時(shí)保存訓(xùn)練過程中表現(xiàn)最好的模型。在模型評估時(shí),我們使用了paddle.metric.Accuracy作為評估指標(biāo)。 模型訓(xùn)練和評估的代碼如下所示。
def evaluate(model, metric, data_loader):
model.eval()
# 每次使用測試集進(jìn)行評估時(shí),先重置掉之前的metric的累計(jì)數(shù)據(jù),保證只是針對本次評估。
metric.reset()
losses = []
for batch in data_loader:
# 獲取數(shù)據(jù)
input_ids, segment_ids, labels = batch
# 執(zhí)行前向計(jì)算
logits = model(input_ids, segment_ids)
# 統(tǒng)計(jì)準(zhǔn)確率指標(biāo)
correct = metric.compute(logits, labels.unsqueeze(axis=-1))
metric.update(correct)
accuracy = metric.accumulate()
return accuracy
def train(model):
global_step=1
best_acc = 0.
for epoch in range(1, n_epochs+1):
model.train()
for step, batch in enumerate(train_loader, start=1):
# 獲取數(shù)據(jù)
input_ids, token_type_ids, labels = batch
# 模型前向計(jì)算
logits = model(input_ids, token_type_ids)
loss = F.cross_entropy(input=logits, label=labels)
# 每隔log_steps步打印一下訓(xùn)練日志
if global_step % log_steps == 0 :
print("[Train] global step {}/{}, epoch: {}, batch: {}, loss: {}".format(global_step, num_training_steps, epoch, step, loss.item()))
# 每隔eval_steps步評估一次模型,同時(shí)保存當(dāng)前表現(xiàn)最好的模型
if global_step % eval_steps == 0 :
accuracy = evaluate(model, metric, dev_loader)
print("[Evaluation] accuracy: {}".format(accuracy))
if best_acc < accuracy:
best_acc = accuracy
print("best accuracy has been updated: from last best_acc {} --> new acc {}.".format(best_acc, accuracy))
if not os.path.exists(save_dir):
os.makedirs(save_dir)
save_path = os.path.join(save_dir, "best.pdparams")
paddle.save(model.state_dict(), save_path)
model.train()
# 參數(shù)更新
loss.backward()
optimizer.step()
lr_scheduler.step()
optimizer.clear_grad()
global_step += 1
接下來,便可以開始進(jìn)行模型訓(xùn)練了,在GPU環(huán)境下,訓(xùn)練3輪大約需要75分鐘。
2.5 模型測試
本節(jié)使用訓(xùn)練過程中在驗(yàn)證集上表現(xiàn)最好的模型對測試集進(jìn)行測試,首先我們先實(shí)現(xiàn)模型測試的函數(shù),在測試完成之后,將測試結(jié)果保存在test_save_path文件中,相應(yīng)代碼如下。
def test(model, ori_examples, data_loader, test_save_path):
model.eval()
# 每次使用測試集進(jìn)行評估時(shí),先重置掉之前的metric的累計(jì)數(shù)據(jù),保證只是針對本次評估。
metric.reset()
test_results = []
for batch in data_loader:
input_ids, segment_ids = batch
logits = model(input_ids, segment_ids)
predictions = paddle.argmax(logits, axis=-1)
test_results.extend(predictions.tolist())
with open(test_save_path, "w", encoding="utf-8") as f:
for idx, result in enumerate(test_results):
example = ori_examples[idx]
example["label"] = result
msg = str(example) + "\n"
f.write(msg)
print("the result of test_set has beed saved to: {}.".format(test_save_path))
接下來,我們將加載保存的模型,并使用該模型對測試集進(jìn)行測試,相應(yīng)代碼如下。
# 模型保存的名稱
model_path = "./checkpoints/best.pdparams"
test_save_path = "./test_results.txt"
state_dict = paddle.load(model_path)
test_examples = load_dataset("lcqmc", splits=["test"])
print(test_examples[0])
model = BertForSequenceClassification.from_pretrained(model_name, num_classes=2)
model.load_dict(state_dict)
test(model, test_examples, test_loader, test_save_path)
{'query': '誰有狂三這張高清的', 'title': '這張高清圖,誰有', 'label': ''} {'query': '近期上映的電影', 'title': '近期上映的電影有哪些', 'label': ''}
測試結(jié)果已經(jīng)保存至 "./test_results.txt" 文件中,下面我們可以選擇一些測試樣本進(jìn)行打印,以便直觀觀察模型預(yù)測結(jié)果。文章來源:http://www.zghlxwxcb.cn/news/detail-825917.html
test_ids = range(10)
# 加載測試結(jié)果文件
with open(test_save_path, "r", encoding="utf-8") as f:
test_results = [line.strip() for line in f.readlines()]
# 根據(jù)test_ids打印相應(yīng)的測試樣本
for test_id in test_ids:
print(test_results[test_id])
{'query': '誰有狂三這張高清的', 'title': '這張高清圖,誰有', 'label': 0} {'query': '英雄聯(lián)盟什么英雄最好', 'title': '英雄聯(lián)盟最好英雄是什么', 'label': 1} {'query': '這是什么意思,被蹭網(wǎng)嗎', 'title': '我也是醉了,這是什么意思', 'label': 0} {'query': '現(xiàn)在有什么動(dòng)畫片好看呢?', 'title': '現(xiàn)在有什么好看的動(dòng)畫片嗎?', 'label': 1} {'query': '請問晶達(dá)電子廠現(xiàn)在的工資待遇怎么樣要求有哪些', 'title': '三星電子廠工資待遇怎么樣啊', 'label': 0} {'query': '文章真的愛姚笛嗎', 'title': '姚笛真的被文章干了嗎', 'label': 0} {'query': '送自己做的閨蜜什么生日禮物好', 'title': '送閨蜜什么生日禮物好', 'label': 1} {'query': '近期上映的電影', 'title': '近期上映的電影有哪些', 'label': 1} {'query': '求英雄聯(lián)盟大神帶?', 'title': '英雄聯(lián)盟,求大神帶~', 'label': 1} {'query': '如加上什么部首', 'title': '給東加上部首是什么字?', 'label': 0}
其中標(biāo)簽為1表示query和title兩段文本是語義相似的,標(biāo)簽為0表示query和title兩段文本是語義不相似的。可以看到,訓(xùn)練后的BERT模型能夠非常準(zhǔn)確地判斷兩句話語義是否相似。文章來源地址http://www.zghlxwxcb.cn/news/detail-825917.html
到了這里,關(guān)于人工智能學(xué)習(xí)與實(shí)訓(xùn)筆記(五):神經(jīng)網(wǎng)絡(luò)之NLP進(jìn)階—詞向量模型及NLP實(shí)戰(zhàn)的文章就介紹完了。如果您還想了解更多內(nèi)容,請?jiān)谟疑辖撬阉鱐OY模板網(wǎng)以前的文章或繼續(xù)瀏覽下面的相關(guān)文章,希望大家以后多多支持TOY模板網(wǎng)!