一. NLP入門(四)命名實體識別(NER)
本文將會簡單介紹自然語言處理(NLP)中的命名實體識別(NER)。
??命名實體識別(Named Entity Recognition,簡稱NER)是信息提取、問答系統(tǒng)、句法分析、機(jī)器翻譯等應(yīng)用領(lǐng)域的重要基礎(chǔ)工具,在自然語言處理技術(shù)走向?qū)嵱没倪^程中占有重要地位。一般來說,命名實體識別的任務(wù)就是識別出待處理文本中三大類(實體類、時間類和數(shù)字類)、七小類(人名、機(jī)構(gòu)名、地名、時間、日期、貨幣和百分比)命名實體。
??舉個簡單的例子,在句子“小明早上8點去學(xué)校上課。”中,對其進(jìn)行命名實體識別,應(yīng)該能提取信息
人名:小明,時間:早上8點,地點:學(xué)校。
本文將會介紹幾個工具用來進(jìn)行命名實體識別,后續(xù)有機(jī)會的話,我們將會嘗試著用HMM、CRF或深度學(xué)習(xí)來實現(xiàn)命名實體識別。
??首先我們來看一下NLTK和Stanford NLP中對命名實體識別的分類,如下圖:
在上圖中,LOCATION和GPE有重合。GPE通常表示地理—政治條目,比如城市,州,國家,洲等。LOCATION除了上述內(nèi)容外,還能表示名山大川等。FACILITY通常表示知名的紀(jì)念碑或人工制品等。
??下面介紹兩個工具來進(jìn)行NER的任務(wù):NLTK和Stanford NLP。
??首先是NLTK,我們的示例文檔(介紹FIFA,來源于維基百科)如下:
FIFA was founded in 1904 to oversee international competition among
the national associations of Belgium, Denmark, France, Germany, the
Netherlands, Spain, Sweden, and Switzerland. Headquartered in Zürich,
its membership now comprises 211 national associations. Member
countries must each also be members of one of the six regional
confederations into which the world is divided: Africa, Asia, Europe,
North & Central America and the Caribbean, Oceania, and South America.
實現(xiàn)NER的Python代碼如下:
i
mport re
import pandas as pd
import nltk
def parse_document(document):
document = re.sub('\n', ' ', document)
if isinstance(document, str):
document = document
else:
raise ValueError('Document is not string!')
document = document.strip()
sentences = nltk.sent_tokenize(document)
sentences = [sentence.strip() for sentence in sentences]
return sentences
# sample document
text = """
FIFA was founded in 1904 to oversee international competition among the national associations of Belgium,
Denmark, France, Germany, the Netherlands, Spain, Sweden, and Switzerland. Headquartered in Zürich, its
membership now comprises 211 national associations. Member countries must each also be members of one of
the six regional confederations into which the world is divided: Africa, Asia, Europe, North & Central America
and the Caribbean, Oceania, and South America.
"""
# tokenize sentences
sentences = parse_document(text)
tokenized_sentences = [nltk.word_tokenize(sentence) for sentence in sentences]
# tag sentences and use nltk's Named Entity Chunker
tagged_sentences = [nltk.pos_tag(sentence) for sentence in tokenized_sentences]
ne_chunked_sents = [nltk.ne_chunk(tagged) for tagged in tagged_sentences]
# extract all named entities
named_entities = []
for ne_tagged_sentence in ne_chunked_sents:
for tagged_tree in ne_tagged_sentence:
# extract only chunks having NE labels
if hasattr(tagged_tree, 'label'):
entity_name = ' '.join(c[0] for c in tagged_tree.leaves()) #get NE name
entity_type = tagged_tree.label() # get NE category
named_entities.append((entity_name, entity_type))
# get unique named entities
named_entities = list(set(named_entities))
# store named entities in a data frame
entity_frame = pd.DataFrame(named_entities, columns=['Entity Name', 'Entity Type'])
# display results
print(entity_frame)
輸出結(jié)果如下:
Entity Name Entity Type
0 FIFA ORGANIZATION
1 Central America ORGANIZATION
2 Belgium GPE
3 Caribbean LOCATION
4 Asia GPE
5 France GPE
6 Oceania GPE
7 Germany GPE
8 South America GPE
9 Denmark GPE
10 Zürich GPE
11 Africa PERSON
12 Sweden GPE
13 Netherlands GPE
14 Spain GPE
15 Switzerland GPE
16 North GPE
17 Europe GPE
可以看到,NLTK中的NER任務(wù)大體上完成得還是不錯的,能夠識別FIFA為組織(ORGANIZATION),Belgium,Asia為GPE, 但是也有一些不太如人意的地方,比如,它將Central America識別為ORGANIZATION,而實際上它應(yīng)該為GPE;將Africa識別為PERSON,實際上應(yīng)該為GPE。
接下來,我們嘗試著用Stanford NLP工具。關(guān)于該工具,我們主要使用Stanford NER 標(biāo)注工具。在使用這個工具之前,你需要在自己的電腦上安裝Java(一般是JDK),并將Java添加到系統(tǒng)路徑中,同時下載英語NER的文件包:stanford-ner-2018-10-16.zip(大小為172MB),下載地址為:https://nlp.stanford.edu/software/CRF-NER.shtml 。以筆者的電腦為例,Java所在的路徑為:C:\Program Files\Java\jdk1.8.0_161\bin\java.exe, 下載Stanford NER的zip文件解壓后的文件夾的路徑為:E://stanford-ner-2018-10-16,如下圖所示:
在classifer文件夾中有如下文件:
它們代表的含義如下:
3 class: Location, Person, Organization 4 class: Location, Person,
Organization, Misc 7 class: Location, Person, Organization, Money,
Percent, Date, Time
可以使用Python實現(xiàn)Stanford NER,完整的代碼如下:
import re
from nltk.tag import StanfordNERTagger
import os
import pandas as pd
import nltk
def parse_document(document):
document = re.sub('\n', ' ', document)
if isinstance(document, str):
document = document
else:
raise ValueError('Document is not string!')
document = document.strip()
sentences = nltk.sent_tokenize(document)
sentences = [sentence.strip() for sentence in sentences]
return sentences
# sample document
text = """
FIFA was founded in 1904 to oversee international competition among the national associations of Belgium,
Denmark, France, Germany, the Netherlands, Spain, Sweden, and Switzerland. Headquartered in Zürich, its
membership now comprises 211 national associations. Member countries must each also be members of one of
the six regional confederations into which the world is divided: Africa, Asia, Europe, North & Central America
and the Caribbean, Oceania, and South America.
"""
sentences = parse_document(text)
tokenized_sentences = [nltk.word_tokenize(sentence) for sentence in sentences]
# set java path in environment variables
java_path = r'C:\Program Files\Java\jdk1.8.0_161\bin\java.exe'
os.environ['JAVAHOME'] = java_path
# load stanford NER
sn = StanfordNERTagger('E://stanford-ner-2018-10-16/classifiers/english.muc.7class.distsim.crf.ser.gz',
path_to_jar='E://stanford-ner-2018-10-16/stanford-ner.jar')
# tag sentences
ne_annotated_sentences = [sn.tag(sent) for sent in tokenized_sentences]
# extract named entities
named_entities = []
for sentence in ne_annotated_sentences:
temp_entity_name = ''
temp_named_entity = None
for term, tag in sentence:
# get terms with NE tags
if tag != 'O':
temp_entity_name = ' '.join([temp_entity_name, term]).strip() #get NE name
temp_named_entity = (temp_entity_name, tag) # get NE and its category
else:
if temp_named_entity:
named_entities.append(temp_named_entity)
temp_entity_name = ''
temp_named_entity = None
# get unique named entities
named_entities = list(set(named_entities))
# store named entities in a data frame
entity_frame = pd.DataFrame(named_entities, columns=['Entity Name', 'Entity Type'])
# display results
print(entity_frame)
輸出結(jié)果如下:
Entity Name Entity Type
0 1904 DATE
1 Denmark LOCATION
2 Spain LOCATION
3 North & Central America ORGANIZATION
4 South America LOCATION
5 Belgium LOCATION
6 Zürich LOCATION
7 the Netherlands LOCATION
8 France LOCATION
9 Caribbean LOCATION
10 Sweden LOCATION
11 Oceania LOCATION
12 Asia LOCATION
13 FIFA ORGANIZATION
14 Europe LOCATION
15 Africa LOCATION
16 Switzerland LOCATION
17 Germany LOCATION
可以看到,在Stanford NER的幫助下,NER的實現(xiàn)效果較好,將Africa識別為LOCATION,將1904識別為時間(這在NLTK中沒有識別出來),但還是對North & Central America識別有誤,將其識別為ORGANIZATION。
??值得注意的是,并不是說Stanford NER一定會比NLTK NER的效果好,兩者針對的對象,預(yù)料,算法可能有差異,因此,需要根據(jù)自己的需求決定使用什么工具。
??本次分享到此結(jié)束,以后有機(jī)會的話,將會嘗試著用HMM、CRF或深度學(xué)習(xí)來實現(xiàn)命名實體識別。
自然語言處理之動手學(xué)NER
網(wǎng)盤地址: 網(wǎng)頁鏈接 提取碼: s8u9
二 .NLP入門(五)用深度學(xué)習(xí)實現(xiàn)命名實體識別(NER)
前言
在文章:NLP入門(四)命名實體識別(NER)中,筆者介紹了兩個實現(xiàn)命名實體識別的工具——NLTK和Stanford NLP。在本文中,我們將會學(xué)習(xí)到如何使用深度學(xué)習(xí)工具來自己一步步地實現(xiàn)NER,只要你堅持看完,就一定會很有收獲的。
??OK,話不多說,讓我們進(jìn)入正題。
??幾乎所有的NLP都依賴一個強(qiáng)大的語料庫,本項目實現(xiàn)NER的語料庫如下(文件名為train.txt,一共42000行,這里只展示前15行,可以在文章最后的Github地址下載該語料庫):
played on Monday ( home team in CAPS ) :
VBD IN NNP ( NN NN IN NNP ) :
O O O O O O O O O O
American League
NNP NNP
B-MISC I-MISC
Cleveland 2 DETROIT 1
NNP CD NNP CD
B-ORG O B-ORG O
BALTIMORE 12 Oakland 11 ( 10 innings )
VB CD NNP CD ( CD NN )
B-ORG O B-ORG O O O O O
TORONTO 5 Minnesota 3
TO CD NNP CD
B-ORG O B-ORG O
…
簡單介紹下該語料庫的結(jié)構(gòu):該語料庫一共42000行,每三行為一組,其中,第一行為英語句子,第二行為每個句子的詞性(關(guān)于英語單詞的詞性,可參考文章:NLP入門(三)詞形還原(Lemmatization)),第三行為NER系統(tǒng)的標(biāo)注,具體的含義會在之后介紹。
??我們的NER項目的名稱為DL_4_NER,結(jié)構(gòu)如下:
項目中每個文件的功能如下:
-
utils.py: 項目配置及數(shù)據(jù)導(dǎo)入
-
data_processing.py: 數(shù)據(jù)探索
-
Bi_LSTM_Model_training.py: 模型創(chuàng)建及訓(xùn)練
-
Bi_LSTM_Model_predict.py: 對新句子進(jìn)行NER預(yù)測
接下來,筆者將結(jié)合代碼文件,分部介紹該項目的步驟,當(dāng)所有步驟介紹完畢后,我們的項目就結(jié)束了,而你,也就知道了如何用深度學(xué)習(xí)實現(xiàn)命名實體識別(NER)。
??Let’s begin!
項目配置
第一步,是項目的配置及數(shù)據(jù)導(dǎo)入,在utils.py文件中實現(xiàn),完整的代碼如下:
# -*- coding: utf-8 -*-
import numpy as np
import pandas as pd
# basic settings for DL_4_NER Project
BASE_DIR = "F://NERSystem"
CORPUS_PATH = "%s/train.txt" % BASE_DIR
KERAS_MODEL_SAVE_PATH = '%s/Bi-LSTM-4-NER.h5' % BASE_DIR
WORD_DICTIONARY_PATH = '%s/word_dictionary.pk' % BASE_DIR
InVERSE_WORD_DICTIONARY_PATH = '%s/inverse_word_dictionary.pk' % BASE_DIR
LABEL_DICTIONARY_PATH = '%s/label_dictionary.pk' % BASE_DIR
OUTPUT_DICTIONARY_PATH = '%s/output_dictionary.pk' % BASE_DIR
CONSTANTS = [
KERAS_MODEL_SAVE_PATH,
InVERSE_WORD_DICTIONARY_PATH,
WORD_DICTIONARY_PATH,
LABEL_DICTIONARY_PATH,
OUTPUT_DICTIONARY_PATH
]
# load data from corpus to from pandas DataFrame
def load_data():
with open(CORPUS_PATH, 'r') as f:
text_data = [text.strip() for text in f.readlines()]
text_data = [text_data[k].split('\t') for k in range(0, len(text_data))]
index = range(0, len(text_data), 3)
# Transforming data to matrix format for neural network
input_data = list()
for i in range(1, len(index) - 1):
rows = text_data[index[i-1]:index[i]]
sentence_no = np.array([i]*len(rows[0]), dtype=str)
rows.append(sentence_no)
rows = np.array(rows).T
input_data.append(rows)
input_data = pd.DataFrame(np.concatenate([item for item in input_data]),\
columns=['word', 'pos', 'tag', 'sent_no'])
return input_data
在該代碼中,先是設(shè)置了語料庫文件的路徑CORPUS_PATH,KERAS模型保存路徑KERAS_MODEL_SAVE_PATH,以及在項目過程中會用到的三個字典的保存路徑(以pickle文件形式保存)WORD_DICTIONARY_PATH,LABEL_DICTIONARY_PATH, OUTPUT_DICTIONARY_PATH。然后是load_data()函數(shù),它將語料庫中的文本以Pandas中的DataFrame結(jié)構(gòu)展示出來,該數(shù)據(jù)框的前30行如下:
word pos tag sent_no
0 played VBD O 1
1 on IN O 1
2 Monday NNP O 1
3 ( ( O 1
4 home NN O 1
5 team NN O 1
6 in IN O 1
7 CAPS NNP O 1
8 ) ) O 1
9 : : O 1
10 American NNP B-MISC 2
11 League NNP I-MISC 2
12 Cleveland NNP B-ORG 3
13 2 CD O 3
14 DETROIT NNP B-ORG 3
15 1 CD O 3
16 BALTIMORE VB B-ORG 4
17 12 CD O 4
18 Oakland NNP B-ORG 4
19 11 CD O 4
20 ( ( O 4
21 10 CD O 4
22 innings NN O 4
23 ) ) O 4
24 TORONTO TO B-ORG 5
25 5 CD O 5
26 Minnesota NNP B-ORG 5
27 3 CD O 5
28 Milwaukee NNP B-ORG 6
29 3 CD O 6
在該數(shù)據(jù)框中,word這一列表示文本語料庫中的單詞,pos這一列表示該單詞的詞性,tag這一列表示NER的標(biāo)注,sent_no這一列表示該單詞在第幾個句子中。
數(shù)據(jù)探索
接著,第二步是數(shù)據(jù)探索,即對輸入的數(shù)據(jù)(input_data)進(jìn)行一些數(shù)據(jù)review,完整的代碼(data_processing.py)如下:
# -*- coding: utf-8 -*-
import pickle
import numpy as np
from collections import Counter
from itertools import accumulate
from operator import itemgetter
import matplotlib.pyplot as plt
import matplotlib as mpl
from utils import BASE_DIR, CONSTANTS, load_data
# 設(shè)置matplotlib繪圖時的字體
mpl.rcParams['font.sans-serif']=['SimHei']
# 數(shù)據(jù)查看
def data_review():
# 數(shù)據(jù)導(dǎo)入
input_data = load_data()
# 基本的數(shù)據(jù)review
sent_num = input_data['sent_no'].astype(np.int).max()
print("一共有%s個句子。\n"%sent_num)
vocabulary = input_data['word'].unique()
print("一共有%d個單詞。"%len(vocabulary))
print("前10個單詞為:%s.\n"%vocabulary[:11])
pos_arr = input_data['pos'].unique()
print("單詞的詞性列表:%s.\n"%pos_arr)
ner_tag_arr = input_data['tag'].unique()
print("NER的標(biāo)注列表:%s.\n" % ner_tag_arr)
df = input_data[['word', 'sent_no']].groupby('sent_no').count()
sent_len_list = df['word'].tolist()
print("句子長度及出現(xiàn)頻數(shù)字典:\n%s." % dict(Counter(sent_len_list)))
# 繪制句子長度及出現(xiàn)頻數(shù)統(tǒng)計圖
sort_sent_len_dist = sorted(dict(Counter(sent_len_list)).items(), key=itemgetter(0))
sent_no_data = [item[0] for item in sort_sent_len_dist]
sent_count_data = [item[1] for item in sort_sent_len_dist]
plt.bar(sent_no_data, sent_count_data)
plt.title("句子長度及出現(xiàn)頻數(shù)統(tǒng)計圖")
plt.xlabel("句子長度")
plt.ylabel("句子長度出現(xiàn)的頻數(shù)")
plt.savefig("%s/句子長度及出現(xiàn)頻數(shù)統(tǒng)計圖.png" % BASE_DIR)
plt.close()
# 繪制句子長度累積分布函數(shù)(CDF)
sent_pentage_list = [(count/sent_num) for count in accumulate(sent_count_data)]
# 尋找分位點為quantile的句子長度
quantile = 0.9992
#print(list(sent_pentage_list))
for length, per in zip(sent_no_data, sent_pentage_list):
if round(per, 4) == quantile:
index = length
break
print("\n分位點為%s的句子長度:%d." % (quantile, index))
# 繪制CDF
plt.plot(sent_no_data, sent_pentage_list)
plt.hlines(quantile, 0, index, colors="c", linestyles="dashed")
plt.vlines(index, 0, quantile, colors="c", linestyles="dashed")
plt.text(0, quantile, str(quantile))
plt.text(index, 0, str(index))
plt.title("句子長度累積分布函數(shù)圖")
plt.xlabel("句子長度")
plt.ylabel("句子長度累積頻率")
plt.savefig("%s/句子長度累積分布函數(shù)圖.png" % BASE_DIR)
plt.close()
# 數(shù)據(jù)處理
def data_processing():
# 數(shù)據(jù)導(dǎo)入
input_data = load_data()
# 標(biāo)簽及詞匯表
labels, vocabulary = list(input_data['tag'].unique()), list(input_data['word'].unique())
# 字典列表
word_dictionary = {word: i+1 for i, word in enumerate(vocabulary)}
inverse_word_dictionary = {i+1: word for i, word in enumerate(vocabulary)}
label_dictionary = {label: i+1 for i, label in enumerate(labels)}
output_dictionary = {i+1: labels for i, labels in enumerate(labels)}
dict_list = [word_dictionary, inverse_word_dictionary,label_dictionary, output_dictionary]
# 保存為pickle形式
for dict_item, path in zip(dict_list, CONSTANTS[1:]):
with open(path, 'wb') as f:
pickle.dump(dict_item, f)
#data_review()
調(diào)用data_review()函數(shù),輸出的結(jié)果如下:
一共有13998個句子。
一共有24339個單詞。
前10個單詞為:['played' 'on' 'Monday' '(' 'home' 'team' 'in' 'CAPS' ')' ':' 'American'].
單詞的詞性列表:['VBD' 'IN' 'NNP' '(' 'NN' ')' ':' 'CD' 'VB' 'TO' 'NNS' ',' 'VBP' 'VBZ'
'.' 'VBG' 'PRP$' 'JJ' 'CC' 'JJS' 'RB' 'DT' 'VBN' '"' 'PRP' 'WDT' 'WRB'
'MD' 'WP' 'POS' 'JJR' 'WP$' 'RP' 'NNPS' 'RBS' 'FW' '$' 'RBR' 'EX' "''"
'PDT' 'UH' 'SYM' 'LS' 'NN|SYM'].
NER的標(biāo)注列表:['O' 'B-MISC' 'I-MISC' 'B-ORG' 'I-ORG' 'B-PER' 'B-LOC' 'I-PER' 'I-LOC'
'sO'].
句子長度及出現(xiàn)頻數(shù)字典:
{1: 177, 2: 1141, 3: 620, 4: 794, 5: 769, 6: 639, 7: 999, 8: 977, 9: 841, 10: 501, 11: 395, 12: 316, 13: 339, 14: 291, 15: 275, 16: 225, 17: 229, 18: 212, 19: 197, 20: 221, 21: 228, 22: 221, 23: 230, 24: 210, 25: 207, 26: 224, 27: 188, 28: 199, 29: 214, 30: 183, 31: 202, 32: 167, 33: 167, 34: 141, 35: 130, 36: 119, 37: 105, 38: 112, 39: 98, 40: 78, 41: 74, 42: 63, 43: 51, 44: 42, 45: 39, 46: 19, 47: 22, 48: 19, 49: 15, 50: 16, 51: 8, 52: 9, 53: 5, 54: 4, 55: 9, 56: 2, 57: 2, 58: 2, 59: 2, 60: 3, 62: 2, 66: 1, 67: 1, 69: 1, 71: 1, 72: 1, 78: 1, 80: 1, 113: 1, 124: 1}.
分位點為0.9992的句子長度:60.
在該語料庫中,一共有13998個句子,比預(yù)期的42000/3=14000個句子少兩個。一個有24339個單詞,單詞量還是蠻大的,當(dāng)然,這里對單詞沒有做任何處理,直接保留了語料庫中的形式(后期可以繼續(xù)優(yōu)化)。單詞的詞性可以參考文章:NLP入門(三)詞形還原(Lemmatization)。我們需要注意的是,NER的標(biāo)注列表為[‘O’ ,‘B-MISC’, ‘I-MISC’, ‘B-ORG’ ,‘I-ORG’, ‘B-PER’ ,‘B-LOC’ ,‘I-PER’, ‘I-LOC’,‘sO’],因此,本項目的NER一共分為四類:PER(人名),LOC(位置),ORG(組織)以及MISC,其中B表示開始,I表示中間,O表示單字詞,不計入NER,sO表示特殊單字詞。
??接下來,讓我們考慮下句子的長度,這對后面的建模時填充的句子長度有有參考作用。句子長度及出現(xiàn)頻數(shù)的統(tǒng)計圖如下:
可以看到,句子長度基本在60以下,當(dāng)然,這也可以在輸出的句子長度及出現(xiàn)頻數(shù)字典中看到。那么,我們是否可以選在一個標(biāo)準(zhǔn)作為后面模型的句子填充的長度呢?答案是,利用出現(xiàn)頻數(shù)的累計分布函數(shù)的分位點,在這里,我們選擇分位點為0.9992,對應(yīng)的句子長度為60,如下圖:
接著是數(shù)據(jù)處理函數(shù)data_processing(),它的功能主要是實現(xiàn)單詞、標(biāo)簽字典,并保存為pickle文件形式,便于后續(xù)直接調(diào)用。
建模
在第三步中,我們建立Bi-LSTM模型來訓(xùn)練訓(xùn)練,完整的Python代碼(Bi_LSTM_Model_training.py)如下:
# -*- coding: utf-8 -*-
import pickle
import numpy as np
import pandas as pd
from utils import BASE_DIR, CONSTANTS, load_data
from data_processing import data_processing
from keras.utils import np_utils, plot_model
from keras.models import Sequential
from keras.preprocessing.sequence import pad_sequences
from keras.layers import Bidirectional, LSTM, Dense, Embedding, TimeDistributed
# 模型輸入數(shù)據(jù)
def input_data_for_model(input_shape):
# 數(shù)據(jù)導(dǎo)入
input_data = load_data()
# 數(shù)據(jù)處理
data_processing()
# 導(dǎo)入字典
with open(CONSTANTS[1], 'rb') as f:
word_dictionary = pickle.load(f)
with open(CONSTANTS[2], 'rb') as f:
inverse_word_dictionary = pickle.load(f)
with open(CONSTANTS[3], 'rb') as f:
label_dictionary = pickle.load(f)
with open(CONSTANTS[4], 'rb') as f:
output_dictionary = pickle.load(f)
vocab_size = len(word_dictionary.keys())
label_size = len(label_dictionary.keys())
# 處理輸入數(shù)據(jù)
aggregate_function = lambda input: [(word, pos, label) for word, pos, label in
zip(input['word'].values.tolist(),
input['pos'].values.tolist(),
input['tag'].values.tolist())]
grouped_input_data = input_data.groupby('sent_no').apply(aggregate_function)
sentences = [sentence for sentence in grouped_input_data]
x = [[word_dictionary[word[0]] for word in sent] for sent in sentences]
x = pad_sequences(maxlen=input_shape, sequences=x, padding='post', value=0)
y = [[label_dictionary[word[2]] for word in sent] for sent in sentences]
y = pad_sequences(maxlen=input_shape, sequences=y, padding='post', value=0)
y = [np_utils.to_categorical(label, num_classes=label_size + 1) for label in y]
return x, y, output_dictionary, vocab_size, label_size, inverse_word_dictionary
# 定義深度學(xué)習(xí)模型:Bi-LSTM
def create_Bi_LSTM(vocab_size, label_size, input_shape, output_dim, n_units, out_act, activation):
model = Sequential()
model.add(Embedding(input_dim=vocab_size + 1, output_dim=output_dim,
input_length=input_shape, mask_zero=True))
model.add(Bidirectional(LSTM(units=n_units, activation=activation,
return_sequences=True)))
model.add(TimeDistributed(Dense(label_size + 1, activation=out_act)))
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
return model
# 模型訓(xùn)練
def model_train():
# 將數(shù)據(jù)集分為訓(xùn)練集和測試集,占比為9:1
input_shape = 60
x, y, output_dictionary, vocab_size, label_size, inverse_word_dictionary = input_data_for_model(input_shape)
train_end = int(len(x)*0.9)
train_x, train_y = x[0:train_end], np.array(y[0:train_end])
test_x, test_y = x[train_end:], np.array(y[train_end:])
# 模型輸入?yún)?shù)
activation = 'selu'
out_act = 'softmax'
n_units = 100
batch_size = 32
epochs = 10
output_dim = 20
# 模型訓(xùn)練
lstm_model = create_Bi_LSTM(vocab_size, label_size, input_shape, output_dim, n_units, out_act, activation)
lstm_model.fit(train_x, train_y, epochs=epochs, batch_size=batch_size, verbose=1)
# 模型保存
model_save_path = CONSTANTS[0]
lstm_model.save(model_save_path)
plot_model(lstm_model, to_file='%s/LSTM_model.png' % BASE_DIR)
# 在測試集上的效果
N = test_x.shape[0] # 測試的條數(shù)
avg_accuracy = 0 # 預(yù)測的平均準(zhǔn)確率
for start, end in zip(range(0, N, 1), range(1, N+1, 1)):
sentence = [inverse_word_dictionary[i] for i in test_x[start] if i != 0]
y_predict = lstm_model.predict(test_x[start:end])
input_sequences, output_sequences = [], []
for i in range(0, len(y_predict[0])):
output_sequences.append(np.argmax(y_predict[0][i]))
input_sequences.append(np.argmax(test_y[start][i]))
eval = lstm_model.evaluate(test_x[start:end], test_y[start:end])
print('Test Accuracy: loss = %0.6f accuracy = %0.2f%%' % (eval[0], eval[1] * 100))
avg_accuracy += eval[1]
output_sequences = ' '.join([output_dictionary[key] for key in output_sequences if key != 0]).split()
input_sequences = ' '.join([output_dictionary[key] for key in input_sequences if key != 0]).split()
output_input_comparison = pd.DataFrame([sentence, output_sequences, input_sequences]).T
print(output_input_comparison.dropna())
print('#' * 80)
avg_accuracy /= N
print("測試樣本的平均預(yù)測準(zhǔn)確率:%.2f%%." % (avg_accuracy * 100))
model_train()
在上面的代碼中,先是通過input_data_for_model()函數(shù)來處理好進(jìn)入模型的數(shù)據(jù),其參數(shù)為input_shape,即填充句子時的長度。然后是創(chuàng)建Bi-LSTM模型create_Bi_LSTM(),模型的示意圖如下:
最后,是在輸入的數(shù)據(jù)上進(jìn)行模型訓(xùn)練,將原始的數(shù)據(jù)分為訓(xùn)練集和測試集,占比為9:1,訓(xùn)練的周期為10次。
模型訓(xùn)練
運行上述模型訓(xùn)練代碼,一共訓(xùn)練10個周期,訓(xùn)練時間大概為500s,在訓(xùn)練集上的準(zhǔn)確率達(dá)99%以上,在測試集上的平均準(zhǔn)確率為95%以上。以下是最后幾個測試集上的預(yù)測結(jié)果:
......(前面的輸出已忽略)
Test Accuracy: loss = 0.000986 accuracy = 100.00%
0 1 2
0 Cardiff B-ORG B-ORG
1 1 O O
2 Brighton B-ORG B-ORG
3 0 O O
################################################################################
1/1 [==============================] - 0s 10ms/step
Test Accuracy: loss = 0.000274 accuracy = 100.00%
0 1 2
0 Carlisle B-ORG B-ORG
1 0 O O
2 Hull B-ORG B-ORG
3 0 O O
################################################################################
1/1 [==============================] - 0s 9ms/step
Test Accuracy: loss = 0.000479 accuracy = 100.00%
0 1 2
0 Chester B-ORG B-ORG
1 1 O O
2 Cambridge B-ORG B-ORG
3 1 O O
################################################################################
1/1 [==============================] - 0s 9ms/step
Test Accuracy: loss = 0.003092 accuracy = 100.00%
0 1 2
0 Darlington B-ORG B-ORG
1 4 O O
2 Swansea B-ORG B-ORG
3 1 O O
################################################################################
1/1 [==============================] - 0s 8ms/step
Test Accuracy: loss = 0.000705 accuracy = 100.00%
0 1 2
0 Exeter B-ORG B-ORG
1 2 O O
2 Scarborough B-ORG B-ORG
3 2 O O
################################################################################
測試樣本的平均預(yù)測準(zhǔn)確率:95.55%.
該模型在原始數(shù)據(jù)上的識別效果還是可以的。
??訓(xùn)練完模型后,BASE_DIR中的所有文件如下:
模型預(yù)測
最后,也許是整個項目最為激動人心的時刻,因為,我們要在新數(shù)據(jù)集上測試模型的識別效果。預(yù)測新數(shù)據(jù)的識別結(jié)果的完整Python代碼(Bi_LSTM_Model_predict.py)如下:
# -*- coding: utf-8 -*-
# Name entity recognition for new data
# Import the necessary modules
import pickle
import numpy as np
from utils import CONSTANTS
from keras.preprocessing.sequence import pad_sequences
from keras.models import load_model
from nltk import word_tokenize
# 導(dǎo)入字典
with open(CONSTANTS[1], 'rb') as f:
word_dictionary = pickle.load(f)
with open(CONSTANTS[4], 'rb') as f:
output_dictionary = pickle.load(f)
try:
# 數(shù)據(jù)預(yù)處理
input_shape = 60
sent = 'New York is the biggest city in America.'
new_sent = word_tokenize(sent)
new_x = [[word_dictionary[word] for word in new_sent]]
x = pad_sequences(maxlen=input_shape, sequences=new_x, padding='post', value=0)
# 載入模型
model_save_path = CONSTANTS[0]
lstm_model = load_model(model_save_path)
# 模型預(yù)測
y_predict = lstm_model.predict(x)
ner_tag = []
for i in range(0, len(new_sent)):
ner_tag.append(np.argmax(y_predict[0][i]))
ner = [output_dictionary[i] for i in ner_tag]
print(new_sent)
print(ner)
# 去掉NER標(biāo)注為O的元素
ner_reg_list = []
for word, tag in zip(new_sent, ner):
if tag != 'O':
ner_reg_list.append((word, tag))
# 輸出模型的NER識別結(jié)果
print("NER識別結(jié)果:")
if ner_reg_list:
for i, item in enumerate(ner_reg_list):
if item[1].startswith('B'):
end = i+1
while end <= len(ner_reg_list)-1 and ner_reg_list[end][1].startswith('I'):
end += 1
ner_type = item[1].split('-')[1]
ner_type_dict = {'PER': 'PERSON: ',
'LOC': 'LOCATION: ',
'ORG': 'ORGANIZATION: ',
'MISC': 'MISC: '
}
print(ner_type_dict[ner_type],\
' '.join([item[0] for item in ner_reg_list[i:end]]))
else:
print("模型并未識別任何有效命名實體。")
except KeyError as err:
print("您輸入的句子有單詞不在詞匯表中,請重新輸入!")
print("不在詞匯表中的單詞為:%s." % err)
輸出結(jié)果為:
['New', 'York', 'is', 'the', 'biggest', 'city', 'in', 'America', '.']
['B-LOC', 'I-LOC', 'O', 'O', 'O', 'O', 'O', 'B-LOC', 'O']
NER識別結(jié)果:
LOCATION: New York
LOCATION: America
接下來,再測試三個筆者自己想的句子:
輸入為:
sent = 'James is a world famous actor, whose home is in London.'
輸出結(jié)果為:
['James', 'is', 'a', 'world', 'famous', 'actor', ',', 'whose', 'home', 'is', 'in', 'London', '.']
['B-PER', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-LOC', 'O']
NER識別結(jié)果:
PERSON: James
LOCATION: London
輸入為:
sent = 'Oxford is in England, Jack is from here.'
輸出為:
['Oxford', 'is', 'in', 'England', ',', 'Jack', 'is', 'from', 'here', '.']
['B-PER', 'O', 'O', 'B-LOC', 'O', 'B-PER', 'O', 'O', 'O', 'O']
NER識別結(jié)果:
PERSON: Oxford
LOCATION: England
PERSON: Jack
輸入為:
sent = 'I love Shanghai.'
輸出為:
['I', 'love', 'Shanghai', '.']
['O', 'O', 'B-LOC', 'O']
NER識別結(jié)果:
LOCATION: Shanghai
在上面的例子中,只有Oxford的識別效果不理想,模型將它識別為PERSON,其實應(yīng)該是ORGANIZATION。
接下來是三個來自CNN和wikipedia的句子:
輸入為:
sent = "the US runs the risk of a military defeat by China or Russia"
輸出為:
['the', 'US', 'runs', 'the', 'risk', 'of', 'a', 'military', 'defeat', 'by', 'China', 'or', 'Russia']
['O', 'B-LOC', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-LOC', 'O', 'B-LOC']
NER識別結(jié)果:
LOCATION: US
LOCATION: China
LOCATION: Russia
輸入為:
sent = "Home to the headquarters of the United Nations, New York is an important center for international diplomacy."
輸出為:
['Home', 'to', 'the', 'headquarters', 'of', 'the', 'United', 'Nations', ',', 'New', 'York', 'is', 'an', 'important', 'center', 'for', 'international', 'diplomacy', '.']
['O', 'O', 'O', 'O', 'O', 'O', 'B-ORG', 'I-ORG', 'O', 'B-LOC', 'I-LOC', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
NER識別結(jié)果:
ORGANIZATION: United Nations
LOCATION: New York
輸入為:
sent = "The United States is a founding member of the United Nations, World Bank, International Monetary Fund."
輸出為:
['The', 'United', 'States', 'is', 'a', 'founding', 'member', 'of', 'the', 'United', 'Nations', ',', 'World', 'Bank', ',', 'International', 'Monetary', 'Fund', '.']
['O', 'B-LOC', 'I-LOC', 'O', 'O', 'O', 'O', 'O', 'O', 'B-ORG', 'I-ORG', 'O', 'B-ORG', 'I-ORG', 'O', 'B-ORG', 'I-ORG', 'I-ORG', 'O']
NER識別結(jié)果:
LOCATION: United States
ORGANIZATION: United Nations
ORGANIZATION: World Bank
ORGANIZATION: International Monetary Fund
這三個例子識別全部正確。
總結(jié)
到這兒,筆者的這個項目就差不多了。我們有必要對這個項目做個總結(jié)。
??首先是這個項目的優(yōu)點。它的優(yōu)點在于能夠讓你一步步地實現(xiàn)NER,而且除了語料庫,你基本熟悉了如何創(chuàng)建一個識別NER系統(tǒng)的步驟,同時,對深度學(xué)習(xí)模型及其應(yīng)用也有了深刻理解。因此,好處是顯而易見的。當(dāng)然,在實際工作中,語料庫的整理才是最耗費時間的,能夠占到90%或者更多的時間,因此,有一個好的語料庫你才能展開工作。
??接著講講這個項目的缺點。第一個,是語料庫不夠大,當(dāng)然,約14000條句子也夠了,但本項目沒有對句子進(jìn)行文本預(yù)處理,所以,有些單詞的變形可能無法進(jìn)入詞匯表。第二個,缺少對新詞的處理,一旦句子中出現(xiàn)一個新的單詞,這個模型便無法處理,這是后期需要完善的地方。第三個,句子的填充長度為60,如果輸入的句子長度大于60,則后面的部分將無法有效識別。
??因此,后續(xù)還有更多的工作需要去做,當(dāng)然,做一個中文NER也是可以考慮的。
??本項目已上傳Github,地址為 https://github.com/percent4/DL_4_NER 。:歡迎大家參考~
參考文獻(xiàn)
BOOK: Applied Natural Language Processing with Python, Taweh Beysolow II
WEBSITE:https://github.com/Apress/applied-natural-language-processing-w-python
WEBSITE: NLP入門(四)命名實體識別(NER)
山陰少年
NLP入門(四)命名實體識別(NER)文章來源:http://www.zghlxwxcb.cn/news/detail-861025.html
NLP入門(五)用深度學(xué)習(xí)實現(xiàn)命名實體識別(NER)文章來源地址http://www.zghlxwxcb.cn/news/detail-861025.html
到了這里,關(guān)于Python系列:NLP系列二:命名實體識別(NER)、用深度學(xué)習(xí)實現(xiàn)命名實體識別(NER)的文章就介紹完了。如果您還想了解更多內(nèi)容,請在右上角搜索TOY模板網(wǎng)以前的文章或繼續(xù)瀏覽下面的相關(guān)文章,希望大家以后多多支持TOY模板網(wǎng)!