国产无码综合区,色欲AV无码国产永久播放,无码天堂亚洲国产AV,国产日韩欧美女同一区二区

Python系列：NLP系列二：命名實體識別（NER）、用深度學(xué)習(xí)實現(xiàn)命名實體識別（NER）

1年前作者：坦笑&&life分類：Toy博客閱讀(27)違法舉報

這篇具有很好參考價值的文章主要介紹了Python系列：NLP系列二：命名實體識別（NER）、用深度學(xué)習(xí)實現(xiàn)命名實體識別（NER）。希望對大家有所幫助。如果存在錯誤或未考慮完全的地方，請大家不吝賜教，您也可以點擊"舉報違法"按鈕提交疑問。

一. NLP入門（四）命名實體識別（NER）

本文將會簡單介紹自然語言處理（NLP）中的命名實體識別（NER）。
??命名實體識別（Named Entity Recognition，簡稱NER）是信息提取、問答系統(tǒng)、句法分析、機(jī)器翻譯等應(yīng)用領(lǐng)域的重要基礎(chǔ)工具，在自然語言處理技術(shù)走向?qū)嵱没倪^程中占有重要地位。一般來說，命名實體識別的任務(wù)就是識別出待處理文本中三大類（實體類、時間類和數(shù)字類）、七小類（人名、機(jī)構(gòu)名、地名、時間、日期、貨幣和百分比）命名實體。
??舉個簡單的例子，在句子“小明早上8點去學(xué)校上課。”中，對其進(jìn)行命名實體識別，應(yīng)該能提取信息

人名：小明，時間：早上8點，地點：學(xué)校。

本文將會介紹幾個工具用來進(jìn)行命名實體識別，后續(xù)有機(jī)會的話，我們將會嘗試著用HMM、CRF或深度學(xué)習(xí)來實現(xiàn)命名實體識別。
??首先我們來看一下NLTK和Stanford NLP中對命名實體識別的分類，如下圖：
命名實體識別,# python,python,自然語言處理,深度學(xué)習(xí)

在上圖中，LOCATION和GPE有重合。GPE通常表示地理—政治條目，比如城市，州，國家，洲等。LOCATION除了上述內(nèi)容外，還能表示名山大川等。FACILITY通常表示知名的紀(jì)念碑或人工制品等。
??下面介紹兩個工具來進(jìn)行NER的任務(wù)：NLTK和Stanford NLP。
??首先是NLTK，我們的示例文檔（介紹FIFA，來源于維基百科）如下：

FIFA was founded in 1904 to oversee international competition among
the national associations of Belgium, Denmark, France, Germany, the
Netherlands, Spain, Sweden, and Switzerland. Headquartered in Zürich,
its membership now comprises 211 national associations. Member
countries must each also be members of one of the six regional
confederations into which the world is divided: Africa, Asia, Europe,
North & Central America and the Caribbean, Oceania, and South America.

實現(xiàn)NER的Python代碼如下：

i

mport re
import pandas as pd
import nltk

def parse_document(document):
   document = re.sub('\n', ' ', document)
   if isinstance(document, str):
       document = document
   else:
       raise ValueError('Document is not string!')
   document = document.strip()
   sentences = nltk.sent_tokenize(document)
   sentences = [sentence.strip() for sentence in sentences]
   return sentences

# sample document
text = """
FIFA was founded in 1904 to oversee international competition among the national associations of Belgium, 
Denmark, France, Germany, the Netherlands, Spain, Sweden, and Switzerland. Headquartered in Zürich, its 
membership now comprises 211 national associations. Member countries must each also be members of one of 
the six regional confederations into which the world is divided: Africa, Asia, Europe, North & Central America 
and the Caribbean, Oceania, and South America.
"""

# tokenize sentences
sentences = parse_document(text)
tokenized_sentences = [nltk.word_tokenize(sentence) for sentence in sentences]
# tag sentences and use nltk's Named Entity Chunker
tagged_sentences = [nltk.pos_tag(sentence) for sentence in tokenized_sentences]
ne_chunked_sents = [nltk.ne_chunk(tagged) for tagged in tagged_sentences]
# extract all named entities
named_entities = []
for ne_tagged_sentence in ne_chunked_sents:
   for tagged_tree in ne_tagged_sentence:
       # extract only chunks having NE labels
       if hasattr(tagged_tree, 'label'):
           entity_name = ' '.join(c[0] for c in tagged_tree.leaves()) #get NE name
           entity_type = tagged_tree.label() # get NE category
           named_entities.append((entity_name, entity_type))
           # get unique named entities
           named_entities = list(set(named_entities))

# store named entities in a data frame
entity_frame = pd.DataFrame(named_entities, columns=['Entity Name', 'Entity Type'])
# display results
print(entity_frame)

輸出結(jié)果如下：

        Entity Name   Entity Type
0              FIFA  ORGANIZATION
1   Central America  ORGANIZATION
2           Belgium           GPE
3         Caribbean      LOCATION
4              Asia           GPE
5            France           GPE
6           Oceania           GPE
7           Germany           GPE
8     South America           GPE
9           Denmark           GPE
10           Zürich           GPE
11           Africa        PERSON
12           Sweden           GPE
13      Netherlands           GPE
14            Spain           GPE
15      Switzerland           GPE
16            North           GPE
17           Europe           GPE

可以看到，NLTK中的NER任務(wù)大體上完成得還是不錯的，能夠識別FIFA為組織（ORGANIZATION），Belgium,Asia為GPE, 但是也有一些不太如人意的地方，比如，它將Central America識別為ORGANIZATION，而實際上它應(yīng)該為GPE；將Africa識別為PERSON，實際上應(yīng)該為GPE。

接下來，我們嘗試著用Stanford NLP工具。關(guān)于該工具，我們主要使用Stanford NER 標(biāo)注工具。在使用這個工具之前，你需要在自己的電腦上安裝Java（一般是JDK），并將Java添加到系統(tǒng)路徑中，同時下載英語NER的文件包：stanford-ner-2018-10-16.zip（大小為172MB），下載地址為：https://nlp.stanford.edu/software/CRF-NER.shtml 。以筆者的電腦為例，Java所在的路徑為：C:\Program Files\Java\jdk1.8.0_161\bin\java.exe，下載Stanford NER的zip文件解壓后的文件夾的路徑為：E://stanford-ner-2018-10-16，如下圖所示：
命名實體識別,# python,python,自然語言處理,深度學(xué)習(xí)
在classifer文件夾中有如下文件：

命名實體識別,# python,python,自然語言處理,深度學(xué)習(xí)

它們代表的含義如下：

3 class: Location, Person, Organization 4 class: Location, Person,
Organization, Misc 7 class: Location, Person, Organization, Money,
Percent, Date, Time

可以使用Python實現(xiàn)Stanford NER，完整的代碼如下：

import re
from nltk.tag import StanfordNERTagger
import os
import pandas as pd
import nltk

def parse_document(document):
   document = re.sub('\n', ' ', document)
   if isinstance(document, str):
       document = document
   else:
       raise ValueError('Document is not string!')
   document = document.strip()
   sentences = nltk.sent_tokenize(document)
   sentences = [sentence.strip() for sentence in sentences]
   return sentences

# sample document
text = """
FIFA was founded in 1904 to oversee international competition among the national associations of Belgium, 
Denmark, France, Germany, the Netherlands, Spain, Sweden, and Switzerland. Headquartered in Zürich, its 
membership now comprises 211 national associations. Member countries must each also be members of one of 
the six regional confederations into which the world is divided: Africa, Asia, Europe, North & Central America 
and the Caribbean, Oceania, and South America.
"""

sentences = parse_document(text)
tokenized_sentences = [nltk.word_tokenize(sentence) for sentence in sentences]

# set java path in environment variables
java_path = r'C:\Program Files\Java\jdk1.8.0_161\bin\java.exe'
os.environ['JAVAHOME'] = java_path
# load stanford NER
sn = StanfordNERTagger('E://stanford-ner-2018-10-16/classifiers/english.muc.7class.distsim.crf.ser.gz',
                       path_to_jar='E://stanford-ner-2018-10-16/stanford-ner.jar')

# tag sentences
ne_annotated_sentences = [sn.tag(sent) for sent in tokenized_sentences]
# extract named entities
named_entities = []
for sentence in ne_annotated_sentences:
   temp_entity_name = ''
   temp_named_entity = None
   for term, tag in sentence:
       # get terms with NE tags
       if tag != 'O':
           temp_entity_name = ' '.join([temp_entity_name, term]).strip() #get NE name
           temp_named_entity = (temp_entity_name, tag) # get NE and its category
       else:
           if temp_named_entity:
               named_entities.append(temp_named_entity)
               temp_entity_name = ''
               temp_named_entity = None

# get unique named entities
named_entities = list(set(named_entities))
# store named entities in a data frame
entity_frame = pd.DataFrame(named_entities, columns=['Entity Name', 'Entity Type'])
# display results
print(entity_frame)

輸出結(jié)果如下：

                Entity Name   Entity Type
0                      1904          DATE
1                   Denmark      LOCATION
2                     Spain      LOCATION
3   North & Central America  ORGANIZATION
4             South America      LOCATION
5                   Belgium      LOCATION
6                    Zürich      LOCATION
7           the Netherlands      LOCATION
8                    France      LOCATION
9                 Caribbean      LOCATION
10                   Sweden      LOCATION
11                  Oceania      LOCATION
12                     Asia      LOCATION
13                     FIFA  ORGANIZATION
14                   Europe      LOCATION
15                   Africa      LOCATION
16              Switzerland      LOCATION
17                  Germany      LOCATION

可以看到，在Stanford NER的幫助下，NER的實現(xiàn)效果較好，將Africa識別為LOCATION，將1904識別為時間（這在NLTK中沒有識別出來），但還是對North & Central America識別有誤，將其識別為ORGANIZATION。
??值得注意的是，并不是說Stanford NER一定會比NLTK NER的效果好，兩者針對的對象，預(yù)料，算法可能有差異，因此，需要根據(jù)自己的需求決定使用什么工具。
??本次分享到此結(jié)束，以后有機(jī)會的話，將會嘗試著用HMM、CRF或深度學(xué)習(xí)來實現(xiàn)命名實體識別。

自然語言處理之動手學(xué)NER
網(wǎng)盤地址：網(wǎng)頁鏈接提取碼: s8u9

二 .NLP入門（五）用深度學(xué)習(xí)實現(xiàn)命名實體識別（NER）

前言

在文章：NLP入門（四）命名實體識別（NER）中，筆者介紹了兩個實現(xiàn)命名實體識別的工具——NLTK和Stanford NLP。在本文中，我們將會學(xué)習(xí)到如何使用深度學(xué)習(xí)工具來自己一步步地實現(xiàn)NER，只要你堅持看完，就一定會很有收獲的。
??OK，話不多說，讓我們進(jìn)入正題。
??幾乎所有的NLP都依賴一個強(qiáng)大的語料庫，本項目實現(xiàn)NER的語料庫如下(文件名為train.txt，一共42000行，這里只展示前15行，可以在文章最后的Github地址下載該語料庫)：

played on Monday ( home team in CAPS ) :
VBD IN NNP ( NN NN IN NNP ) :
O O O O O O O O O O
American League
NNP NNP
B-MISC I-MISC
Cleveland 2 DETROIT 1
NNP CD NNP CD
B-ORG O B-ORG O
BALTIMORE 12 Oakland 11 ( 10 innings )
VB CD NNP CD ( CD NN )
B-ORG O B-ORG O O O O O
TORONTO 5 Minnesota 3
TO CD NNP CD
B-ORG O B-ORG O
…

簡單介紹下該語料庫的結(jié)構(gòu)：該語料庫一共42000行，每三行為一組，其中，第一行為英語句子，第二行為每個句子的詞性（關(guān)于英語單詞的詞性，可參考文章：NLP入門（三）詞形還原（Lemmatization）），第三行為NER系統(tǒng)的標(biāo)注，具體的含義會在之后介紹。
??我們的NER項目的名稱為DL_4_NER，結(jié)構(gòu)如下：
命名實體識別,# python,python,自然語言處理,深度學(xué)習(xí)
項目中每個文件的功能如下：

utils.py: 項目配置及數(shù)據(jù)導(dǎo)入
data_processing.py: 數(shù)據(jù)探索
Bi_LSTM_Model_training.py: 模型創(chuàng)建及訓(xùn)練
Bi_LSTM_Model_predict.py: 對新句子進(jìn)行NER預(yù)測

接下來，筆者將結(jié)合代碼文件，分部介紹該項目的步驟，當(dāng)所有步驟介紹完畢后，我們的項目就結(jié)束了，而你，也就知道了如何用深度學(xué)習(xí)實現(xiàn)命名實體識別（NER）。
??Let’s begin!

項目配置

第一步，是項目的配置及數(shù)據(jù)導(dǎo)入，在utils.py文件中實現(xiàn)，完整的代碼如下：

# -*- coding: utf-8 -*-
import numpy as np
import pandas as pd

# basic settings for DL_4_NER Project
BASE_DIR = "F://NERSystem"
CORPUS_PATH = "%s/train.txt" % BASE_DIR

KERAS_MODEL_SAVE_PATH = '%s/Bi-LSTM-4-NER.h5' % BASE_DIR
WORD_DICTIONARY_PATH = '%s/word_dictionary.pk' % BASE_DIR
InVERSE_WORD_DICTIONARY_PATH = '%s/inverse_word_dictionary.pk' % BASE_DIR
LABEL_DICTIONARY_PATH = '%s/label_dictionary.pk' % BASE_DIR
OUTPUT_DICTIONARY_PATH = '%s/output_dictionary.pk' % BASE_DIR

CONSTANTS = [
             KERAS_MODEL_SAVE_PATH,
             InVERSE_WORD_DICTIONARY_PATH,
             WORD_DICTIONARY_PATH,
             LABEL_DICTIONARY_PATH,
             OUTPUT_DICTIONARY_PATH
             ]

# load data from corpus to from pandas DataFrame
def load_data():
    with open(CORPUS_PATH, 'r') as f:
        text_data = [text.strip() for text in f.readlines()]
    text_data = [text_data[k].split('\t') for k in range(0, len(text_data))]
    index = range(0, len(text_data), 3)

    # Transforming data to matrix format for neural network
    input_data = list()
    for i in range(1, len(index) - 1):
        rows = text_data[index[i-1]:index[i]]
        sentence_no = np.array([i]*len(rows[0]), dtype=str)
        rows.append(sentence_no)
        rows = np.array(rows).T
        input_data.append(rows)

    input_data = pd.DataFrame(np.concatenate([item for item in input_data]),\
                               columns=['word', 'pos', 'tag', 'sent_no'])

    return input_data

在該代碼中，先是設(shè)置了語料庫文件的路徑CORPUS_PATH，KERAS模型保存路徑KERAS_MODEL_SAVE_PATH，以及在項目過程中會用到的三個字典的保存路徑（以pickle文件形式保存）WORD_DICTIONARY_PATH，LABEL_DICTIONARY_PATH， OUTPUT_DICTIONARY_PATH。然后是load_data()函數(shù)，它將語料庫中的文本以Pandas中的DataFrame結(jié)構(gòu)展示出來，該數(shù)據(jù)框的前30行如下：

         word  pos     tag sent_no
0      played  VBD       O       1
1          on   IN       O       1
2      Monday  NNP       O       1
3           (    (       O       1
4        home   NN       O       1
5        team   NN       O       1
6          in   IN       O       1
7        CAPS  NNP       O       1
8           )    )       O       1
9           :    :       O       1
10   American  NNP  B-MISC       2
11     League  NNP  I-MISC       2
12  Cleveland  NNP   B-ORG       3
13          2   CD       O       3
14    DETROIT  NNP   B-ORG       3
15          1   CD       O       3
16  BALTIMORE   VB   B-ORG       4
17         12   CD       O       4
18    Oakland  NNP   B-ORG       4
19         11   CD       O       4
20          (    (       O       4
21         10   CD       O       4
22    innings   NN       O       4
23          )    )       O       4
24    TORONTO   TO   B-ORG       5
25          5   CD       O       5
26  Minnesota  NNP   B-ORG       5
27          3   CD       O       5
28  Milwaukee  NNP   B-ORG       6
29          3   CD       O       6

在該數(shù)據(jù)框中，word這一列表示文本語料庫中的單詞，pos這一列表示該單詞的詞性，tag這一列表示NER的標(biāo)注，sent_no這一列表示該單詞在第幾個句子中。

數(shù)據(jù)探索

接著，第二步是數(shù)據(jù)探索，即對輸入的數(shù)據(jù)（input_data）進(jìn)行一些數(shù)據(jù)review，完整的代碼（data_processing.py）如下:

# -*- coding: utf-8 -*-

import pickle
import numpy as np
from collections import Counter
from itertools import accumulate
from operator import itemgetter
import matplotlib.pyplot as plt
import matplotlib as mpl
from utils import BASE_DIR, CONSTANTS, load_data

# 設(shè)置matplotlib繪圖時的字體
mpl.rcParams['font.sans-serif']=['SimHei']

# 數(shù)據(jù)查看
def data_review():

    # 數(shù)據(jù)導(dǎo)入
    input_data = load_data()

    # 基本的數(shù)據(jù)review
    sent_num = input_data['sent_no'].astype(np.int).max()
    print("一共有%s個句子。\n"%sent_num)

    vocabulary = input_data['word'].unique()
    print("一共有%d個單詞。"%len(vocabulary))
    print("前10個單詞為：%s.\n"%vocabulary[:11])

    pos_arr = input_data['pos'].unique()
    print("單詞的詞性列表：%s.\n"%pos_arr)

    ner_tag_arr = input_data['tag'].unique()
    print("NER的標(biāo)注列表：%s.\n" % ner_tag_arr)

    df = input_data[['word', 'sent_no']].groupby('sent_no').count()
    sent_len_list = df['word'].tolist()
    print("句子長度及出現(xiàn)頻數(shù)字典：\n%s." % dict(Counter(sent_len_list)))

    # 繪制句子長度及出現(xiàn)頻數(shù)統(tǒng)計圖
    sort_sent_len_dist = sorted(dict(Counter(sent_len_list)).items(), key=itemgetter(0))
    sent_no_data = [item[0] for item in sort_sent_len_dist]
    sent_count_data = [item[1] for item in sort_sent_len_dist]
    plt.bar(sent_no_data, sent_count_data)
    plt.title("句子長度及出現(xiàn)頻數(shù)統(tǒng)計圖")
    plt.xlabel("句子長度")
    plt.ylabel("句子長度出現(xiàn)的頻數(shù)")
    plt.savefig("%s/句子長度及出現(xiàn)頻數(shù)統(tǒng)計圖.png" % BASE_DIR)
    plt.close()

    # 繪制句子長度累積分布函數(shù)(CDF)
    sent_pentage_list = [(count/sent_num) for count in accumulate(sent_count_data)]

    # 尋找分位點為quantile的句子長度
    quantile = 0.9992
    #print(list(sent_pentage_list))
    for length, per in zip(sent_no_data, sent_pentage_list):
        if round(per, 4) == quantile:
            index = length
            break
    print("\n分位點為%s的句子長度:%d." % (quantile, index))

    # 繪制CDF
    plt.plot(sent_no_data, sent_pentage_list)
    plt.hlines(quantile, 0, index, colors="c", linestyles="dashed")
    plt.vlines(index, 0, quantile, colors="c", linestyles="dashed")
    plt.text(0, quantile, str(quantile))
    plt.text(index, 0, str(index))
    plt.title("句子長度累積分布函數(shù)圖")
    plt.xlabel("句子長度")
    plt.ylabel("句子長度累積頻率")
    plt.savefig("%s/句子長度累積分布函數(shù)圖.png" % BASE_DIR)
    plt.close()

# 數(shù)據(jù)處理
def data_processing():
    # 數(shù)據(jù)導(dǎo)入
    input_data = load_data()

    # 標(biāo)簽及詞匯表
    labels, vocabulary = list(input_data['tag'].unique()), list(input_data['word'].unique())

    # 字典列表
    word_dictionary = {word: i+1 for i, word in enumerate(vocabulary)}
    inverse_word_dictionary = {i+1: word for i, word in enumerate(vocabulary)}
    label_dictionary = {label: i+1 for i, label in enumerate(labels)}
    output_dictionary = {i+1: labels for i, labels in enumerate(labels)}

    dict_list = [word_dictionary, inverse_word_dictionary,label_dictionary, output_dictionary]

    # 保存為pickle形式
    for dict_item, path in zip(dict_list, CONSTANTS[1:]):
        with open(path, 'wb') as f:
            pickle.dump(dict_item, f)

#data_review()

調(diào)用data_review()函數(shù)，輸出的結(jié)果如下：

一共有13998個句子。

一共有24339個單詞。
前10個單詞為：['played' 'on' 'Monday' '(' 'home' 'team' 'in' 'CAPS' ')' ':' 'American'].

單詞的詞性列表：['VBD' 'IN' 'NNP' '(' 'NN' ')' ':' 'CD' 'VB' 'TO' 'NNS' ',' 'VBP' 'VBZ'
 '.' 'VBG' 'PRP$' 'JJ' 'CC' 'JJS' 'RB' 'DT' 'VBN' '"' 'PRP' 'WDT' 'WRB'
 'MD' 'WP' 'POS' 'JJR' 'WP$' 'RP' 'NNPS' 'RBS' 'FW' '$' 'RBR' 'EX' "''"
 'PDT' 'UH' 'SYM' 'LS' 'NN|SYM'].

NER的標(biāo)注列表：['O' 'B-MISC' 'I-MISC' 'B-ORG' 'I-ORG' 'B-PER' 'B-LOC' 'I-PER' 'I-LOC'
 'sO'].

句子長度及出現(xiàn)頻數(shù)字典：
{1: 177, 2: 1141, 3: 620, 4: 794, 5: 769, 6: 639, 7: 999, 8: 977, 9: 841, 10: 501, 11: 395, 12: 316, 13: 339, 14: 291, 15: 275, 16: 225, 17: 229, 18: 212, 19: 197, 20: 221, 21: 228, 22: 221, 23: 230, 24: 210, 25: 207, 26: 224, 27: 188, 28: 199, 29: 214, 30: 183, 31: 202, 32: 167, 33: 167, 34: 141, 35: 130, 36: 119, 37: 105, 38: 112, 39: 98, 40: 78, 41: 74, 42: 63, 43: 51, 44: 42, 45: 39, 46: 19, 47: 22, 48: 19, 49: 15, 50: 16, 51: 8, 52: 9, 53: 5, 54: 4, 55: 9, 56: 2, 57: 2, 58: 2, 59: 2, 60: 3, 62: 2, 66: 1, 67: 1, 69: 1, 71: 1, 72: 1, 78: 1, 80: 1, 113: 1, 124: 1}.

分位點為0.9992的句子長度:60.

在該語料庫中，一共有13998個句子，比預(yù)期的42000/3=14000個句子少兩個。一個有24339個單詞，單詞量還是蠻大的，當(dāng)然，這里對單詞沒有做任何處理，直接保留了語料庫中的形式（后期可以繼續(xù)優(yōu)化）。單詞的詞性可以參考文章：NLP入門（三）詞形還原（Lemmatization）。我們需要注意的是，NER的標(biāo)注列表為[‘O’ ,‘B-MISC’, ‘I-MISC’, ‘B-ORG’ ,‘I-ORG’, ‘B-PER’ ,‘B-LOC’ ,‘I-PER’, ‘I-LOC’,‘sO’]，因此，本項目的NER一共分為四類：PER（人名），LOC（位置），ORG（組織）以及MISC，其中B表示開始，I表示中間，O表示單字詞，不計入NER，sO表示特殊單字詞。
??接下來，讓我們考慮下句子的長度，這對后面的建模時填充的句子長度有有參考作用。句子長度及出現(xiàn)頻數(shù)的統(tǒng)計圖如下：
命名實體識別,# python,python,自然語言處理,深度學(xué)習(xí)
可以看到，句子長度基本在60以下，當(dāng)然，這也可以在輸出的句子長度及出現(xiàn)頻數(shù)字典中看到。那么，我們是否可以選在一個標(biāo)準(zhǔn)作為后面模型的句子填充的長度呢？答案是，利用出現(xiàn)頻數(shù)的累計分布函數(shù)的分位點，在這里，我們選擇分位點為0.9992,對應(yīng)的句子長度為60，如下圖：

命名實體識別,# python,python,自然語言處理,深度學(xué)習(xí)

接著是數(shù)據(jù)處理函數(shù)data_processing()，它的功能主要是實現(xiàn)單詞、標(biāo)簽字典，并保存為pickle文件形式，便于后續(xù)直接調(diào)用。

建模

在第三步中，我們建立Bi-LSTM模型來訓(xùn)練訓(xùn)練，完整的Python代碼（Bi_LSTM_Model_training.py）如下：

# -*- coding: utf-8 -*-
import pickle
import numpy as np
import pandas as pd
from utils import BASE_DIR, CONSTANTS, load_data
from data_processing import data_processing
from keras.utils import np_utils, plot_model
from keras.models import Sequential
from keras.preprocessing.sequence import pad_sequences
from keras.layers import Bidirectional, LSTM, Dense, Embedding, TimeDistributed


# 模型輸入數(shù)據(jù)
def input_data_for_model(input_shape):

    # 數(shù)據(jù)導(dǎo)入
    input_data = load_data()
    # 數(shù)據(jù)處理
    data_processing()
    # 導(dǎo)入字典
    with open(CONSTANTS[1], 'rb') as f:
        word_dictionary = pickle.load(f)
    with open(CONSTANTS[2], 'rb') as f:
        inverse_word_dictionary = pickle.load(f)
    with open(CONSTANTS[3], 'rb') as f:
        label_dictionary = pickle.load(f)
    with open(CONSTANTS[4], 'rb') as f:
        output_dictionary = pickle.load(f)
    vocab_size = len(word_dictionary.keys())
    label_size = len(label_dictionary.keys())

    # 處理輸入數(shù)據(jù)
    aggregate_function = lambda input: [(word, pos, label) for word, pos, label in
                                            zip(input['word'].values.tolist(),
                                                input['pos'].values.tolist(),
                                                input['tag'].values.tolist())]

    grouped_input_data = input_data.groupby('sent_no').apply(aggregate_function)
    sentences = [sentence for sentence in grouped_input_data]

    x = [[word_dictionary[word[0]] for word in sent] for sent in sentences]
    x = pad_sequences(maxlen=input_shape, sequences=x, padding='post', value=0)
    y = [[label_dictionary[word[2]] for word in sent] for sent in sentences]
    y = pad_sequences(maxlen=input_shape, sequences=y, padding='post', value=0)
    y = [np_utils.to_categorical(label, num_classes=label_size + 1) for label in y]

    return x, y, output_dictionary, vocab_size, label_size, inverse_word_dictionary


# 定義深度學(xué)習(xí)模型：Bi-LSTM
def create_Bi_LSTM(vocab_size, label_size, input_shape, output_dim, n_units, out_act, activation):
    model = Sequential()
    model.add(Embedding(input_dim=vocab_size + 1, output_dim=output_dim,
                        input_length=input_shape, mask_zero=True))
    model.add(Bidirectional(LSTM(units=n_units, activation=activation,
                                 return_sequences=True)))
    model.add(TimeDistributed(Dense(label_size + 1, activation=out_act)))
    model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
    return model


# 模型訓(xùn)練
def model_train():

    # 將數(shù)據(jù)集分為訓(xùn)練集和測試集，占比為9:1
    input_shape = 60
    x, y, output_dictionary, vocab_size, label_size, inverse_word_dictionary = input_data_for_model(input_shape)
    train_end = int(len(x)*0.9)
    train_x, train_y = x[0:train_end], np.array(y[0:train_end])
    test_x, test_y = x[train_end:], np.array(y[train_end:])

    # 模型輸入?yún)?shù)
    activation = 'selu'
    out_act = 'softmax'
    n_units = 100
    batch_size = 32
    epochs = 10
    output_dim = 20

    # 模型訓(xùn)練
    lstm_model = create_Bi_LSTM(vocab_size, label_size, input_shape, output_dim, n_units, out_act, activation)
    lstm_model.fit(train_x, train_y, epochs=epochs, batch_size=batch_size, verbose=1)

    # 模型保存
    model_save_path = CONSTANTS[0]
    lstm_model.save(model_save_path)
    plot_model(lstm_model, to_file='%s/LSTM_model.png' % BASE_DIR)

    # 在測試集上的效果
    N = test_x.shape[0]  # 測試的條數(shù)
    avg_accuracy = 0  # 預(yù)測的平均準(zhǔn)確率
    for start, end in zip(range(0, N, 1), range(1, N+1, 1)):
        sentence = [inverse_word_dictionary[i] for i in test_x[start] if i != 0]
        y_predict = lstm_model.predict(test_x[start:end])
        input_sequences, output_sequences = [], []
        for i in range(0, len(y_predict[0])):
            output_sequences.append(np.argmax(y_predict[0][i]))
            input_sequences.append(np.argmax(test_y[start][i]))

        eval = lstm_model.evaluate(test_x[start:end], test_y[start:end])
        print('Test Accuracy: loss = %0.6f accuracy = %0.2f%%' % (eval[0], eval[1] * 100))
        avg_accuracy += eval[1]
        output_sequences = ' '.join([output_dictionary[key] for key in output_sequences if key != 0]).split()
        input_sequences = ' '.join([output_dictionary[key] for key in input_sequences if key != 0]).split()
        output_input_comparison = pd.DataFrame([sentence, output_sequences, input_sequences]).T
        print(output_input_comparison.dropna())
        print('#' * 80)

    avg_accuracy /= N
    print("測試樣本的平均預(yù)測準(zhǔn)確率：%.2f%%." % (avg_accuracy * 100))

model_train()

在上面的代碼中，先是通過input_data_for_model()函數(shù)來處理好進(jìn)入模型的數(shù)據(jù)，其參數(shù)為input_shape，即填充句子時的長度。然后是創(chuàng)建Bi-LSTM模型create_Bi_LSTM()，模型的示意圖如下：
命名實體識別,# python,python,自然語言處理,深度學(xué)習(xí)
最后，是在輸入的數(shù)據(jù)上進(jìn)行模型訓(xùn)練，將原始的數(shù)據(jù)分為訓(xùn)練集和測試集，占比為9:1，訓(xùn)練的周期為10次。

模型訓(xùn)練

運行上述模型訓(xùn)練代碼，一共訓(xùn)練10個周期，訓(xùn)練時間大概為500s，在訓(xùn)練集上的準(zhǔn)確率達(dá)99%以上，在測試集上的平均準(zhǔn)確率為95%以上。以下是最后幾個測試集上的預(yù)測結(jié)果：

......(前面的輸出已忽略)
Test Accuracy: loss = 0.000986 accuracy = 100.00%
          0      1      2
0   Cardiff  B-ORG  B-ORG
1         1      O      O
2  Brighton  B-ORG  B-ORG
3         0      O      O
################################################################################

1/1 [==============================] - 0s 10ms/step
Test Accuracy: loss = 0.000274 accuracy = 100.00%
          0      1      2
0  Carlisle  B-ORG  B-ORG
1         0      O      O
2      Hull  B-ORG  B-ORG
3         0      O      O
################################################################################

1/1 [==============================] - 0s 9ms/step
Test Accuracy: loss = 0.000479 accuracy = 100.00%
           0      1      2
0    Chester  B-ORG  B-ORG
1          1      O      O
2  Cambridge  B-ORG  B-ORG
3          1      O      O
################################################################################

1/1 [==============================] - 0s 9ms/step
Test Accuracy: loss = 0.003092 accuracy = 100.00%
            0      1      2
0  Darlington  B-ORG  B-ORG
1           4      O      O
2     Swansea  B-ORG  B-ORG
3           1      O      O
################################################################################

1/1 [==============================] - 0s 8ms/step
Test Accuracy: loss = 0.000705 accuracy = 100.00%
             0      1      2
0       Exeter  B-ORG  B-ORG
1            2      O      O
2  Scarborough  B-ORG  B-ORG
3            2      O      O
################################################################################
測試樣本的平均預(yù)測準(zhǔn)確率：95.55%.

該模型在原始數(shù)據(jù)上的識別效果還是可以的。
??訓(xùn)練完模型后，BASE_DIR中的所有文件如下：
命名實體識別,# python,python,自然語言處理,深度學(xué)習(xí)

模型預(yù)測

最后，也許是整個項目最為激動人心的時刻，因為，我們要在新數(shù)據(jù)集上測試模型的識別效果。預(yù)測新數(shù)據(jù)的識別結(jié)果的完整Python代碼（Bi_LSTM_Model_predict.py）如下：

# -*- coding: utf-8 -*-
# Name entity recognition for new data

# Import the necessary modules
import pickle
import numpy as np
from utils import CONSTANTS
from keras.preprocessing.sequence import pad_sequences
from keras.models import load_model
from nltk import word_tokenize

# 導(dǎo)入字典
with open(CONSTANTS[1], 'rb') as f:
    word_dictionary = pickle.load(f)
with open(CONSTANTS[4], 'rb') as f:
    output_dictionary = pickle.load(f)

try:
    # 數(shù)據(jù)預(yù)處理
    input_shape = 60
    sent = 'New York is the biggest city in America.'
    new_sent = word_tokenize(sent)
    new_x = [[word_dictionary[word] for word in new_sent]]
    x = pad_sequences(maxlen=input_shape, sequences=new_x, padding='post', value=0)

    # 載入模型
    model_save_path = CONSTANTS[0]
    lstm_model = load_model(model_save_path)

    # 模型預(yù)測
    y_predict = lstm_model.predict(x)

    ner_tag = []
    for i in range(0, len(new_sent)):
        ner_tag.append(np.argmax(y_predict[0][i]))

    ner = [output_dictionary[i] for i in ner_tag]
    print(new_sent)
    print(ner)

    # 去掉NER標(biāo)注為O的元素
    ner_reg_list = []
    for word, tag in zip(new_sent, ner):
        if tag != 'O':
            ner_reg_list.append((word, tag))

    # 輸出模型的NER識別結(jié)果
    print("NER識別結(jié)果：")
    if ner_reg_list:
        for i, item in enumerate(ner_reg_list):
            if item[1].startswith('B'):
                end = i+1
                while end <= len(ner_reg_list)-1 and ner_reg_list[end][1].startswith('I'):
                    end += 1

                ner_type = item[1].split('-')[1]
                ner_type_dict = {'PER': 'PERSON: ',
                                'LOC': 'LOCATION: ',
                                'ORG': 'ORGANIZATION: ',
                                'MISC': 'MISC: '
                                }
                print(ner_type_dict[ner_type],\
                    ' '.join([item[0] for item in ner_reg_list[i:end]]))
    else:
        print("模型并未識別任何有效命名實體。")

except KeyError as err:
    print("您輸入的句子有單詞不在詞匯表中，請重新輸入！")
    print("不在詞匯表中的單詞為：%s." % err)

輸出結(jié)果為：

['New', 'York', 'is', 'the', 'biggest', 'city', 'in', 'America', '.']
['B-LOC', 'I-LOC', 'O', 'O', 'O', 'O', 'O', 'B-LOC', 'O']
NER識別結(jié)果：
LOCATION:  New York
LOCATION:  America

接下來，再測試三個筆者自己想的句子：

輸入為：

sent = 'James is a world famous actor, whose home is in London.'

輸出結(jié)果為：

['James', 'is', 'a', 'world', 'famous', 'actor', ',', 'whose', 'home', 'is', 'in', 'London', '.']
['B-PER', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-LOC', 'O']
NER識別結(jié)果：
PERSON:  James
LOCATION:  London

輸入為：

sent = 'Oxford is in England, Jack is from here.'

輸出為：

['Oxford', 'is', 'in', 'England', ',', 'Jack', 'is', 'from', 'here', '.']
['B-PER', 'O', 'O', 'B-LOC', 'O', 'B-PER', 'O', 'O', 'O', 'O']
NER識別結(jié)果：
PERSON:  Oxford
LOCATION:  England
PERSON:  Jack

輸入為：

sent = 'I love Shanghai.'

輸出為：

['I', 'love', 'Shanghai', '.']
['O', 'O', 'B-LOC', 'O']
NER識別結(jié)果：
LOCATION:  Shanghai

在上面的例子中，只有Oxford的識別效果不理想，模型將它識別為PERSON，其實應(yīng)該是ORGANIZATION。

接下來是三個來自CNN和wikipedia的句子：

輸入為：

sent = "the US runs the risk of a military defeat by China or Russia"

輸出為：

['the', 'US', 'runs', 'the', 'risk', 'of', 'a', 'military', 'defeat', 'by', 'China', 'or', 'Russia']
['O', 'B-LOC', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-LOC', 'O', 'B-LOC']
NER識別結(jié)果：
LOCATION:  US
LOCATION:  China
LOCATION:  Russia

輸入為：

sent = "Home to the headquarters of the United Nations, New York is an important center for international diplomacy."

輸出為：

['Home', 'to', 'the', 'headquarters', 'of', 'the', 'United', 'Nations', ',', 'New', 'York', 'is', 'an', 'important', 'center', 'for', 'international', 'diplomacy', '.']
['O', 'O', 'O', 'O', 'O', 'O', 'B-ORG', 'I-ORG', 'O', 'B-LOC', 'I-LOC', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
NER識別結(jié)果：
ORGANIZATION:  United Nations
LOCATION:  New York

輸入為：

sent = "The United States is a founding member of the United Nations, World Bank, International Monetary Fund."

輸出為:

['The', 'United', 'States', 'is', 'a', 'founding', 'member', 'of', 'the', 'United', 'Nations', ',', 'World', 'Bank', ',', 'International', 'Monetary', 'Fund', '.']
['O', 'B-LOC', 'I-LOC', 'O', 'O', 'O', 'O', 'O', 'O', 'B-ORG', 'I-ORG', 'O', 'B-ORG', 'I-ORG', 'O', 'B-ORG', 'I-ORG', 'I-ORG', 'O']
NER識別結(jié)果：
LOCATION:  United States
ORGANIZATION:  United Nations
ORGANIZATION:  World Bank
ORGANIZATION:  International Monetary Fund

這三個例子識別全部正確。

總結(jié)

到這兒，筆者的這個項目就差不多了。我們有必要對這個項目做個總結(jié)。
??首先是這個項目的優(yōu)點。它的優(yōu)點在于能夠讓你一步步地實現(xiàn)NER，而且除了語料庫，你基本熟悉了如何創(chuàng)建一個識別NER系統(tǒng)的步驟，同時，對深度學(xué)習(xí)模型及其應(yīng)用也有了深刻理解。因此，好處是顯而易見的。當(dāng)然，在實際工作中，語料庫的整理才是最耗費時間的，能夠占到90%或者更多的時間，因此，有一個好的語料庫你才能展開工作。
??接著講講這個項目的缺點。第一個，是語料庫不夠大，當(dāng)然，約14000條句子也夠了，但本項目沒有對句子進(jìn)行文本預(yù)處理，所以，有些單詞的變形可能無法進(jìn)入詞匯表。第二個，缺少對新詞的處理，一旦句子中出現(xiàn)一個新的單詞，這個模型便無法處理，這是后期需要完善的地方。第三個，句子的填充長度為60，如果輸入的句子長度大于60，則后面的部分將無法有效識別。
??因此，后續(xù)還有更多的工作需要去做，當(dāng)然，做一個中文NER也是可以考慮的。
??本項目已上傳Github,地址為 https://github.com/percent4/DL_4_NER 。：歡迎大家參考~

參考文獻(xiàn)

BOOK： Applied Natural Language Processing with Python， Taweh Beysolow II

WEBSITE：https://github.com/Apress/applied-natural-language-processing-w-python

WEBSITE: NLP入門（四）命名實體識別（NER）

山陰少年

NLP入門（四）命名實體識別（NER）

NLP入門（五）用深度學(xué)習(xí)實現(xiàn)命名實體識別（NER）文章來源地址http://www.zghlxwxcb.cn/news/detail-861025.html

到了這里，關(guān)于Python系列：NLP系列二：命名實體識別（NER）、用深度學(xué)習(xí)實現(xiàn)命名實體識別（NER）的文章就介紹完了。如果您還想了解更多內(nèi)容，請在右上角搜索TOY模板網(wǎng)以前的文章或繼續(xù)瀏覽下面的相關(guān)文章，希望大家以后多多支持TOY模板網(wǎng)！

本文來自互聯(lián)網(wǎng)用戶投稿，該文觀點僅代表作者本人，不代表本站立場。本站僅提供信息存儲空間服務(wù)，不擁有所有權(quán)，不承擔(dān)相關(guān)法律責(zé)任。如若轉(zhuǎn)載，請注明出處：如若內(nèi)容造成侵權(quán)/違法違規(guī)/事實不符，請點擊違法舉報進(jìn)行投訴反饋，一經(jīng)查實，立即刪除！

分享到：

領(lǐng)支付寶紅包贊助服務(wù)器費用

【NLP】一項NER實體提取任務(wù)
????????從文本中提取實體是一項主要的自然語言處理（NLP）任務(wù)。由于深度學(xué)習(xí)（DL）的最新進(jìn)展使我們能夠?qū)⑺鼈冇糜贜LP任務(wù)，并且與傳統(tǒng)方法相比，在準(zhǔn)確性上產(chǎn)生了巨大的差異。 ????????我試圖使用深度學(xué)習(xí)和傳統(tǒng)方法從文章中提取信息。結(jié)果是驚人的，因為
2024年02月16日
瀏覽(21)
nlp系列（7）實體識別（Bert）pytorch
本項目是使用Bert模型來進(jìn)行文本的實體識別。 Bert模型介紹可以查看這篇文章： NLP系列（2）文本分類（Bert）pytorch_bert文本分類-CSDN博客 Bert模型的模型結(jié)構(gòu)：數(shù)據(jù)網(wǎng)址：??????https://github.com/buppt//raw/master/data/people-relation/train.txt 實體1? 實體2? 關(guān)系文本輸入中文句子
2024年02月10日
瀏覽(19)
【網(wǎng)安AIGC專題11.1】（頂刊OpenAI API調(diào)用）CodeX（比chatgpt更好）用于命名實體識別NER和關(guān)系抽取RE：提示工程設(shè)計+控制變量對比實驗（格式一致性、模型忠實度、細(xì)粒度性能）
這次該我匯報啦許愿明天講的順利，問的都會講+提問1個小時但是在討論的過程中，感覺逐步抽絲挖掘到了核心原理：之前的理解：借助代碼-LLM中的編碼豐富結(jié)構(gòu)化代碼信息最后的理解：如果能設(shè)置一個方法，讓大模型能對自己輸出的有所理解，那么效果會更好。這篇論
2024年02月05日
瀏覽(34)
華為Could API人工智能系列——命名實體識別
云原生時代，開發(fā)者們的編程方式、編程習(xí)慣都發(fā)生了天翻地覆的變化，大家逐漸地習(xí)慣在云端構(gòu)建自己的應(yīng)用。作為新一代的開發(fā)者們，如何更快速了解云，學(xué)習(xí)云，使用云，更便捷、更智能的開發(fā)代碼，從而提升我們的開發(fā)效率，是當(dāng)前最熱門的話題之一，而Huawei Cloud
2024年02月02日
瀏覽(21)
nlp系列（6）文本實體識別（Bi-LSTM+CRF）pytorch
LSTM：長短期記憶網(wǎng)絡(luò)（Long-short-term-memory）,能夠記住長句子的前后信息，解決了RNN的問題（時間間隔較大時，網(wǎng)絡(luò)對前面的信息會遺忘，從而出現(xiàn)梯度消失問題，會形成長期依賴問題），避免長期依賴問題。 Bi-LSTM：由前向LSTM與后向LSTM組合而成。同LSTM，區(qū)別在于模型的輸出
2024年02月15日
瀏覽(18)
【NLP屠夫系列】- NER之實戰(zhàn)BILSTM
了解什么是命名實體識別了解命名實體識別的作用了解命名實體識別常用方法了解醫(yī)學(xué)文本特征命名實體識別(Named Entity Recognition，NER)就是從一段自然語言文本中找出相關(guān)實體，并標(biāo)注出其位置以及類型。是信息提取, 問答系統(tǒng), 句法分析, 機(jī)器翻譯等應(yīng)用領(lǐng)域的重要基礎(chǔ)工
2023年04月12日
瀏覽(18)
【NLP pytorch】基于標(biāo)注信息從句子中提取命名實體內(nèi)容
給定一個句子和已經(jīng)通過模型訓(xùn)練標(biāo)注好的信息，從而提取出句子中的實體內(nèi)容，如下輸入：（1）句子信息（2）標(biāo)注信息
2024年02月14日
瀏覽(34)
NLP信息抽取全解析：從命名實體到事件的PyTorch實戰(zhàn)指南
本文深入探討了信息抽取的關(guān)鍵組成部分：命名實體識別、關(guān)系抽取和事件抽取，并提供了基于PyTorch的實現(xiàn)代碼。關(guān)注TechLead，分享AI全維度知識。作者擁有10+年互聯(lián)網(wǎng)服務(wù)架構(gòu)、AI產(chǎn)品研發(fā)經(jīng)驗、團(tuán)隊管理經(jīng)驗，同濟(jì)本復(fù)旦碩，復(fù)旦機(jī)器人智能實驗室成員，阿里云認(rèn)證的資
2024年02月07日
瀏覽(63)
【實體識別】深入淺出講解命名實體識別（介紹、常用算法）
本文收錄于《深入淺出講解自然語言處理》專欄，此專欄聚焦于自然語言處理領(lǐng)域的各大經(jīng)典算法，將持續(xù)更新，歡迎大家訂閱！個人主頁：有夢想的程序星空個人介紹：小編是人工智能領(lǐng)域碩士，全棧工程師，深耕Flask后端開發(fā)、數(shù)據(jù)挖掘、NLP、Android開發(fā)、自動化等領(lǐng)域
2023年04月08日
瀏覽(48)
【NLP學(xué)習(xí)計劃】萬字吃透NER
????? 博主介紹：大家好，我是可可卷，一個NLP領(lǐng)域的小小白~ ?? 文章介紹：命名實體識別，即Named Entity Recognition(NER)，在比如QA，text summarization，machine translation等多項任務(wù)中均有涉及。今天我們將研究頂會文章Chinese NER using Lattice LSTM中的數(shù)據(jù)集，并嘗試自己手寫模型，領(lǐng)
2023年04月08日
瀏覽(14)