1、簡介
NLTK - 自然語言工具包 - 是一套開源Python。 支持自然研究和開發(fā)的模塊、數(shù)據(jù)集和教程 語言處理。NLTK 需要 Python 版本 3.7、3.8、3.9、3.10 或 3.11。
NLTK是一個(gè)高效的Python構(gòu)建的平臺(tái),用來處理人類自然語言數(shù)據(jù)。它提供了易于使用的接口,通過這些接口可以訪問超過50個(gè)語料庫和詞匯資源(如WordNet),還有一套用于分類、標(biāo)記化、詞干標(biāo)記、解析和語義推理的文本處理庫,以及工業(yè)級(jí)NLP庫的封裝器和一個(gè)活躍的討論論壇。
2、安裝
2.1 安裝nltk庫
The Natural Language Toolkit (NLTK) is a Python package for natural language processing. NLTK requires Python 3.7, 3.8, 3.9, 3.10 or 3.11.
pip install nltk
# or
pip install nltk -i https://pypi.tuna.tsinghua.edu.cn/simple
可以用以下代碼測試nltk分詞的功能:
2.2 安裝nltk語料庫
在NLTK模塊中包含數(shù)十種完整的語料庫,可用來練習(xí)使用,如下所示:
古騰堡語料庫:gutenberg,包含古藤堡項(xiàng)目電子文檔的一小部分文本,約有36000本免費(fèi)電子書。
網(wǎng)絡(luò)聊天語料庫:webtext、nps_chat
布朗語料庫:brown
路透社語料庫:reuters
影評(píng)語料庫:movie_reviews,擁有評(píng)論、被標(biāo)記為正面或負(fù)面的語料庫;
就職演講語料庫:inaugural,有55個(gè)文本的集合,每個(gè)文本是某個(gè)總統(tǒng)在不同時(shí)間的演說.
- 方法1:在線下載
import nltk
nltk.download()
通過上面命令代碼下載,大概率是失敗的。
-
方法2:手動(dòng)下載,離線安裝
github:https://github.com/nltk/nltk_data/tree/gh-pages
gitee:https://gitee.com/qwererer2/nltk_data/tree/gh-pages -
查看packages文件夾應(yīng)該放在哪個(gè)路徑下
將下載的packages文件夾改名為nltk_data,放在如下文件夾: -
驗(yàn)證是否安裝成功
from nltk.book import *
- 分詞測試
import nltk
ret = nltk.word_tokenize("A pivot is the pin or the central point on which something balances or turns")
print(ret)
- wordnet詞庫測試
WordNet是一個(gè)在20世紀(jì)80年代由Princeton大學(xué)的著名認(rèn)知心理學(xué)家George Miller團(tuán)隊(duì)構(gòu)建的一個(gè)大型的英文詞匯數(shù)據(jù)庫。名詞、動(dòng)詞、形容詞和副詞以同義詞集合(synsets)的形式存儲(chǔ)在這個(gè)數(shù)據(jù)庫中。
import nltk
nltk.download('wordnet')
from nltk.corpus import wordnet as wn
from nltk.corpus import brown
print(brown.words())
3、測試
3.1 分句分詞
英文分句:nltk.sent_tokenize :對(duì)文本按照句子進(jìn)行分割
英文分詞:nltk.word_tokenize:將句子按照單詞進(jìn)行分隔,返回一個(gè)列表
from nltk.tokenize import sent_tokenize, word_tokenize
EXAMPLE_TEXT = "Hello Mr. Smith, how are you doing today? The weather is great, and Python is awesome. The sky is pinkish-blue. You shouldn't eat cardboard."
print(sent_tokenize(EXAMPLE_TEXT))
print(word_tokenize(EXAMPLE_TEXT))
from nltk.corpus import stopwords
stop_word = set(stopwords.words('english')) # 獲取所有的英文停止詞
word_tokens = word_tokenize(EXAMPLE_TEXT) # 獲取所有分詞詞語
filtered_sentence = [w for w in word_tokens if not w in stop_word] #獲取案例文本中的非停止詞
print(filtered_sentence)
3.2 停用詞過濾
停止詞:nltk.corpus的 stopwords:查看英文中的停止詞表。
定義了一個(gè)過濾英文停用詞的函數(shù),將文本中的詞匯歸一化處理為小寫并提取。從停用詞語料庫中提取出英語停用詞,將文本進(jìn)行區(qū)分。
from nltk.tokenize import sent_tokenize, word_tokenize #導(dǎo)入 分句、分詞模塊
from nltk.corpus import stopwords #導(dǎo)入停止詞模塊
def remove_stopwords(text):
text_lower=[w.lower() for w in text if w.isalpha()]
stopword_set =set(stopwords.words('english'))
result = [w for w in text_lower if w not in stopword_set]
return result
example_text = "Stray birds of summer come to my window to sing and fly away. And yellow leaves of autumn,which have no songs,flutter and fall there with a sigh."
word_tokens = word_tokenize(example_text)
print(remove_stopwords(word_tokens))
from nltk.tokenize import sent_tokenize, word_tokenize #導(dǎo)入 分句、分詞模塊
example_text = "Stray birds of summer come to my window to sing and fly away. And yellow leaves of autumn,which have no songs,flutter and fall there with a sigh."
word_tokens = word_tokenize(example_text)
from nltk.corpus import stopwords
test_words = [word.lower() for word in word_tokens]
test_words_set = set(test_words)
test_words_set.intersection(set(stopwords.words('english')))
filtered = [w for w in test_words_set if(w not in stopwords.words('english'))]
print(filtered)
3.3 詞干提取
詞干提?。菏侨コ~綴得到詞根的過程,例如:fishing、fished,為同一個(gè)詞干 fish。Nltk,提供PorterStemmer進(jìn)行詞干提取。
from nltk.stem import PorterStemmer
from nltk.tokenize import sent_tokenize,word_tokenize
ps = PorterStemmer()
example_words = ["python","pythoner","pythoning","pythoned","pythonly"]
print(example_words)
for w in example_words:
print(ps.stem(w),end=' ')
from nltk.stem import PorterStemmer
from nltk.tokenize import sent_tokenize,word_tokenize
ps = PorterStemmer()
example_text = "Stray birds of summer come to my window to sing and fly away. And yellow leaves of autumn,which have no songs,flutter and fall there with a sigh."
print(example_text)
words = word_tokenize(example_text)
for w in words:
print(ps.stem(w), end=' ')
from nltk.stem import PorterStemmer
from nltk.tokenize import sent_tokenize, word_tokenize
ps = PorterStemmer()
example_text1 = "Stray birds of summer come to my window to sing and fly away. And yellow leaves of autumn,which have no songs,flutter and fall there with a sigh."
example_text2 = "There little thoughts are the rustle of leaves; they have their whisper of joy in my mind."
example_text3 = "We, the rustling leaves, have a voice that answers the storms,but who are you so silent? I am a mere flower."
example_text4 = "The light that plays, like a naked child, among the green leaves happily knows not that man can lie."
example_text5 = "My heart beats her waves at the shore of the world and writes upon it her signature in tears with the words, I love thee."
example_text_list = [example_text1, example_text2, example_text3, example_text4, example_text5]
for sent in example_text_list:
words = word_tokenize(sent)
print("tokenize: ", words)
stems = [ps.stem(w) for w in words]
print("stem: ", stems)
3.4 詞形/詞干還原
與詞干提取類似,詞干提取包含被創(chuàng)造出的不存在的詞匯,而詞形還原的是實(shí)際的詞匯。
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
print('cats\t',lemmatizer.lemmatize('cats'))
print('better\t',lemmatizer.lemmatize('better',pos='a'))
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
print(lemmatizer.lemmatize("cats"))
print(lemmatizer.lemmatize("cacti"))
print(lemmatizer.lemmatize("geese"))
print(lemmatizer.lemmatize("rocks"))
print(lemmatizer.lemmatize("python"))
print(lemmatizer.lemmatize("better", pos="a"))
print(lemmatizer.lemmatize("best", pos="a"))
print(lemmatizer.lemmatize("run"))
print(lemmatizer.lemmatize("run",'v'))
唯一要注意的是,lemmatize接受詞性參數(shù)pos。 如果沒有提供,默認(rèn)是“名詞”。
- 時(shí)態(tài)和單復(fù)數(shù)
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.stem import PorterStemmer
tokens = word_tokenize(text="All work and no play makes jack a dull boy, all work and no play,playing,played", language="english")
ps=PorterStemmer()
stems = [ps.stem(word)for word in tokens]
print(stems)
from nltk.stem import SnowballStemmer
snowball_stemmer = SnowballStemmer('english')
ret = snowball_stemmer.stem('presumably')
print(ret)
from nltk.stem import WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()
ret = wordnet_lemmatizer.lemmatize('dogs')
print(ret)
3.5 同義詞與反義詞
nltk提供了WordNet進(jìn)行定義同義詞、反義詞等詞匯數(shù)據(jù)庫的集合。
- 同義詞
from nltk.corpus import wordnet
# 單詞boy尋找同義詞
syns = wordnet.synsets('girl')
print(syns[0].name())
# 只是單詞
print(syns[0].lemmas()[0].name())
# 第一個(gè)同義詞的定義
print(syns[0].definition())
# 單詞boy的使用示例
print(syns[0].examples())
- 近義詞與反義詞
from nltk.corpus import wordnet
synonyms = [] # 定義近義詞存儲(chǔ)空間
antonyms = [] # 定義反義詞存儲(chǔ)空間
for syn in wordnet.synsets('bad'):
for i in syn.lemmas():
synonyms.append(i.name())
if i.antonyms():
antonyms.append(i.antonyms()[0].name())
print(set(synonyms))
print(set(antonyms))
3.6 語義相關(guān)性
wordnet的wup_similarity() 方法用于語義相關(guān)性。
from nltk.corpus import wordnet
w1 = wordnet.synset('ship.n.01')
w2 = wordnet.synset('boat.n.01')
print(w1.wup_similarity(w2))
w1 = wordnet.synset('ship.n.01')
w2 = wordnet.synset('car.n.01')
print(w1.wup_similarity(w2))
w1 = wordnet.synset('ship.n.01')
w2 = wordnet.synset('cat.n.01')
print(w1.wup_similarity(w2))
NLTK 提供多種 相似度計(jì)分器(similarity scorers),比如:
- path_similarity
- lch_similarity
- wup_similarity
- res_similarity
- jcn_similarity
- lin_similarity
3.7 詞性標(biāo)注
把一個(gè)句子中的單詞標(biāo)注為名詞,形容詞,動(dòng)詞等。
from nltk.tokenize import sent_tokenize, word_tokenize #導(dǎo)入 分句、分詞模塊
example_text = "Stray birds of summer come to my window to sing and fly away. And yellow leaves of autumn,which have no songs,flutter and fall there with a sigh."
word_tokens = word_tokenize(example_text)
from nltk import pos_tag
tags = pos_tag(word_tokens)
print(tags)
- 標(biāo)注釋義如下
| POS Tag |指代 |
| --- | --- |
| CC | 并列連詞 |
| CD | 基數(shù)詞 |
| DT | 限定符|
| EX | 存在詞|
| FW |外來詞 |
| IN | 介詞或從屬連詞|
| JJ | 形容詞 |
| JJR | 比較級(jí)的形容詞 |
| JJS | 最高級(jí)的形容詞 |
| LS | 列表項(xiàng)標(biāo)記 |
| MD | 情態(tài)動(dòng)詞 |
| NN |名詞單數(shù)|
| NNS | 名詞復(fù)數(shù) |
| NNP |專有名詞|
| PDT | 前置限定詞 |
| POS | 所有格結(jié)尾|
| PRP | 人稱代詞 |
| PRP$ | 所有格代詞 |
| RB |副詞 |
| RBR | 副詞比較級(jí) |
| RBS | 副詞最高級(jí) |
| RP | 小品詞 |
| UH | 感嘆詞 |
| VB |動(dòng)詞原型 |
| VBD | 動(dòng)詞過去式 |
| VBG |動(dòng)名詞或現(xiàn)在分詞 |
| VBN |動(dòng)詞過去分詞|
| VBP |非第三人稱單數(shù)的現(xiàn)在時(shí)|
| VBZ | 第三人稱單數(shù)的現(xiàn)在時(shí) |
| WDT |以wh開頭的限定詞 |
POS tag list:
CC coordinating conjunction
CD cardinal digit
DT determiner
EX existential there (like: "there is" ... think of it like "there exists")
FW foreign word
IN preposition/subordinating conjunction
JJ adjective 'big'
JJR adjective, comparative 'bigger'
JJS adjective, superlative 'biggest'
LS list marker 1)
MD modal could, will
NN noun, singular 'desk'
NNS noun plural 'desks'
NNP proper noun, singular 'Harrison'
NNPS proper noun, plural 'Americans'
PDT predeterminer 'all the kids'
POS possessive ending parent's
PRP personal pronoun I, he, she
PRP$ possessive pronoun my, his, hers
RB adverb very, silently,
RBR adverb, comparative better
RBS adverb, superlative best
RP particle give up
TO to go 'to' the store.
UH interjection errrrrrrrm
VB verb, base form take
VBD verb, past tense took
VBG verb, gerund/present participle taking
VBN verb, past participle taken
VBP verb, sing. present, non-3d take
VBZ verb, 3rd person sing. present takes
WDT wh-determiner which
WP wh-pronoun who, what
WP$ possessive wh-pronoun whose
WRB wh-abverb where, when
3.8 命名實(shí)體識(shí)別
命名實(shí)體識(shí)別(NER)是信息提取的第一步,旨在在文本中查找和分類命名實(shí)體轉(zhuǎn)換為預(yù)定義的分類,例如人員名稱,組織,地點(diǎn),時(shí)間,數(shù)量,貨幣價(jià)值,百分比等。
import nltk
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag
ex= 'European authorities fined Google a record $5.1 billion on Wednesday for abusing its power in the mobile phone market and ordered the company to alter its practices'
def preprocess(sent):
sent= nltk.word_tokenize(sent)
sent= nltk.pos_tag(sent)
return sent
# 單詞標(biāo)記和詞性標(biāo)注
sent= preprocess(ex)
print(sent)
# 名詞短語分塊
pattern='NP: {<DT>?<JJ> * <NN>}'
cp= nltk.RegexpParser(pattern)
cs= cp.parse(sent)
print(cs)
# IOB標(biāo)簽
from nltk.chunk import conlltags2tree, tree2conlltags
from pprint import pprint
iob_tagged= tree2conlltags(cs)
pprint(iob_tagged)
# 分類器識(shí)別命名實(shí)體,類別標(biāo)簽(如PERSON,ORGANIZATION和GPE)
from nltk import ne_chunk
ne_tree= ne_chunk(pos_tag(word_tokenize(ex)))
print(ne_tree)
import nltk
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag
from nltk.chunk import conlltags2tree, tree2conlltags
def learnAnaphora():
sentences = [
"John is a man. He walks",
"John and Mary are married. They have two kids",
"In order for Ravi to be successful, he should follow John",
"John met Mary in Barista. She asked him to order a Pizza"
]
for sent in sentences:
chunks = nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(sent)), binary=False)
stack = []
print(sent)
items = tree2conlltags(chunks)
for item in items:
if item[1] == 'NNP' and (item[2] == 'B-PERSON' or item[2] == 'O'):
stack.append(item[0])
elif item[1] == 'CC':
stack.append(item[0])
elif item[1] == 'PRP':
stack.append(item[0])
print("\t {}".format(stack))
learnAnaphora()
import nltk
sentence = 'Peterson first suggested the name "open source" at Palo Alto, California'
# 先預(yù)處理
words = nltk.word_tokenize(sentence)
pos_tagged = nltk.pos_tag(words)
# 運(yùn)行命名實(shí)體標(biāo)注器
ne_tagged = nltk.ne_chunk(pos_tagged)
print("NE tagged text:")
print(ne_tagged)
# 只提取這個(gè) 樹(tree)里的命名實(shí)體
print("Recognized named entities:")
for ne in ne_tagged:
if hasattr(ne, "label"):
print(ne.label(), ne[0:])
ne_tagged.draw()
NLTK 內(nèi)置的命名實(shí)體標(biāo)注器(named-entity tagger),使用的是賓州法尼亞大學(xué)的 Automatic Content Extraction(ACE)程序。該標(biāo)注器能夠識(shí)別 組織機(jī)構(gòu)(ORGANIZATION) 、人名(PERSON) 、地名(LOCATION) 、設(shè)施(FACILITY)和地緣政治實(shí)體(geopolitical entity)等常見實(shí)體(entites)。
NLTK 也可以使用其他標(biāo)注器(tagger),比如 Stanford Named Entity Recognizer. 這個(gè)經(jīng)過訓(xùn)練的標(biāo)注器用 Java 寫成,但 NLTK 提供了一個(gè)使用它的接口(詳情請(qǐng)查看 nltk.parse.stanford 或 nltk.tag.stanford)。
3.9 Text對(duì)象
from nltk.tokenize import sent_tokenize, word_tokenize #導(dǎo)入 分句、分詞模塊
example_text = "Stray birds of summer come to my window to sing and fly away. And yellow leaves of autumn,which have no songs,flutter and fall there with a sigh."
word_tokens = word_tokenize(example_text)
word_tokens = [word.lower() for word in word_tokens]
from nltk.text import Text
t = Text(word_tokens)
print(t.count('and') )
print(t.index('and') )
t.plot(8)
3.10 文本分類
import nltk
import random
from nltk.corpus import movie_reviews
documents = [(list(movie_reviews.words(fileid)), category)
for category in movie_reviews.categories()
for fileid in movie_reviews.fileids(category)]
random.shuffle(documents)
print(documents[1])
all_words = []
for w in movie_reviews.words():
all_words.append(w.lower())
all_words = nltk.FreqDist(all_words)
print(all_words.most_common(15))
print(all_words["stupid"])
3.11 其他分類器
- 下面列出的是NLTK中自帶的分類器:
from nltk.classify.api import ClassifierI, MultiClassifierI
from nltk.classify.megam import config_megam, call_megam
from nltk.classify.weka import WekaClassifier, config_weka
from nltk.classify.naivebayes import NaiveBayesClassifier
from nltk.classify.positivenaivebayes import PositiveNaiveBayesClassifier
from nltk.classify.decisiontree import DecisionTreeClassifier
from nltk.classify.rte_classify import rte_classifier, rte_features, RTEFeatureExtractor
from nltk.classify.util import accuracy, apply_features, log_likelihood
from nltk.classify.scikitlearn import SklearnClassifier
from nltk.classify.maxent import (MaxentClassifier, BinaryMaxentFeatureEncoding,TypedMaxentFeatureEncoding,ConditionalExponentialClassifier)
- 通過名字預(yù)測性別
import nltk
from nltk.corpus import names
from nltk import classify
#特征取的是最后一個(gè)字母
def gender_features(word):
return {'last_letter': word[-1]}
#數(shù)據(jù)準(zhǔn)備
name=[(n,'male') for n in names.words('male.txt')]+[(n,'female') for n in names.words('female.txt')]
print(len(name))
#特征提取和訓(xùn)練模型
features=[(gender_features(n),g) for (n,g) in name]
classifier = nltk.NaiveBayesClassifier.train(features[:6000])
#測試
print(classifier.classify(gender_features('Frank')))
print(classify.accuracy(classifier, features[6000:]))
print(classifier.classify(gender_features('Tom')))
print(classify.accuracy(classifier, features[6000:]))
print(classifier.classify(gender_features('Sonya')))
print(classify.accuracy(classifier, features[6000:]))
- 情感分析
import nltk.classify.util
from nltk.classify import NaiveBayesClassifier
from nltk.corpus import names
def word_feats(words):
return dict([(word, True) for word in words])
#數(shù)據(jù)準(zhǔn)備
positive_vocab = ['awesome', 'outstanding', 'fantastic', 'terrific', 'good', 'nice', 'great', ':)']
negative_vocab = ['bad', 'terrible', 'useless', 'hate', ':(']
neutral_vocab = ['movie', 'the', 'sound', 'was', 'is', 'actors', 'did', 'know', 'words', 'not']
#特征提取
positive_features = [(word_feats(pos), 'pos') for pos in positive_vocab]
negative_features = [(word_feats(neg), 'neg') for neg in negative_vocab]
neutral_features = [(word_feats(neu), 'neu') for neu in neutral_vocab]
train_set = negative_features + positive_features + neutral_features
#訓(xùn)練
classifier = NaiveBayesClassifier.train(train_set)
# 測試
neg = 0
pos = 0
sentence = "Awesome movie, I liked it"
sentence = sentence.lower()
words = sentence.split(' ')
for word in words:
classResult = classifier.classify(word_feats(word))
if classResult == 'neg':
neg = neg + 1
if classResult == 'pos':
pos = pos + 1
print('Positive: ' + str(float(pos) / len(words)))
print('Negative: ' + str(float(neg) / len(words)))
3.12 數(shù)據(jù)清洗
- 去除HTML標(biāo)簽,如 &
text_no_special_html_label = re.sub(r'\&\w+;|#\w*|\@\w*','',text)
print(text_no_special_html_label)
- 去除鏈接標(biāo)簽
text_no_link = re.sub(r'http:\/\/.*|https:\/\/.*','',text_no_special_html_label)
print(text_no_link)
- 去除換行符
text_no_next_line = re.sub(r'\n','',text_no_link)
print(text_no_next_line)
- 去除帶有$符號(hào)的
text_no_dollar = re.sub(r'\$\w*\s','',text_no_next_line)
print(text_no_dollar)
- 去除縮寫專有名詞
text_no_short = re.sub(r'\b\w{1,2}\b','',text_no_dollar)
print(text_no_short)
- 去除多余空格
text_no_more_space = re.sub(r'\s+',' ',text_no_short)
print(text_no_more_space)
- 使用nltk分詞
tokens = word_tokenize(text_no_more_space)
tokens_lower = [s.lower() for s in tokens]
print(tokens_lower)
- 去除停用詞
import re
from nltk.corpus import stopwords
cache_english_stopwords = stopwords.words('english')
tokens_stopwords = [s for s in tokens_lower if s not in cache_english_stopwords]
print(tokens_stopwords)
print(" ".join(tokens_stopwords))
除了NLTK,這幾年spaCy的應(yīng)用也非常廣泛,功能與nltk類似,但是功能更強(qiáng),更新也快,語言處理上也具有很大的優(yōu)勢。文章來源:http://www.zghlxwxcb.cn/news/detail-698482.html
結(jié)語
如果您覺得該方法或代碼有一點(diǎn)點(diǎn)用處,可以給作者點(diǎn)個(gè)贊,或打賞杯咖啡;
╮( ̄▽ ̄)╭如果您感覺方法或代碼不咋地
//(ㄒoㄒ)//,就在評(píng)論處留言,作者繼續(xù)改進(jìn);
o_O???如果您需要相關(guān)功能的代碼定制化開發(fā),可以留言私信作者;
(????)感謝各位大佬童鞋們的支持!
( ′ ▽′ )? ( ′ ▽′)っ!??!文章來源地址http://www.zghlxwxcb.cn/news/detail-698482.html
到了這里,關(guān)于【小沐學(xué)NLP】Python使用NLTK庫的入門教程的文章就介紹完了。如果您還想了解更多內(nèi)容,請(qǐng)?jiān)谟疑辖撬阉鱐OY模板網(wǎng)以前的文章或繼續(xù)瀏覽下面的相關(guān)文章,希望大家以后多多支持TOY模板網(wǎng)!