国产无码综合区,色欲AV无码国产永久播放,无码天堂亚洲国产AV,国产日韩欧美女同一区二区

【小沐學(xué)NLP】Python使用NLTK庫的入門教程

2年前作者：愛看書的小沐分類：Toy博客閱讀(24)違法舉報(bào)

這篇具有很好參考價(jià)值的文章主要介紹了【小沐學(xué)NLP】Python使用NLTK庫的入門教程。希望對(duì)大家有所幫助。如果存在錯(cuò)誤或未考慮完全的地方，請(qǐng)大家不吝賜教，您也可以點(diǎn)擊"舉報(bào)違法"按鈕提交疑問。

1、簡介

NLTK - 自然語言工具包 - 是一套開源Python。支持自然研究和開發(fā)的模塊、數(shù)據(jù)集和教程語言處理。NLTK 需要 Python 版本 3.7、3.8、3.9、3.10 或 3.11。

NLTK是一個(gè)高效的Python構(gòu)建的平臺(tái)，用來處理人類自然語言數(shù)據(jù)。它提供了易于使用的接口，通過這些接口可以訪問超過50個(gè)語料庫和詞匯資源（如WordNet），還有一套用于分類、標(biāo)記化、詞干標(biāo)記、解析和語義推理的文本處理庫，以及工業(yè)級(jí)NLP庫的封裝器和一個(gè)活躍的討論論壇。

【小沐學(xué)NLP】Python使用NLTK庫的入門教程,Python,NLP,自然語言處理,python,人工智能,nlp,nltk,分詞,數(shù)據(jù)清洗

2、安裝

2.1 安裝nltk庫

The Natural Language Toolkit (NLTK) is a Python package for natural language processing. NLTK requires Python 3.7, 3.8, 3.9, 3.10 or 3.11.

pip install nltk
# or
pip install nltk -i https://pypi.tuna.tsinghua.edu.cn/simple

【小沐學(xué)NLP】Python使用NLTK庫的入門教程,Python,NLP,自然語言處理,python,人工智能,nlp,nltk,分詞,數(shù)據(jù)清洗
可以用以下代碼測試nltk分詞的功能：

2.2 安裝nltk語料庫

在NLTK模塊中包含數(shù)十種完整的語料庫，可用來練習(xí)使用，如下所示：
古騰堡語料庫：gutenberg，包含古藤堡項(xiàng)目電子文檔的一小部分文本，約有36000本免費(fèi)電子書。
網(wǎng)絡(luò)聊天語料庫：webtext、nps_chat
布朗語料庫：brown
路透社語料庫：reuters
影評(píng)語料庫：movie_reviews，擁有評(píng)論、被標(biāo)記為正面或負(fù)面的語料庫；
就職演講語料庫：inaugural，有55個(gè)文本的集合，每個(gè)文本是某個(gè)總統(tǒng)在不同時(shí)間的演說.

方法1：在線下載

import nltk
nltk.download()

通過上面命令代碼下載，大概率是失敗的。
【小沐學(xué)NLP】Python使用NLTK庫的入門教程,Python,NLP,自然語言處理,python,人工智能,nlp,nltk,分詞,數(shù)據(jù)清洗

方法2：手動(dòng)下載，離線安裝
github：https://github.com/nltk/nltk_data/tree/gh-pages
gitee：https://gitee.com/qwererer2/nltk_data/tree/gh-pages
查看packages文件夾應(yīng)該放在哪個(gè)路徑下

將下載的packages文件夾改名為nltk_data，放在如下文件夾：
驗(yàn)證是否安裝成功

from nltk.book import *

【小沐學(xué)NLP】Python使用NLTK庫的入門教程,Python,NLP,自然語言處理,python,人工智能,nlp,nltk,分詞,數(shù)據(jù)清洗

分詞測試

import nltk
ret = nltk.word_tokenize("A pivot is the pin or the central point on which something balances or turns")
print(ret)

【小沐學(xué)NLP】Python使用NLTK庫的入門教程,Python,NLP,自然語言處理,python,人工智能,nlp,nltk,分詞,數(shù)據(jù)清洗

wordnet詞庫測試

WordNet是一個(gè)在20世紀(jì)80年代由Princeton大學(xué)的著名認(rèn)知心理學(xué)家George Miller團(tuán)隊(duì)構(gòu)建的一個(gè)大型的英文詞匯數(shù)據(jù)庫。名詞、動(dòng)詞、形容詞和副詞以同義詞集合（synsets）的形式存儲(chǔ)在這個(gè)數(shù)據(jù)庫中。

import nltk
nltk.download('wordnet')
from nltk.corpus import wordnet as wn

from nltk.corpus import brown
print(brown.words())

【小沐學(xué)NLP】Python使用NLTK庫的入門教程,Python,NLP,自然語言處理,python,人工智能,nlp,nltk,分詞,數(shù)據(jù)清洗

3、測試

3.1 分句分詞

英文分句：nltk.sent_tokenize ：對(duì)文本按照句子進(jìn)行分割
英文分詞：nltk.word_tokenize：將句子按照單詞進(jìn)行分隔，返回一個(gè)列表

from nltk.tokenize import sent_tokenize, word_tokenize

EXAMPLE_TEXT = "Hello Mr. Smith, how are you doing today? The weather is great, and Python is awesome. The sky is pinkish-blue. You shouldn't eat cardboard."

print(sent_tokenize(EXAMPLE_TEXT))
print(word_tokenize(EXAMPLE_TEXT))

from nltk.corpus import stopwords
stop_word = set(stopwords.words('english'))    # 獲取所有的英文停止詞
word_tokens = word_tokenize(EXAMPLE_TEXT)      # 獲取所有分詞詞語
filtered_sentence = [w for w in word_tokens if not w in stop_word] #獲取案例文本中的非停止詞
print(filtered_sentence)

【小沐學(xué)NLP】Python使用NLTK庫的入門教程,Python,NLP,自然語言處理,python,人工智能,nlp,nltk,分詞,數(shù)據(jù)清洗

3.2 停用詞過濾

停止詞：nltk.corpus的 stopwords：查看英文中的停止詞表。

定義了一個(gè)過濾英文停用詞的函數(shù)，將文本中的詞匯歸一化處理為小寫并提取。從停用詞語料庫中提取出英語停用詞，將文本進(jìn)行區(qū)分。

from nltk.tokenize import sent_tokenize, word_tokenize   #導(dǎo)入 分句、分詞模塊
from nltk.corpus import stopwords                       #導(dǎo)入停止詞模塊
def remove_stopwords(text):
    text_lower=[w.lower() for w in text if w.isalpha()]
    stopword_set =set(stopwords.words('english'))
    result = [w for w in text_lower if w not in stopword_set]
    return result

example_text = "Stray birds of summer come to my window to sing and fly away. And yellow leaves of autumn,which have no songs,flutter and fall there with a sigh."
word_tokens = word_tokenize(example_text) 
print(remove_stopwords(word_tokens))

【小沐學(xué)NLP】Python使用NLTK庫的入門教程,Python,NLP,自然語言處理,python,人工智能,nlp,nltk,分詞,數(shù)據(jù)清洗

from nltk.tokenize import sent_tokenize, word_tokenize   #導(dǎo)入 分句、分詞模塊

example_text = "Stray birds of summer come to my window to sing and fly away. And yellow leaves of autumn,which have no songs,flutter and fall there with a sigh."
word_tokens = word_tokenize(example_text) 

from nltk.corpus import stopwords
test_words = [word.lower() for word in word_tokens]
test_words_set = set(test_words)
test_words_set.intersection(set(stopwords.words('english')))
filtered = [w for w in test_words_set if(w not in stopwords.words('english'))] 
print(filtered)

3.3 詞干提取

詞干提?。菏侨コ~綴得到詞根的過程，例如：fishing、fished，為同一個(gè)詞干 fish。Nltk，提供PorterStemmer進(jìn)行詞干提取。

from nltk.stem import PorterStemmer
from nltk.tokenize import sent_tokenize,word_tokenize
ps = PorterStemmer()
example_words = ["python","pythoner","pythoning","pythoned","pythonly"]
print(example_words)
for w in example_words:
    print(ps.stem(w),end=' ')

【小沐學(xué)NLP】Python使用NLTK庫的入門教程,Python,NLP,自然語言處理,python,人工智能,nlp,nltk,分詞,數(shù)據(jù)清洗

from nltk.stem import PorterStemmer
from nltk.tokenize import sent_tokenize,word_tokenize
ps = PorterStemmer()

example_text = "Stray birds of summer come to my window to sing and fly away. And yellow leaves of autumn,which have no songs,flutter and fall there with a sigh."
print(example_text)
words = word_tokenize(example_text)

for w in words:
    print(ps.stem(w), end=' ')

【小沐學(xué)NLP】Python使用NLTK庫的入門教程,Python,NLP,自然語言處理,python,人工智能,nlp,nltk,分詞,數(shù)據(jù)清洗

from nltk.stem import PorterStemmer
from nltk.tokenize import sent_tokenize, word_tokenize
ps = PorterStemmer()

example_text1 = "Stray birds of summer come to my window to sing and fly away. And yellow leaves of autumn,which have no songs,flutter and fall there with a sigh."
example_text2 = "There little thoughts are the rustle of leaves; they have their whisper of joy in my mind."
example_text3 = "We, the rustling leaves, have a voice that answers the storms,but who are you so silent? I am a mere flower."
example_text4 = "The light that plays, like a naked child, among the green leaves happily knows not that man can lie."
example_text5 = "My heart beats her waves at the shore of the world and writes upon it her signature in tears with the words, I love thee."
example_text_list = [example_text1, example_text2, example_text3, example_text4, example_text5]

for sent in example_text_list:
    words = word_tokenize(sent)
    print("tokenize: ", words)

    stems = [ps.stem(w) for w in words]
    print("stem: ", stems)

【小沐學(xué)NLP】Python使用NLTK庫的入門教程,Python,NLP,自然語言處理,python,人工智能,nlp,nltk,分詞,數(shù)據(jù)清洗

3.4 詞形/詞干還原

與詞干提取類似，詞干提取包含被創(chuàng)造出的不存在的詞匯，而詞形還原的是實(shí)際的詞匯。

from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
print('cats\t',lemmatizer.lemmatize('cats'))
print('better\t',lemmatizer.lemmatize('better',pos='a'))

【小沐學(xué)NLP】Python使用NLTK庫的入門教程,Python,NLP,自然語言處理,python,人工智能,nlp,nltk,分詞,數(shù)據(jù)清洗

from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

print(lemmatizer.lemmatize("cats"))
print(lemmatizer.lemmatize("cacti"))
print(lemmatizer.lemmatize("geese"))
print(lemmatizer.lemmatize("rocks"))
print(lemmatizer.lemmatize("python"))
print(lemmatizer.lemmatize("better", pos="a"))
print(lemmatizer.lemmatize("best", pos="a"))
print(lemmatizer.lemmatize("run"))
print(lemmatizer.lemmatize("run",'v'))

【小沐學(xué)NLP】Python使用NLTK庫的入門教程,Python,NLP,自然語言處理,python,人工智能,nlp,nltk,分詞,數(shù)據(jù)清洗
唯一要注意的是，lemmatize接受詞性參數(shù)pos。如果沒有提供，默認(rèn)是“名詞”。

時(shí)態(tài)和單復(fù)數(shù)

from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.stem import PorterStemmer

tokens = word_tokenize(text="All work and no play makes jack a dull boy, all work and no play,playing,played", language="english")
ps=PorterStemmer()
stems = [ps.stem(word)for word in tokens]
print(stems)

from nltk.stem import SnowballStemmer
snowball_stemmer = SnowballStemmer('english')
ret = snowball_stemmer.stem('presumably')
print(ret)

from nltk.stem import WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()
ret = wordnet_lemmatizer.lemmatize('dogs')
print(ret)

【小沐學(xué)NLP】Python使用NLTK庫的入門教程,Python,NLP,自然語言處理,python,人工智能,nlp,nltk,分詞,數(shù)據(jù)清洗

3.5 同義詞與反義詞

nltk提供了WordNet進(jìn)行定義同義詞、反義詞等詞匯數(shù)據(jù)庫的集合。

同義詞

from nltk.corpus import wordnet
# 單詞boy尋找同義詞
syns = wordnet.synsets('girl')
print(syns[0].name())
# 只是單詞
print(syns[0].lemmas()[0].name())
# 第一個(gè)同義詞的定義
print(syns[0].definition())
# 單詞boy的使用示例
print(syns[0].examples())

【小沐學(xué)NLP】Python使用NLTK庫的入門教程,Python,NLP,自然語言處理,python,人工智能,nlp,nltk,分詞,數(shù)據(jù)清洗

近義詞與反義詞

from nltk.corpus import wordnet
synonyms = []  # 定義近義詞存儲(chǔ)空間
antonyms = []  # 定義反義詞存儲(chǔ)空間
for syn in wordnet.synsets('bad'):
    for i in syn.lemmas():
        synonyms.append(i.name())
        if i.antonyms():
            antonyms.append(i.antonyms()[0].name())

print(set(synonyms))
print(set(antonyms))

【小沐學(xué)NLP】Python使用NLTK庫的入門教程,Python,NLP,自然語言處理,python,人工智能,nlp,nltk,分詞,數(shù)據(jù)清洗

3.6 語義相關(guān)性

wordnet的wup_similarity() 方法用于語義相關(guān)性。

from nltk.corpus import wordnet

w1 = wordnet.synset('ship.n.01')
w2 = wordnet.synset('boat.n.01')
print(w1.wup_similarity(w2))

w1 = wordnet.synset('ship.n.01')
w2 = wordnet.synset('car.n.01')
print(w1.wup_similarity(w2))

w1 = wordnet.synset('ship.n.01')
w2 = wordnet.synset('cat.n.01')
print(w1.wup_similarity(w2))

【小沐學(xué)NLP】Python使用NLTK庫的入門教程,Python,NLP,自然語言處理,python,人工智能,nlp,nltk,分詞,數(shù)據(jù)清洗

NLTK 提供多種相似度計(jì)分器(similarity scorers)，比如：

path_similarity
lch_similarity
wup_similarity
res_similarity
jcn_similarity
lin_similarity

3.7 詞性標(biāo)注

把一個(gè)句子中的單詞標(biāo)注為名詞，形容詞，動(dòng)詞等。

from nltk.tokenize import sent_tokenize, word_tokenize   #導(dǎo)入 分句、分詞模塊

example_text = "Stray birds of summer come to my window to sing and fly away. And yellow leaves of autumn,which have no songs,flutter and fall there with a sigh."
word_tokens = word_tokenize(example_text) 

from nltk import pos_tag
tags = pos_tag(word_tokens)
print(tags)

【小沐學(xué)NLP】Python使用NLTK庫的入門教程,Python,NLP,自然語言處理,python,人工智能,nlp,nltk,分詞,數(shù)據(jù)清洗

標(biāo)注釋義如下

| POS Tag |指代 |
| --- | --- |
| CC | 并列連詞 |
| CD | 基數(shù)詞 |
| DT | 限定符|
| EX | 存在詞|
| FW |外來詞 |
| IN | 介詞或從屬連詞|
| JJ | 形容詞 |
| JJR | 比較級(jí)的形容詞 |
| JJS | 最高級(jí)的形容詞 |
| LS | 列表項(xiàng)標(biāo)記 |
| MD | 情態(tài)動(dòng)詞 |
| NN |名詞單數(shù)|
| NNS | 名詞復(fù)數(shù) |
| NNP |專有名詞|
| PDT | 前置限定詞 |
| POS | 所有格結(jié)尾|
| PRP | 人稱代詞 |
| PRP$ | 所有格代詞 |
| RB |副詞 |
| RBR | 副詞比較級(jí) |
| RBS | 副詞最高級(jí) |
| RP | 小品詞 |
| UH | 感嘆詞 |
| VB |動(dòng)詞原型 |
| VBD | 動(dòng)詞過去式 |
| VBG |動(dòng)名詞或現(xiàn)在分詞 |
| VBN |動(dòng)詞過去分詞|
| VBP |非第三人稱單數(shù)的現(xiàn)在時(shí)|
| VBZ | 第三人稱單數(shù)的現(xiàn)在時(shí) |
| WDT |以wh開頭的限定詞 |

POS tag list:

CC  coordinating conjunction
CD  cardinal digit
DT  determiner
EX  existential there (like: "there is" ... think of it like "there exists")
FW  foreign word
IN  preposition/subordinating conjunction
JJ  adjective   'big'
JJR adjective, comparative  'bigger'
JJS adjective, superlative  'biggest'
LS  list marker 1)
MD  modal   could, will
NN  noun, singular 'desk'
NNS noun plural 'desks'
NNP proper noun, singular   'Harrison'
NNPS    proper noun, plural 'Americans'
PDT predeterminer   'all the kids'
POS possessive ending   parent's
PRP personal pronoun    I, he, she
PRP$    possessive pronoun  my, his, hers
RB  adverb  very, silently,
RBR adverb, comparative better
RBS adverb, superlative best
RP  particle    give up
TO  to  go 'to' the store.
UH  interjection    errrrrrrrm
VB  verb, base form take
VBD verb, past tense    took
VBG verb, gerund/present participle taking
VBN verb, past participle   taken
VBP verb, sing. present, non-3d take
VBZ verb, 3rd person sing. present  takes
WDT wh-determiner   which
WP  wh-pronoun  who, what
WP$ possessive wh-pronoun   whose
WRB wh-abverb   where, when

3.8 命名實(shí)體識(shí)別

命名實(shí)體識(shí)別（NER）是信息提取的第一步，旨在在文本中查找和分類命名實(shí)體轉(zhuǎn)換為預(yù)定義的分類，例如人員名稱，組織，地點(diǎn)，時(shí)間，數(shù)量，貨幣價(jià)值，百分比等。


import nltk
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag

ex= 'European authorities fined Google a record $5.1 billion on Wednesday for abusing its power in the mobile phone market and ordered the company to alter its practices'

def preprocess(sent):
    sent= nltk.word_tokenize(sent)
    sent= nltk.pos_tag(sent)
    return sent

# 單詞標(biāo)記和詞性標(biāo)注
sent= preprocess(ex)
print(sent)

# 名詞短語分塊
pattern='NP: {<DT>？<JJ> * <NN>}'
cp= nltk.RegexpParser(pattern)
cs= cp.parse(sent)
print(cs)

# IOB標(biāo)簽
from nltk.chunk import conlltags2tree, tree2conlltags
from pprint import pprint
iob_tagged= tree2conlltags(cs)
pprint(iob_tagged)

# 分類器識(shí)別命名實(shí)體，類別標(biāo)簽（如PERSON，ORGANIZATION和GPE）
from nltk import ne_chunk
ne_tree= ne_chunk(pos_tag(word_tokenize(ex)))
print(ne_tree)

【小沐學(xué)NLP】Python使用NLTK庫的入門教程,Python,NLP,自然語言處理,python,人工智能,nlp,nltk,分詞,數(shù)據(jù)清洗


import nltk
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag
from nltk.chunk import conlltags2tree, tree2conlltags

def learnAnaphora():
    sentences = [
        "John is a man. He walks",
        "John and Mary are married. They have two kids",
        "In order for Ravi to be successful, he should follow John",
        "John met Mary in Barista. She asked him to order a Pizza"
    ]

    for sent in sentences:
        chunks = nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(sent)), binary=False)
        stack = []
        print(sent)
        items = tree2conlltags(chunks)
        for item in items:
            if item[1] == 'NNP' and (item[2] == 'B-PERSON' or item[2] == 'O'):
                stack.append(item[0])
            elif item[1] == 'CC':
                stack.append(item[0])
            elif item[1] == 'PRP':
                stack.append(item[0])
        print("\t {}".format(stack)) 
    
learnAnaphora()

【小沐學(xué)NLP】Python使用NLTK庫的入門教程,Python,NLP,自然語言處理,python,人工智能,nlp,nltk,分詞,數(shù)據(jù)清洗


import nltk

sentence = 'Peterson first suggested the name "open source" at Palo Alto, California'

# 先預(yù)處理
words = nltk.word_tokenize(sentence)
pos_tagged = nltk.pos_tag(words)

# 運(yùn)行命名實(shí)體標(biāo)注器
ne_tagged = nltk.ne_chunk(pos_tagged)
print("NE tagged text:")
print(ne_tagged)

# 只提取這個(gè) 樹(tree)里的命名實(shí)體
print("Recognized named entities:")
for ne in ne_tagged:
    if hasattr(ne, "label"):
        print(ne.label(), ne[0:])

ne_tagged.draw()

【小沐學(xué)NLP】Python使用NLTK庫的入門教程,Python,NLP,自然語言處理,python,人工智能,nlp,nltk,分詞,數(shù)據(jù)清洗

NLTK 內(nèi)置的命名實(shí)體標(biāo)注器(named-entity tagger)，使用的是賓州法尼亞大學(xué)的 Automatic Content Extraction（ACE）程序。該標(biāo)注器能夠識(shí)別組織機(jī)構(gòu)(ORGANIZATION) 、人名(PERSON) 、地名(LOCATION) 、設(shè)施(FACILITY)和地緣政治實(shí)體(geopolitical entity)等常見實(shí)體(entites)。

NLTK 也可以使用其他標(biāo)注器(tagger)，比如 Stanford Named Entity Recognizer. 這個(gè)經(jīng)過訓(xùn)練的標(biāo)注器用 Java 寫成，但 NLTK 提供了一個(gè)使用它的接口（詳情請(qǐng)查看 nltk.parse.stanford 或 nltk.tag.stanford）。

3.9 Text對(duì)象

from nltk.tokenize import sent_tokenize, word_tokenize   #導(dǎo)入 分句、分詞模塊

example_text = "Stray birds of summer come to my window to sing and fly away. And yellow leaves of autumn,which have no songs,flutter and fall there with a sigh."
word_tokens = word_tokenize(example_text) 
word_tokens = [word.lower() for word in word_tokens]

from nltk.text import Text
t = Text(word_tokens)
print(t.count('and') )
print(t.index('and') )
t.plot(8)

【小沐學(xué)NLP】Python使用NLTK庫的入門教程,Python,NLP,自然語言處理,python,人工智能,nlp,nltk,分詞,數(shù)據(jù)清洗

3.10 文本分類

import nltk
import random
from nltk.corpus import movie_reviews

documents = [(list(movie_reviews.words(fileid)), category)
             for category in movie_reviews.categories()
             for fileid in movie_reviews.fileids(category)]

random.shuffle(documents)

print(documents[1])

all_words = []
for w in movie_reviews.words():
    all_words.append(w.lower())

all_words = nltk.FreqDist(all_words)
print(all_words.most_common(15))
print(all_words["stupid"])

【小沐學(xué)NLP】Python使用NLTK庫的入門教程,Python,NLP,自然語言處理,python,人工智能,nlp,nltk,分詞,數(shù)據(jù)清洗

3.11 其他分類器

下面列出的是NLTK中自帶的分類器：

from nltk.classify.api import ClassifierI, MultiClassifierI
from nltk.classify.megam import config_megam, call_megam
from nltk.classify.weka import WekaClassifier, config_weka
from nltk.classify.naivebayes import NaiveBayesClassifier
from nltk.classify.positivenaivebayes import PositiveNaiveBayesClassifier
from nltk.classify.decisiontree import DecisionTreeClassifier
from nltk.classify.rte_classify import rte_classifier, rte_features, RTEFeatureExtractor
from nltk.classify.util import accuracy, apply_features, log_likelihood
from nltk.classify.scikitlearn import SklearnClassifier
from nltk.classify.maxent import (MaxentClassifier, BinaryMaxentFeatureEncoding,TypedMaxentFeatureEncoding,ConditionalExponentialClassifier)

通過名字預(yù)測性別


import nltk
from nltk.corpus import names
from nltk import classify

#特征取的是最后一個(gè)字母
def gender_features(word):
    return {'last_letter': word[-1]}

#數(shù)據(jù)準(zhǔn)備
name=[(n,'male') for n in names.words('male.txt')]+[(n,'female') for n in names.words('female.txt')]
print(len(name))

#特征提取和訓(xùn)練模型
features=[(gender_features(n),g) for (n,g) in name]
classifier = nltk.NaiveBayesClassifier.train(features[:6000])

#測試
print(classifier.classify(gender_features('Frank')))
print(classify.accuracy(classifier, features[6000:]))

print(classifier.classify(gender_features('Tom')))
print(classify.accuracy(classifier, features[6000:]))

print(classifier.classify(gender_features('Sonya')))
print(classify.accuracy(classifier, features[6000:]))

【小沐學(xué)NLP】Python使用NLTK庫的入門教程,Python,NLP,自然語言處理,python,人工智能,nlp,nltk,分詞,數(shù)據(jù)清洗

情感分析

import nltk.classify.util
from nltk.classify import NaiveBayesClassifier
from nltk.corpus import names


def word_feats(words):
    return dict([(word, True) for word in words])

#數(shù)據(jù)準(zhǔn)備
positive_vocab = ['awesome', 'outstanding', 'fantastic', 'terrific', 'good', 'nice', 'great', ':)']
negative_vocab = ['bad', 'terrible', 'useless', 'hate', ':(']
neutral_vocab = ['movie', 'the', 'sound', 'was', 'is', 'actors', 'did', 'know', 'words', 'not']

#特征提取
positive_features = [(word_feats(pos), 'pos') for pos in positive_vocab]
negative_features = [(word_feats(neg), 'neg') for neg in negative_vocab]
neutral_features = [(word_feats(neu), 'neu') for neu in neutral_vocab]

train_set = negative_features + positive_features + neutral_features

#訓(xùn)練
classifier = NaiveBayesClassifier.train(train_set)

# 測試
neg = 0
pos = 0
sentence = "Awesome movie, I liked it"
sentence = sentence.lower()
words = sentence.split(' ')
for word in words:
    classResult = classifier.classify(word_feats(word))
    if classResult == 'neg':
        neg = neg + 1
    if classResult == 'pos':
        pos = pos + 1

print('Positive: ' + str(float(pos) / len(words)))
print('Negative: ' + str(float(neg) / len(words)))

【小沐學(xué)NLP】Python使用NLTK庫的入門教程,Python,NLP,自然語言處理,python,人工智能,nlp,nltk,分詞,數(shù)據(jù)清洗

3.12 數(shù)據(jù)清洗

去除HTML標(biāo)簽，如 &

text_no_special_html_label = re.sub(r'\&\w+;|#\w*|\@\w*','',text)
print(text_no_special_html_label)

去除鏈接標(biāo)簽

text_no_link = re.sub(r'http:\/\/.*|https:\/\/.*','',text_no_special_html_label)
print(text_no_link)

去除換行符

text_no_next_line = re.sub(r'\n','',text_no_link)
print(text_no_next_line)

去除帶有$符號(hào)的

text_no_dollar = re.sub(r'\$\w*\s','',text_no_next_line)
print(text_no_dollar)

去除縮寫專有名詞

text_no_short = re.sub(r'\b\w{1,2}\b','',text_no_dollar)
print(text_no_short)

去除多余空格

text_no_more_space = re.sub(r'\s+',' ',text_no_short)
print(text_no_more_space)

使用nltk分詞

tokens = word_tokenize(text_no_more_space)
tokens_lower = [s.lower() for s in tokens]
print(tokens_lower)

去除停用詞

import re
from nltk.corpus import stopwords

cache_english_stopwords = stopwords.words('english')
tokens_stopwords = [s for s in tokens_lower if s not in cache_english_stopwords]
print(tokens_stopwords)
print(" ".join(tokens_stopwords))

除了NLTK，這幾年spaCy的應(yīng)用也非常廣泛，功能與nltk類似，但是功能更強(qiáng)，更新也快，語言處理上也具有很大的優(yōu)勢。

結(jié)語

如果您覺得該方法或代碼有一點(diǎn)點(diǎn)用處，可以給作者點(diǎn)個(gè)贊，或打賞杯咖啡；╮(￣▽￣)╭
如果您感覺方法或代碼不咋地//(ㄒoㄒ)//，就在評(píng)論處留言，作者繼續(xù)改進(jìn)；o_O???
如果您需要相關(guān)功能的代碼定制化開發(fā)，可以留言私信作者；(????)
感謝各位大佬童鞋們的支持！( ′ ▽′ )? ( ′ ▽′)っ！??！文章來源地址http://www.zghlxwxcb.cn/news/detail-698482.html

到了這里，關(guān)于【小沐學(xué)NLP】Python使用NLTK庫的入門教程的文章就介紹完了。如果您還想了解更多內(nèi)容，請(qǐng)?jiān)谟疑辖撬阉鱐OY模板網(wǎng)以前的文章或繼續(xù)瀏覽下面的相關(guān)文章，希望大家以后多多支持TOY模板網(wǎng)！

本文來自互聯(lián)網(wǎng)用戶投稿，該文觀點(diǎn)僅代表作者本人，不代表本站立場。本站僅提供信息存儲(chǔ)空間服務(wù)，不擁有所有權(quán)，不承擔(dān)相關(guān)法律責(zé)任。如若轉(zhuǎn)載，請(qǐng)注明出處：如若內(nèi)容造成侵權(quán)/違法違規(guī)/事實(shí)不符，請(qǐng)點(diǎn)擊違法舉報(bào)進(jìn)行投訴反饋，一經(jīng)查實(shí)，立即刪除！

分享到：

領(lǐng)支付寶紅包贊助服務(wù)器費(fèi)用

【小沐學(xué)NLP】在線AI繪畫網(wǎng)站（百度：文心一格）
當(dāng)下，越來越多AI領(lǐng)域前沿技術(shù)爭相落地，逐步釋放出極大的產(chǎn)業(yè)價(jià)值，其中最受關(guān)注的方向之一便是大規(guī)模預(yù)訓(xùn)練模型（簡稱“大模型”），大模型不僅效果好、泛化能力強(qiáng)、通用性強(qiáng)，而且具有強(qiáng)大的生成能力。在此基礎(chǔ)上，AIGC（Artificial Intelligence Generated Content，人工智
2024年02月14日
瀏覽(22)
【小沐學(xué)數(shù)據(jù)庫】MongoDB下載、安裝和入門（Python）
MongoDB是一個(gè)文檔數(shù)據(jù)庫，旨在簡化應(yīng)用程序開發(fā)和擴(kuò)展。官網(wǎng)地址： https://www.mongodb.com/ MongoDB 是一個(gè)基于分布式文件存儲(chǔ)的數(shù)據(jù)庫。由 C++ 語言編寫。旨在為 WEB 應(yīng)用提供可擴(kuò)展的高性能數(shù)據(jù)存儲(chǔ)解決方案。 MongoDB 是一個(gè)介于關(guān)系數(shù)據(jù)庫和非關(guān)系數(shù)據(jù)庫之間的產(chǎn)品，是非關(guān)
2024年02月03日
瀏覽(20)
【小沐學(xué)NLP】在線AI繪畫網(wǎng)站（網(wǎng)易云課堂：AI繪畫工坊）
Stable Diffusion是一種強(qiáng)大的圖像生成AI，它可以根據(jù)輸入的文字描述詞（prompt）來繪制圖像。在Stable Diffusion上完成優(yōu)秀圖像的制作需要有正確的模型+準(zhǔn)確的提示詞+參數(shù)調(diào)整+后期處理技術(shù)。網(wǎng)易云課堂云課堂stable diffusion上線。參與方式一 ① 進(jìn)入網(wǎng)易云課（https://study.163.com
2024年02月13日
瀏覽(18)
入門NLTK：Python自然語言處理庫初級(jí)教程
NLTK（Natural Language Toolkit）是一個(gè)Python庫，用于實(shí)現(xiàn)自然語言處理（NLP）的許多任務(wù)。NLTK包括一些有用的工具和資源，如文本語料庫、詞性標(biāo)注器、語法分析器等。在這篇初級(jí)教程中，我們將了解NLTK的基礎(chǔ)功能。在開始使用NLTK之前，我們需要確保已經(jīng)正確安裝了它?？梢允?/p>
2024年02月14日
瀏覽(33)
【小沐學(xué)Python】Python實(shí)現(xiàn)語音識(shí)別（Whisper）
https://github.com/openai/whisper Whisper 是一種通用的語音識(shí)別模型。它是在包含各種音頻的大型數(shù)據(jù)集上訓(xùn)練的，也是一個(gè)多任務(wù)模型，可以執(zhí)行多語言語音識(shí)別、語音翻譯和語言識(shí)別。 Open AI在2022年9月21日開源了號(hào)稱其英文語音辨識(shí)能力已達(dá)到人類水準(zhǔn)的Whisper神經(jīng)網(wǎng)絡(luò)，且它亦支
2024年02月04日
瀏覽(1050)
【小沐學(xué)Python】Python實(shí)現(xiàn)語音識(shí)別（SpeechRecognition）
https://pypi.org/project/SpeechRecognition/ https://github.com/Uberi/speech_recognition SpeechRecognition用于執(zhí)行語音識(shí)別的庫，支持多個(gè)引擎和 API，在線和離線。 Speech recognition engine/API 支持如下接口: 以上幾個(gè)中只有 recognition_sphinx（）可與CMU Sphinx 引擎脫機(jī)工作，其他六個(gè)都需要連接互聯(lián)網(wǎng)。另
2024年02月04日
瀏覽(96)
【小沐學(xué)Python】Python實(shí)現(xiàn)Web圖表功能（Dash）
https://dash.plotly.com/ https://dash.gallery/Portal/ Dash 是一個(gè)用于構(gòu)建Web應(yīng)用程序的 Python 庫，無需 JavaScript 。 Dash是下載量最大，最值得信賴的Python框架，用于構(gòu)建ML和數(shù)據(jù)科學(xué)Web應(yīng)用程序。 Dash是一個(gè)用來創(chuàng)建 web 應(yīng)用的 python 庫，它建立在 Plotly.js(同一個(gè)團(tuán)隊(duì)開發(fā))、React 和 Flask 之上
2024年02月04日
瀏覽(96)
【小沐學(xué)Python】Python實(shí)現(xiàn)Web服務(wù)器（Flask打包部署上線）
??基于Python的Web服務(wù)器系列相關(guān)文章編寫如下??： ??【W(wǎng)eb開發(fā)】Python實(shí)現(xiàn)Web服務(wù)器（Flask快速入門）?? ??【W(wǎng)eb開發(fā)】Python實(shí)現(xiàn)Web服務(wù)器（Flask案例測試）?? ??【W(wǎng)eb開發(fā)】Python實(shí)現(xiàn)Web服務(wù)器（Flask部署上線）?? ??【W(wǎng)eb開發(fā)】Python實(shí)現(xiàn)Web服務(wù)器（Tornado入門）?? ??【W(wǎng)eb開
2024年02月12日
瀏覽(92)
【小沐學(xué)AI】數(shù)據(jù)分析的Python庫：Pandas AI
https://pandas-ai.com/ https://github.com/Sinaptik-AI/pandas-ai PandasAI 是一個(gè) Python 庫，可以輕松地用自然語言向數(shù)據(jù)提問。它可以幫助您使用生成式 AI 探索、清理和分析數(shù)據(jù)。 PandasAI與您的數(shù)據(jù)庫（SQL、CSV、pandas、polars、mongodb、noSQL 等）聊天。PandasAI 使用 LLM（GPT 3.5 / 4、Anthropic、VertexA
2024年04月14日
瀏覽(86)
【小沐學(xué)Python】Python實(shí)現(xiàn)在線電子書（Sphinx + readthedocs + github + Markdown）
Sphinx 是一個(gè) 文檔生成器，您也可以把它看成一種工具，它可以將一組純文本源文件轉(zhuǎn)換成各種輸出格式，并且自動(dòng)生成交叉引用、索引等。也就是說，如果您的目錄包含一堆 reStructuredText 或 Markdown 文檔，那么 Sphinx 就能生成一系列HTML文件，PDF文件（通過LaTeX），手冊(cè)頁等。
2024年02月10日
瀏覽(90)