最近總結(jié)修改了下預處理方法,記錄下
?首先download需要的依賴
pip install pyenchant
pip install nltk
?pyenchant 是用來檢測拼寫正確的,如果你的文本里面可能包含非正確拼寫的單詞,那就忽略它,nltk用來做分詞的。
python -m nltk.downloader punkt
python -m nltk.downloader stopwords
from nltk.corpus import stopwords
import nltk
import enchant
import re
def is_spelled_correctly(word, language='en_US'):
spell_checker = enchant.Dict(language)
return spell_checker.check(word)
def preprocess_text(text):
text= re.sub(r'\W+', ' ',re.sub(r'[0-9]+', '', text.replace('-', '').replace('_', ' ')))
words=nltk.word_tokenize(text)
stop_words = set(stopwords.words('english'))
words = [item for word in words for item in re.findall(r'[A-Z]+[a-z]*|[a-z]+', word)if is_spelled_correctly(item) and item.lower() not in stop_words]
return ' '.join(words).lower()
if __name__ == '__main__':
print(preprocess_text('ServiceHandlerId caedbe-85432-xssc-dsdabffdddbea An exception of some microservice TargetDownService occurred and was test #@/*-sss '))
#service handler id exception target service occurred test
?這里最后再轉(zhuǎn)小寫是因為防止ServiceHandlerId這種連續(xù)的單詞鏈接成的字符串被拼寫檢查剔除,只有保持駝峰情況下,才能用?re.findall(r'[A-Z]+[a-z]*|[a-z]+', word)?成功把他分成單獨的單詞,所以最后再處理大小寫。
改進方案1:?
之后測試的時候發(fā)現(xiàn)數(shù)據(jù)量一大,他就很慢,后面優(yōu)化了一下,速度大大提升了
from nltk.corpus import stopwords
import nltk
import enchant
import re
spell_checker = enchant.Dict(language)
def memoize(func):
cache = {}
def wrapper(*args):
if args not in cache:
cache[args] = func(*args)
return cache[args]
return wrapper
@memoize
def check_spelling(word):
return spell_checker.check(word)
def preprocess_text(text):
text= re.sub(r'\W+', ' ',re.sub(r'[0-9]+', '', text.replace('-', '').replace('_', ' ')))
words=nltk.word_tokenize(text)
stop_words = set(stopwords.words('english'))
words = [item for word in words for item in re.findall(r'[A-Z]+[a-z]*|[a-z]+', word)if check_spelling(item) and item.lower() not in stop_words]
return ' '.join(words).lower()
if __name__ == '__main__':
print(preprocess_text('ServiceHandlerId caedbe-85432-xssc-dsdabffdddbea An exception of some microservice TargetDownService occurred and was test #@/*-sss '))
#service handler id exception target service occurred test
這里面使用了memoization 技術(shù),它是一種將函數(shù)調(diào)用和結(jié)果存儲在一個字典中的優(yōu)化技術(shù)。我這里用來緩存單詞的拼寫檢查結(jié)果。
這樣之后數(shù)據(jù)量大了之后速度依然不會太慢了。
改進方案2:
使用spellchecker 這個的速度就比enchant 快的多
pip install pyspellchecker
spell = SpellChecker()
def preprocess_text(text):
text= re.sub(r'\W+', ' ',re.sub(r'[0-9]+', '', text.replace('-', '').replace('_', ' ')))
words=nltk.word_tokenize(text)
stop_words = set(stopwords.words('english'))
words = [item for word in words for item in spell.known(re.findall(r'[A-Z]+[a-z]*|[a-z]+', word)) if item.lower() not in stop_words]
return ' '.join(words).lower()
區(qū)別:?
SpellChecker是一個基于編輯距離的拼寫檢查庫,它可以在內(nèi)存中加載一個詞典,并對給定的單詞列表進行快速的拼寫檢查。enchant是一個基于C語言的拼寫檢查庫,它可以使用不同的后端,如aspell, hunspell, ispell等,來檢查單詞是否存在于詞典中。SpellChecker比enchant更快,尤其是當單詞列表很大時。文章來源:http://www.zghlxwxcb.cn/news/detail-634010.html
新增改進
這邊用了一段時間,發(fā)現(xiàn)有些字符串是不是正確拼寫,但是確實對我之后的分類有用處,所以新加一個list文章來源地址http://www.zghlxwxcb.cn/news/detail-634010.html
spell = SpellChecker()
exceptions_words = ['jwt','json',......]
def preprocess_text(self, text):
text = re.sub(
r'\W+', ' ', re.sub(r'[0-9]+', '', text.replace('-', ' ').replace('_', ' ')))
split_text = re.findall(r'[A-Z]+[a-z]*|[a-z]+', text)
new_sentence = ' '.join(split_text)
words = nltk.word_tokenize(new_sentence)
stop_words = set(stopwords.words('english'))
words = [word for word in words if (word in self.spell.known([word]) and word.lower() not in stop_words) or word.lower() in exceptions_words]
return ' '.join(words).lower()
到了這里,關(guān)于文本NLP噪音預處理(加拼寫檢查)的文章就介紹完了。如果您還想了解更多內(nèi)容,請在右上角搜索TOY模板網(wǎng)以前的文章或繼續(xù)瀏覽下面的相關(guān)文章,希望大家以后多多支持TOY模板網(wǎng)!