1、實(shí)戰(zhàn)的背景與目標(biāo)
背景:
垃圾短信形式日益多變,相關(guān)報(bào)告可以在下面網(wǎng)站查看
360互聯(lián)網(wǎng)安全中心(http://zt.#/report/)
目標(biāo):
基于短信文本內(nèi)容,建立識(shí)別模型,準(zhǔn)確地識(shí)別出垃圾短信,以解決垃圾短信過濾問題
2、總體流程
?3、代碼實(shí)現(xiàn)
1、數(shù)據(jù)探索
導(dǎo)入數(shù)據(jù)
#數(shù)據(jù)導(dǎo)入,設(shè)置第一行不為頭標(biāo)簽并設(shè)置第一列數(shù)據(jù)為索引
message = pd.read_csv("./data/message80W1.csv",encoding="UTF-8",header=None,index_col=0)
更換列名
message.columns =["label","message"]
查看數(shù)據(jù)形狀
message.shape
# (800000, 2)
整體查看數(shù)據(jù)
message.info()
"""
<class 'pandas.core.frame.DataFrame'>
Int64Index: 800000 entries, 1 to 800000
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 label 800000 non-null int64
1 message 800000 non-null object
dtypes: int64(1), object(1)
memory usage: 18.3+ MB
"""
查看數(shù)據(jù)是否有重復(fù)值(若有目前不清除,等抽樣后,再處理樣本數(shù)據(jù))
message.duplicated().sum()
#13424
查看垃圾短信與非垃圾短信的占比
data_group = message.groupby("label").count().reset_index()
import matplotlib.pyplot as plt
fig = plt.figure(figsize=(8,8))
plt.rcParams["font.sans-serif"] = ["SimHei"] # 設(shè)置中文字體為黑體
plt.rcParams["axes.unicode_minus"] = False # 設(shè)置顯示負(fù)號(hào)
plt.title("垃圾短信與非垃圾短信的占比餅圖",fontsize=12)
plt.pie(data_group["message"],labels=["非垃圾短信","垃圾短信"],autopct="%0.2f%%",startangle=90,explode=[0,0.04])
plt.savefig("垃圾短信與非垃圾短信的占比餅圖.jpg")
plt.show()
?2、進(jìn)行數(shù)據(jù)抽樣
先抽取抽樣操作 ,垃圾短信,非垃圾短信? 各1000, 占比為? ?1:1
n = 1000
a = message[message["label"] == 0].sample(n) #隨機(jī)抽樣
b = message[message["label"] == 1].sample(n)
data_new = pd.concat([a,b],axis=0) #數(shù)據(jù)按行合并
3、進(jìn)行數(shù)據(jù)預(yù)處理
數(shù)據(jù)清洗 去重 ?去除x序列 (這里的 x序列指? 數(shù)據(jù)中已經(jīng)用 XXX 隱藏的內(nèi)容,例如手機(jī)號(hào)、名字,薪資等敏感數(shù)據(jù))
#去重
data_dup = data_new["message"].drop_duplicates()
#去除x序列
import re
data_qumin = data_dup.apply(lambda x:re.sub("x","",x))
進(jìn)行 jieba分詞操作
import jieba
jieba.load_userdict("./data/newdic1.txt") #向 jieba 模塊中添加自定義詞語
data_cut = data_qumin.apply(lambda x:jieba.lcut(x)) #進(jìn)行分詞操作
#設(shè)置停用詞
stopWords = pd.read_csv('./data/stopword.txt', encoding='GB18030', sep='hahaha', header=None) #設(shè)置分隔符為 不存在的內(nèi)容,實(shí)現(xiàn)分割
stopWords = ['≮', '≯', '≠', '≮', ' ', '會(huì)', '月', '日', '–'] + list(stopWords.iloc[:, 0])
#去除停用詞
data_after_stop = data_cut.apply(lambda x:[i for i in x if i not in stopWords])
#提取標(biāo)簽
labels = data_new.loc[data_after_stop.index,"label"]
#使用空格分割詞語
adata = data_after_stop.apply(lambda x:" ".join(x))
將上述操作封裝成函數(shù)
def data_process(file='./data/message80W1.csv'):
data = pd.read_csv(file, header=None, index_col=0)
data.columns = ['label', 'message']
n = 10000
a = data[data['label'] == 0].sample(n)
b = data[data['label'] == 1].sample(n)
data_new = pd.concat([a, b], axis=0)
data_dup = data_new['message'].drop_duplicates()
data_qumin = data_dup.apply(lambda x: re.sub('x', '', x))
jieba.load_userdict('./data/newdic1.txt')
data_cut = data_qumin.apply(lambda x: jieba.lcut(x))
stopWords = pd.read_csv('./data/stopword.txt', encoding='GB18030', sep='hahaha', header=None)
stopWords = ['≮', '≯', '≠', '≮', ' ', '會(huì)', '月', '日', '–'] + list(stopWords.iloc[:, 0])
data_after_stop = data_cut.apply(lambda x: [i for i in x if i not in stopWords])
labels = data_new.loc[data_after_stop.index, 'label']
adata = data_after_stop.apply(lambda x: ' '.join(x))
return adata, data_after_stop, labels
4、進(jìn)行詞云圖繪制
調(diào)用函數(shù),設(shè)置數(shù)據(jù)
adata,data_after_stop,labels = data_process()
非垃圾短信的詞頻統(tǒng)計(jì)
word_fre = {}
for i in data_after_stop[labels == 0]:
for j in i:
if j not in word_fre.keys():
word_fre[j] = 1
else:
word_fre[j] += 1
繪圖
from wordcloud import WordCloud
import matplotlib.pyplot as plt
fig = plt.figure(figsize=(12,8))
plt.title("非垃圾短息詞云圖",fontsize=20)
mask = plt.imread('./data/duihuakuan.jpg')
wc = WordCloud(mask=mask, background_color='white', font_path=r'D:\2023暑假\基于文本內(nèi)容的垃圾短信分類\基于文本內(nèi)容的垃圾短信識(shí)別-數(shù)據(jù)&代碼\data\simhei.ttf')
wc.fit_words(word_fre)
plt.imshow(wc)
?垃圾短信詞云圖繪制(方法類似)
word_fre = {}
for i in data_after_stop[labels == 1]:
for j in i:
if j not in word_fre.keys():
word_fre[j] = 1
else:
word_fre[j] += 1
from wordcloud import WordCloud
import matplotlib.pyplot as plt
fig = plt.figure(figsize=(12,8))
plt.title("垃圾短息詞云圖",fontsize=20)
mask = plt.imread('./data/duihuakuan.jpg')
wc = WordCloud(mask=mask, background_color='white', font_path=r'D:\2023暑假\基于文本內(nèi)容的垃圾短信分類\基于文本內(nèi)容的垃圾短信識(shí)別-數(shù)據(jù)&代碼\data\simhei.ttf')
wc.fit_words(word_fre)
plt.imshow(wc)
?
?5、模型的構(gòu)建
采用? ?TF-IDF權(quán)重策略
權(quán)重策略文檔中的高頻詞應(yīng)具有表征此文檔較高的權(quán)重,除非該詞也是高文檔頻率詞
TF:Term frequency即關(guān)鍵詞詞頻,是指一篇文檔中關(guān)鍵詞出現(xiàn)的頻率
? ? ? ? ? ? ? ? ? ?N:?jiǎn)卧~在某文檔中的頻次 ,? ? ?M:該文檔的單詞數(shù)
IDF:Inverse document frequency指逆向文本頻率,是用于衡量關(guān)鍵詞權(quán)重的指數(shù)
? ? ? ?D:總文檔數(shù) ,? ? ?Dw:出現(xiàn)了該單詞的文檔數(shù)
調(diào)用 Sklearn庫(kù)中相關(guān)模塊解釋
sklearn.feature_extraction.text #文本特征提取模塊
CosuntVectorizer #轉(zhuǎn)化詞頻向量函數(shù)
fit_transform() #轉(zhuǎn)化詞頻向量方法
get_feature_names() #獲取單詞集合方法
toarray() #獲取數(shù)值矩陣方法
TfidfTransformer #轉(zhuǎn)化tf-idf權(quán)重向量函數(shù)
fit_transform(counts) #轉(zhuǎn)成tf-idf權(quán)重向量方法
transformer = TfidfTransformer() #轉(zhuǎn)化tf-idf權(quán)重向量函數(shù)
vectorizer = CountVectorizer() #轉(zhuǎn)化詞頻向量函數(shù)
word_vec = vectorizer.fit_transform(corpus) #轉(zhuǎn)成詞向量
words = vectorizer.get_feature_names() #單詞集合
word_cout = word_vec.toarray() #轉(zhuǎn)成ndarray
tfidf = transformer.fit_transform(word_cout) #轉(zhuǎn)成tf-idf權(quán)重向量
tfidf_ma= tfidf.toarray() #轉(zhuǎn)成ndarray
采用樸素貝葉斯算法
?多項(xiàng)式樸素貝葉斯——用于文本分類 構(gòu)造方法:文章來源:http://www.zghlxwxcb.cn/news/detail-576421.html
sklearn.naive_bayes.MultinomialNB(alpha=1.0 #平滑參數(shù),
fit_prior=True #學(xué)習(xí)類的先驗(yàn)概率 ?,
class_prior=None) #類的先驗(yàn)概率
模型代碼實(shí)現(xiàn)文章來源地址http://www.zghlxwxcb.cn/news/detail-576421.html
from sklearn.feature_extraction.text import CountVectorizer,TfidfTransformer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
data_train,data_test,labels_train,labels_test = train_test_split(adata,labels,test_size=0.2)
countVectorizer = CountVectorizer()
data_train = countVectorizer.fit_transform(data_train)
X_train = TfidfTransformer().fit_transform(data_train.toarray()).toarray()
data_test = CountVectorizer(vocabulary=countVectorizer.vocabulary_).fit_transform(data_test)
X_test = TfidfTransformer().fit_transform(data_test.toarray()).toarray()
model = GaussianNB()
model.fit(X_train, labels_train) #訓(xùn)練
model.score(X_test,labels_test) #測(cè)試
# 0.9055374592833876
到了這里,關(guān)于基于文本內(nèi)容的垃圾短信識(shí)別實(shí)戰(zhàn)的文章就介紹完了。如果您還想了解更多內(nèi)容,請(qǐng)?jiān)谟疑辖撬阉鱐OY模板網(wǎng)以前的文章或繼續(xù)瀏覽下面的相關(guān)文章,希望大家以后多多支持TOY模板網(wǎng)!