以下舉例以同一個excel中, sheet2的詞語去匹配sheet1中詞語找模糊匹配結(jié)果來舉例
導(dǎo)入數(shù)據(jù),讀取excel中sheet1(被匹配的目標(biāo)詞庫),sheet2(需要進(jìn)行匹配的詞)
import pandas as pd
import jieba
#需要進(jìn)行匹配的詞
attendee = pd.read_excel('路徑/testnn.xlsx',sheet_name='Sheet2')
#被匹配的目標(biāo)詞庫
account = pd.read_excel('路徑/testnn.xlsx',sheet_name='Sheet1')
attendee = attendee.values
account = account.values
#print(attendee)
#print(account)
結(jié)果:
…………………………………………………………
一、分詞匹配
把需要匹配的詞語和目標(biāo)詞語做分詞,對比分詞匹配度判定關(guān)聯(lián)關(guān)系
1、導(dǎo)入jieba分詞包,對目標(biāo)詞和待匹配詞進(jìn)行分詞,并將其導(dǎo)入至新字典中
#需要進(jìn)行匹配的詞的分詞結(jié)果字典
Sheet2 = {}
for i in attendee:
HCO=[]
temp = jieba.cut(i[0], cut_all=False)
for a in temp:
HCO.append(a)
Sheet2[i[0]] = HCO
#print(Sheet2)
#被匹配的目標(biāo)詞庫的分詞結(jié)果字典
Sheet1 = {}
for i in account:
HCO = []
temp = jieba.cut(i[0], cut_all=False)
for a in temp:
HCO.append(a)
Sheet1[i[0]] = HCO
#print(Sheet1)
結(jié)果:
2、遍歷分詞后結(jié)果字典,對比相同的關(guān)鍵詞并記錄匹配情況
for i in Sheet1:
a = i
if i in Sheet2:
#如果名稱完全相同則返回名稱
resultstr = i
#如果名稱不完全相同,對比分詞后的詞語
for j in Sheet2:
b = j
#需要進(jìn)行匹配的詞的分詞數(shù)量
origin_num = 0
#兩分詞結(jié)果中匹配成功的分詞詞語數(shù)量
match_num = 0
#存儲需要進(jìn)行匹配的詞的分詞結(jié)果
origin_l=[]
for k in Sheet1[i] :
# xxxx代表分詞結(jié)果中需要人工判定排除的異常詞
if k != 'xxxx':
c = k
origin_l.append(k)
origin_num = origin_num +1
target_l = []
target_num = 0
for h in Sheet2[j] :
# xxxx代表分詞結(jié)果中需要人工判定排除的異常詞
if h != 'xxxx':
d = h
target_num = target_num +1
target_l.append(h)
if c == d:
match_num = match_num + 1
#選取符合條件的結(jié)果輸出,每條詞語對應(yīng)一條結(jié)果
if match_num > origin_num - match_num:
data = {'origin_str': a, 'target_str': b, 'origin_l': origin_l, 'target_l': target_l,'origin_num': origin_num, 'target_num':target_num, 'match_num':match_num}
print(data)
結(jié)果概覽:
整體代碼
import pandas as pd
import jieba
#需要進(jìn)行匹配的詞
attendee = pd.read_excel('路徑/testnn.xlsx',sheet_name='Sheet2')
#被匹配的目標(biāo)詞庫
account = pd.read_excel('路徑/testnn.xlsx',sheet_name='Sheet1')
attendee = attendee.values
account = account.values
#print(attendee)
#print(account)
Sheet2 = {}
for i in attendee:
HCO=[]
temp = jieba.cut(i[0], cut_all=False)
for a in temp:
HCO.append(a)
Sheet2[i[0]] = HCO
#print(Sheet2)
#被匹配的目標(biāo)詞庫的分詞結(jié)果字典
Sheet1 = {}
for i in account:
HCO = []
temp = jieba.cut(i[0], cut_all=False)
for a in temp:
HCO.append(a)
Sheet1[i[0]] = HCO
#print(Sheet1)
for i in Sheet1:
a = i
if i in Sheet2:
#如果名稱完全相同則返回名稱
resultstr = i
#如果名稱不完全相同,對比分詞后的詞語
for j in Sheet2:
b = j
#需要進(jìn)行匹配的詞的分詞數(shù)量
origin_num = 0
#兩分詞結(jié)果中匹配成功的分詞詞語數(shù)量
match_num = 0
#存儲需要進(jìn)行匹配的詞的分詞結(jié)果
origin_l=[]
for k in Sheet1[i] :
# xxxx代表分詞結(jié)果中需要人工判定排除的異常詞
if k != 'xxxx':
c = k
origin_l.append(k)
origin_num = origin_num +1
target_l = []
target_num = 0
for h in Sheet2[j] :
# xxxx代表分詞結(jié)果中需要人工判定排除的異常詞
if h != 'xxxx':
d = h
target_num = target_num +1
target_l.append(h)
if c == d:
match_num = match_num + 1
#選取符合條件的結(jié)果輸出
if match_num > origin_num - match_num:
data = {'origin_str': a, 'target_str': b, 'origin_l': origin_l, 'target_l': target_l,'origin_num': origin_num, 'target_num':target_num, 'match_num':match_num}
print(data)
二、距離匹配
調(diào)用fuzzywuzzy包中直接進(jìn)行判斷,采用距離匹配方式
兩個字符串之間,由一個轉(zhuǎn)成另一個所需的最少編輯操作次數(shù)。
編輯操作包括:將一個字符替換成另一個字符,插入字符,刪除字符。
一般來說,編輯距離越小,兩個串的相似度越大
整體代碼文章來源:http://www.zghlxwxcb.cn/news/detail-650976.html
import pandas as pd
import jieba
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
attendee = pd.read_excel('路徑/testnn.xlsx',sheet_name='Sheet2')
account = pd.read_excel('路徑/testnn.xlsx',sheet_name='Sheet1')
attendee = attendee.values
account = account.values
Sheet2 = {}
for i in attendee:
HCO=[]
temp = jieba.cut(i[0], cut_all=False)
for a in temp:
HCO.append(a)
Sheet2[i[0]] = HCO
print(Sheet2)
Sheet1 = {}
for i in account:
HCO = []
temp = jieba.cut(i[0], cut_all=False)
for a in temp:
HCO.append(a)
Sheet1[i[0]] = HCO
print(Sheet1)
target_l = []
data = []
n = 0
for j in Sheet2:
target_l.append(j)
for i in Sheet1:
n = n+1
target= {'搜索公司':i,'目標(biāo)公司': process.extractOne( i, target_l )[0],'目標(biāo)權(quán)重': process.extractOne( i, target_l )[1]}
data.append(target)
print (data)
df1 = pd.DataFrame(data)
print(df1)
writer = pd.ExcelWriter('路徑/testmm.xlsx')
df1.to_excel(writer, 'Final')
writer.save()
writer.close()
結(jié)果概覽:文章來源地址http://www.zghlxwxcb.cn/news/detail-650976.html
到了這里,關(guān)于兩種實(shí)現(xiàn)模糊匹配的方法--python的文章就介紹完了。如果您還想了解更多內(nèi)容,請在右上角搜索TOY模板網(wǎng)以前的文章或繼續(xù)瀏覽下面的相關(guān)文章,希望大家以后多多支持TOY模板網(wǎng)!