目錄
一、系統(tǒng)環(huán)境和軟件要求
二、軟件說(shuō)明
三、定義文本抽取管道(pipeline)
四、建立索引設(shè)置文檔結(jié)構(gòu)映射
五、插入文檔
六、查詢文檔
需求是將本地郵件內(nèi)容以及PDF,EXCEL,WORD等附件內(nèi)容進(jìn)行處理,保存到ES數(shù)據(jù)庫(kù),實(shí)現(xiàn)郵件內(nèi)容及附件內(nèi)容的全文檢索。
一、系統(tǒng)環(huán)境和軟件要求
系統(tǒng):CentOS7.3
elasticsearch版本:7.13.3
kibana版本:7.16.3
ingest-attachment插件版本:7.13.3
二、軟件說(shuō)明
Kibana是一個(gè)開(kāi)源的分析和可視化平臺(tái),設(shè)計(jì)用于和Elasticsearch一起工作。當(dāng)前我們的用途主要是在kibana的開(kāi)發(fā)工具dev tools中執(zhí)行一些命令。
Ingest-Attachment是一個(gè)開(kāi)箱即用的插件??梢詫⒊S酶袷降奈募鳛楦郊?xiě)入Index。ingest attachment插件通過(guò)使用Apache Tika來(lái)提取文件,支持的文件格式有TXT、DOC、PPT、XLS和PDF等。 可以進(jìn)行文本抽取及自動(dòng)導(dǎo)入。注意:源字段必須是base64編碼的二進(jìn)制。
缺點(diǎn):在處理xls和xlsx格式的時(shí)候,無(wú)法將sheet分開(kāi)索引,只能將整個(gè)文件當(dāng)做一個(gè)文檔插入es中。
三、安裝插件
我這里采用離線方式安裝Ingest-Attachment,通過(guò)wget方式直接下載跟elasticsearch版本相同的離線文件。
wget https://artifacts.elastic.co/downloads/elasticsearch-plugins/ingest-attachment/ingest-attachment-7.13.3.zip
上傳到服務(wù)器 目錄
/home/es/install/ingest-attachment-7.13.3.zip
進(jìn)入ES_HOME的主目錄,執(zhí)行下面的命令進(jìn)行安裝
cd /home/elasticsearch/
./bin/elasticsearch-plugin install file:///home/es/install/ingest-attachment-7.13.3.zip
安裝完成后重啟elasticsearch服務(wù)
插件安裝完成!
三、定義文本抽取管道(pipeline)
在kibana的dev tool執(zhí)行
我這里郵件可能是多個(gè)附件,所以定義文本抽取管道(多附件),我這里是設(shè)置 處理后移除base64的二進(jìn)制數(shù)據(jù)。
需要注意的是,多附件的情況下,field和target_field必須要寫(xiě)成_ingest._value.*,否則不能匹配正確的字段。
PUT _ingest/pipeline/multiple_attachment
{
? ? "description" : "Extract attachment information from arrays",
? ? "processors" : [
? ? ? {
? ? ? ? "foreach" : {
? ? ? ? ? "field" : "attachments",
? ? ? ? ? "processor" : {
? ? ? ? ? ? "attachment" : {
? ? ? ? ? ? ? "target_field" : "_ingest._value.attachment",
? ? ? ? ? ? ? "field" : "_ingest._value.content"
? ? ? ? ? ? }
? ? ? ? ? }
? ? ? ? }
? ? ? },
? ? ? {
? ? ? ? "foreach" : {
? ? ? ? ? "field" : "attachments",
? ? ? ? ? "processor" : {
? ? ? ? ? ? "remove" : {
? ? ? ? ? ? ? "field" : "_ingest._value.content"
? ? ? ? ? ? }
? ? ? ? ? }
? ? ? ? }
? ? ? }
? ? ]
}
插件ingest attachment的pipeline參數(shù)含義
Name | 是否必須 | Default | Description |
field | yes | - | 從這個(gè)字段中獲取base64編碼 |
target_field | no | attachment | 用于保留attachment信息,主要用于多附件的情況 |
indexed_chars | no | 100000 | 限制字段的最大保存字符數(shù)。-1為無(wú)限制。 |
indexed_chars_field | no | - | 可以從數(shù)據(jù)中設(shè)定的字段取到indexed_chars限制的值。 |
properties | no | 全屬性 | 選擇需要存儲(chǔ)的屬性。例如?content,?title,?name,?author,?keywords,?date,?content_type,?content_length,?language |
ignore_missing | no | FALSE | 如果使用true,并且?field?不存在, 則會(huì)忽略附件直接寫(xiě)入doc;否則則會(huì)報(bào)錯(cuò)。 |
四、建立索引設(shè)置文檔結(jié)構(gòu)映射
PUT mail
{
"settings": {
"index": {
"max_result_window": 100000000
},
"number_of_shards": 3,
"number_of_replicas": 0
},
"mappings": {
"properties": {
"mfrom": {
"type": "keyword"
},
"mto": {
"type": "keyword"
},
"mcc": {
"type": "keyword"
},
"mbcc": {
"type": "keyword"
},
"rcvtime": {
"type": "date",
"format": "yyyy-MM-dd HH:mm:ss"
},
"subject": {
"type": "keyword"
},
"importance": {
"type": "keyword"
},
"savepath": {
"type": "keyword"
},
"mbody": {
"type": "text",
"fields": {
"keyword": {
"ignore_above": 256,
"type": "keyword"
}
}
},
"attachments": {
"properties": {
"attachment": {
"properties": {
"content": {
"type": "text",
"fields": {
"keyword": {
"ignore_above": 256,
"type": "keyword"
}
}
},
"filename": {
"type": "keyword"
},
"type": {
"type": "keyword"
}
}
}
}
}
}
}
}
創(chuàng)建成功會(huì)返回
{
"acknowledged" : true,
"shards_acknowledged" : true,
"index" : "mail"
}
五、插入文檔
可以使用Postman來(lái)調(diào)用elasticsearch的rest full接口完成文檔插入或者更新。
請(qǐng)求類型:POST
請(qǐng)求地址:http://192.168.31.200:9200/mail/_doc?pipeline=multiple_attachment
請(qǐng)求地址中mail是索引名,pipeline=multiple_attachment指定需要使用的管道(pipeline)是multiple_attachment
請(qǐng)求body內(nèi)容是JSON格式:
{
"mfrom": "microsoft.teams@outlook.com",
"mto": "network@163.com",
"mcc": "",
"mbcc": "",
"rcvtime": "2023-05-18 23:35:29",
"subject": "神奇的郵件2023066- ",
"importance": "1",
"savepath": "d:\\mail\\TEST123.eml",
"mbody": "這是郵件內(nèi)容",
"attachments": [
{
"filename": "附件名字1.pdf",
"type": ".pdf",
"content": "5oiR54ix5L2g5Lit5Zu9MjAyMw=="
},
{
"filename": "附件名字2.xlsx",
"type": ".xlsx",
"content": "Q2hhdEdQVCDniZvpgLwh"
}
]
}
attachments是JSON數(shù)組,里面放2個(gè)附件的信息。filename是附件名字,content是附件解析出來(lái)的base64編碼字符串。插入時(shí)通過(guò)管道處理,會(huì)自動(dòng)識(shí)別內(nèi)容,剩下的跟操作普通的索引一樣。
下面是執(zhí)行成功返回的內(nèi)容:
{
"_index": "mail",
"_type": "_doc",
"_id": "eiCNNIgBUc2qXUv978Tg",
"_version": 1,
"result": "created",
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"_seq_no": 0,
"_primary_term": 1
}
Postman截圖
六、查詢文檔
6.1 根據(jù)_id查看文檔
GET請(qǐng)求地址?http://192.168.31.200:9200/mail/_doc/eiCNNIgBUc2qXUv978Tg
參數(shù)和內(nèi)容無(wú)
其中eiCNNIgBUc2qXUv978Tg為文檔_id,mail為需要查詢的索引名
返回結(jié)果:
{
"_index": "mail",
"_type": "_doc",
"_id": "eiCNNIgBUc2qXUv978Tg",
"_version": 1,
"_seq_no": 0,
"_primary_term": 1,
"found": true,
"_source": {
"savepath": "d:\\mail\\TEST123.eml",
"mbody": "這是郵件內(nèi)容",
"attachments": [
{
"filename": "附件名字1.pdf",
"attachment": {
"content_type": "text/plain; charset=UTF-8",
"language": "lt",
"content": "我愛(ài)你中國(guó)2023",
"content_length": 10
},
"type": ".pdf"
},
{
"filename": "附件名字2.xlsx",
"attachment": {
"content_type": "text/plain; charset=UTF-8",
"language": "lt",
"content": "ChatGPT 牛逼!",
"content_length": 12
},
"type": ".pdf"
}
],
"mbcc": "",
"subject": "神奇的郵件2023066- ",
"importance": "1",
"mfrom": "microsoft.teams@outlook.com",
"mto": "network@163.com",
"mcc": "",
"rcvtime": "2023-05-18 23:35:29"
}
}
Postman截圖
6.2 模糊查詢附件名字
Post請(qǐng)求地址??http://192.168.31.200:9200/mail/_search
?請(qǐng)求內(nèi)容是JSON字符串,attachments.filename.keyword是附件名字(不分詞)
{
"query": {
"bool": {
"should": [{
"wildcard": {
"attachments.filename.keyword": "*附件*"
}
}]
}
}
}
返回結(jié)果
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 3,
"successful": 3,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 1,
"relation": "eq"
},
"max_score": 1.0,
"hits": [
{
"_index": "mail",
"_type": "_doc",
"_id": "eiCNNIgBUc2qXUv978Tg",
"_score": 1.0,
"_source": {
"savepath": "d:\\mail\\TEST123.eml",
"mbody": "這是郵件內(nèi)容",
"attachments": [
{
"filename": "附件名字1.pdf",
"attachment": {
"content_type": "text/plain; charset=UTF-8",
"language": "lt",
"content": "我愛(ài)你中國(guó)2023",
"content_length": 10
},
"type": ".pdf"
},
{
"filename": "附件名字2.xlsx",
"attachment": {
"content_type": "text/plain; charset=UTF-8",
"language": "lt",
"content": "ChatGPT 牛逼!",
"content_length": 12
},
"type": ".pdf"
}
],
"mbcc": "",
"subject": "神奇的郵件2023066- ",
"importance": "1",
"mfrom": "microsoft.teams@outlook.com",
"mto": "network@163.com",
"mcc": "",
"rcvtime": "2023-05-18 23:35:29"
}
}
]
}
}
6.3 模糊查詢附件內(nèi)容
POST請(qǐng)求地址?http://192.168.31.200:9200/mail/_search
請(qǐng)求內(nèi)容為JSON格式,attachments.attachment.content是附件內(nèi)容(不加密)
{
"size":"10000",
"_source" :[
"_id",
"seqnbr",
"subject",
"eml"
],
"query": {
"match": {
"attachments.attachment.content":"*ChatGPT*"
}
}
}
返回結(jié)果
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 3,
"successful": 3,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 1,
"relation": "eq"
},
"max_score": 0.2876821,
"hits": [
{
"_index": "mail",
"_type": "_doc",
"_id": "eiCNNIgBUc2qXUv978Tg",
"_score": 0.2876821,
"_source": {
"subject": "神奇的郵件2023066- "
}
}
]
}
}
七、其他說(shuō)明
下面是單獨(dú)說(shuō)明的定義文本抽取的管道single_attachment
在kibana的dev tool執(zhí)行
PUT _ingest/pipeline/single_attachment文章來(lái)源:http://www.zghlxwxcb.cn/news/detail-538368.html
{
? "description" : "Extract single attachment information",
? "processors" : [
? ? {
? ? ? "attachment" : {
? ? ? ? "field": "data",
? ? ? ? "indexed_chars" : -1,
? ? ? ? "ignore_missing" : true
? ? ? }
? ? }
? ]
}
剩下的就是代碼集成的問(wèn)題了。關(guān)于中文分詞IK插件的使用,后期需要再詳細(xì)說(shuō)明。文章來(lái)源地址http://www.zghlxwxcb.cn/news/detail-538368.html
到了這里,關(guān)于Elasticsearch實(shí)戰(zhàn)之處理郵件附件進(jìn)行進(jìn)行內(nèi)容全文檢索的文章就介紹完了。如果您還想了解更多內(nèi)容,請(qǐng)?jiān)谟疑辖撬阉鱐OY模板網(wǎng)以前的文章或繼續(xù)瀏覽下面的相關(guān)文章,希望大家以后多多支持TOY模板網(wǎng)!