需求:
用ES對上傳文件內容的檢索和高亮顯示。
之前從事于物聯(lián)網(wǎng)行業(yè),從多年前了解ES以后沒有使用過,本篇文章就是為了記錄小白用ES完成工作的過程。
Elasticsearch的介紹、安裝和環(huán)境這里不過多介紹,網(wǎng)上有很多。
思考:
文本關鍵字搜索,文本需要上傳elasticsearch。支持任意格式文件。純文本文件應該很容易實現(xiàn),而對于包含圖片和文本的文件怎么處理?
es的文本抽取插件可以幫我們實現(xiàn)。
環(huán)境介紹:
由于是已有的環(huán)境,es版本已經(jīng)確定好了,elasticsearch 8.6.2,看了一下官方網(wǎng)頁,屬于很新的版本(這樣的版本意味遇到問題不好找原因和解決辦法)
es解析文本需要用到ingest attachment插件解析文件中的文本,需要先把文件轉base64,具體官網(wǎng)有介紹https://www.elastic.co/guide/en/elasticsearch/reference/8.7/attachment.html 本次使用的es8.6.2版本已經(jīng)把插件集成進來了,無需單獨下載安裝。低版本安裝attchment插件:在安裝目錄下,
./bin/elasticsearch-plugin install ingest-attachment
創(chuàng)建索引庫
PUT /file2
{
"mappings": {
"properties": {
"deptId":{
"type": "long"
},
"title":{
"type": "text",
"analyzer": "ik_smart"
},
"summary": {
"type": "text",
"analyzer": "ik_smart"
},
"attachment": {
"properties": {
"content":{
"type": "text",
"analyzer": "ik_smart",
"index_options" : "offsets"
}
}
}
}
}
}
attachment指定抽取解析的文本內容
PUT _ingest/pipeline/attachment
{
"description" : "Extract attachment information",
"processors" : [
{
"attachment" : {
"field" : "content",
"remove_binary": false,
"indexed_chars" : -1
}
}
]
}
“field” : “content”,指定文本字段端
“remove_binary”: false,保存base64文件內容 true不保存
“indexed_chars” : -1 不限制解析文件管道流的最大大小,不設置默認100000
因為要使用高亮,選擇RestHighLevelClient,所以需要引入依賴
<dependency>
<groupId>org.elasticsearch.client</groupId>
<artifactId>elasticsearch-rest-high-level-client</artifactId>
<version>7.17.4</version>
</dependency>
創(chuàng)建RestHighLevelClient對象
RestHighLevelClient restClient= new RestHighLevelClient(RestClient.builder(new HttpHost(elasticsearchServerIp, elasticsearchServerPort, "http")));
上傳文檔內容
@Async
public void addOrUpdateNew(String fileUrl ,String title,String summary) {
try {
//文件標題
fileEntity.setTitle(title);
//文件摘要
fileEntity.setSummary(summary);
//判斷文件類型
String fileType = getFileTypeByDefaultTika(fileUrl);
if (fileType != null) {
if (!fileType.contains("video") && !fileType.contains("image") && !"application/zip".equals(fileType)) {
byte[] bytes = toByteArray(fileUrl);
String base64 = Base64.getEncoder().encodeToString(bytes);
fileEntity.setContent(base64);
fileEntity.setContentType(1);
String body = JSON.toJSONString(fileEntity);
IndexRequest indexRequest = new IndexRequest(endpoint)
.source(body, XContentType.JSON)
//上傳同時,使用attachment pipline進行提取文件
.setPipeline("attachment").timeout(TimeValue.timeValueMinutes(10));
restClient.index(indexRequest, RequestOptions.DEFAULT);
} else {
fileEntity.setContentType(2);
String body = JSON.toJSONString(fileEntity);
IndexRequest indexRequest = new IndexRequest(endpoint)
.source(body, XContentType.JSON);
restClient.index(indexRequest, RequestOptions.DEFAULT);
}
}
} catch (Exception e) {
// e.printStackTrace();
}
}
分頁、關鍵字、高亮查詢文章來源:http://www.zghlxwxcb.cn/news/detail-847242.html
/**
* @param deptId 部門id
* @param keyword 關鍵字
* @param current 當前頁
* @param size 一頁條數(shù)
* @return PageVo 封裝分頁對象
*/
public PageVo search(Long deptId,String keyword, Integer current, Integer size) {
PageVo pageVo = new PageVo();
pageVo.setSize(size);
pageVo.setCurrent(current);
try {
//創(chuàng)建查詢對象
SearchRequest request = new SearchRequest("GET", endpoint);
SearchSourceBuilder searchSourceBuilder = new SearchSourceBuilder();
BoolQueryBuilder boolQueryBuilder = new BoolQueryBuilder();
//設置查詢條件
boolQueryBuilder.filter(QueryBuilders.termsQuery("deptId",deptId))
.should(QueryBuilders.matchPhraseQuery("summary", keyword))
.should(QueryBuilders.matchPhraseQuery("title", keyword))
.should(QueryBuilders.matchPhraseQuery("attachment.content", keyword))
.minimumShouldMatch(1);
//設置高亮
HighlightBuilder hiBuilder = new HighlightBuilder();
//設置高亮字段
HighlightBuilder.Field title = new HighlightBuilder.Field("title");
HighlightBuilder.Field summary = new HighlightBuilder.Field("summary");
HighlightBuilder.Field content = new HighlightBuilder.Field("attachment.content");
hiBuilder.field(title).field(summary).field(content);
//設置高亮樣式
hiBuilder.preTags("<span style='color:red'>");
hiBuilder.postTags("</span>");
hiBuilder.fragmentSize(800000); //最大高亮分片數(shù)
hiBuilder.numOfFragments(0); //從第一個分片獲取高亮片段
List<String> list = new ArrayList<>();
list.add("content");
searchSourceBuilder.from((current - 1) * size);
searchSourceBuilder.size(size);
// searchSourceBuilder.sort("_id", SortOrder.DESC);
searchSourceBuilder.query(boolQueryBuilder).highlighter(hiBuilder)
//字段過濾 content字段是base64 影響查詢速度 第一個參數(shù)結果集包括哪些字段,第二個參數(shù)表示結果集不包括哪些字段
.fetchSource(null, list.toArray(new String[list.size()]));
//指定聚合條件
request.source(searchSourceBuilder);
//IndicesOptions.fromOptions的參數(shù)
//ignore_unavailable :是否忽略不可用的索引
//allow_no_indices:是否允許索引不存在
//expandToOpenIndices :通配符表達式將擴展為打開的索引
//expandToClosedIndices :通配符表達式將擴展為關閉的索引
request.indicesOptions(IndicesOptions.fromOptions(true, true, true, false));
//查詢到搜索結果
SearchResponse search = restClient.search(request, RequestOptions.DEFAULT);
//獲取結果中的高亮對象
SearchHits hits1 = search.getHits();
//獲取高亮總條數(shù)
TotalHits totalHits = hits1.getTotalHits();
//設置分頁總條數(shù)
pageVo.setTotal((int) totalHits.value);
SearchHit[] hits = search.getHits().getHits();
List<KnowledgeFile> ret = new ArrayList<>();
for (SearchHit hit : hits) {
String sourceAsString = hit.getSourceAsString();
KnowledgeFile parsedObject = JSONObject.parseObject(sourceAsString, KnowledgeFile.class);
Map map = JSONObject.parseObject(sourceAsString, Map.class);
JSONObject attachment = (JSONObject) map.get("attachment");
if (attachment != null && parsedObject.getContentType() != 2) {
Map map2 = JSONObject.parseObject(attachment.toJSONString(), Map.class);
String content1 = (String) map2.get("content");
parsedObject.setContent(content1);
}
Map<String, HighlightField> highlightFields = hit.getHighlightFields();
KnowledgeFile knowledgeFile = new KnowledgeFile();
if (highlightFields.get("title") != null) {
String highlightTitle = highlightFields.get("title").getFragments()[0].toString();
knowledgeFile.setTitle(highlightTitle);
} else {
knowledgeFile.setTitle(parsedObject.getTitle());
}
if (highlightFields.get("summary") != null) {
String highlightSummary = highlightFields.get("summary").getFragments()[0].toString();
knowledgeFile.setSummary(highlightSummary);
} else {
knowledgeFile.setSummary(parsedObject.getSummary());
}
if (parsedObject.getContentType() != 2) {
if (highlightFields.get("attachment.content") != null) {
String highlightContent = highlightFields.get("attachment.content").getFragments()[0].toString();
knowledgeFile.setContent(highlightContent.replaceAll("\\n", "<br/>"));
} else {
if (parsedObject.getContent() != null) {
knowledgeFile.setContent(parsedObject.getContent().replaceAll("\\n", "<br/>"));
}
}
knowledgeFile.setContentType(parsedObject.getContentType());
} else {
knowledgeFile.setContentType(parsedObject.getContentType());
}
Map fileMap = JSONObject.parseObject(parsedObject.getFile(), Map.class);
knowledgeFile.setFileName(String.valueOf(fileMap.get("fileName")));
knowledgeFile.setFileUrl(String.valueOf(fileMap.get("fileUrl")));
knowledgeFile.setFilePath(String.valueOf(fileMap.get("filePath")));
ret.add(knowledgeFile);
}
pageVo.setResult(ret);
return pageVo;
} catch (Exception e) {
// e.printStackTrace();
return null;
}
}
寫到這里基本就可以正常插入,查詢了。文章來源地址http://www.zghlxwxcb.cn/news/detail-847242.html
到了這里,關于ES全文檢索pdf、word、txt等文本文件內容的文章就介紹完了。如果您還想了解更多內容,請在右上角搜索TOY模板網(wǎng)以前的文章或繼續(xù)瀏覽下面的相關文章,希望大家以后多多支持TOY模板網(wǎng)!