国产无码综合区,色欲AV无码国产永久播放,无码天堂亚洲国产AV,国产日韩欧美女同一区二区

ES全文檢索pdf、word、txt等文本文件內容

1年前作者：nguby分類：Toy博客閱讀(20)違法舉報

這篇具有很好參考價值的文章主要介紹了ES全文檢索pdf、word、txt等文本文件內容。希望對大家有所幫助。如果存在錯誤或未考慮完全的地方，請大家不吝賜教，您也可以點擊"舉報違法"按鈕提交疑問。

需求：
用ES對上傳文件內容的檢索和高亮顯示。
之前從事于物聯(lián)網(wǎng)行業(yè)，從多年前了解ES以后沒有使用過，本篇文章就是為了記錄小白用ES完成工作的過程。
Elasticsearch的介紹、安裝和環(huán)境這里不過多介紹，網(wǎng)上有很多。
思考：
文本關鍵字搜索，文本需要上傳elasticsearch。支持任意格式文件。純文本文件應該很容易實現(xiàn)，而對于包含圖片和文本的文件怎么處理?
es的文本抽取插件可以幫我們實現(xiàn)。
環(huán)境介紹：
由于是已有的環(huán)境，es版本已經(jīng)確定好了，elasticsearch 8.6.2，看了一下官方網(wǎng)頁，屬于很新的版本（這樣的版本意味遇到問題不好找原因和解決辦法）
es 長文本檢索,使用記錄,全文檢索,elasticsearch

es解析文本需要用到ingest attachment插件解析文件中的文本，需要先把文件轉base64，具體官網(wǎng)有介紹https://www.elastic.co/guide/en/elasticsearch/reference/8.7/attachment.html 本次使用的es8.6.2版本已經(jīng)把插件集成進來了，無需單獨下載安裝。低版本安裝attchment插件：在安裝目錄下，
./bin/elasticsearch-plugin install ingest-attachment

創(chuàng)建索引庫

PUT /file2
{
  "mappings": {
    "properties": {
      "deptId":{
        "type": "long"
      },
      "title":{
        "type": "text",
        "analyzer": "ik_smart"
      },
      "summary": {
          "type": "text",
		      "analyzer": "ik_smart"
      },
      "attachment": {
        "properties": {
          "content":{
            "type": "text",
            "analyzer": "ik_smart",
            "index_options" : "offsets"
          }
        }
      }
    }
  }
}

attachment指定抽取解析的文本內容

PUT _ingest/pipeline/attachment
{
  "description" : "Extract attachment information",
  "processors" : [
    {
      "attachment" : {
        "field" : "content",
        "remove_binary": false,
        "indexed_chars" : -1
      }
    }
  ]
}

“field” : “content”,指定文本字段端
“remove_binary”: false,保存base64文件內容 true不保存
“indexed_chars” : -1 不限制解析文件管道流的最大大小，不設置默認100000
因為要使用高亮，選擇RestHighLevelClient，所以需要引入依賴

		<dependency>
			<groupId>org.elasticsearch.client</groupId>
			<artifactId>elasticsearch-rest-high-level-client</artifactId>
			<version>7.17.4</version>
		</dependency>

創(chuàng)建RestHighLevelClient對象

RestHighLevelClient restClient= new RestHighLevelClient(RestClient.builder(new HttpHost(elasticsearchServerIp, elasticsearchServerPort, "http")));

上傳文檔內容

    @Async
    public void addOrUpdateNew(String fileUrl ,String title,String summary) {
        try {
        	//文件標題
            fileEntity.setTitle(title);
            //文件摘要
            fileEntity.setSummary(summary);
            //判斷文件類型
            String fileType = getFileTypeByDefaultTika(fileUrl);
            if (fileType != null) {
                if (!fileType.contains("video") && !fileType.contains("image") && !"application/zip".equals(fileType)) {
                    byte[] bytes = toByteArray(fileUrl);
                    String base64 = Base64.getEncoder().encodeToString(bytes);
                    fileEntity.setContent(base64);
                    fileEntity.setContentType(1);
                    String body = JSON.toJSONString(fileEntity);
                    IndexRequest indexRequest = new IndexRequest(endpoint)
                            .source(body, XContentType.JSON)
  							 //上傳同時，使用attachment pipline進行提取文件
                            .setPipeline("attachment").timeout(TimeValue.timeValueMinutes(10));
                    restClient.index(indexRequest, RequestOptions.DEFAULT);
                } else {
                    fileEntity.setContentType(2);
                    String body = JSON.toJSONString(fileEntity);
                    IndexRequest indexRequest = new IndexRequest(endpoint)
                            .source(body, XContentType.JSON);
                    restClient.index(indexRequest, RequestOptions.DEFAULT);
                }
            }
        } catch (Exception e) {
//            e.printStackTrace();
        }
    }

分頁、關鍵字、高亮查詢

    /**
     * @param deptId 部門id
     * @param keyword 關鍵字
     * @param current 當前頁
     * @param size 一頁條數(shù)
     * @return PageVo 封裝分頁對象
     */
    public PageVo search(Long deptId,String keyword, Integer current, Integer size) {
        PageVo pageVo = new PageVo();
        pageVo.setSize(size);
        pageVo.setCurrent(current);
        try {

            //創(chuàng)建查詢對象
            SearchRequest request = new SearchRequest("GET", endpoint);
            SearchSourceBuilder searchSourceBuilder = new SearchSourceBuilder();
            BoolQueryBuilder boolQueryBuilder = new BoolQueryBuilder();
            //設置查詢條件
            boolQueryBuilder.filter(QueryBuilders.termsQuery("deptId",deptId))
            		.should(QueryBuilders.matchPhraseQuery("summary", keyword))
                    .should(QueryBuilders.matchPhraseQuery("title", keyword))
                    .should(QueryBuilders.matchPhraseQuery("attachment.content", keyword))
                    .minimumShouldMatch(1);
            //設置高亮
            HighlightBuilder hiBuilder = new HighlightBuilder();
            //設置高亮字段
            HighlightBuilder.Field title = new HighlightBuilder.Field("title");
            HighlightBuilder.Field summary = new HighlightBuilder.Field("summary");
            HighlightBuilder.Field content = new HighlightBuilder.Field("attachment.content");
            hiBuilder.field(title).field(summary).field(content);
            //設置高亮樣式
            hiBuilder.preTags("<span style='color:red'>");
            hiBuilder.postTags("</span>");
            hiBuilder.fragmentSize(800000); //最大高亮分片數(shù)
            hiBuilder.numOfFragments(0); //從第一個分片獲取高亮片段
            List<String> list = new ArrayList<>();
            list.add("content");
            searchSourceBuilder.from((current - 1) * size);
            searchSourceBuilder.size(size);
//            searchSourceBuilder.sort("_id", SortOrder.DESC);
            searchSourceBuilder.query(boolQueryBuilder).highlighter(hiBuilder)
                    //字段過濾  content字段是base64 影響查詢速度 第一個參數(shù)結果集包括哪些字段，第二個參數(shù)表示結果集不包括哪些字段
                    .fetchSource(null, list.toArray(new String[list.size()]));
            //指定聚合條件
            request.source(searchSourceBuilder);
            //IndicesOptions.fromOptions的參數(shù)
            //ignore_unavailable ：是否忽略不可用的索引
            //allow_no_indices：是否允許索引不存在
            //expandToOpenIndices ：通配符表達式將擴展為打開的索引
            //expandToClosedIndices ：通配符表達式將擴展為關閉的索引
            request.indicesOptions(IndicesOptions.fromOptions(true, true, true, false));
            //查詢到搜索結果
            SearchResponse search = restClient.search(request, RequestOptions.DEFAULT);
            //獲取結果中的高亮對象
            SearchHits hits1 = search.getHits();
            //獲取高亮總條數(shù)
            TotalHits totalHits = hits1.getTotalHits();
            //設置分頁總條數(shù)
            pageVo.setTotal((int) totalHits.value);
            SearchHit[] hits = search.getHits().getHits();
            List<KnowledgeFile> ret = new ArrayList<>();
            for (SearchHit hit : hits) {
                String sourceAsString = hit.getSourceAsString();
                KnowledgeFile parsedObject = JSONObject.parseObject(sourceAsString, KnowledgeFile.class);
                Map map = JSONObject.parseObject(sourceAsString, Map.class);
                JSONObject attachment = (JSONObject) map.get("attachment");
                if (attachment != null && parsedObject.getContentType() != 2) {
                    Map map2 = JSONObject.parseObject(attachment.toJSONString(), Map.class);
                    String content1 = (String) map2.get("content");
                    parsedObject.setContent(content1);
                }
                Map<String, HighlightField> highlightFields = hit.getHighlightFields();
                KnowledgeFile knowledgeFile = new KnowledgeFile();
                if (highlightFields.get("title") != null) {
                    String highlightTitle = highlightFields.get("title").getFragments()[0].toString();
                    knowledgeFile.setTitle(highlightTitle);
                } else {
                    knowledgeFile.setTitle(parsedObject.getTitle());
                }
                if (highlightFields.get("summary") != null) {
                    String highlightSummary = highlightFields.get("summary").getFragments()[0].toString();
                    knowledgeFile.setSummary(highlightSummary);
                } else {
                    knowledgeFile.setSummary(parsedObject.getSummary());
                }
                if (parsedObject.getContentType() != 2) {
                    if (highlightFields.get("attachment.content") != null) {
                        String highlightContent = highlightFields.get("attachment.content").getFragments()[0].toString();
                        knowledgeFile.setContent(highlightContent.replaceAll("\\n", "<br/>"));
                    } else {
                        if (parsedObject.getContent() != null) {
                            knowledgeFile.setContent(parsedObject.getContent().replaceAll("\\n", "<br/>"));
                        }
                    }
                    knowledgeFile.setContentType(parsedObject.getContentType());
                } else {
                    knowledgeFile.setContentType(parsedObject.getContentType());
                }
                Map fileMap = JSONObject.parseObject(parsedObject.getFile(), Map.class);
                knowledgeFile.setFileName(String.valueOf(fileMap.get("fileName")));
                knowledgeFile.setFileUrl(String.valueOf(fileMap.get("fileUrl")));
                knowledgeFile.setFilePath(String.valueOf(fileMap.get("filePath")));
                ret.add(knowledgeFile);
            }

            pageVo.setResult(ret);
            return pageVo;
        } catch (Exception e) {
//            e.printStackTrace();
            return null;
        }
    }

寫到這里基本就可以正常插入，查詢了。文章來源地址http://www.zghlxwxcb.cn/news/detail-847242.html

到了這里，關于ES全文檢索pdf、word、txt等文本文件內容的文章就介紹完了。如果您還想了解更多內容，請在右上角搜索TOY模板網(wǎng)以前的文章或繼續(xù)瀏覽下面的相關文章，希望大家以后多多支持TOY模板網(wǎng)！

本文來自互聯(lián)網(wǎng)用戶投稿，該文觀點僅代表作者本人，不代表本站立場。本站僅提供信息存儲空間服務，不擁有所有權，不承擔相關法律責任。如若轉載，請注明出處：如若內容造成侵權/違法違規(guī)/事實不符，請點擊違法舉報進行投訴反饋，一經(jīng)查實，立即刪除！

分享到：

領支付寶紅包贊助服務器費用

全文檢索-Es-初步檢索（三）
#為jmeter返回的結果 jmeter測試結果請求頭 http請求 put 返回結果再次發(fā)送請求 post不帶/帶id保存不帶id 結果二次請求結果帶id保存結果二次請求結果結論發(fā)送請求查詢-查看結果樹增加判斷，確定是否修改結果查看修改是否成功結果更新文檔 post/put帶_update的請求（會比
2024年02月14日
瀏覽(29)
大文本的全文檢索方案附件索引
Elasticsearch 附件索引是需要插件支持的功能，它允許將文件內容附加到 Elasticsearch 文檔中，并對這些附件內容進行全文檢索。本文將帶你了解索引附件的原理和使用方法，并通過一個實際示例來說明如何在 Elasticsearch 中索引和檢索文件附件。索引附件的核心原理是通過 Inges
2024年02月12日
瀏覽(20)
ES（Elasticsearch 全文檢索）
數(shù)據(jù)量大的時候索引失效 =查詢性能低功能比較弱對文檔的內容進行分詞，對詞條創(chuàng)建索引，記錄詞條所在的文檔信息根據(jù)詞條查詢到文檔的id 從而查到文檔文檔：每一條數(shù)據(jù)就是一條文檔詞條：文檔按照語義分成的詞語正向索引根據(jù)文檔的id創(chuàng)建索引查詢詞條必須先找
2024年02月05日
瀏覽(52)
全文檢索[ES系列] - 第495篇
歷史文章（文章累計490+）《國內最全的Spring?Boot系列之一》《國內最全的Spring?Boot系列之二》《國內最全的Spring?Boot系列之三》《國內最全的Spring?Boot系列之四》《國內最全的Spring?Boot系列之五》《國內最全的Spring?Boot系列之六》 Mybatis-Plus通用枚舉功能 [MyBatis-Plus系列
2024年02月04日
瀏覽(20)
ES+微服務對文檔進行全文檢索
打開ES服務進入es安裝目錄下F:elasticsearch-7.17.1bin，雙擊elasticsearch.bat，如圖成功后，如圖 2. 打開ES可視化服務進入安裝F:elasticsearch-head-master路徑下，執(zhí)行npm run start 3. 打開瀏覽器參考文獻：https://blog.csdn.net/mjl1125/article/details/121975950
2024年02月11日
瀏覽(24)
商城-學習整理-高級-全文檢索-ES（九）
https://www.elastic.co/cn/what-is/elasticsearch Elastic 的底層是開源庫 Lucene。但是，你沒法直接用 Lucene，必須自己寫代碼去調用它的接口。Elastic 是 Lucene 的封裝，提供了 REST API 的操作接口，開箱即用。 REST API：天然的跨平臺。官方文檔：https://www.elastic.co/guide/en/elasticsearch/reference/cur
2024年02月12日
瀏覽(30)
MySQL全文檢索臨時代替ES實現(xiàn)快速搜索
引入在MySQL 5.7.6之前，全文索引只支持英文全文索引，不支持中文全文索引，需要利用分詞器把中文段落預處理拆分成單詞，然后存入數(shù)據(jù)庫。從MySQL 5.7.6開始，MySQL內置了ngram全文解析器，用來支持中文、日文、韓文分詞。全文索引只支持InnoDB和MyISAM引擎，支持的類型為C
2024年02月07日
瀏覽(28)
第八章全文檢索【上】+商品添加ES + 商品熱度排名
根據(jù)用戶輸入的檢索條件，查詢出對用的商品首頁的分類 ?搜索欄 1.3.1 建立mapping！這時我們要思考三個問題：哪些字段需要分詞 ?例如：商品名稱我們用哪些字段進行過濾平臺屬性值分類Id 品牌Id 哪些字段我們需要通過搜索查詢出來。商品名稱,價格,圖片等。以上分析
2024年02月09日
瀏覽(18)
【ElasticSearch-基礎篇】ES高級查詢Query DSL全文檢索
和術語級別查詢（Term-Level Queries）不同，全文檢索查詢（Full Text Queries）旨在基于相關性搜索和匹配文本數(shù)據(jù) 。這些查詢會對輸入的文本進行分析，將其拆分為詞項（單個單詞），并執(zhí)行諸如分詞、詞干處理和標準化等操作。全文檢索的關鍵特點：對輸入的文本進行分析
2024年01月22日
瀏覽(10)
【JavaEE】文件操作和IO-目錄掃描全文檢索小程序
不知道說啥了，看看吧在之前的學習中，基本上都是圍繞內存展開的~ MySQL 主要是操作硬盤的文件IO也是是操作硬盤的~ IO ： i nput o utput 創(chuàng)造文件，刪除文件，重命名文件，創(chuàng)建目錄······ 一些操作沒有權限也做不了~ 1.1 路徑就是我們的文件系統(tǒng)上的一個文件/ 目錄的具
2024年02月09日
瀏覽(18)