什么是全文檢索
和術(shù)語級別查詢(Term-Level Queries)不同,全文檢索查詢(Full Text Queries)旨在基于相關(guān)性搜索和匹配文本數(shù)據(jù)
。這些查詢會對輸入的文本進(jìn)行分析,將其拆分
為詞項(xiàng)(單個單詞),并執(zhí)行諸如分詞、詞干處理和標(biāo)準(zhǔn)化等操作。
全文檢索的關(guān)鍵特點(diǎn):
- 對輸入的文本進(jìn)行分析,并根據(jù)分析后的詞項(xiàng)進(jìn)行搜索和匹配。全文檢索查詢會對輸入的文本進(jìn)行分析,將其拆分為詞項(xiàng),并基于這些詞項(xiàng)進(jìn)行搜索和匹配操作。
- 以相關(guān)性為基礎(chǔ)進(jìn)行搜索和匹配。全文檢索查詢使用相關(guān)性算法來確定文檔與查詢的匹配程度,并按照相關(guān)性進(jìn)行排序。相關(guān)性可以基于詞項(xiàng)的頻率、權(quán)重和其他因素來計(jì)算。
- 全文檢索查詢適用于包含自由文本數(shù)據(jù)的字段,例如文檔的內(nèi)容、文章的正文或產(chǎn)品描述等。
一、數(shù)據(jù)準(zhǔn)備
PUT full_index
{
"settings": {
"number_of_replicas": 1,
"number_of_shards": 1
},
"mappings": {
"properties": {
"name": {
"type": "text"
},
"age": {
"type": "long"
},
"description" : {
"type" : "text",
"analyzer": "ik_max_word",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
}
}
}
}
測試數(shù)據(jù)如下:
{name=張三, description=北京故宮圓明園, age=11}
{name=王五, description=南京總統(tǒng)府, age=15}
{name=李四, description=北京市天安門廣場, age=18}
{name=富貴, description=南京市中山陵, age=22}
{name=來福, description=山東濟(jì)南趵突泉, age=8}
{name=憨憨, description=安徽黃山九華山, age=27}
{name=小七, description=上海東方明珠, age=31}
二、match query
匹配查詢: match在匹配時會對所查找的關(guān)鍵詞進(jìn)行分詞,然后按分詞匹配查找。
match支持以下參數(shù):
- query : 指定匹配的值
- operator : 匹配條件類型
- and : 條件分詞后都要匹配
- or : 條件分詞后有一個匹配即可(默認(rèn))
- minmum_should_match : 最低匹配度,即條件在倒排索引中最低的匹配度
DSL: 索引description字段包含 “南京總統(tǒng)府” 的數(shù)據(jù)
GET full_index/_search
{
"query": {
"match": {
"description": "南京總統(tǒng)府"
}
}
}
返回?cái)?shù)據(jù)如下:
{
"took" : 0,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 2,
"relation" : "eq"
},
"max_score" : 1.2667978,
"hits" : [
{
"_index" : "full_index",
"_type" : "_doc",
"_id" : "2",
"_score" : 1.2667978,
"_source" : {
"name" : "王五",
"age" : 15,
"description" : "南京總統(tǒng)府"
}
},
{
"_index" : "full_index",
"_type" : "_doc",
"_id" : "4",
"_score" : 1.0751815,
"_source" : {
"name" : "富貴",
"age" : 22,
"description" : "南京市中山陵"
}
}
]
}
}
springboot實(shí)現(xiàn):
private final static Logger LOGGER = LoggerFactory.getLogger(FullTextQuery.class);
private static final String INDEX_NAME = "full_index";
@Resource
private RestHighLevelClient client;
@RequestMapping(value = "/match_query", method = RequestMethod.GET)
@ApiOperation(value = "DSL - match_query")
public void match_query() throws Exception {
// 定義請求對象
SearchRequest searchRequest = new SearchRequest(INDEX_NAME);
// 查詢所有
searchRequest.source(new SearchSourceBuilder().query(QueryBuilders.matchQuery("description","南京總統(tǒng)府")));
// 打印返回?cái)?shù)據(jù)
printLog(client.search(searchRequest, RequestOptions.DEFAULT));
}
private void printLog(SearchResponse searchResponse) {
SearchHits hits = searchResponse.getHits();
System.out.println("返回hits數(shù)組長度:" + hits.getHits().length);
for (SearchHit hit: hits.getHits()) {
System.out.println(hit.getSourceAsMap().toString());
}
}
返回結(jié)果如下:
返回hits數(shù)組長度:2
{name=王五, description=南京總統(tǒng)府, age=15}
{name=富貴, description=南京市中山陵, age=22}
分析: 此時可以發(fā)現(xiàn)當(dāng)搜索 “南京總統(tǒng)府” 時,返回了兩條數(shù)據(jù),那么為什么 “南京市中山陵” 也被搜索到了呢?
原因就是全文檢索會拆分
搜索的此項(xiàng),因?yàn)樵趧?chuàng)建索引的時候指定了 description 字段的分詞方式是 “ik_max_word” ,而該分詞類型會將 “南京總統(tǒng)府” 拆分成如下詞類去搜索倒排索引:
POST _analyze
{
"analyzer": "ik_max_word",
"text": ["南京總統(tǒng)府"]
}
{
"tokens" : [
{
"token" : "南京",
"start_offset" : 0,
"end_offset" : 2,
"type" : "CN_WORD",
"position" : 0
},
{
"token" : "總統(tǒng)府",
"start_offset" : 2,
"end_offset" : 5,
"type" : "CN_WORD",
"position" : 1
},
{
"token" : "總統(tǒng)",
"start_offset" : 2,
"end_offset" : 4,
"type" : "CN_WORD",
"position" : 2
},
{
"token" : "府",
"start_offset" : 4,
"end_offset" : 5,
"type" : "CN_CHAR",
"position" : 3
}
]
}
其中就有"南京"這個詞項(xiàng),所以用 “南京總統(tǒng)府” 去搜索是可以搜到 “南京中山陵” 的數(shù)據(jù),那么match_query的operator也不用多說,就是滿足所有拆分的詞項(xiàng)
比如此時我們再插入一條數(shù)據(jù):
POST /full_index/_bulk
{"index":{"_id":8}}
{"name":"張三","age":11,"description":"南京總統(tǒng)"}
當(dāng)我們搜索:"南京總統(tǒng)",可以搜到兩條數(shù)據(jù)
GET full_index/_search
{
"query": {
"match": {
"description": {
"query": "南京總統(tǒng)",
"operator": "and"
}
}
}
}
數(shù)據(jù)如下:
{
"took" : 0,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 2,
"relation" : "eq"
},
"max_score" : 2.898355,
"hits" : [
{
"_index" : "full_index",
"_type" : "_doc",
"_id" : "8",
"_score" : 2.898355,
"_source" : {
"name" : "張三",
"age" : 11,
"description" : "南京總統(tǒng)"
}
},
{
"_index" : "full_index",
"_type" : "_doc",
"_id" : "2",
"_score" : 2.35562,
"_source" : {
"name" : "王五",
"age" : 15,
"description" : "南京總統(tǒng)府"
}
}
]
}
}
但是當(dāng)搜索:"南京總統(tǒng)府"時,只能搜索到一條數(shù)據(jù),就是因?yàn)榉衷~時,有一個詞項(xiàng)"府",在其中一條數(shù)據(jù)中不存在
三、multi_match query
多字段查詢:可以根據(jù)字段類型,決定是否使用分詞查詢,得分最高的在前面注意:字段類型分詞,將查詢條件分詞之后進(jìn)行查詢,如果該字段不分詞就會將查詢條件作為整體進(jìn)行查詢。
DSL: 查詢 “name” 或者 “description” 這兩個字段中出現(xiàn) “北京王五” 詞匯的數(shù)據(jù)
GET full_index/_search
{
"query": {
"multi_match": {
"query": "北京王五",
"fields": ["name","description"]
}
}
}
返回結(jié)果如下:
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 3,
"relation" : "eq"
},
"max_score" : 3.583519,
"hits" : [
{
"_index" : "full_index",
"_type" : "_doc",
"_id" : "2",
"_score" : 3.583519,
"_source" : {
"name" : "王五",
"age" : 15,
"description" : "南京總統(tǒng)府"
}
},
{
"_index" : "full_index",
"_type" : "_doc",
"_id" : "1",
"_score" : 1.4959542,
"_source" : {
"name" : "張三",
"age" : 11,
"description" : "北京故宮圓明園"
}
},
{
"_index" : "full_index",
"_type" : "_doc",
"_id" : "3",
"_score" : 0.98645234,
"_source" : {
"name" : "李四",
"age" : 18,
"description" : "北京市天安門廣場"
}
}
]
}
}
springboot實(shí)現(xiàn):
@RequestMapping(value = "/multi_match", method = RequestMethod.GET)
@ApiOperation(value = "DSL - multi_match")
public void multi_match() throws Exception {
// 定義請求對象
SearchRequest searchRequest = new SearchRequest(INDEX_NAME);
// 查詢所有
searchRequest.source(new SearchSourceBuilder().query(
QueryBuilders.multiMatchQuery("北京王五", new String[]{"name","description"})));
// 打印返回?cái)?shù)據(jù)
printLog(client.search(searchRequest, RequestOptions.DEFAULT));
}
查詢結(jié)果如下:
返回hits數(shù)組長度:3
{name=王五, description=南京總統(tǒng)府, age=15}
{name=張三, description=北京故宮圓明園, age=11}
{name=李四, description=北京市天安門廣場, age=18}
前面也強(qiáng)調(diào)到
字段類型分詞,將查詢條件分詞之后進(jìn)行查詢,如果該字段不分詞就會將查詢條件作為整體進(jìn)行查詢
那么我們來測試一下,比如當(dāng)不對 “description” 字段分詞時查詢
GET full_index/_search
{
"query": {
"multi_match": {
"query": "北京王五",
"fields": ["name","description.keyword"]
}
}
}
返回結(jié)果如下:
{
"took" : 0,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 3.583519,
"hits" : [
{
"_index" : "full_index",
"_type" : "_doc",
"_id" : "2",
"_score" : 3.583519,
"_source" : {
"name" : "王五",
"age" : 15,
"description" : "南京總統(tǒng)府"
}
}
]
}
}
可以看到,當(dāng)使用 “description.keyword” 也就是不對 “description” 進(jìn)行分詞時,只返回了一條數(shù)據(jù),該條數(shù)據(jù)只有 “name” 字段為 “王五” 滿足了查詢條件分詞匹配后的結(jié)果。
四、match_phrase query
短語搜索(match phrase)會對搜索文本進(jìn)行文本分析,然后到索引中尋找搜索的每個分詞并要求分詞相鄰,你可以通過調(diào)整slop參數(shù)設(shè)置分詞出現(xiàn)的最大間隔距離。match_phrase 會將檢索關(guān)鍵詞分詞。
DSL: 搜索 "description " 字段有 “北京故宮” 的數(shù)據(jù)
GET full_index/_search
{
"query": {
"match_phrase": {
"description": {
"query": "北京故宮"
}
}
}
}
返回?cái)?shù)據(jù)如下:
{
"took" : 0,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 3.5884824,
"hits" : [
{
"_index" : "full_index",
"_type" : "_doc",
"_id" : "1",
"_score" : 3.5884824,
"_source" : {
"name" : "張三",
"age" : 11,
"description" : "北京故宮圓明園"
}
}
]
}
}
springboot實(shí)現(xiàn):
@RequestMapping(value = "/match_phrase", method = RequestMethod.GET)
@ApiOperation(value = "DSL - match_phrase")
public void match_phrase() throws Exception {
// 定義請求對象
SearchRequest searchRequest = new SearchRequest(INDEX_NAME);
// 查詢所有
searchRequest.source(new SearchSourceBuilder().query(
QueryBuilders.matchPhraseQuery("description","北京故宮")));
// 打印返回?cái)?shù)據(jù)
printLog(client.search(searchRequest, RequestOptions.DEFAULT));
}
返回?cái)?shù)據(jù)如下:
返回hits數(shù)組長度:1
{name=張三, description=北京故宮圓明園, age=11}
思考: 搜索 "description " 字段有 “北京故宮” 的數(shù)據(jù)有返回,那么搜索 “北京圓明園” ,為什么沒數(shù)據(jù)返回?
GET full_index/_search
{
"query": {
"match_phrase": {
"description": {
"query": "北京圓明園"
}
}
}
}
返回?cái)?shù)據(jù)如下:
{
"took" : 0,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 0,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
}
}
原因分析: 先查看 “北京故宮圓明園” 的分詞結(jié)果,如下:
POST _analyze
{
"analyzer": "ik_max_word",
"text": ["北京故宮圓明園"]
}
{
"tokens" : [
{
"token" : "北京",
"start_offset" : 0,
"end_offset" : 2,
"type" : "CN_WORD",
"position" : 0
},
{
"token" : "故宮",
"start_offset" : 2,
"end_offset" : 4,
"type" : "CN_WORD",
"position" : 1
},
{
"token" : "圓明園",
"start_offset" : 4,
"end_offset" : 7,
"type" : "CN_WORD",
"position" : 2
}
]
}
可以發(fā)現(xiàn) “北京” 和 “圓明園” 并不是相鄰的詞條,他們之間相差了一個詞條,所以這時候就需要用到 “slop” ,
slop參數(shù)告訴match_phrase查詢詞條能夠相隔多遠(yuǎn)時仍然將文檔視為匹配
GET full_index/_search
{
"query": {
"match_phrase": {
"description": {
"query": "北京圓明園",
"slop": 1
}
}
}
}
返回結(jié)果如下:
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 2.4425511,
"hits" : [
{
"_index" : "full_index",
"_type" : "_doc",
"_id" : "1",
"_score" : 2.4425511,
"_source" : {
"name" : "張三",
"age" : 11,
"description" : "北京故宮圓明園"
}
}
]
}
}
五、query_string query
允許我們在單個查詢字符串中指定AND | OR | NOT條件,同時也和 multi_match query 一樣,支持多字段搜索。和match類似,但是match需要指定字段名,query_string是在所有字段中搜索,范圍更廣泛。注意: 查詢字段分詞就將查詢條件分詞查詢,查詢字段不分詞將查詢條件不分詞查詢
DSL: 搜索當(dāng)前索引所有字段中含有 “北京故宮” 的文檔
GET full_index/_search
{
"query": {
"query_string": {
"query": "安徽張三"
}
}
}
返回?cái)?shù)據(jù)如下:
{
"took" : 0,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 3,
"relation" : "eq"
},
"max_score" : 2.5618675,
"hits" : [
{
"_index" : "full_index",
"_type" : "_doc",
"_id" : "1",
"_score" : 2.5618675,
"_source" : {
"name" : "張三",
"age" : 11,
"description" : "北京故宮圓明園"
}
},
{
"_index" : "full_index",
"_type" : "_doc",
"_id" : "8",
"_score" : 2.5618675,
"_source" : {
"name" : "張三",
"age" : 11,
"description" : "南京總統(tǒng)"
}
},
{
"_index" : "full_index",
"_type" : "_doc",
"_id" : "6",
"_score" : 1.7342355,
"_source" : {
"name" : "憨憨",
"age" : 27,
"description" : "安徽黃山九華山"
}
}
]
}
}
springboot實(shí)現(xiàn):
@RequestMapping(value = "/query_string", method = RequestMethod.GET)
@ApiOperation(value = "DSL - query_string")
public void query_string() throws Exception {
// 定義請求對象
SearchRequest searchRequest = new SearchRequest(INDEX_NAME);
// 查詢所有
searchRequest.source(new SearchSourceBuilder().query(
QueryBuilders.queryStringQuery("安徽張三")));
// 打印返回?cái)?shù)據(jù)
printLog(client.search(searchRequest, RequestOptions.DEFAULT));
}
返回hits數(shù)組長度:3
{name=張三, description=北京故宮圓明園, age=11}
{name=張三, description=南京總統(tǒng), age=11}
{name=憨憨, description=安徽黃山九華山, age=27}
指定字段查詢: “description” 字段中含有 “安徽張三” 的文檔
GET full_index/_search
{
"query": {
"query_string": {
"query": "安徽張三",
"fields": ["description"]
}
}
}
返回?cái)?shù)據(jù)如下:
{
"took" : 0,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 1.7342355,
"hits" : [
{
"_index" : "full_index",
"_type" : "_doc",
"_id" : "6",
"_score" : 1.7342355,
"_source" : {
"name" : "憨憨",
"age" : 27,
"description" : "安徽黃山九華山"
}
}
]
}
}
指定多個字段查詢 : 查詢 “安徽” “憨憨” 同時滿足
GET full_index/_search
{
"query": {
"query_string": {
"query": "安徽 AND 憨憨",
"fields": ["description","name"]
}
}
}
返回:
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 6.6615744,
"hits" : [
{
"_index" : "full_index",
"_type" : "_doc",
"_id" : "6",
"_score" : 6.6615744,
"_source" : {
"name" : "憨憨",
"age" : 27,
"description" : "安徽黃山九華山"
}
}
]
}
}
GET full_index/_search
{
"query": {
"query_string": {
"query": "(安徽 AND 憨憨)OR 張三",
"fields": ["description","name"]
}
}
}
返回?cái)?shù)據(jù)如下:
{
"took" : 3,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 3,
"relation" : "eq"
},
"max_score" : 6.6615744,
"hits" : [
{
"_index" : "full_index",
"_type" : "_doc",
"_id" : "6",
"_score" : 6.6615744,
"_source" : {
"name" : "憨憨",
"age" : 27,
"description" : "安徽黃山九華山"
}
},
{
"_index" : "full_index",
"_type" : "_doc",
"_id" : "1",
"_score" : 2.5618675,
"_source" : {
"name" : "張三",
"age" : 11,
"description" : "北京故宮圓明園"
}
},
{
"_index" : "full_index",
"_type" : "_doc",
"_id" : "8",
"_score" : 2.5618675,
"_source" : {
"name" : "張三",
"age" : 11,
"description" : "南京總統(tǒng)"
}
}
]
}
}
query_string query 這種查詢方式類似于 match query匹配查詢 結(jié)合 multi_match query 多字段查詢 一起使用。文章來源:http://www.zghlxwxcb.cn/news/detail-814195.html
六、simple_query_string
類似Query String,但是會忽略錯誤的語法,同時只支持部分查詢語法,不支持AND OR NOT,會當(dāng)作字符串處理。支持部分邏輯:文章來源地址http://www.zghlxwxcb.cn/news/detail-814195.html
- “+” 替代 “AND”
- “|” 替代 “OR”
- “-” 替代 “NOT”
GET full_index/_search
{
"query": {
"simple_query_string": {
"query": "(安徽 + 憨憨) | 張三",
"fields": ["description","name"]
}
}
}
返回結(jié)果如下:
{
"took" : 41,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 3,
"relation" : "eq"
},
"max_score" : 6.6615744,
"hits" : [
{
"_index" : "full_index",
"_type" : "_doc",
"_id" : "6",
"_score" : 6.6615744,
"_source" : {
"name" : "憨憨",
"age" : 27,
"description" : "安徽黃山九華山"
}
},
{
"_index" : "full_index",
"_type" : "_doc",
"_id" : "1",
"_score" : 2.5618675,
"_source" : {
"name" : "張三",
"age" : 11,
"description" : "北京故宮圓明園"
}
},
{
"_index" : "full_index",
"_type" : "_doc",
"_id" : "8",
"_score" : 2.5618675,
"_source" : {
"name" : "張三",
"age" : 11,
"description" : "南京總統(tǒng)"
}
}
]
}
}
到了這里,關(guān)于【ElasticSearch-基礎(chǔ)篇】ES高級查詢Query DSL全文檢索的文章就介紹完了。如果您還想了解更多內(nèi)容,請?jiān)谟疑辖撬阉鱐OY模板網(wǎng)以前的文章或繼續(xù)瀏覽下面的相關(guān)文章,希望大家以后多多支持TOY模板網(wǎng)!