一、Elasticsearch 查詢分析器概述
1.1 Elasticsearch 簡(jiǎn)介
Elasticsearch 是一個(gè)開源的分布式搜索和分析引擎,它提供了強(qiáng)大的查詢和分析功能。它基于 Apache Lucene 構(gòu)建,支持大規(guī)模數(shù)據(jù)的實(shí)時(shí)搜索,并具有高可用性和可擴(kuò)展性。
1.2 查詢分析器的作用
在 Elasticsearch 中,查詢分析器負(fù)責(zé)處理用戶搜索的輸入,將文本進(jìn)行分詞并生成倒排索引。分析器在搜索過(guò)程中起到了關(guān)鍵作用,它們決定了搜索的準(zhǔn)確性和效率。
二、查詢分析器類型
2.1 Standard Analyzer
標(biāo)準(zhǔn)分析器是 Elasticsearch 默認(rèn)的分析器,它使用了標(biāo)準(zhǔn)的分詞器、小寫轉(zhuǎn)換器和停用詞過(guò)濾器
// 創(chuàng)建查詢分析器
Analyzer analyzer = new StandardAnalyzer();
// 使用查詢分析器進(jìn)行分詞
String text = "Elasticsearch is a distributed, RESTful search and analytics engine.";
TokenStream tokenStream = analyzer.tokenStream(null, new StringReader(text));
CharTermAttribute termAttribute = tokenStream.addAttribute(CharTermAttribute.class);
// 遍歷分詞結(jié)果
tokenStream.reset();
while (tokenStream.incrementToken()) {
System.out.println(termAttribute.toString());
}
tokenStream.end();
tokenStream.close();
2.2 Simple Analyzer
簡(jiǎn)單分析器將輸入文本按照非字母字符進(jìn)行分詞,并將所有的詞轉(zhuǎn)換成小寫
// 創(chuàng)建查詢分析器
Analyzer analyzer = new SimpleAnalyzer();
// 使用查詢分析器進(jìn)行分詞
String text = "Elasticsearch is a distributed, RESTful search and analytics engine.";
TokenStream tokenStream = analyzer.tokenStream(null, new StringReader(text));
CharTermAttribute termAttribute = tokenStream.addAttribute(CharTermAttribute.class);
// 遍歷分詞結(jié)果
tokenStream.reset();
while (tokenStream.incrementToken()) {
System.out.println(termAttribute.toString());
}
tokenStream.end();
tokenStream.close();
2.3 Whitespace Analyzer
空格分析器根據(jù)空格進(jìn)行分詞,不進(jìn)行任何其他處理
// 創(chuàng)建查詢分析器
Analyzer analyzer = new WhitespaceAnalyzer();
// 使用查詢分析器進(jìn)行分詞
String text = "Elasticsearch is a distributed, RESTful search and analytics engine.";
TokenStream tokenStream = analyzer.tokenStream(null, new StringReader(text));
CharTermAttribute termAttribute = tokenStream.addAttribute(CharTermAttribute.class);
// 遍歷分詞結(jié)果
tokenStream.reset();
while (tokenStream.incrementToken()) {
System.out.println(termAttribute.toString());
}
tokenStream.end();
tokenStream.close();
2.4 Stop Analyzer
停用詞分析器移除一些常見的英文停用詞,例如 “a”, “the”, “is” 等
// 創(chuàng)建查詢分析器,指定停用詞列表
Analyzer analyzer = new StopAnalyzer(CharArraySet.EMPTY_SET);
// 使用查詢分析器進(jìn)行分詞
String text = "Elasticsearch is a distributed, RESTful search and analytics engine.";
TokenStream tokenStream = analyzer.tokenStream(null, new StringReader(text));
CharTermAttribute termAttribute = tokenStream.addAttribute(CharTermAttribute.class);
// 遍歷分詞結(jié)果
tokenStream.reset();
while (tokenStream.incrementToken()) {
System.out.println(termAttribute.toString());
}
tokenStream.end();
tokenStream.close();
2.5 Keyword Analyzer
關(guān)鍵詞分析器將輸入視為單個(gè)關(guān)鍵詞,不進(jìn)行分詞和任何其他處理
// 創(chuàng)建查詢分析器
Analyzer analyzer = new KeywordAnalyzer();
// 使用查詢分析器進(jìn)行分詞
String text = "Elasticsearch is a distributed, RESTful search and analytics engine.";
TokenStream tokenStream = analyzer.tokenStream(null, new StringReader(text));
CharTermAttribute termAttribute = tokenStream.addAttribute(CharTermAttribute.class);
// 遍歷分詞結(jié)果
tokenStream.reset();
while (tokenStream.incrementToken()) {
System.out.println(termAttribute.toString());
}
tokenStream.end();
tokenStream.close();
2.6 Pattern Analyzer
模式分析器根據(jù)正則表達(dá)式來(lái)分詞,它將輸入文本匹配正則表達(dá)式的部分作為詞匯
// 創(chuàng)建查詢分析器,指定正則表達(dá)式
Analyzer analyzer = new PatternAnalyzer(Pattern.compile("\\W+"), true, true);
// 使用查詢分析器進(jìn)行分詞
String text = "Elasticsearch is a distributed, RESTful search and analytics engine.";
TokenStream tokenStream = analyzer.tokenStream(null, new StringReader(text));
CharTermAttribute termAttribute = tokenStream.addAttribute(CharTermAttribute.class);
// 遍歷分詞結(jié)果
tokenStream.reset();
while (tokenStream.incrementToken()) {
System.out.println(termAttribute.toString());
}
tokenStream.end();
tokenStream.close();
2.7 語(yǔ)言分析器
語(yǔ)言分析器是針對(duì)不同語(yǔ)言的特定分析器,可以提供更好的分詞和處理效果
2.7.1 English Analyzer
英語(yǔ)分析器基于英語(yǔ)特定的分詞規(guī)則和處理方式
// 創(chuàng)建查詢分析器
Analyzer analyzer = new EnglishAnalyzer();
// 使用查詢分析器進(jìn)行分詞
String text = "Elasticsearch is a distributed, RESTful search and analytics engine.";
TokenStream tokenStream = analyzer.tokenStream(null, new StringReader(text));
CharTermAttribute termAttribute = tokenStream.addAttribute(CharTermAttribute.class);
// 遍歷分詞結(jié)果
tokenStream.reset();
while (tokenStream.incrementToken()) {
System.out.println(termAttribute.toString());
}
tokenStream.end();
tokenStream.close();
2.7.2 Chinese Analyzer
中文分析器基于中文特定的分詞規(guī)則和處理方式
// 創(chuàng)建查詢分析器
Analyzer analyzer = new SmartChineseAnalyzer();
// 使用查詢分析器進(jìn)行分詞
String text = "Elasticsearch is a distributed, RESTful search and analytics engine.";
TokenStream tokenStream = analyzer.tokenStream(null, new StringReader(text));
CharTermAttribute termAttribute = tokenStream.addAttribute(CharTermAttribute.class);
// 遍歷分詞結(jié)果
tokenStream.reset();
while (tokenStream.incrementToken()) {
System.out.println(termAttribute.toString());
}
tokenStream.end();
tokenStream.close();
三、自定義查詢分析器
3.1 自定義 Tokenizer
代碼示例
import org.elasticsearch.index.analysis.Analyzer;
import org.elasticsearch.index.analysis.TokenizerFactory;
class CustomTokenizerFactory extends TokenizerFactory {
public CustomTokenizerFactory(IndexSettings indexSettings, Environment environment, String name, Settings settings) {
super(indexSettings, name, settings);
}
@Override
public Tokenizer create() {
// 在這里實(shí)現(xiàn)自定義的 tokenizer 邏輯
return new CustomTokenizer();
}
}
class CustomTokenizer implements Tokenizer {
// 實(shí)現(xiàn)自定義 tokenizer 的邏輯
@Override
public Token next() throws IOException {
// 返回下一個(gè) token
}
}
3.2 自定義 Filter
代碼示例
import org.elasticsearch.index.analysis.TokenFilterFactory;
class CustomFilterFactory extends TokenFilterFactory {
public CustomFilterFactory(IndexSettings indexSettings, Environment environment, String name, Settings settings) {
super(indexSettings, name, settings);
}
@Override
public TokenStream create(TokenStream tokenStream) {
// 在這里實(shí)現(xiàn)自定義的 filter 邏輯
return new CustomFilter(tokenStream);
}
}
class CustomFilter extends TokenFilter {
public CustomFilter(TokenStream tokenStream) {
super(tokenStream);
}
@Override
public Token next(Token reusableToken) throws IOException {
// 實(shí)現(xiàn)自定義 filter 的邏輯
}
}
3.3 自定義 Analyzer
代碼示例
import org.elasticsearch.index.analysis.TokenFilterFactory;
import org.elasticsearch.index.analysis.TokenizerFactory;
import org.elasticsearch.index.analysis.Analyzer;
import org.elasticsearch.index.analysis.Tokenizer;
class CustomAnalyzerProvider extends AbstractIndexAnalyzerProvider<Analyzer> {
private final Analyzer analyzer;
public CustomAnalyzerProvider(IndexSettings indexSettings, Environment environment, String name, Settings settings) {
super(indexSettings, name, settings);
TokenizerFactory tokenizerFactory = new CustomTokenizerFactory(indexSettings, environment, name, settings);
TokenFilterFactory tokenFilterFactory = new CustomFilterFactory(indexSettings, environment, name, settings);
this.analyzer = new CustomAnalyzer(tokenizerFactory, tokenFilterFactory);
}
@Override
public Analyzer get() {
return this.analyzer;
}
}
class CustomAnalyzer extends Analyzer {
private final TokenizerFactory tokenizerFactory;
private final TokenFilterFactory tokenFilterFactory;
public CustomAnalyzer(TokenizerFactory tokenizerFactory, TokenFilterFactory tokenFilterFactory) {
this.tokenizerFactory = tokenizerFactory;
this.tokenFilterFactory = tokenFilterFactory;
}
@Override
protected TokenStreamComponents createComponents(String fieldName) {
Tokenizer tokenizer = tokenizerFactory.create();
TokenStream tokenStream = tokenFilterFactory.create(tokenizer);
return new TokenStreamComponents(tokenizer, tokenStream);
}
}
3.4 使用自定義 Analyzer
代碼示例
import org.elasticsearch.index.query.QueryBuilders;
import org.elasticsearch.action.search.SearchResponse;
import org.elasticsearch.client.transport.TransportClient;
import org.elasticsearch.common.transport.TransportAddress;
import org.elasticsearch.common.settings.Settings;
import org.elasticsearch.common.transport.InetSocketTransportAddress;
import org.elasticsearch.index.query.QueryBuilder;
// 創(chuàng)建 TransportClient
Settings settings = Settings.builder()
.put("cluster.name", "your_cluster_name")
.build();
TransportClient client = new PreBuiltTransportClient(settings);
// 添加節(jié)點(diǎn)地址
client.addTransportAddress(new InetSocketTransportAddress(new InetSocketAddress("your_host_name", 9300)));
// 創(chuàng)建自定義 Analyzer 查詢
QueryBuilder query = QueryBuilders.matchQuery("your_field", "your_query_text")
.analyzer("your_custom_analyzer");
// 執(zhí)行搜索操作
SearchResponse response = client.prepareSearch("your_index_name")
.setQuery(query)
.execute()
.actionGet();
四、常見使用場(chǎng)景
4.1 模糊搜索
模糊搜索,是指根據(jù)用戶提供的關(guān)鍵詞在搜索時(shí)進(jìn)行近似匹配,而非完全精確匹配。在 Elasticsearch 中,可以使用 Fuzzy Query 實(shí)現(xiàn)模糊搜索。
SearchSourceBuilder searchSourceBuilder = new SearchSourceBuilder();
QueryBuilder queryBuilder = QueryBuilders.fuzzyQuery("field", "keyword")
.fuzziness(Fuzziness.AUTO)
.prefixLength(3)
.maxExpansions(10);
searchSourceBuilder.query(queryBuilder);
SearchRequest searchRequest = new SearchRequest("index");
searchRequest.source(searchSourceBuilder);
SearchResponse searchResponse = client.search(searchRequest, RequestOptions.DEFAULT);
-
field
是需要進(jìn)行模糊搜索的字段名稱。 -
keyword
是用戶提供的關(guān)鍵詞。 -
fuzziness
參數(shù)指定了模糊匹配的程度,Fuzziness.AUTO
表示自動(dòng)選擇最佳匹配程度。 -
prefixLength
參數(shù)指定了模糊匹配時(shí)前綴的最小長(zhǎng)度。 -
maxExpansions
參數(shù)指定了在擴(kuò)展搜索條件時(shí)的最大擴(kuò)展次數(shù)。
4.2 細(xì)粒度搜索
細(xì)粒度搜索,是指根據(jù)用戶提供的關(guān)鍵詞進(jìn)行精確匹配,包括前綴匹配、通配符匹配和正則表達(dá)式匹配。
4.2.1 前綴匹配
前綴匹配,是指根據(jù)用戶提供的關(guān)鍵詞匹配字段值的前綴。在 Elasticsearch 中,可以使用 Prefix Query 實(shí)現(xiàn)前綴匹配。
SearchSourceBuilder searchSourceBuilder = new SearchSourceBuilder();
QueryBuilder queryBuilder = QueryBuilders.prefixQuery("field", "prefix");
searchSourceBuilder.query(queryBuilder);
SearchRequest searchRequest = new SearchRequest("index");
searchRequest.source(searchSourceBuilder);
SearchResponse searchResponse = client.search(searchRequest, RequestOptions.DEFAULT);
-
field
是需要進(jìn)行前綴匹配的字段名稱。 -
prefix
是用戶提供的前綴關(guān)鍵詞。
4.2.2 通配符匹配
通配符匹配,是指根據(jù)用戶提供的帶有通配符的關(guān)鍵詞進(jìn)行匹配。在 Elasticsearch 中,可以使用 Wildcard Query 實(shí)現(xiàn)通配符匹配。
SearchSourceBuilder searchSourceBuilder = new SearchSourceBuilder();
QueryBuilder queryBuilder = QueryBuilders.wildcardQuery("field", "keyword*");
searchSourceBuilder.query(queryBuilder);
SearchRequest searchRequest = new SearchRequest("index");
searchRequest.source(searchSourceBuilder);
SearchResponse searchResponse = client.search(searchRequest, RequestOptions.DEFAULT);
-
field
是需要進(jìn)行通配符匹配的字段名稱。 -
keyword*
是用戶提供的帶有通配符的關(guān)鍵詞,*
表示匹配任意多個(gè)字符。
4.2.3 正則表達(dá)式匹配
正則表達(dá)式匹配,是指根據(jù)用戶提供的正則表達(dá)式進(jìn)行匹配。在 Elasticsearch 中,可以使用 Regexp Query 實(shí)現(xiàn)正則表達(dá)式匹配。
SearchSourceBuilder searchSourceBuilder = new SearchSourceBuilder();
QueryBuilder queryBuilder = QueryBuilders.regexpQuery("field", "regex");
searchSourceBuilder.query(queryBuilder);
SearchRequest searchRequest = new SearchRequest("index");
searchRequest.source(searchSourceBuilder);
SearchResponse searchResponse = client.search(searchRequest, RequestOptions.DEFAULT);
-
field
是需要進(jìn)行正則表達(dá)式匹配的字段名稱。 -
regex
是用戶提供的正則表達(dá)式。
4.3 多語(yǔ)言搜索
多語(yǔ)言搜索,是指根據(jù)用戶提供的關(guān)鍵詞進(jìn)行跨語(yǔ)言的搜索。在 Elasticsearch 中,可以使用 MultiMatch Query 實(shí)現(xiàn)多語(yǔ)言搜索。
SearchSourceBuilder searchSourceBuilder = new SearchSourceBuilder();
QueryBuilder queryBuilder = QueryBuilders.multiMatchQuery("keyword", "field1", "field2");
searchSourceBuilder.query(queryBuilder);
SearchRequest searchRequest = new SearchRequest("index");
searchRequest.source(searchSourceBuilder);
SearchResponse searchResponse = client.search(searchRequest, RequestOptions.DEFAULT);
-
keyword
是用戶提供的關(guān)鍵詞。 -
field1
和field2
是需要進(jìn)行搜索的字段名稱列表。
4.4 數(shù)據(jù)清理和標(biāo)準(zhǔn)化
數(shù)據(jù)清理和標(biāo)準(zhǔn)化,是指對(duì)用戶提供的關(guān)鍵詞進(jìn)行處理,使其符合特定規(guī)范,以便更好地匹配。在 Elasticsearch 中,可以使用 Analyzers 和 Tokenizers 實(shí)現(xiàn)數(shù)據(jù)清理和標(biāo)準(zhǔn)化。
String keyword = "原始關(guān)鍵詞";
AnalyzeRequest analyzeRequest = AnalyzeRequest.withGlobalAnalyzer("index", keyword);
AnalyzeResponse analyzeResponse = client.indices().analyze(analyzeRequest, RequestOptions.DEFAULT);
List<AnalyzeResponse.AnalyzeToken> tokens = analyzeResponse.getTokens();
for (AnalyzeResponse.AnalyzeToken token : tokens) {
System.out.println(token.getTerm());
}
-
keyword
是用戶提供的原始關(guān)鍵詞。 -
index
是需要進(jìn)行數(shù)據(jù)清理和標(biāo)準(zhǔn)化的索引名稱。
以上就是 Elasticsearch 查詢分析器的簡(jiǎn)要介紹以及常見使用場(chǎng)景的示例代碼。通過(guò)這些查詢類型和數(shù)據(jù)處理功能,可以實(shí)現(xiàn)高效、準(zhǔn)確的搜索和分析。
五、調(diào)試和性能優(yōu)化
5.1 分析 Query Parsing 過(guò)程
在 Elasticsearch 中,查詢的解析過(guò)程是將查詢字符串轉(zhuǎn)換為查詢對(duì)象的過(guò)程。為了分析查詢解析過(guò)程,可以使用 Elasticsearch 的 SearchRequest
對(duì)象的 source
方法來(lái)傳遞搜索請(qǐng)求的源代碼。
SearchRequest searchRequest = new SearchRequest("your_index");
SearchSourceBuilder sourceBuilder = new SearchSourceBuilder();
sourceBuilder.query(QueryBuilders.matchQuery("field_name", "search_query"));
searchRequest.source(sourceBuilder);
5.2 使用 Explain API 來(lái)查看匹配細(xì)節(jié)
對(duì)于調(diào)試和查看查詢的匹配細(xì)節(jié),可以使用 Elasticsearch 的 Explain API。Explain API 接受一個(gè)搜索請(qǐng)求并返回與該請(qǐng)求匹配的文檔,以及有關(guān)如何計(jì)算匹配度的詳細(xì)信息。
ExplainRequest explainRequest = new ExplainRequest("your_index");
SearchSourceBuilder sourceBuilder = new SearchSourceBuilder();
sourceBuilder.query(QueryBuilders.matchQuery("field_name", "search_query"));
explainRequest.source(sourceBuilder);
5.3 性能測(cè)試和優(yōu)化
性能測(cè)試是評(píng)估 Elasticsearch 集群查詢性能的一種方法。可以使用 Elasticsearch 的搜索 Profile 來(lái)幫助識(shí)別潛在的性能問(wèn)題。文章來(lái)源:http://www.zghlxwxcb.cn/news/detail-584701.html
SearchRequest searchRequest = new SearchRequest("your_index");
SearchSourceBuilder sourceBuilder = new SearchSourceBuilder();
sourceBuilder.query(QueryBuilders.matchQuery("field_name", "search_query"));
sourceBuilder.profile(true);
searchRequest.source(sourceBuilder);
除了性能測(cè)試,還可以使用 Elasticsearch 的 Warmers 來(lái)優(yōu)化搜索性能。Warmers 是一種預(yù)熱索引的機(jī)制,它可以事先計(jì)算某些搜索的結(jié)果并緩存這些結(jié)果,以提高后續(xù)的搜索性能。文章來(lái)源地址http://www.zghlxwxcb.cn/news/detail-584701.html
PutWarmersRequest putWarmersRequest = new PutWarmersRequest("your_index");
SearchSourceBuilder sourceBuilder = new SearchSourceBuilder();
sourceBuilder.query(QueryBuilders.matchQuery("field_name", "search_query"));
SearchRequest searchRequest = new SearchRequest("your_index");
searchRequest.source(sourceBuilder);
putWarmersRequest.addWarmers("your_warmer", searchRequest);
到了這里,關(guān)于Elasticsearch 查詢分析器簡(jiǎn)介的文章就介紹完了。如果您還想了解更多內(nèi)容,請(qǐng)?jiān)谟疑辖撬阉鱐OY模板網(wǎng)以前的文章或繼續(xù)瀏覽下面的相關(guān)文章,希望大家以后多多支持TOY模板網(wǎng)!