在此 notebook 中,我們將看到有關(guān)如何使用 Reindex API 將索引升級(jí)到 ELSER 模型 .elser_model_2 的示例。
注意:或者,你也可以通過 update_by_query 來更新索引以使用 ELSER。 在本筆記本中,我們將看到使用 Reindex API 的示例。
我們將在本筆記本中看到的場(chǎng)景:
- 將未生成 text_expansion 字段的索引遷移到 ELSER 模型 .elser_model_2
- 使用 .elser_model_1 升級(jí)現(xiàn)有索引以使用 .elser_model_2 模型
- 升級(jí)使用不同模型的索引以使用 ELSER
在下面的顏色中,我們將使用 Elastic Stack 8.11 來進(jìn)行展示。
安裝
如果你還沒有安裝好自己的 Elasticsearch 及 Kibana,請(qǐng)參考文章:
安裝 Elasticsearch 及 Kibana
如果你還沒有安裝好自己的 Elasticsearch 及 Kibana,那么請(qǐng)參考一下的文章來進(jìn)行安裝:
-
如何在 Linux,MacOS 及 Windows 上進(jìn)行安裝 Elasticsearch
-
Kibana:如何在 Linux,MacOS 及 Windows 上安裝 Elastic 棧中的 Kibana
在安裝的時(shí)候,請(qǐng)選擇 Elastic Stack 8.x?進(jìn)行安裝。在安裝的時(shí)候,我們可以看到如下的安裝信息:
??
為了能夠上傳向量模型,我們必須訂閱白金版或試用。
??
??
安裝 ELSER 模型
如果你還沒有安裝好 ELSER 模型,請(qǐng)參考文章 “Elasticsearch:部署 ELSER - Elastic Learned Sparse EncoderR” 來進(jìn)行安裝。在這里就不再累述了。請(qǐng)注意安裝好的 ELSER 模型的 ID 為 .elser_model_2 而不是之前那篇文章中的 .elser_model_1。
Python
我們需要安裝相應(yīng)的 Elasticsearch 包:
$ pwd
/Users/liuxg/python/elser
$ pip3 install elasticsearch -qU
$ pip3 list | grep elasticseach
elasticsearch 8.11.1
rag-elasticsearch 0.0.1 /Users/liuxg/python/rag-elasticsearch/my-app/packages/rag-elasticsearch
環(huán)境變量
在啟動(dòng) Jupyter 之前,我們?cè)O(shè)置如下的環(huán)境變量:
export ES_USER="elastic"
export ES_PASSWORD="yarOjyX5CLqTsKVE3v*d"
export ES_ENDPOINT="localhost"
拷貝 Elasticsearch 證書
我們把 Elasticsearch 的證書拷貝到當(dāng)前的目錄下:
$ pwd
/Users/liuxg/python/elser
$ cp ~/elastic/elasticsearch-8.11.0/config/certs/http_ca.crt .
$ ls
find_books_about_christmas_without_searching_for_christmas.ipynb
Chatbot with LangChain conversational chain and OpenAI.ipynb
ElasticKnnSearch.ipynb
ElasticVectorSearch.ipynb
ElasticsearchStore.ipynb
Mental Health FAQ.ipynb
Multilingual semantic search.ipynb
NLP text search using hugging face transformer model.ipynb
Question Answering with Langchain and OpenAI.ipynb
RAG-langchain-elasticsearch.ipynb
Semantic search - ELSER.ipynb
Semantic search quick start.ipynb
book_summaries_1000_chunked.json
books.json
data.json
http_ca.crt
lib
sample_data.json
upgrading-index-to-use-elser.ipynb
vector_search_implementation_guide_api.ipynb
workplace-docs.json
在上面,我們把? Elasticsearch 的證書 http_ca.crt 拷貝到當(dāng)前的目錄下。
運(yùn)行應(yīng)用
使用客戶端連接 Elasticsearch
from elasticsearch import Elasticsearch
import os
elastic_user=os.getenv('ES_USER')
elastic_password=os.getenv('ES_PASSWORD')
elastic_endpoint=os.getenv("ES_ENDPOINT")
url = f"https://{elastic_user}:{elastic_password}@{elastic_endpoint}:9200"
es = Elasticsearch(url, ca_certs = "./http_ca.crt", verify_certs = True)
print(es.info())
從上面的輸出中,我們可以看到與?Elasticsearch 的連接是成功的。
案例一
在本例中,我們將了解如何升級(jí)已經(jīng)配置了攝取管道的索引,以使用 ELSER 模型 elser_model_2?
使用 lowercase 創(chuàng)建攝取管道
我們將創(chuàng)建一個(gè)簡(jiǎn)單的管道來將標(biāo)題字段值轉(zhuǎn)換為小寫,并在我們的索引上使用此攝取管道。
es.ingest.put_pipeline(
id="ingest-pipeline-lowercase",
description="Ingest pipeline to change title to lowercase",
processors=[
{
"lowercase": {
"field": "title"
}
}
]
)
創(chuàng)建索引 - 帶有映射的 movies
接下來,我們將使用我們?cè)谏弦徊街袆?chuàng)建的管道 ingest-pipeline-lowercase 創(chuàng)建一個(gè)索引。
es.indices.delete(index="movies",ignore_unavailable=True)
es.indices.create(
index="movies",
settings={
"index": {
"number_of_shards": 1,
"number_of_replicas": 1,
"default_pipeline": "ingest-pipeline-lowercase"
}
},
mappings={
"properties": {
"plot": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
}
}
)
攝入文檔
我們現(xiàn)在準(zhǔn)備將 12 部電影的示例數(shù)據(jù)集插入到我們的電影索引中。我們把如下的數(shù)據(jù)保存到一個(gè)叫做 movies.json 的文件中。
movies.json
[
{
"title": "Pulp Fiction",
"runtime": "154",
"plot": "The lives of two mob hitmen, a boxer, a gangster and his wife, and a pair of diner bandits intertwine in four tales of violence and redemption.",
"keyScene": "John Travolta is forced to inject adrenaline directly into Uma Thurman's heart after she overdoses on heroin.",
"genre": "Crime, Drama",
"released": "1994"
},
{
"title": "The Dark Knight",
"runtime": "152",
"plot": "When the menace known as the Joker wreaks havoc and chaos on the people of Gotham, Batman must accept one of the greatest psychological and physical tests of his ability to fight injustice.",
"keyScene": "Batman angrily responds 'I’m Batman' when asked who he is by Falcone.",
"genre": "Action, Crime, Drama, Thriller",
"released": "2008"
},
{
"title": "Fight Club",
"runtime": "139",
"plot": "An insomniac office worker and a devil-may-care soapmaker form an underground fight club that evolves into something much, much more.",
"keyScene": "Brad Pitt explains the rules of Fight Club to Edward Norton. The first rule of Fight Club is: You do not talk about Fight Club. The second rule of Fight Club is: You do not talk about Fight Club.",
"genre": "Drama",
"released": "1999"
},
{
"title": "Inception",
"runtime": "148",
"plot": "A thief who steals corporate secrets through the use of dream-sharing technology is given the inverse task of planting an idea into thed of a C.E.O.",
"keyScene": "Leonardo DiCaprio explains the concept of inception to Ellen Page by using a child's spinning top.",
"genre": "Action, Adventure, Sci-Fi, Thriller",
"released": "2010"
},
{
"title": "The Matrix",
"runtime": "136",
"plot": "A computer hacker learns from mysterious rebels about the true nature of his reality and his role in the war against its controllers.",
"keyScene": "Red pill or blue pill? Morpheus offers Neo a choice between the red pill, which will allow him to learn the truth about the Matrix, or the blue pill, which will return him to his former life.",
"genre": "Action, Sci-Fi",
"released": "1999"
},
{
"title": "The Shawshank Redemption",
"runtime": "142",
"plot": "Two imprisoned men bond over a number of years, finding solace and eventual redemption through acts of common decency.",
"keyScene": "Andy Dufresne escapes from Shawshank prison by crawling through a sewer pipe.",
"genre": "Drama",
"released": "1994"
},
{
"title": "Goodfellas",
"runtime": "146",
"plot": "The story of Henry Hill and his life in the mob, covering his relationship with his wife Karen Hill and his mob partners Jimmy Conway and Tommy DeVito in the Italian-American crime syndicate.",
"keyScene": "Joe Pesci's character Tommy DeVito shoots young Spider in the foot for not getting him a drink.",
"genre": "Biography, Crime, Drama",
"released": "1990"
},
{
"title": "Se7en",
"runtime": "127",
"plot": "Two detectives, a rookie and a veteran, hunt a serial killer who uses the seven deadly sins as his motives.",
"keyScene": "Brad Pitt's character David Mills shoots John Doe after he reveals that he murdered Mills' wife.",
"genre": "Crime, Drama, Mystery, Thriller",
"released": "1995"
},
{
"title": "The Silence of the Lambs",
"runtime": "118",
"plot": "A young F.B.I. cadet must receive the help of an incarcerated and manipulative cannibal killer to help catch another serial killer, a madman who skins his victims.",
"keyScene": "Hannibal Lecter explains to Clarice Starling that he ate a census taker's liver with some fava beans and a nice Chianti.",
"genre": "Crime, Drama, Thriller",
"released": "1991"
},
{
"title": "The Godfather",
"runtime": "175",
"plot": "An organized crime dynasty's aging patriarch transfers control of his clandestine empire to his reluctant son.",
"keyScene": "James Caan's character Sonny Corleone is shot to death at a toll booth by a number of machine gun toting enemies.",
"genre": "Crime, Drama",
"released": "1972"
},
{
"title": "The Departed",
"runtime": "151",
"plot": "An undercover cop and a mole in the police attempt to identify each other while infiltrating an Irish gang in South Boston.",
"keyScene": "Leonardo DiCaprio's character Billy Costigan is shot to death by Matt Damon's character Colin Sullivan.",
"genre": "Crime, Drama, Thriller",
"released": "2006"
},
{
"title": "The Usual Suspects",
"runtime": "106",
"plot": "A sole survivor tells of the twisty events leading up to a horrific gun battle on a boat, which began when five criminals met at a seemingly random police lineup.",
"keyScene": "Kevin Spacey's character Verbal Kint is revealed to be the mastermind behind the crime, when his limp disappears as he walks away from the police station.",
"genre": "Crime, Mystery, Thriller",
"released": "1995"
}
]
$ pwd
/Users/liuxg/python/elser
$ ls movies.json
movies.json
我們接下來運(yùn)行如下的代碼:
import json
from elasticsearch import helpers
import time
with open('movies.json') as f:
data_json = json.load(f)
# Prepare the documents to be indexed
documents = []
for doc in data_json:
documents.append({
"_index": "movies",
"_source": doc,
})
# Use helpers.bulk to index
helpers.bulk(es, documents)
print("Done indexing documents into `movies` index!")
time.sleep(5)
我們可以在 Kibana 中查看到剛才攝入的 12 個(gè)文檔:
更新 movies 索引使用 ELSER 模型
我們已準(zhǔn)備好使用 ELSER 模型 .elser_model_2 將 movies 重新索引到新索引。 第一步,我們必須創(chuàng)建新的攝取管道和索引才能使用 ELSER 模型。
創(chuàng)建一個(gè)使用 ELSER 模型的新的 ingest pipeline
讓我們使用 ELSER 模型 .elser_model_2 創(chuàng)建一個(gè)新的攝取管道。
es.ingest.put_pipeline(
id="elser-ingest-pipeline",
description="Ingest pipeline for ELSER",
processors=[
{
"inference": {
"model_id": ".elser_model_2",
"input_output": [
{
"input_field": "plot",
"output_field": "plot_embedding"
}
]
}
}
]
)
使用映射創(chuàng)建一個(gè)新的索引
接下來,使用 ELSER 所需的映射創(chuàng)建索引。
es.indices.delete(index="elser-movies",ignore_unavailable=True)
es.indices.create(
index="elser-movies",
mappings={
"properties": {
"plot": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"plot_embedding": {
"type": "sparse_vector"
}
}
}
)
注意:
- plot_embedding 是包含生成的類型為稀疏向量的標(biāo)記的字段的名稱
- plot 是創(chuàng)建稀疏向量的字段的名稱。
使用更新的 ingest pipeline 來進(jìn)行 reindex
借助 Reindex API,我們可以將數(shù)據(jù)從舊索引電影復(fù)制到新索引 elser-movies,并將攝取管道設(shè)置為 elser-ingest-pipeline 。 成功后,索引 elser-movies 會(huì)在你針對(duì) ELSER 推理的 text_expansion 術(shù)語上創(chuàng)建標(biāo)記。
es.reindex(source={
"index": "movies"
}, dest={
"index": "elser-movies",
"pipeline": "elser-ingest-pipeline"
})
time.sleep(7)
重新索引完成后,檢查索引 elser-movies 中的任何文檔,并注意到該文檔有一個(gè)附加字段 plot_embedding,其中包含我們將在 text_expansion 查詢中使用的術(shù)語。
使用 ELSER 來查詢文檔
讓我們嘗試使用 ELSER 模型 .elser_model_2 對(duì)索引進(jìn)行語義搜索:
response = es.search(
index='elser-movies',
size=3,
query={
"text_expansion": {
"plot_embedding": {
"model_id":".elser_model_2",
"model_text":"investigation"
}
}
}
)
for hit in response['hits']['hits']:
doc_id = hit['_id']
score = hit['_score']
title = hit['_source']['title']
plot = hit['_source']['plot']
print(f"Score: {score}\nTitle: {title}\nPlot: {plot}\n")
案例二:將 ELSER 模型的索引升級(jí)到 .elser_model_2
如果你已有 ELSER 模型 .elser_model_1 的索引,并且想要升級(jí)到 .elser_model_2,則可以結(jié)合使用 Reindex API 和攝取管道來使用 ELSER .elser_model_2 模型。
注意:在開始之前,請(qǐng)確保你使用的是 Elasticsearch 8.11 版本并且已部署 ELSER 模型 .elser_model_2。
創(chuàng)建一個(gè)新的 ingest pipeline
我們將使用 .elser_model_2 創(chuàng)建一個(gè)管道,以便能夠重新索引。
es.ingest.put_pipeline(
id="elser-pipeline-upgrade-demo",
description="Ingest pipeline for ELSER upgrade demo",
processors=[
{
"inference": {
"model_id": ".elser_model_2",
"input_output": [
{
"input_field": "plot",
"output_field": "plot_embedding"
}
]
}
}
]
)
創(chuàng)建一個(gè)帶有 mapping 的新索引
我們將創(chuàng)建一個(gè)新索引,其中包含支持 ELSER 所需的映射:
es.indices.delete(index="elser-upgrade-index-demo", ignore_unavailable=True)
es.indices.create(
index="elser-upgrade-index-demo",
mappings={
"properties": {
"plot": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"plot_embedding": {
"type": "sparse_vector"
},
}
}
)
使用 reindex API
我們將使用 Reindex API 將數(shù)據(jù)從舊索引移動(dòng)到新索引 elser-upgrade-index-demo。 我們將從舊索引中排除 target 字段,并在重新索引時(shí)使用 .elser_model_2 在字段 plot_embedding 中生成新 token。
注意:請(qǐng)確保將 my-index 替換為你要升級(jí)的索引名稱,并將字段 my-tokens-field 替換為你之前生成的 token 的字段名稱。
client.reindex(source={
"index": "my-index", # replace with your index name
"_source": {
"excludes": ["my-tokens-field"] # replace with the field-name from your index, that has previously generated tokens
}},
dest={
"index": "elser-upgrade-index-demo",
"pipeline": "elser-pipeline-upgrade-demo"
})
time.sleep(5)
為了演示的目的。我們使用上一步中得到的 elser-movies 來進(jìn)行練習(xí)。我們假定它是有 .elser_model_1 所生成的(盡管它是由? .elser_model_2 模型所生成的)。我們使用如下的代碼:
es.reindex(source={
"index": "elser-movies", # replace with your index name
"_source": {
"excludes": ["plot_embedding"] # replace with the field-name from your index, that has previously generated tokens
}},
dest={
"index": "elser-upgrade-index-demo",
"pipeline": "elser-pipeline-upgrade-demo"
})
time.sleep(5)
查詢你的數(shù)據(jù)
重新索引完成后,你就可以查詢數(shù)據(jù)并執(zhí)行語義搜索:
response = es.search(
index='elser-upgrade-index-demo',
size=3,
query={
"text_expansion": {
"plot_embedding": {
"model_id":".elser_model_2",
"model_text":"child toy"
}
}
}
)
for hit in response['hits']['hits']:
doc_id = hit['_id']
score = hit['_score']
title = hit['_source']['title']
plot = hit['_source']['plot']
print(f"Score: {score}\nTitle: {title}\nPlot: {plot}\n")
案例三:將不同模型的索引升級(jí)到 ELSER
現(xiàn)在我們將了解如何使用不同的模型移動(dòng)已經(jīng)生成嵌入的索引。
讓我們考慮索引 - blogs,并使用 NLP 模型 Sentence-transformers__all-minilm-l6-v2 生成 text_embedding。 如果你想了解更多如何將 NLP 模型加載到索引的信息,請(qǐng)按照我們的筆記本中的步驟 NLP text search using hugging face transformer model.ipynb
請(qǐng)遵循我們之前執(zhí)行的類似過程:
- 使用 ELSER 模型 .elser_model_2 創(chuàng)建攝取管道
- 使用我們?cè)谏弦徊街袆?chuàng)建的管道創(chuàng)建帶有映射的索引。
- 重新索引,從 blogs 索引中排除 embedding 的字段
在開始之前,讓我們先看一下我們的索引博客并查看映射:
es.indices.get(index="blogs")
注意字段 text_embedding,我們將在新索引中排除 (exclude) 該字段,并根據(jù)博客索引中的字段 title 生成新映射
創(chuàng)建 ingest pipeline
接下來,我們將使用 ELSER 模型 .elser_model_2 創(chuàng)建管道
client.ingest.put_pipeline(
id="elser-pipeline-blogs",
description="Ingest pipeline for ELSER upgrade",
processors=[
{
"inference": {
"model_id": ".elser_model_2",
"input_output": [
{
"input_field": "title",
"output_field": "title_embedding"
}
]
}
}
]
)
創(chuàng)建帶有 mappings 的索引
讓我們創(chuàng)建一個(gè)帶有映射的索引 elser-blogs
es.indices.delete(index="elser-blogs", ignore_unavailable=True)
es.indices.create(
index="elser-blogs",
mappings={
"properties": {
"title": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"title_embedding": {
"type": "sparse_vector"
},
}
}
)
Reindex API
我們將使用 Reindex API 復(fù)制數(shù)據(jù)并生成 text_expansion 嵌入到我們的新索引 elser-blogs 中。
es.reindex(source={
"index": "blogs",
"_source": {
"excludes": ["text_embedding"]
}
}, dest={
"index": "elser-blogs",
"pipeline": "elser-pipeline-blogs"
})
time.sleep(5)
查詢你的數(shù)據(jù)
成功! 現(xiàn)在我們可以在索引 elser-blogs 上查詢數(shù)據(jù)。
response = es.search(
index='elser-blogs',
size=3,
query={
"text_expansion": {
"title_embedding": {
"model_id":".elser_model_2",
"model_text":"Track network connections"
}
}
}
)
for hit in response['hits']['hits']:
doc_id = hit['_id']
score = hit['_score']
title = hit['_source']['title']
print(f"Score: {score}\nTitle: {title}")
文章來源:http://www.zghlxwxcb.cn/news/detail-773901.html
整個(gè) notebook 可以在地址進(jìn)行下載。文章來源地址http://www.zghlxwxcb.cn/news/detail-773901.html
到了這里,關(guān)于Elasticsearch:升級(jí)索引以使用 ELSER 最新的模型的文章就介紹完了。如果您還想了解更多內(nèi)容,請(qǐng)?jiān)谟疑辖撬阉鱐OY模板網(wǎng)以前的文章或繼續(xù)瀏覽下面的相關(guān)文章,希望大家以后多多支持TOY模板網(wǎng)!