在我的文章 “Elastic:開發(fā)者上手指南” 的 “NLP - 自然語言處理及矢量搜索”,我對 Elastic Stack 所提供的矢量搜索有大量的描述。其中很多的方法需要使用到 huggingface.co 及 Elastic 的機器學習。這個對于許多的開發(fā)者來說,意味著付費使用。在那些方案里,帶有機器學習的 inference processor 是收費的。還有那個上傳的 eland 也是收費的。
在今天的文章中,我們來介紹另外一種方法來進行矢量搜素。我們繞過使用 eland 來進行上傳模型。取而代之的是使用 Python 應用來上傳我們已經(jīng)生成好的 dense_vector 字段值。?我們將首先使用數(shù)據(jù)攝取腳本將數(shù)據(jù)攝取到 Elasticsearch 中。 該腳本將使用本地托管的 Elasticsearch 和 SentenceTransformer 庫連接到 Elasticsearch 并執(zhí)行文本嵌入。
在下面的展示中,我使用最新的 Elastic Stack 8.8.1 來進行展示,盡管它適用于其他版本的 Elastic Stack 8.x 版本。
ingest.py
from elasticsearch import Elasticsearch
from sentence_transformers import SentenceTransformer
USERNAME = "elastic"
PASSWORD = "z5nxTriCD4fi7jSS=GFM"
ELATICSEARCH_ENDPOINT = "https://localhost:9200"
CERT_FINGERPRINT = "783663875df7ae1daf3541ab293d8cd48c068b3dbc2d9dd6fa8a668289986ac2"
# Connect to Elasticsearch
es = Elasticsearch(ELATICSEARCH_ENDPOINT,
ssl_assert_fingerprint = (CERT_FINGERPRINT),
basic_auth=(USERNAME, PASSWORD),
verify_certs = False)
resp = es.info()
print(resp)
# Index name
index_name = "test1"
# Example data
data = [
{"id": 1, "text": "The sun slowly set behind the mountains, casting a golden glow across the landscape. The air was crisp and cool, a gentle breeze rustling through the leaves of the trees. Birds chirped in the distance, their melodic songs filling the air. As I walked along the winding path, I couldn't help but marvel at the beauty of nature surrounding me. The scent of wildflowers wafted through the air, intoxicating and refreshing. It was a moment of tranquility, a moment to escape from the chaos of everyday life and immerse myself in the serenity of the natural world."},
{"id": 2, "text": "The bustling city streets were filled with the sound of car horns and chatter. People hurried past, their faces lost in a sea of anonymity. Skyscrapers towered above, their reflective glass windows shimmering in the sunlight. The aroma of street food filled the air, mingling with the scent of exhaust fumes. Neon signs flashed with vibrant colors, advertising the latest products and services. It was a city that never slept, a constant whirlwind of activity and excitement. Amidst the chaos, I navigated through the crowds, searching for moments of connection and inspiration."},
{"id": 3, "text": "The waves crashed against the shore, each one a powerful force of nature. The sand beneath my feet shifted with every step, as if it was alive. Seagulls soared overhead, their calls echoing through the salty air. The ocean stretched out before me, its vastness both awe-inspiring and humbling. I closed my eyes and listened to the symphony of the sea, the rhythm of the waves lulling me into a state of tranquility. It was a place of solace, a place where the worries of the world melted away and all that remained was the beauty of the natural world."},
{"id": 4, "text": "The old bookstore was a treasure trove of knowledge and stories. Rows upon rows of bookshelves lined the walls, each one filled with books of every genre and era. The scent of aged paper and ink filled the air, creating an atmosphere of nostalgia and adventure. As I perused the shelves, my fingers lightly grazing the spines of the books, I felt a sense of wonder and curiosity. Each book held the potential to transport me to another world, to introduce me to new ideas and perspectives. It was a sanctuary for the avid reader, a place where imagination flourished and stories came to life."}
]
# Create Elasticsearch index and mapping
if not es.indices.exists(index=index_name):
es_index = {
"mappings": {
"properties": {
"text": {"type": "text"},
"embedding": {"type": "dense_vector", "dims": 768}
}
}
}
es.indices.create(index=index_name, body=es_index, ignore=[400])
# Upload documents to Elasticsearch with text embeddings
model = SentenceTransformer('quora-distilbert-multilingual')
for doc in data:
# Calculate text embeddings using the SentenceTransformer model
embedding = model.encode(doc["text"], show_progress_bar=False)
# Create document with text and embedding
document = {
"text": doc["text"],
"embedding": embedding.tolist()
}
# Index the document in Elasticsearch
es.index(index=index_name, id=doc["id"], document=document)
為了運行上面的應用,我們需要安裝 elasticsearch 及?sentence_transformers 包:
pip install sentence_transformers elasticsearch
如果我們對上面的 Python 連接到 Elasticsearch 還比較不清楚的話,請詳細閱讀我之前的文章 “Elasticsearch:關(guān)于在 Python 中使用 Elasticsearch 你需要知道的一切 - 8.x”。
我們首先在數(shù)據(jù)攝取腳本中導入必要的庫,包括 Elasticsearch 和 SentenceTransformer。 我們使用 Elasticsearch URL 建立與 Elasticsearch 的連接。 我們定義 index_name 變量來保存 Elasticsearch 索引的名稱。
接下來,我們將示例數(shù)據(jù)定義為字典列表,其中每個字典代表一個具有 ID 和文本的文檔。 這些文檔模擬了我們想要搜索的數(shù)據(jù)。 你可以根據(jù)您的特定數(shù)據(jù)源和元數(shù)據(jù)提取要求自定義腳本。
我們檢查 Elasticsearch 索引是否存在,如果不存在,則使用適當?shù)挠成鋭?chuàng)建它。 該映射定義了我們文檔的字段類型,包括作為文本的文本字段和作為維度為 768 的密集向量的嵌入(embedding)字段。
我們使用 quora-distilbert-multilingual?預訓練文本嵌入模型來初始化 SentenceTransformer 模型。 該模型可以將文本編碼為長度為 768 的密集向量。
對于示例數(shù)據(jù)中的每個文檔,我們使用 model.encode() 函數(shù)計算文本嵌入并將其存儲在嵌入變量中。 我們使用文本和嵌入字段創(chuàng)建一個文檔字典。 最后,我們使用 es.index() 函數(shù)在 Elasticsearch 中索引文檔。
現(xiàn)在我們已經(jīng)將數(shù)據(jù)提取到 Elasticsearch 中,讓我們繼續(xù)使用 FastAPI 創(chuàng)建搜索 API。
main.py
from fastapi import FastAPI
from elasticsearch import Elasticsearch
from sentence_transformers import SentenceTransformer
USERNAME = "elastic"
PASSWORD = "z5nxTriCD4fi7jSS=GFM"
ELATICSEARCH_ENDPOINT = "https://localhost:9200"
CERT_FINGERPRINT = "783663875df7ae1daf3541ab293d8cd48c068b3dbc2d9dd6fa8a668289986ac2"
# Connect to Elasticsearch
es = Elasticsearch(ELATICSEARCH_ENDPOINT,
ssl_assert_fingerprint = (CERT_FINGERPRINT),
basic_auth=(USERNAME, PASSWORD),
verify_certs = False)
app = FastAPI()
@app.get("/search/")
async def search(query: str):
print("query string is: ", query)
model = SentenceTransformer('quora-distilbert-multilingual')
embedding = model.encode(query, show_progress_bar=False)
# Build the Elasticsearch script query
script_query = {
"script_score": {
"query": {"match_all": {}},
"script": {
"source": "cosineSimilarity(params.query_vector, 'embedding') + 1.0",
"params": {"query_vector": embedding.tolist()}
}
}
}
# Execute the search query
search_results = es.search(index="test1", body={"query": script_query})
# Process and return the search results
results = search_results["hits"]["hits"]
return {"results": results}
@app.get("/")
async def root():
return {"message": "Hello World"}
要運行 FastAPI 應用程序,請將代碼保存在文件中(例如 main.py)并在終端中執(zhí)行以下命令:
uvicorn main:app --reload
$ pwd
/Users/liuxg/python/fastapi_vector
$ uvicorn main:app --reload
INFO: Will watch for changes in these directories: ['/Users/liuxg/python/fastapi_vector']
INFO: Uvicorn running on http://127.0.0.1:8000 (Press CTRL+C to quit)
INFO: Started reloader process [95339] using WatchFiles
/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/elasticsearch/_sync/client/__init__.py:395: SecurityWarning: Connecting to 'https://localhost:9200' using TLS with verify_certs=False is insecure
_transport = transport_class(
INFO: Started server process [95341]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: 127.0.0.1:59811 - "GET / HTTP/1.1" 200 OK
這將啟動 FastAPI 開發(fā)服務器。 然后,您可以訪問 http://localhost:8000/search/ 的搜索端點并提供查詢參數(shù)來執(zhí)行搜索。 結(jié)果將作為 JSON 響應返回。
確保根據(jù)你的要求自定義代碼,例如添加錯誤處理、身份驗證和修改響應結(jié)構(gòu)。我們做如下的搜索:
文章來源:http://www.zghlxwxcb.cn/news/detail-507351.html
很顯然,當我們搜索語句 “The sun slowly set behind the mountains” 的時候,第一個文檔是最相近的。其它的文檔沒有那么相近,但是他們也會作為候選結(jié)果返回給用戶。文章來源地址http://www.zghlxwxcb.cn/news/detail-507351.html
到了這里,關(guān)于Elasticsearch:使用 Elasticsearch 矢量搜索和 FastAPI 構(gòu)建文本搜索應用程序的文章就介紹完了。如果您還想了解更多內(nèi)容,請在右上角搜索TOY模板網(wǎng)以前的文章或繼續(xù)瀏覽下面的相關(guān)文章,希望大家以后多多支持TOY模板網(wǎng)!