openai-cookbook/apps/enterprise-knowledge-retrieval at main · openai/openai-cookbook · GitHub
終于看到對(duì)于我解決現(xiàn)有問(wèn)題的例子代碼,對(duì)于企業(yè)私有知識(shí)庫(kù)的集成。
我對(duì)"Retrieval"重新理解了一下,源自動(dòng)詞"retrieve",其基本含義是“取回”,“恢復(fù)”,或“檢索”。在不同的上下文中,"retrieval"可以有稍微不同的含義。"Enterprise Knowledge Retrieval"中,"retrieval"指的就是從企業(yè)的知識(shí)庫(kù)中查找和提取信息的過(guò)程。
GPT的很多應(yīng)用需求和場(chǎng)景就是對(duì)企業(yè)自有知識(shí)庫(kù)的問(wèn)答、發(fā)掘、匯總、分析。這里openai提供了一個(gè)簡(jiǎn)單的例子。但現(xiàn)實(shí)的場(chǎng)景,企業(yè)知識(shí)還是很龐雜的,存儲(chǔ)的方式多樣、數(shù)量眾多的擁有者,知識(shí)碎片化。要想做好內(nèi)部的知識(shí)助理,或者面向客戶的問(wèn)答機(jī)器人,首先需要做好內(nèi)部的知識(shí)管理,這個(gè)工作量還是很大的,管理成本也很高。說(shuō)到這其實(shí)大家也不必太害怕,對(duì)于gpt應(yīng)用如果我們專注于特定的場(chǎng)景,其實(shí)所需的知識(shí)庫(kù)可能并不多,往往是幾個(gè)文檔就可以涵蓋,這個(gè)時(shí)候做一些小的輔助工具就容易的多。
例子先可以看notebook,例子對(duì)一個(gè)集成知識(shí)庫(kù)的過(guò)程和步驟做了完整的說(shuō)明。下面是該過(guò)程的英文說(shuō)明和翻譯。個(gè)人覺(jué)得在存儲(chǔ)上給了一個(gè)比較好的指導(dǎo)方向。
openai-cookbook/apps/enterprise-knowledge-retrieval/enterprise_knowledge_retrieval.ipynb at main · openai/openai-cookbook · GitHub
Enterprise Knowledge Retrieval
This notebook contains an end-to-end workflow to set up an Enterprise Knowledge Retrieval solution from scratch. 這個(gè)notebook代碼包含了從頭開(kāi)始設(shè)置一個(gè)企業(yè)知識(shí)檢索解決方案的端到端流程。
Problem Statement
LLMs have great conversational ability but their knowledge is general and often out of date. Relevant knowledge often exists, but is kept in disparate datestores that are hard to surface with current search solutions. 問(wèn)題:大型語(yǔ)言模型(LLMs)具有很強(qiáng)的對(duì)話能力,但是它們的知識(shí)往往是通用的,且常常過(guò)時(shí)。相關(guān)的知識(shí)往往存在,但是存儲(chǔ)在不同的數(shù)據(jù)倉(cāng)庫(kù)中,使用當(dāng)前的搜索解決方案往往難以檢索。?
Objective
We want to deliver an outstanding user experience where the user is presented with the right knowledge when they need it in a clear and conversational way. To accomplish this we need an LLM-powered solution that knows our organizational context and data, that can retrieve the right knowledge when the user needs it.目標(biāo) 我們希望能夠提供出色的用戶體驗(yàn),當(dāng)用戶需要知識(shí)時(shí),以清晰、對(duì)話的方式呈現(xiàn)正確的知識(shí)。為了實(shí)現(xiàn)這個(gè)目標(biāo),我們需要一個(gè)由大型語(yǔ)言模型(LLM)驅(qū)動(dòng)的解決方案,它了解我們的組織背景和數(shù)據(jù),能夠在用戶需要知識(shí)時(shí)檢索到正確的知識(shí)。
Solution(解決方案)
We'll build a knowledge retrieval solution that will embed a corpus of knowledge (in our case a database of Wikipedia manuals) and use it to answer user questions.我們將構(gòu)建一個(gè)知識(shí)檢索解決方案,該解決方案將嵌入一整套知識(shí)庫(kù)(案例中,是一個(gè)維基百科手冊(cè)的數(shù)據(jù)庫(kù)wikipedia_articles_2000.csv),并用它來(lái)回答用戶的問(wèn)題。
Learning Path學(xué)習(xí)路徑
Walkthrough演練
You can follow on to this solution walkthrough through either the video recorded here, or the text walkthrough below. We'll build out the solution in the following stages:可以通過(guò)下面的文本演練來(lái)繼續(xù)此解決方案的演練
- Setup:?Initiate variables and connect to a vector database.設(shè)置:初始化變量并連接到向量數(shù)據(jù)庫(kù)
- Storage:?Configure the database, prepare our data and store embeddings and metadata for retrieval.存儲(chǔ):配置數(shù)據(jù)庫(kù),準(zhǔn)備我們的數(shù)據(jù)并存儲(chǔ)用于檢索的嵌入和元數(shù)據(jù)
- Search:?Extract relevant documents back out with a basic search function and use an LLM to summarise results into a concise reply.搜索:使用基本的搜索功能提取出相關(guān)的文檔,并使用LLM將結(jié)果總結(jié)為簡(jiǎn)潔的回復(fù)。
- Answer:?Add a more sophisticated agent which will process the user's query and maintain a memory for follow-up questions.回答:添加一個(gè)更復(fù)雜的代理,該代理將處理用戶的查詢并為后續(xù)問(wèn)題保留記憶。
- Evaluate:?Take a sample evaluated question/answer pairs using our service and plot them to scope out remedial action.評(píng)估:對(duì)樣本評(píng)估的問(wèn)題/答案對(duì)進(jìn)行采樣并繪制它們以確定補(bǔ)救措施。
Storage(存儲(chǔ))
We'll initialise our vector database first. Which database you choose and how you store data in it is a key decision point, and we've collated a few principles to aid your decision here:
我們首先將初始化我們的向量數(shù)據(jù)庫(kù)。你選擇哪種數(shù)據(jù)庫(kù)以及如何在其中存儲(chǔ)數(shù)據(jù)是一個(gè)關(guān)鍵決策點(diǎn),我們?cè)谶@里匯總了一些原則來(lái)幫助你做出決策?!?/p>
How much data to store
How much metadata do you want to include in the index. Metadata can be used to filter your queries or to bring back more information upon retrieval for your application to use, but larger indices will be slower so there is a trade-off.“需要存儲(chǔ)多少數(shù)據(jù)\n你希望在索引中包含多少元數(shù)據(jù)。元數(shù)據(jù)可以用來(lái)過(guò)濾你的查詢,或者在檢索時(shí)獲取更多信息供你的應(yīng)用程序使用,但是較大的索引會(huì)更慢,所以這里存在一個(gè)權(quán)衡?!?/p>
There are two common design patterns here:
這里有兩種常見(jiàn)的設(shè)計(jì)模式:
- All-in-one:?Store your metadata with the vector embeddings so you perform semantic search and retrieval on the same database. This is easier to setup and run, but can run into scaling issues when your index grows.全包式(All-in-one):將元數(shù)據(jù)與向量嵌入一起存儲(chǔ),以便在同一個(gè)數(shù)據(jù)庫(kù)上執(zhí)行語(yǔ)義搜索和檢索。這種模式更容易設(shè)置和運(yùn)行,但當(dāng)索引增長(zhǎng)時(shí)可能會(huì)遇到擴(kuò)展性問(wèn)題。
- Vectors only:?Store just the embeddings and any IDs/references needed to locate the metadata that goes with the vector in a different database or location. In this pattern the vector database is only used to locate the most relevant IDs, then those are looked up from a different database. This can be more scalable if your vector database is going to be extremely large, or if you have large volumes of metadata with each vector.僅向量(Vectors only):僅存儲(chǔ)嵌入向量以及用于定位與向量相關(guān)的元數(shù)據(jù)的任何ID/引用,存儲(chǔ)在不同的數(shù)據(jù)庫(kù)或位置。在這種模式中,向量數(shù)據(jù)庫(kù)僅用于定位最相關(guān)的ID,然后從不同的數(shù)據(jù)庫(kù)中查找這些ID。如果您的向量數(shù)據(jù)庫(kù)將非常龐大,或者每個(gè)向量都有大量元數(shù)據(jù),那么這種模式可能更具擴(kuò)展性。
Which vector database to use(使用向量數(shù)據(jù)庫(kù))
The vector database market is wide and varied, so we won't recommend one over the other. For a few options you can review?this cookbook?and the sub-folders, which have examples supplied by many of the vector database providers in the market.
We're going to use Redis as our database for both document contents and the vector embeddings. You will need the full Redis Stack to enable use of Redisearch, which is the module that allows semantic search - more detail is in the?docs for Redis Stack.
To set this up locally, you will need to:
- Install an appropriate version of?Docker?for your OS
- Ensure Docker is running i.e. by running?
docker run hello-world
- Run the following command:?
docker run -d --name redis-stack -p 6379:6379 -p 8001:8001 redis/redis-stack:latest
.
The code used here draws heavily on?this repo.
After setting up the Docker instance of Redis Stack, you can follow the below instructions to initiate a Redis connection and create a Hierarchical Navigable Small World (HNSW) index for semantic search.
這段不翻譯了,主要意思是例子使用了redis-stack,redis的一個(gè)增強(qiáng)版本做向量數(shù)據(jù)庫(kù),按照要求啟動(dòng)向量數(shù)據(jù)庫(kù)。
Data preparation數(shù)據(jù)準(zhǔn)備
The next step is to prepare your data. There are a few decisions to keep in mind here(對(duì)于數(shù)據(jù)準(zhǔn)備,有一些需要注意的決策):
Chunking your data
In this context, "chunking" means cutting up the text into reasonable sizes so that the content will fit into the context length of the language model you choose. If your data is small enough or your LLM has a large enough context limit then you can proceed with no chunking, but in many cases you'll need to chunk your data. I'll share two main design patterns here:在這個(gè)語(yǔ)境中,“切分”意味著將文本切成合理的大小,以便內(nèi)容能夠適應(yīng)你選擇的語(yǔ)言模型的上下文長(zhǎng)度。如果數(shù)據(jù)足夠小或者LLM的上下文限制足夠大,那么可以不進(jìn)行切分,但在許多情況下,需要切分?jǐn)?shù)據(jù)。
- Token-based:?Chunking your data based on some common token threshold i.e. 300, 500, 1000 depending on your use case. This approach works best with a grid-search evaluation to decide the optimal chunking logic over a set of evaluation questions. Variables to consider are whether chunks have overlaps, and whether you extend or truncate a section to keep full sentences and paragraphs together.基于令牌的:根據(jù)一些常見(jiàn)的令牌閾值(例如300,500,1000等)來(lái)切分?jǐn)?shù)據(jù),這取決于用例。這種方法最適合通過(guò)一組評(píng)估問(wèn)題進(jìn)行網(wǎng)格搜索評(píng)估來(lái)決定最優(yōu)的切分邏輯。需要考慮的變量是切分是否有重疊,以及為了保持完整的句子和段落在一起,你是否擴(kuò)展或截?cái)嘁粋€(gè)部分。
- Deterministic:?Deterministic chunking uses some common delimiter, like a page break, paragraph end, section header etc. to chunk. This can work well if you have data of reasonable uniform structure, or if you can use GPT to help annotate the data first so you can guarantee common delimiters. However, it can be difficult to handle your chunks when you stuff them into the prompt given you need to cater for many different lengths of content, so consider that in your application design.確定性的:確定性切分使用一些常見(jiàn)的分隔符,如頁(yè)面分隔,段落結(jié)束,節(jié)標(biāo)題等進(jìn)行切分。如果你的數(shù)據(jù)結(jié)構(gòu)合理統(tǒng)一,或者你可以使用GPT先對(duì)數(shù)據(jù)進(jìn)行注釋,以便你可以保證常見(jiàn)的分隔符,那么這種方法可能會(huì)很好用。然而,當(dāng)你把切分的內(nèi)容塞入提示時(shí),可能會(huì)很難處理,因?yàn)槟阈枰m應(yīng)許多不同長(zhǎng)度的內(nèi)容,所以在你的應(yīng)用設(shè)計(jì)中要考慮這一點(diǎn)。 你應(yīng)該存儲(chǔ)哪些向量
Which vectors should you store
It is critical to think through the user experience you're building towards because this will inform both the number and content of your vectors. Here are two example use cases that show how these can pan out:思考你正在構(gòu)建的用戶體驗(yàn)是非常關(guān)鍵的,因?yàn)檫@將決定你的向量的數(shù)量和內(nèi)容。這里有兩個(gè)示例用例,展示了這些情況可能如何發(fā)展:
-
Tool Manual Knowledge Base:?We have a database of manuals that our customers want to search over. For this use case, we want a vector to allow the user to identify the right manual, before searching a different set of vectors to interrogate the content of the manual to avoid any cross-pollination of similar content between different manuals.工具手冊(cè)知識(shí)庫(kù):我們有一個(gè)我們的客戶想要搜索的手冊(cè)數(shù)據(jù)庫(kù)。對(duì)于這個(gè)用例,我們想要一個(gè)向量讓用戶識(shí)別出正確的手冊(cè),然后搜索另一組向量來(lái)查詢手冊(cè)的內(nèi)容,以避免不同手冊(cè)之間的相似內(nèi)容交叉污染。
- Title Vector:?Could include title, author name, brand and abstract.標(biāo)題向量:可以包括標(biāo)題,作者名字,品牌和摘要。
- Content Vector:?Includes content only.內(nèi)容向量:只包括內(nèi)容。
-
Investor Reports:?We have a database of investor reports that contain financial information about public companies. I want relevant snippets pulled out and summarised so I can decide how to invest. In this instance we want one set of content vectors, so that the retrieval can pull multiple entries on a company or industry, and summarise them to form a composite analysis.投資者報(bào)告:我們有一個(gè)包含公共公司財(cái)務(wù)信息的投資者報(bào)告數(shù)據(jù)庫(kù)。我希望能拉出相關(guān)的片段并總結(jié),以便我決定如何投資。在這種情況下,我們希望有一組內(nèi)容向量,這樣檢索就可以提取一個(gè)公司或行業(yè)的多個(gè)條目,并將它們總結(jié)形成一個(gè)綜合分析。
- Content Vector:?Includes content only, or content supplemented by other features that improve search quality such as author, industry etc.內(nèi)容向量:只包括內(nèi)容,或者通過(guò)其他可以提高搜索質(zhì)量的特征來(lái)補(bǔ)充內(nèi)容,比如作者,行業(yè)等。
For this walkthrough we'll go with 1000 token-based chunking of text content with no overlap, and embed them with the article title included as a prefix.對(duì)于這段例子,將選擇基于1000個(gè)令牌的文本內(nèi)容切分,沒(méi)有重疊,并將它們嵌入,文章標(biāo)題作為前綴包含在內(nèi)。
Search(搜索)
We can now use our knowledge base to bring back search results. This is one of the areas of highest friction in enterprise knowledge retrieval use cases, with the most common being that the system is not retrieving what you intuitively think are the most relevant documents. There are a few ways of tackling this - I'll share a few options here, as well as some resources to take your research further:現(xiàn)在可以使用我們的知識(shí)庫(kù)來(lái)返回搜索結(jié)果。這是企業(yè)知識(shí)檢索用例中摩擦最大的一個(gè)領(lǐng)域,最常見(jiàn)的問(wèn)題是系統(tǒng)沒(méi)有檢索到你直觀認(rèn)為最相關(guān)的文檔。有一些方法可以解決這個(gè)問(wèn)題 - 在這里分享一些選項(xiàng),以及一些資源來(lái)幫助使用者進(jìn)一步研究:
Vector search, keyword search or a hybrid(向量搜索,關(guān)鍵詞搜索或混合搜索)
Despite the strong capabilities out of the box that vector search gives, search is still not a solved problem, and there are well proven?Lucene-based?search solutions such Elasticsearch and Solr that use methods that work well for certain use cases, as well as the sparse vector methods of traditional NLP such as?TF-IDF. If your retrieval is poor, the answer may be one of these in particular, or a combination:盡管向量搜索的開(kāi)箱即用能力很強(qiáng),但搜索仍然不是一個(gè)已經(jīng)解決的問(wèn)題,有一些經(jīng)過(guò)良好驗(yàn)證的基于Lucene的搜索解決方案,如Elasticsearch和Solr,它們使用了在某些用例中表現(xiàn)良好的方法,以及傳統(tǒng)NLP的稀疏向量方法,如TF-IDF。如果檢索效果不好,你可以使用下面方法中的一種或者它們的組合來(lái)優(yōu)化搜索效果:
- Vector search:?Converts your text into vector embeddings which can be searched using KNN, SVM or some other model to return the most relevant results. This is the approach we take in this workbook, using a RediSearch vector DB which employs a KNN search under the hood.向量搜索:將你的文本轉(zhuǎn)換成向量嵌入,可以使用KNN,SVM或其他一些模型進(jìn)行搜索,返回最相關(guān)的結(jié)果。這是我們?cè)谶@個(gè)工作簿中采取的方法,使用一個(gè)RediSearch向量數(shù)據(jù)庫(kù),它在底層使用了KNN搜索。?
- Keyword search:?This method uses any keyword-based search approach to return a score - it could use Elasticsearch/Solr out-of-the-box, or a TF-IDF approach like BM25.關(guān)鍵詞搜索:這種方法使用任何基于關(guān)鍵詞的搜索方法來(lái)返回一個(gè)分?jǐn)?shù) - 它可以使用開(kāi)箱即用的Elasticsearch/Solr,或者像BM25那樣的TF-IDF方法。
-
Hybrid search:?This last approach is a mix of the two, where you produce both a vector search and keyword search result, before using an?
alpha
?between 0 and 1 to weight the outputs. There is a great example of this explained by the Weaviate team?here.混合搜索:這最后一種方法是前兩者的混合,你可以生成一個(gè)向量搜索結(jié)果和一個(gè)關(guān)鍵詞搜索結(jié)果,然后使用一個(gè)0到1之間的alpha值來(lái)權(quán)衡輸出。Weaviate團(tuán)隊(duì)在這里解釋了這個(gè)方法的一個(gè)很好的例子。
Hypothetical Document Embeddings (HyDE)混合搜索
This is a novel approach from?this paper, which states that a hypothetical answer to a question is more semantically similar to the real answer than the question is. In practice this means that your search would use GPT to generate a hypothetical answer, then embed that and use it for search. I've seen success with this both as a pure search, and as a retry step if the initial retrieval fails to retrieve relevant content. A simple example implementation is here:這最后一種方法是前兩者的混合,你可以生成一個(gè)向量搜索結(jié)果和一個(gè)關(guān)鍵詞搜索結(jié)果,然后使用一個(gè)0到1之間的alpha值來(lái)權(quán)衡輸出。Weaviate團(tuán)隊(duì)在這里解釋了這個(gè)方法的一個(gè)很好的例子。
def answer_question_hyde(question,prompt):
hyde_prompt = '''You are OracleGPT, an helpful expert who answers user questions to the best of their ability.
Provide a confident answer to their question. If you don't know the answer, make the best guess you can based on the context of the question.
User question: USER_QUESTION_HERE
Answer:'''
hypothetical_answer = openai.Completion.create(model=COMPLETIONS_MODEL,prompt=hyde_prompt.replace('USER_QUESTION_HERE',question))['choices'][0]['text']
search_results = get_redis_results(redis_client,hypothetical_answer)
return search_results
Fine-tuning embeddings微調(diào)嵌入
This next approach leverages the learning you gain from real question/answer pairs that your users will generate during the evaluation approach. It works by:
這個(gè)方法利用了你在評(píng)估過(guò)程中從真實(shí)的問(wèn)題/答案對(duì)中獲得的學(xué)習(xí)。它的工作方式是:
- Creating a dataset of positive (and optionally negative) question and answer pairs. Positive examples would be a correct retrieval to a question, while negative would be poor retrievals.
創(chuàng)建一個(gè)正面(可選負(fù)面)問(wèn)題和答案對(duì)的數(shù)據(jù)集。正面的例子是一個(gè)問(wèn)題的正確檢索,而負(fù)面的例子是差的檢索。 計(jì)算問(wèn)題和答案的嵌入以及它們之間的余弦相似性。 訓(xùn)練一個(gè)模型來(lái)優(yōu)化嵌入矩陣和測(cè)試檢索,選擇最好的一個(gè)。 將基礎(chǔ)Ada嵌入矩陣與這個(gè)新的最好的矩陣進(jìn)行矩陣乘法,創(chuàng)建一個(gè)新的用于檢索的微調(diào)嵌入。 在這個(gè)烹飪書(shū)中有一個(gè)詳細(xì)的微調(diào)嵌入方法和執(zhí)行代碼的演示。
- Calculating the embeddings for both questions and answers and the cosine similarity between them.計(jì)算問(wèn)題和答案的嵌入以及它們之間的余弦相似性。
- Train a model to optimize the embeddings matrix and test retrieval, picking the best one.
Perform a matrix multiplication of the base Ada embeddings by this new best matrix, creating a new fine-tuned embedding to do for retrieval.訓(xùn)練一個(gè)模型來(lái)優(yōu)化嵌入矩陣和測(cè)試檢索,選擇最好的一個(gè)。 將基礎(chǔ)Ada嵌入矩陣與這個(gè)新的最好的矩陣進(jìn)行矩陣乘法,創(chuàng)建一個(gè)新的用于檢索的微調(diào)嵌入。
There is a great walkthrough of both the approach and the code to perform it in?this cookbook.
對(duì)于這個(gè)演示,我們將堅(jiān)持使用基礎(chǔ)語(yǔ)義搜索,返回用戶問(wèn)題的前5個(gè)塊,并使用GPT提供一個(gè)總結(jié)的回應(yīng)。
Reranking重新排名
One other well-proven method from traditional search solutions that can be applied to any of the above approaches is reranking, where we over-fetch our search results, and then deterministically rerank based on a modifier or set of modifiers. 另一個(gè)從傳統(tǒng)搜索解決方案中得到驗(yàn)證的方法是重新排列,我們可以對(duì)任何上述方法過(guò)度獲取搜索結(jié)果,然后基于修飾符或一組修飾符確定性地重新排列。
An example is investor reports again - it is highly likely that if we have 3 reports on Apple, we'll want to make our investment decisions based on the latest one. In this instance a?recency
?modifier could be applied to the vector scores to sort them, giving us the latest one on the top even if it is not the most semantically similar to our search question.一個(gè)例子是投資者報(bào)告 - 如果我們有3份關(guān)于Apple的報(bào)告,我們很可能希望根據(jù)最新的一份報(bào)告做出投資決策。在這種情況下,可以應(yīng)用一個(gè)最近的修飾符來(lái)對(duì)向量分?jǐn)?shù)進(jìn)行排序,這樣我們就可以得到最新的一份,即使它在語(yǔ)義上不是最接近我們搜索問(wèn)題的。
For this walkthrough we'll stick with a basic semantic search bringing back the top 5 chunks for a user question, and providing a summarised response using GPT.在這個(gè)演示中,我們將堅(jiān)持使用基礎(chǔ)語(yǔ)義搜索,返回用戶問(wèn)題的前5個(gè)塊,并使用GPT提供一個(gè)總結(jié)的回應(yīng)。
對(duì)于notebook的示例,其中的一些說(shuō)明對(duì)于私有知識(shí)庫(kù)實(shí)踐給了一個(gè)大方向的指導(dǎo),但是感覺(jué)缺少實(shí)踐的針對(duì)性,這個(gè)也可以理解,比較人家只提供平臺(tái),怎么用好其實(shí)是工程問(wèn)題。后半部分稍微有點(diǎn)晦澀,需要多讀幾遍結(jié)合例子代碼,希望以后能在實(shí)際項(xiàng)目上派上用處。
這個(gè)例子中的還有個(gè)chatbot例子,也是利用streamlit實(shí)現(xiàn)的。
在運(yùn)行完notebook后,用文章來(lái)源:http://www.zghlxwxcb.cn/news/detail-500939.html
streamlit run chatbot.py
啟動(dòng)頁(yè)面。記住必須運(yùn)行notebook做好數(shù)據(jù)的處理,否則chatbot沒(méi)有數(shù)據(jù)無(wú)法正常工作。文章來(lái)源地址http://www.zghlxwxcb.cn/news/detail-500939.html
到了這里,關(guān)于GPT學(xué)習(xí)筆記-Enterprise Knowledge Retrieval(企業(yè)知識(shí)檢索)--私有知識(shí)庫(kù)的集成的文章就介紹完了。如果您還想了解更多內(nèi)容,請(qǐng)?jiān)谟疑辖撬阉鱐OY模板網(wǎng)以前的文章或繼續(xù)瀏覽下面的相關(guān)文章,希望大家以后多多支持TOY模板網(wǎng)!