本文介紹Embeddings的基本概念,并使用最少但完整的代碼講解Embeddings是如何使用的,幫你打造專(zhuān)屬AI聊天機(jī)器人(智能客服),你可以拿到該代碼進(jìn)行修改以滿足實(shí)際需求。
ChatGPT的Embeddings解決了什么問(wèn)題?
如果直接問(wèn)ChatGPT:What is langchain? If you do not know please do not answer.
,由于ChatGPT不知道2021年9月份之后的事情,而langchain比較新,是在那之后才有的,所以ChatGPT會(huì)回答不知道:
I’m sorry, but I don’t have any information on “l(fā)angchain.” It appears to be a term that is not widely recognized or used in general knowledge.
如果我們用上Embeddings,用上面的問(wèn)題提問(wèn),它可以給出答案:
LangChain is a framework for developing applications powered by language models.文章來(lái)源:http://www.zghlxwxcb.cn/news/detail-742970.html
有了這個(gè)技術(shù),我們就可以對(duì)自己的文檔進(jìn)行提問(wèn),從而拓展ChatGPT的知識(shí)范圍,打造定制化的AI智能客服。例如在官網(wǎng)接入ChatGPT,根據(jù)網(wǎng)站的文檔讓他回答用戶的問(wèn)題。
Embeddings相關(guān)基本概念介紹
什么是Embeddings?
在跳進(jìn)代碼之前,先簡(jiǎn)要介紹一下什么是Embeddings。在介紹Embeddings之前我們需要先學(xué)習(xí)一下「向量」這個(gè)概念。
我們可以將一個(gè)事物從多個(gè)維度來(lái)描述,例如聲音可以從「時(shí)域」和「頻域」來(lái)描述(傅里葉變換可能很多人都聽(tīng)過(guò)),維度拆分的越多就越能描述一個(gè)事物,在向量空間上的接近往往意味著這兩個(gè)事物有更多的聯(lián)系,而向量空間又是比較好計(jì)算的,于是我們可以通過(guò)計(jì)算向量來(lái)判斷事物的相似程度。
在自然語(yǔ)言處理 (NLP) 的中,Embeddings是將單詞或句子轉(zhuǎn)換為數(shù)值向量的一種方法。這些向量捕獲單詞或句子的語(yǔ)義,使我們能夠?qū)λ鼈儓?zhí)行數(shù)學(xué)運(yùn)算。例如,我們可以計(jì)算兩個(gè)向量之間的余弦相似度來(lái)衡量它們?cè)谡Z(yǔ)義上的相似程度。
Embeddings使用流程講解
如何讓ChatGPT回答沒(méi)有訓(xùn)練過(guò)的內(nèi)容?流程如下,一圖勝千言。
分步解釋?zhuān)?/p>
- 首先是獲取本地?cái)?shù)據(jù)的embeddings結(jié)果,由于一次embeddings調(diào)用的token數(shù)量是有限制的,先將數(shù)據(jù)進(jìn)行分段然后以依次行調(diào)用獲得所有數(shù)據(jù)的embeddings結(jié)果。
- 然后我們開(kāi)始提問(wèn),同樣的,將提問(wèn)的內(nèi)容也做一次embedding,得到一個(gè)結(jié)果。
- 再將提問(wèn)的intending結(jié)果和之前所有數(shù)據(jù)的embedded結(jié)果進(jìn)行距離的計(jì)算,這里的距離就是指向量之間的距離,然后我們獲取距離最近的幾段段數(shù)據(jù)來(lái)作為我們提問(wèn)的「上下文」(例如這里找到data2/data3是和問(wèn)題最相關(guān)的內(nèi)容)。
- 獲得上下文之后我們開(kāi)始構(gòu)造真正的問(wèn)題,問(wèn)題會(huì)將上下文也附屬在后面一并發(fā)送給chat gpt,這樣它就可以回答之前不知道的問(wèn)題了。
總結(jié)來(lái)說(shuō):
之所以能夠讓ChatGPT回答他不知道的內(nèi)容,其實(shí)是因?yàn)槲覀儼严嚓P(guān)的上下文傳遞給了他,他從上下文中獲取的答案。如何確定要發(fā)送哪些上下文給他,就是通過(guò)計(jì)算向量距離得到的。
embedding實(shí)戰(zhàn)代碼(python)
讓我來(lái)看看實(shí)際的代碼。
前置條件
- Python 3.6 或更高版本。
- OpenAI API 密鑰,或者其他提供API服務(wù)的也可以。
- 安裝了以下 Python 軟件包:
requests
、beautifulsoup4
、pandas
、tiktoken
、openai
、numpy
。 - 私有文本數(shù)據(jù)集。在這個(gè)示例中,使用名為
langchainintro.txt
的文本文件,這里面是langchain官網(wǎng)的一些文檔說(shuō)明,文檔比較新所以ChatGPT肯定不知道,以此來(lái)測(cè)試效果。
代碼:
代碼來(lái)自于OpenAI官網(wǎng),我做了一些改動(dòng)和精簡(jiǎn)。
import os
import numpy as np
import openai
import pandas as pd
import tiktoken
from ast import literal_eval
from openai.embeddings_utils import distances_from_embeddings
import traceback
tokenizer = tiktoken.get_encoding("cl100k_base")
def get_api_key():
return os.getenv('OPENAI_API_KEY')
def set_openai_config():
openai.api_key = get_api_key()
openai.api_base = "https://openai.api2d.net/v1"
def remove_newlines(serie):
serie = serie.str.replace('\n', ' ')
serie = serie.str.replace('\\n', ' ')
serie = serie.str.replace(' ', ' ')
serie = serie.str.replace(' ', ' ')
return serie
def load_text_files(file_name):
with open(file_name, "r", encoding="UTF-8") as f:
text = f.read()
return text
def prepare_directory(dir_name="processed"):
if not os.path.exists(dir_name):
os.mkdir(dir_name)
def split_into_many(text, max_tokens):
# Split the text into sentences
sentences = text.split('. ')
# Get the number of tokens for each sentence
n_tokens = [len(tokenizer.encode(" " + sentence)) for sentence in sentences]
chunks = []
tokens_so_far = 0
chunk = []
# Loop through the sentences and tokens joined together in a tuple
for sentence, token in zip(sentences, n_tokens):
# If the number of tokens so far plus the number of tokens in the current sentence is greater
# than the max number of tokens, then add the chunk to the list of chunks and reset
# the chunk and tokens so far
if tokens_so_far + token > max_tokens:
chunks.append(". ".join(chunk) + ".")
chunk = []
tokens_so_far = 0
# If the number of tokens in the current sentence is greater than the max number of
# tokens, split the sentence into smaller parts and add them to the chunk
while token > max_tokens:
part = sentence[:max_tokens]
chunk.append(part)
sentence = sentence[max_tokens:]
token = len(tokenizer.encode(" " + sentence))
# Otherwise, add the sentence to the chunk and add the number of tokens to the total
chunk.append(sentence)
tokens_so_far += token + 1
# Add the last chunk to the list of chunks
if chunk:
chunks.append(". ".join(chunk) + ".")
return chunks
def shorten_texts(df, max_tokens):
shortened = []
# Loop through the dataframe
for row in df.iterrows():
# If the text is None, go to the next row
if row[1]['text'] is None:
continue
# If the number of tokens is greater than the max number of tokens, split the text into chunks
if row[1]['n_tokens'] > max_tokens:
shortened += split_into_many(row[1]['text'], max_tokens)
# Otherwise, add the text to the list of shortened texts
else:
shortened.append(row[1]['text'])
df = pd.DataFrame(shortened, columns=['text'])
df['n_tokens'] = df.text.apply(lambda x: len(tokenizer.encode(x)))
return df
def create_embeddings(df):
df['embeddings'] = df.text.apply(
lambda x: openai.Embedding.create(input=x, engine='text-embedding-ada-002')['data'][0]['embedding'])
df.to_csv('processed/embeddings.csv')
return df
def load_embeddings():
df = pd.read_csv('processed/embeddings.csv', index_col=0)
df['embeddings'] = df['embeddings'].apply(literal_eval).apply(np.array)
return df
def create_context(
question, df, max_len=1800, size="ada"
):
"""
Create a context for a question by finding the most similar context from the dataframe
"""
# print(f'start create_context')
# Get the embeddings for the question
q_embeddings = openai.Embedding.create(input=question, engine='text-embedding-ada-002')['data'][0]['embedding']
# print(f'q_embeddings:{q_embeddings}')
# Get the distances from the embeddings
df['distances'] = distances_from_embeddings(q_embeddings, df['embeddings'].values, distance_metric='cosine')
# print(f'df[distances]:{df["distances"]}')
returns = []
cur_len = 0
# Sort by distance and add the text to the context until the context is too long
for i, row in df.sort_values('distances', ascending=True).iterrows():
# print(f'i:{i}, row:{row}')
# Add the length of the text to the current length
cur_len += row['n_tokens'] + 4
# If the context is too long, break
if cur_len > max_len:
break
# Else add it to the text that is being returned
returns.append(row["text"])
# Return the context
return "\n\n###\n\n".join(returns)
def answer_question(
df,
model="text-davinci-003",
question="Am I allowed to publish model outputs to Twitter, without a human review?",
max_len=1800,
size="ada",
debug=False,
max_tokens=150,
stop_sequence=None
):
"""
Answer a question based on the most similar context from the dataframe texts
"""
context = create_context(
question,
df,
max_len=max_len,
size=size,
)
# If debug, print the raw model response
if debug:
print("Context:\n" + context)
print("\n\n")
prompt = f"Answer the question based on the context below, \n\nContext: {context}\n\n---\n\nQuestion: {question}\nAnswer:"
messages = [
{
'role': 'user',
'content': prompt
}
]
try:
# Create a completions using the questin and context
response = openai.ChatCompletion.create(
messages=messages,
temperature=0,
max_tokens=max_tokens,
stop=stop_sequence,
model=model,
)
return response["choices"][0]["message"]["content"]
except Exception as e:
# print stack
traceback.print_exc()
print(e)
return ""
def main():
# 設(shè)置API key
set_openai_config()
# 載入本地?cái)?shù)據(jù)
texts = []
text = load_text_files("langchainintro.txt")
texts.append(('langchainintro', text))
prepare_directory("processed")
# 創(chuàng)建一個(gè)dataframe,包含fname和text兩列
df = pd.DataFrame(texts, columns=['fname', 'text'])
df['text'] = df.fname + ". " + remove_newlines(df.text)
df.to_csv('processed/scraped.csv')
# 計(jì)算token數(shù)量
df.columns = ['title', 'text']
df['n_tokens'] = df.text.apply(lambda x: len(tokenizer.encode(x)))
# print(f'{df}')
df = shorten_texts(df, 500)
# 如果processed/embeddings.csv已經(jīng)存在,直接load,不存在則create
if os.path.exists('processed/embeddings.csv'):
df = load_embeddings()
else:
df = create_embeddings(df)
print(f"What is langchain? If you do not know please do not answer.")
ans = answer_question(df, model='gpt-3.5-turbo', question="What is langchain? If you do not know please do not answer.", debug=False)
print(f'ans:{ans}')
if __name__ == '__main__':
main()
代碼流程與時(shí)序圖的流程基本一致,注意api_key需要放入環(huán)境變量,也可以自己改動(dòng)。
如果直接問(wèn)ChatGPT:What is langchain? If you do not know please do not answer.
,ChatGPT會(huì)回答不知道:
I’m sorry, but I don’t have any information on “l(fā)angchain.” It appears to be a term that is not widely recognized or used in general knowledge.
運(yùn)行上面的代碼,它可以給出答案:
LangChain is a framework for developing applications powered by language models.
可以看到它使用了我們提供的文檔來(lái)回答。文章來(lái)源地址http://www.zghlxwxcb.cn/news/detail-742970.html
拓展
- 注意token消耗,如果你的本地?cái)?shù)據(jù)非常多,embedding階段將會(huì)消耗非常多的token,請(qǐng)注意使用。
- embedding階段仍然會(huì)將本地?cái)?shù)據(jù)傳給ChatGPT,如果你有隱私需求,需要注意。
- 一般生產(chǎn)環(huán)境會(huì)將向量結(jié)果存入「向量數(shù)據(jù)庫(kù)」而不是本地文件,此處為了演示直接使用的文本文件存放。
到了這里,關(guān)于ChatGPT實(shí)戰(zhàn)-Embeddings打造定制化AI智能客服的文章就介紹完了。如果您還想了解更多內(nèi)容,請(qǐng)?jiān)谟疑辖撬阉鱐OY模板網(wǎng)以前的文章或繼續(xù)瀏覽下面的相關(guān)文章,希望大家以后多多支持TOY模板網(wǎng)!