国产无码综合区,色欲AV无码国产永久播放,无码天堂亚洲国产AV,国产日韩欧美女同一区二区

OpenAI-ChatGPT最新官方接口《嵌入向量式文本轉換》全網最詳細中英文實用指南和教程，助你零基礎快速輕松掌握全新技術（五）（附源碼）

2年前作者：小胡說人工智能分類：Toy博客閱讀(24)違法舉報

這篇具有很好參考價值的文章主要介紹了OpenAI-ChatGPT最新官方接口《嵌入向量式文本轉換》全網最詳細中英文實用指南和教程，助你零基礎快速輕松掌握全新技術（五）（附源碼）。希望對大家有所幫助。如果存在錯誤或未考慮完全的地方，請大家不吝賜教，您也可以點擊"舉報違法"按鈕提交疑問。

OpenAI-ChatGPT最新官方接口《嵌入向量式文本轉換》全網最詳細中英文實用指南和教程，助你零基礎快速輕松掌握全新技術（五）（附源碼）

前言

ChatGPT 嵌入能夠將文本轉換為固定長度的連續(xù)向量，允許對文本數據執(zhí)行分類、主題聚類、搜索和推薦等功能。這樣，以前很難被處理的文本數據可以輕松地被處理了。

使用 ChatGPT 嵌入可以極大地改善用戶體驗。它能夠幫助聊天機器人更準確地處理文本信息，并實現更有效的文本搜索與推薦、體驗更為流暢的交互式會話，從而更好地滿足用戶需求。

Overview 概述

What are embeddings? 什么是嵌入？

OpenAI’s text embeddings measure the relatedness of text strings. Embeddings are commonly used for:
OpenAI的文本嵌入測量文本字符串的相關性。嵌入通常用于：

Search (where results are ranked by relevance to a query string)
搜索（其中結果按與查詢字符串的相關性進行排名）
Clustering (where text strings are grouped by similarity)
聚類（其中文本字符串按相似性分組）
Recommendations (where items with related text strings are recommended)
建議（其中建議包含相關文本字符串的項目）
Anomaly detection (where outliers with little relatedness are identified)
異常檢測（其中識別出相關性很小的離群值）
Diversity measurement (where similarity distributions are analyzed)
多樣性測量（分析相似性分布）
Classification (where text strings are classified by their most similar label)
分類（其中文本字符串按其最相似的標簽進行分類）

An embedding is a vector (list) of floating point numbers. The distance between two vectors measures their relatedness. Small distances suggest high relatedness and large distances suggest low relatedness.
嵌入是浮點數的向量（列表）。兩個向量之間的距離測量它們的相關性。小的距離表示高相關性，大的距離表示低相關性。

Visit our pricing page to learn about Embeddings pricing. Requests are billed based on the number of tokens in the input sent.
請訪問我們的定價頁面了解嵌入定價。請求根據發(fā)送的輸入中的標記數計費。

To see embeddings in action, check out our code samples
要查看嵌入實際使用，請查看我們的代碼示例，詳見博客下部分 Use Cases

Classification
Topic clustering
Search
Recommendations

How to get embeddings 如何獲取嵌入

To get an embedding, send your text string to the embeddings API endpoint along with a choice of embedding model ID (e.g., text-embedding-ada-002). The response will contain an embedding, which you can extract, save, and use.
要獲取嵌入，請將文本字符串發(fā)送到嵌入API端點沿著選擇嵌入模型ID（例如， text-embedding-ada-002 ）。響應將包含一個嵌入，您可以提取、保存和使用它。

Example requests: 請求的示例

python代碼示例

response = openai.Embedding.create(
    input="Your text string goes here",
    model="text-embedding-ada-002"
)
embeddings = response['data'][0]['embedding']

cURL代碼示例

curl https://api.openai.com/v1/embeddings \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -d '{
    "input": "Your text string goes here",
    "model": "text-embedding-ada-002"
  }'

Example response: 返回的示例

{
  "data": [
    {
      "embedding": [
        -0.006929283495992422,
        -0.005336422007530928,
        ...
        -4.547132266452536e-05,
        -0.024047505110502243
      ],
      "index": 0,
      "object": "embedding"
    }
  ],
  "model": "text-embedding-ada-002",
  "object": "list",
  "usage": {
    "prompt_tokens": 5,
    "total_tokens": 5
  }
}

See more Python code examples in the OpenAI Cookbook.
在OpenAI Cookbook中查看更多Python代碼示例。

When using OpenAI embeddings, please keep in mind their limitations and risks.
當使用OpenAI嵌入式時，請記住它們的局限性和風險。詳見博客最后一部分 limitations and risks。

Embedding models 嵌入模型

OpenAI offers one second-generation embedding model (denoted by -002 in the model ID) and 16 first-generation models (denoted by -001 in the model ID).
OpenAI提供了一個第二代嵌入模型（在模型ID中表示為 -002 ）和16個第一代模型（在模型ID中表示為 -001 ）。

We recommend using text-embedding-ada-002 for nearly all use cases. It’s better, cheaper, and simpler to use. Read the blog post announcement.
我們建議在幾乎所有用例中使用text-embedding-ada-002。它更好，更便宜，更容易使用。閱讀博客文章公告。

MODEL GENERATION 模型版本	TOKENIZER 標記	MAX INPUT TOKENS 最大輸入標記	KNOWLEDGE CUTOFF 截止時間
V2	cl100k_base	8191	Sep 2021
V1	GPT-2/GPT-3	2046	Aug 2020

Usage is priced per input token, at a rate of $0.0004 per 1000 tokens, or about ~3,000 pages per US dollar (assuming ~800 tokens per page):
使用量按每個輸入標記定價，每1000個標記0.0004美元，或每美元約3，000頁（假設每頁約800個標記）：

MODEL 模型	ROUGH PAGES PER DOLLAR 每美元估計頁數	EXAMPLE PERFORMANCE ON BEIR SEARCH EVAL BEIR搜索評估的示例性能
text-embedding-ada-002	3000	53.9
-davinci--001	6	52.8
-curie--001	60	50.9
-babbage--001	240	50.4
-ada--001	300	49.0

Second-generation models 第二代模型

MODEL NAME 模型名稱	TOKENIZER 標記	MAX INPUT TOKENS 最大輸入標記	OUTPUT DIMENSIONS 輸出維度
text-embedding-ada-002	cl100k_base	8191	1536

First-generation models (not recommended) 第一代模型（不推薦）

All first-generation models (those ending in -001) use the GPT-3 tokenizer and have a max input of 2046 tokens.
所有第一代模型（以-001結尾的模型）都使用GPT-3標記器，最大輸入為2046個標記。

由于官方不推薦，故不詳細分享其示例，如有需求，請前往地址。

Use cases 用例

Here we show some representative use cases. We will use the Amazon fine-food reviews dataset for the following examples.
在這里，我們展示了一些有代表性的用例。我們將在以下示例中使用Amazon美食評論數據集。

Obtaining the embeddings 獲取嵌入

The dataset contains a total of 568,454 food reviews Amazon users left up to October 2012. We will use a subset of 1,000 most recent reviews for illustration purposes. The reviews are in English and tend to be positive or negative. Each review has a ProductId, UserId, Score, review title (Summary) and review body (Text). For example:
該數據集包含截至2012年10月亞馬遜用戶留下的總共568，454條食品評論。我們將使用1，000個最新評論的子集進行說明。評論是英語的，往往是正面或負面的。每個評論都有ProductId、UserId、Score、評論標題（摘要）和評論主體（文本）。例如：

PRODUCT ID 產品ID	USER ID 用戶ID	SCORE 得分	SUMMARY 摘要	TEXT 正文
B001E4KFG0	A3SGXH7AUHU8GW	5	Good Quality Dog Food 優(yōu)質狗糧	I have bought several of the Vitality canned…我已經買了幾個Vitality罐頭…
B00813GRG4	A1D87F6ZCVE5NK	1	Not as Advertised 不像宣傳的那樣	Product arrived labeled as Jumbo Salted Peanut…產品到達標簽為Jumbo Salted Peanut…

We will combine the review summary and review text into a single combined text. The model will encode this combined text and output a single vector embedding.
我們將把評論摘要和評論文本組合成一個單獨的合并文本。該模型將對該組合文本進行編碼，并輸出單個向量嵌入。

Obtain_dataset.ipynb

def get_embedding(text, model="text-embedding-ada-002"):
   text = text.replace("\n", " ")
   return openai.Embedding.create(input = [text], model=model)['data'][0]['embedding']

df['ada_embedding'] = df.combined.apply(lambda x: get_embedding(x, model='text-embedding-ada-002'))
df.to_csv('output/embedded_1k_reviews.csv', index=False)

To load the data from a saved file, you can run the following:
要從已保存的文件加載數據，可以運行以下命令：

import pandas as pd

df = pd.read_csv('output/embedded_1k_reviews.csv')
df['ada_embedding'] = df.ada_embedding.apply(eval).apply(np.array)

Data visualization in 2D 二維數據可視化

Visualizing_embeddings_in_2D.ipynb

The size of the embeddings varies with the complexity of the underlying model. In order to visualize this high dimensional data we use the t-SNE algorithm to transform the data into two dimensions.
嵌入的大小隨底層模型的復雜性而變化。為了可視化這些高維數據，我們使用t-SNE算法將數據轉換為二維。

We color the individual reviews based on the star rating which the reviewer has given:
我們會根據點評者給出的星星對每條點評進行著色：

1-star: red 紅色
2-star: dark orange 橙色
3-star: gold 金色
4-star: turquoise 藍綠色
5-star: dark green 綠色

Amazon ratings visualized in language using t-SNE 使用t-SNE語言可視化亞馬遜評級
OpenAI-ChatGPT最新官方接口《嵌入向量式文本轉換》全網最詳細中英文實用指南和教程，助你零基礎快速輕松掌握全新技術（五）（附源碼）

The visualization seems to have produced roughly 3 clusters, one of which has mostly negative reviews.
可視化似乎產生了大約3個集群，其中一個大多是負面評論。

import pandas as pd
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
import matplotlib

df = pd.read_csv('output/embedded_1k_reviews.csv')
matrix = df.ada_embedding.apply(eval).to_list()

# Create a t-SNE model and transform the data
tsne = TSNE(n_components=2, perplexity=15, random_state=42, init='random', learning_rate=200)
vis_dims = tsne.fit_transform(matrix)

colors = ["red", "darkorange", "gold", "turquiose", "darkgreen"]
x = [x for x,y in vis_dims]
y = [y for x,y in vis_dims]
color_indices = df.Score.values - 1

colormap = matplotlib.colors.ListedColormap(colors)
plt.scatter(x, y, c=color_indices, cmap=colormap, alpha=0.3)
plt.title("Amazon ratings visualized in language using t-SNE")

Embedding as a text feature encoder for ML algorithms 嵌入作為ML算法的文本特征編碼器

Regression_using_embeddings.ipynb

An embedding can be used as a general free-text feature encoder within a machine learning model. Incorporating embeddings will improve the performance of any machine learning model, if some of the relevant inputs are free text. An embedding can also be used as a categorical feature encoder within a ML model. This adds most value if the names of categorical variables are meaningful and numerous, such as job titles. Similarity embeddings generally perform better than search embeddings for this task.
嵌入可以用作機器學習模型內的一般自由文本特征編碼器。如果一些相關的輸入是自由文本，那么嵌入將提高任何機器學習模型的性能。嵌入也可以用作ML模型內的分類特征編碼器。如果分類變量的名稱是有意義的并且數量眾多，例如職位，則這會增加最大的價值。相似性嵌入通常比搜索嵌入更好地執(zhí)行此任務。

We observed that generally the embedding representation is very rich and information dense. For example, reducing the dimensionality of the inputs using SVD or PCA, even by 10%, generally results in worse downstream performance on specific tasks.
我們觀察到，通常嵌入表示是非常豐富和信息密集的。例如，使用SVD或PCA降低輸入的維度，即使降低10%，通常也會導致特定任務的下游性能更差。

This code splits the data into a training set and a testing set, which will be used by the following two use cases, namely regression and classification.
這段代碼將數據分為訓練集和測試集，它們將被以下兩個用例使用，即回歸和分類。

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    list(df.ada_embedding.values),
    df.Score,
    test_size = 0.2,
    random_state=42
)

Regression using the embedding features 使用嵌入特征的回歸

Embeddings present an elegant way of predicting a numerical value. In this example we predict the reviewer’s star rating, based on the text of their review. Because the semantic information contained within embeddings is high, the prediction is decent even with very few reviews.
嵌入提供了一種預測數值的優(yōu)雅方式。在這個例子中，我們根據評論的文本預測評論者的星級排名。因為嵌入中包含的語義信息很高，所以即使評論很少，預測也很不錯。

We assume the score is a continuous variable between 1 and 5, and allow the algorithm to predict any floating point value. The ML algorithm minimizes the distance of the predicted value to the true score, and achieves a mean absolute error of 0.39, which means that on average the prediction is off by less than half a star.
我們假設分數是1到5之間的連續(xù)變量，并允許算法預測任何浮點值。ML算法將預測值與真實分數的距離最小化，并實現0.39的平均絕對誤差，這意味著平均預測偏差不到半顆星星。

from sklearn.ensemble import RandomForestRegressor

rfr = RandomForestRegressor(n_estimators=100)
rfr.fit(X_train, y_train)
preds = rfr.predict(X_test)

Classification using the embedding features使用嵌入特征的分類

Classification_using_embeddings.ipynb

This time, instead of having the algorithm predict a value anywhere between 1 and 5, we will attempt to classify the exact number of stars for a review into 5 buckets, ranging from 1 to 5 stars.
這一次，我們不再讓算法預測1到5之間的值，而是嘗試將評論的確切星級數量分為5個桶，范圍從1到5顆星。

After the training, the model learns to predict 1 and 5-star reviews much better than the more nuanced reviews (2-4 stars), likely due to more extreme sentiment expression.
在訓練之后，模型學習預測1星和5星評論比更細致入微的評論（2-4星）好得多，這可能是由于更極端的情感表達。

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score

clf = RandomForestClassifier(n_estimators=100)
clf.fit(X_train, y_train)
preds = clf.predict(X_test)

Zero-shot classification 零樣本分類

Zero-shot_classification_with_embeddings.ipynb

We can use embeddings for zero shot classification without any labeled training data. For each class, we embed the class name or a short description of the class. To classify some new text in a zero-shot manner, we compare its embedding to all class embeddings and predict the class with the highest similarity.
我們可以在沒有任何標記的訓練數據的情況下使用嵌入進行零樣本分類。對于每個類，我們嵌入類名或類的簡短描述。為了對新文本進行零樣本分類，我們將新文本的嵌入與所有類的嵌入進行比較，并預測出相似度最高的類。

from openai.embeddings_utils import cosine_similarity, get_embedding

df= df[df.Score!=3]
df['sentiment'] = df.Score.replace({1:'negative', 2:'negative', 4:'positive', 5:'positive'})

labels = ['negative', 'positive']
label_embeddings = [get_embedding(label, model=model) for label in labels]

def label_score(review_embedding, label_embeddings):
   return cosine_similarity(review_embedding, label_embeddings[1]) - cosine_similarity(review_embedding, label_embeddings[0])

prediction = 'positive' if label_score('Sample Review', label_embeddings) > 0 else 'negative'

Obtaining user and product embeddings for cold-start recommendation 獲取用于冷啟動推薦的用戶和產品嵌入

User_and_product_embeddings.ipynb

We can obtain a user embedding by averaging over all of their reviews. Similarly, we can obtain a product embedding by averaging over all the reviews about that product. In order to showcase the usefulness of this approach we use a subset of 50k reviews to cover more reviews per user and per product.
我們可以通過對所有評論進行平均來獲得用戶嵌入。類似地，我們可以通過對關于該產品的所有評論進行平均來獲得產品嵌入。為了展示這種方法的有用性，我們使用了50k評論的子集來覆蓋每個用戶和每個產品的更多評論。

We evaluate the usefulness of these embeddings on a separate test set, where we plot similarity of the user and product embedding as a function of the rating. Interestingly, based on this approach, even before the user receives the product we can predict better than random whether they would like the product.
我們在一個單獨的測試集上評估這些嵌入的有用性，在那里我們繪制用戶和產品嵌入的相似性作為評級的函數。有趣的是，基于這種方法，即使在用戶收到產品之前，我們也可以比隨機預測更好地預測他們是否喜歡該產品。

Boxplot grouped by Score 箱線圖按分數分組
OpenAI-ChatGPT最新官方接口《嵌入向量式文本轉換》全網最詳細中英文實用指南和教程，助你零基礎快速輕松掌握全新技術（五）（附源碼）

user_embeddings = df.groupby('UserId').ada_embedding.apply(np.mean)
prod_embeddings = df.groupby('ProductId').ada_embedding.apply(np.mean)

Clustering 聚類

Clustering.ipynb

Clustering is one way of making sense of a large volume of textual data. Embeddings are useful for this task, as they provide semantically meaningful vector representations of each text. Thus, in an unsupervised way, clustering will uncover hidden groupings in our dataset.
聚類是理解大量文本數據的一種方法。嵌入對于這項任務很有用，因為它們提供了每個文本的語義上有意義的向量表示。因此，以無監(jiān)督的方式，聚類將揭示數據集中隱藏的分組。

In this example, we discover four distinct clusters: one focusing on dog food, one on negative reviews, and two on positive reviews.
在這個例子中，我們發(fā)現了四個不同的集群：一個關注狗糧，一個關注負面評價，兩個關注正面評價。

Clusters identified visualized in language 2d using t-SNE 使用t-SNE在語言2d中可視化地識別集群
OpenAI-ChatGPT最新官方接口《嵌入向量式文本轉換》全網最詳細中英文實用指南和教程，助你零基礎快速輕松掌握全新技術（五）（附源碼）

import numpy as np
from sklearn.cluster import KMeans

matrix = np.vstack(df.ada_embedding.values)
n_clusters = 4

kmeans = KMeans(n_clusters = n_clusters, init='k-means++', random_state=42)
kmeans.fit(matrix)
df['Cluster'] = kmeans.labels_

Text search using embeddings 使用嵌入的文本搜索

Semantic_text_search_using_embeddings.ipynb

To retrieve the most relevant documents we use the cosine similarity between the embedding vectors of the query and each document, and return the highest scored documents.
為了檢索最相關的文檔，我們使用查詢和每個文檔的嵌入向量之間的余弦相似度，并返回得分最高的文檔。

from openai.embeddings_utils import get_embedding, cosine_similarity

def search_reviews(df, product_description, n=3, pprint=True):
   embedding = get_embedding(product_description, model='text-embedding-ada-002')
   df['similarities'] = df.ada_embedding.apply(lambda x: cosine_similarity(x, embedding))
   res = df.sort_values('similarities', ascending=False).head(n)
   return res

res = search_reviews(df, 'delicious beans', n=3)

Code search using embeddings 使用嵌入的代碼搜索

Code_search.ipynb

Code search works similarly to embedding-based text search. We provide a method to extract Python functions from all the Python files in a given repository. Each function is then indexed by the text-embedding-ada-002 model.
代碼搜索的工作原理與基于嵌入的文本搜索類似。我們提供了一種方法，可以從給定存儲庫中的所有Python文件中提取Python函數。然后每個函數由 text-embedding-ada-002 模型索引。

To perform a code search, we embed the query in natural language using the same model. Then we calculate cosine similarity between the resulting query embedding and each of the function embeddings. The highest cosine similarity results are most relevant.
為了執(zhí)行代碼搜索，我們使用相同的模型將查詢嵌入到自然語言中。然后，我們計算得到的查詢嵌入和每個函數嵌入之間的余弦相似度。最高余弦相似性結果是最相關的。

from openai.embeddings_utils import get_embedding, cosine_similarity

df['code_embedding'] = df['code'].apply(lambda x: get_embedding(x, model='text-embedding-ada-002'))

def search_functions(df, code_query, n=3, pprint=True, n_lines=7):
   embedding = get_embedding(code_query, model='text-embedding-ada-002')
   df['similarities'] = df.code_embedding.apply(lambda x: cosine_similarity(x, embedding))

   res = df.sort_values('similarities', ascending=False).head(n)
   return res
res = search_functions(df, 'Completions API tests', n=3)

Recommendations using embeddings 使用嵌入的推薦

Recommendation_using_embeddings.ipynb

Because shorter distances between embedding vectors represent greater similarity, embeddings can be useful for recommendation.
因為嵌入向量之間的較短距離表示較大的相似性，所以嵌入對于推薦是有用的。

Below, we illustrate a basic recommender. It takes in a list of strings and one ‘source’ string, computes their embeddings, and then returns a ranking of the strings, ranked from most similar to least similar. As a concrete example, the linked notebook below applies a version of this function to the AG news dataset (sampled down to 2,000 news article descriptions) to return the top 5 most similar articles to any given source article.
下面，我們展示一個基本的推薦器。它接受一個字符串列表和一個“源”字符串，計算它們的嵌入，然后返回字符串的排序，從最相似到最不相似排序。作為一個具體的例子，下面的鏈接筆記本將此函數的一個版本應用于AG新聞數據集（采樣到2,000篇新聞文章描述），以返回與任何給定源文章最相似的前5篇文章。

def recommendations_from_strings(
   strings: List[str],
   index_of_source_string: int,
   model="text-embedding-ada-002",
) -> List[int]:
   """Return nearest neighbors of a given string."""

   # get embeddings for all strings
   embeddings = [embedding_from_string(string, model=model) for string in strings]

   # get the embedding of the source string
   query_embedding = embeddings[index_of_source_string]

   # get distances between the source embedding and other embeddings (function from embeddings_utils.py)
   distances = distances_from_embeddings(query_embedding, embeddings, distance_metric="cosine")

   # get indices of nearest neighbors (function from embeddings_utils.py)
   indices_of_nearest_neighbors = indices_of_nearest_neighbors_from_distances(distances)
   return indices_of_nearest_neighbors

Limitations & risks 限制和風險

Our embedding models may be unreliable or pose social risks in certain cases, and may cause harm in the absence of mitigations.
我們的嵌入模型在某些情況下可能不可靠或構成社會風險，并且在缺乏緩解措施的情況下可能造成傷害。

Social bias 社會偏差

Limitation: The models encode social biases, e.g. via stereotypes or negative sentiment towards certain groups.
局限性：這些模型對社會偏差進行了編碼，例如通過對某些群體的刻板印象或負面情緒。

We found evidence of bias in our models via running the SEAT (May et al, 2019) and the Winogender (Rudinger et al, 2018) benchmarks. Together, these benchmarks consist of 7 tests that measure whether models contain implicit biases when applied to gendered names, regional names, and some stereotypes.
我們通過運行SEAT（May et al，2019）和Winogender（Rudinger et al，2018）基準測試發(fā)現了模型中存在偏差的證據。這些基準包括7個測試，用于衡量模型在應用于性別名稱，地區(qū)名稱和一些刻板印象時是否包含隱含的偏見。

For example, we found that our models more strongly associate (a) European American names with positive sentiment, when compared to African American names, and (b) negative stereotypes with black women.
例如，我們發(fā)現，我們的模型更強烈地將（a）歐洲裔美國人的名字與積極情緒相關聯，與非洲裔美國人的名字相比，以及（B）與黑人女性的負面刻板印象。

These benchmarks are limited in several ways: (a) they may not generalize to your particular use case, and (b) they only test for a very small slice of possible social bias.
這些基準在幾個方面受到限制：（a）它們可能不會推廣到您的特定用例，（B）它們只測試可能的社會偏見的一小部分。

These tests are preliminary, and we recommend running tests for your specific use cases. These results should be taken as evidence of the existence of the phenomenon, not a definitive characterization of it for your use case. Please see our usage policies for more details and guidance.
這些測試是初步的，我們建議您針對特定用例運行測試。這些結果應該被視為現象存在的證據，而不是對您的用例的明確描述。請參閱我們的使用政策了解更多詳情和指導。

Please contact our support team via chat if you have any questions; we are happy to advise on this.
如果您有任何問題，請通過聊天聯系我們的支持團隊;我們很樂意就此提供意見。

Blindness to recent events 對最近發(fā)生的事件視而不見

Limitation: Models lack knowledge of events that occurred after August 2020.
局限性：模型缺乏對2020年8月之后發(fā)生的事件的了解。

Our models are trained on datasets that contain some information about real world events up until 8/2020. If you rely on the models representing recent events, then they may not perform well.
我們的模型是在包含一些真實的世界事件信息的數據集上訓練的，直到2020年8月。如果你依賴于代表最近事件的模型，那么它們可能表現不佳。

Frequently asked questions 常見問題

How can I tell how many tokens a string has before I embed it? 在嵌入字符串之前，如何知道它有多少個標記？

In Python, you can split a string into tokens with OpenAI’s tokenizer tiktoken.
在Python中，你可以使用OpenAI的標記器 tiktoken 將一個字符串分割成多個標記。

Example code:

import tiktoken

def num_tokens_from_string(string: str, encoding_name: str) -> int:
    """Returns the number of tokens in a text string."""
    encoding = tiktoken.get_encoding(encoding_name)
    num_tokens = len(encoding.encode(string))
    return num_tokens

num_tokens_from_string("tiktoken is great!", "cl100k_base")

For second-generation embedding models like text-embedding-ada-002, use the cl100k_base encoding.
對于像 text-embedding-ada-002 這樣的第二代嵌入模型，請使用 cl100k_base 編碼。

More details and example code are in the OpenAI Cookbook guide how to count tokens with tiktoken.
更多細節(jié)和示例代碼在OpenAI Cookbook指南中如何使用tiktoken進行標記計數。

How can I retrieve K nearest embedding vectors quickly? 如何快速檢索K個最近的嵌入向量？

For searching over many vectors quickly, we recommend using a vector database. You can find examples of working with vector databases and the OpenAI API in our Cookbook on GitHub.
為了快速搜索多個矢量，我們建議使用矢量數據庫。您可以在GitHub上的Cookbook中找到使用矢量數據庫和OpenAI API的示例。

Vector database options include: 矢量數據庫選項包括：

Pinecone, a fully managed vector database
Pinecone，一個完全管理的矢量數據庫
Weaviate, an open-source vector search engine
Weaviate，一個開源矢量搜索引擎
Redis as a vector database
Redis作為矢量數據庫
Qdrant, a vector search engine
Qdrant，一個矢量搜索引擎
Milvus, a vector database built for scalable similarity search
Milvus，一個為可擴展的相似性搜索而構建的矢量數據庫
Chroma, an open-source embeddings store
Chroma，一個開源嵌入式商店
Typesense, fast open source vector search
Typesense，快速開源矢量搜索

Which distance function should I use? 我應該使用哪種距離函數？

We recommend cosine similarity. The choice of distance function typically doesn’t matter much.
我們推薦余弦相似性。距離函數的選擇通常并不重要。

OpenAI embeddings are normalized to length 1, which means that:
OpenAI嵌入規(guī)范化為長度1，這意味著：

Cosine similarity can be computed slightly faster using just a dot product
僅使用點積可以稍微更快地計算余弦相似性
Cosine similarity and Euclidean distance will result in the identical rankings
余弦相似度和歐氏距離將導致相同的排名

Can I share my embeddings online? 我可以在線分享我的嵌入嗎？

Customers own their input and output from our models, including in the case of embeddings. You are responsible for ensuring that the content you input to our API does not violate any applicable law or our Terms of Use.
客戶擁有他們的輸入和我們模型的輸出，包括嵌入的情況。您有責任確保您輸入我們API的內容不違反任何適用法律或我們的使用條款。

其它資料下載

如果大家想繼續(xù)了解人工智能相關學習路線和知識體系，歡迎大家翻閱我的另外一篇博客《重磅 | 完備的人工智能AI 學習——基礎知識學習路線，所有資料免關注免套路直接網盤下載》
這篇博客參考了Github知名開源平臺，AI技術平臺以及相關領域專家：Datawhale，ApacheCN，AI有道和黃海廣博士等約有近100G相關資料，希望能幫助到所有小伙伴們。文章來源地址http://www.zghlxwxcb.cn/news/detail-429642.html

到了這里，關于OpenAI-ChatGPT最新官方接口《嵌入向量式文本轉換》全網最詳細中英文實用指南和教程，助你零基礎快速輕松掌握全新技術（五）（附源碼）的文章就介紹完了。如果您還想了解更多內容，請在右上角搜索TOY模板網以前的文章或繼續(xù)瀏覽下面的相關文章，希望大家以后多多支持TOY模板網！

本文來自互聯網用戶投稿，該文觀點僅代表作者本人，不代表本站立場。本站僅提供信息存儲空間服務，不擁有所有權，不承擔相關法律責任。如若轉載，請注明出處：如若內容造成侵權/違法違規(guī)/事實不符，請點擊違法舉報進行投訴反饋，一經查實，立即刪除！

分享到：

領支付寶紅包贊助服務器費用

OpenAI-ChatGPT最新官方接口《語音智能轉文本》全網最詳細中英文實用指南和教程，助你零基礎快速輕松掌握全新技術（六）（附源碼）
Speech to text 語音轉文本 Learn how to turn audio into text 了解如何將音頻轉換為文本 ChatGPT 是集人工智能和自然語言處理技術于一身的大型語言模型。它能夠通過文字、語音或者圖像等多種方式與用戶進行交互。其中，通過語音轉文字功能，ChatGPT 能夠將用戶說出的話語，立即轉化為
2023年04月21日
瀏覽(25)
OpenAI-ChatGPT最新官方接口《聊天交互多輪對話》全網最詳細中英文實用指南和教程，助你零基礎快速輕松掌握全新技術（二）（附源碼）
Using the OpenAI Chat API, you can build your own applications with gpt-3.5-turbo and gpt-4 to do things like: 使用OpenAI Chat API，您可以使用 gpt-3.5-turbo 和 gpt-4 構建自己的應用程序，以執(zhí)行以下操作： Draft an email or other piece of writing 起草一封電子郵件或其他書面材料 Write Python code 編寫Python代碼 Answer
2023年04月24日
瀏覽(42)
OpenAI-ChatGPT最新官方接口《從0到1生產最佳實例》全網最詳細中英文實用指南和教程，助你零基礎快速輕松掌握全新技術（十一）（附源碼）
作為高級開發(fā)工程師，如果你需要開發(fā)一個使用ChatGPT的應用程序并部署到生產環(huán)境上，那么在此之前，你需要提前考慮完善各項工作。比如如何做好相應的成本控制、并發(fā)性能監(jiān)控，如何持續(xù)評估和迭代機器學習模型，以及數據安全性和合規(guī)性等方面。值得一提的是，Open
2023年04月20日
瀏覽(25)
OpenAI最新官方ChatGPT聊天插件接口《接入插件快速開始》全網最詳細中英文實用指南和教程，助你零基礎快速輕松掌握全新技術（二）（附源碼）
ChatGPT正在經歷著一次革命性的改變，隨著越來越多的小程序和第三方插件的引入，ChatGPT將變得更加強大、靈活和自由。這些插件不僅能夠讓用戶實現更多更復雜的AI任務和目標，還會帶來類似國內微信小程序般的瘋狂，為用戶和開發(fā)者帶來更多驚喜和創(chuàng)新。想象一下，當您
2024年02月04日
瀏覽(41)
最新寶塔反代openai官方API開發(fā)接口詳細搭建教程，解決502 Bad Gateway問題
寶塔反代openai官方API接口詳細教程，實現國內使用ChatGPT+502 Bad Gateway問題解決，此方法最簡單快捷，沒有復雜步驟，不容易出錯，即最簡單，零代碼、零部署的方法。一臺海外服務器 OpenAI官方的API_KEY 第三方網站系統(tǒng)或插件關于第三方網站系統(tǒng)或插件，可以看另一篇文章介
2024年01月25日
瀏覽(20)
openai-chatGPT的API調用異常處理
因為目前openai對地區(qū)限制的原因，即使設置了全局代理使用API調用時，還是會出現科學上網代理的錯誤問題。 openai庫 == 0.26.5 【錯誤提示】： raise error.APIConnectionError( openai.error.APIConnectionError: Error communicating with OpenAI: HTTPSConnectionPool(host=\\\' api.openai.com \\\', port=443): Max retries exceede
2024年01月20日
瀏覽(29)
最新使用寶塔反代openai官方API接口搭建詳細教程及502 Bad Gateway錯誤問題解決
寶塔反代openai官方API接口詳細教程，實現國內使用ChatGPT+502 Bad Gateway問題解決，此方法最簡單快捷，沒有復雜步驟，不容易出錯，即最簡單，零代碼、零部署的方法。一臺海外VPS服務器 OpenAI官方的API_KEY 第三方網站系統(tǒng)或插件關于第三方網站系統(tǒng)或插件，可以看另一篇文章
2024年01月19日
瀏覽(42)
一文讀懂Springboot如何使用ChatGPT【OpenAI官方Springboot依賴，極強接口封裝】
封裝了豐富的OpenAI 接口可直接使用申請外國虛擬信用卡【Depay】充值USTD虛擬貨幣【歐易】 USTD充值到Depay Depay 的USTD 轉 USD虛擬貨幣將USD貨幣存入虛擬信用卡通過虛擬信用卡充值到ChatGPT 優(yōu)先ChatGPT試用用戶暢享絲滑的響應速度優(yōu)先體驗新功能原文非常感謝你從頭到尾閱讀
2024年02月07日
瀏覽(30)
寶塔反代教程，ChatGPT網站系統(tǒng)實現國內服務器訪問openai官網接口(使用寶塔反代openai官方的API接口教程)
近期有網友問寶塔如何設置反向代理，小編這里介紹一種簡單的操作方法，就是使用寶塔官方軟件面板自帶的反向代理功能來實現。首先您要先安裝寶塔面板，當Nginx或LNMP環(huán)境配置完成后，便可開始設置反向代理了，下面來看下操作步驟。此方法最簡單快捷，沒有復雜步驟，
2024年02月06日
瀏覽(31)
OpenAI API最新速查表；輕松制作數字分身；8個ChatGPT「作弊」策略；微軟提示工程官方教程 | ShowMeAI日報
?? 日報周刊合集 | ?? 生產力工具與行業(yè)應用大全 | ?? 點贊關注評論拜托啦！ ShowMeAI知識星球資源編碼：R102 大語言模型的發(fā)展，正在推動 OpenAI API 集成到越來越多的應用中。這份速查表整理了官方教程的要點，便于學習者和開發(fā)者使用。 ? 獲取訪問權限 (Set UP) ? 使用
2024年02月06日
瀏覽(30)