Speech to text 語音轉(zhuǎn)文本
Learn how to turn audio into text
了解如何將音頻轉(zhuǎn)換為文本
ChatGPT 是集人工智能和自然語言處理技術(shù)于一身的大型語言模型。它能夠通過文字、語音或者圖像等多種方式與用戶進(jìn)行交互。其中,通過語音轉(zhuǎn)文字功能,ChatGPT 能夠?qū)⒂脩粽f出的話語,立即轉(zhuǎn)化為文字,并對其進(jìn)行分析處理,再以文字形式作答。這樣的交互方式大大提升了 ChatGPT 與用戶之間的交流效率。
Introduction 導(dǎo)言
The speech to text API provides two endpoints, transcriptions
and translations
, based on our state-of-the-art open source large-v2 Whisper model. They can be used to:
語音到文本API提供了兩個端點, transcriptions
和 translations
,基于我們最先進(jìn)的開源大型v2 Whisper模型。它們可用于:
- Transcribe audio into whatever language the audio is in.
將音頻轉(zhuǎn)錄為音頻所用的任何語言。 - Translate and transcribe the audio into english.
翻譯和轉(zhuǎn)錄音頻成英語。
File uploads are currently limited to 25 MB and the following input file types are supported: mp3, mp4, mpeg, mpga, m4a, wav, and webm
.
文件上傳當(dāng)前限制為25 MB,支持以下輸入文件類型: mp3, mp4, mpeg, mpga, m4a, wav, and webm
。
Quickstart 快速開始
Transcriptions 轉(zhuǎn)錄
The transcriptions API takes as input the audio file you want to transcribe and the desired output file format for the transcription of the audio. We currently support multiple input and output file formats.
轉(zhuǎn)錄API將您要轉(zhuǎn)錄的音頻文件和音頻轉(zhuǎn)錄所需的輸出文件格式作為輸入。我們目前支持多種輸入和輸出文件格式。
python代碼
# Note: you need to be using OpenAI Python v0.27.0 for the code below to work
import openai
audio_file= open("/path/to/file/audio.mp3", "rb")
transcript = openai.Audio.transcribe("whisper-1", audio_file)
cURL代碼
curl --request POST \
--url https://api.openai.com/v1/audio/transcriptions \
--header 'Authorization: Bearer TOKEN' \
--header 'Content-Type: multipart/form-data' \
--form file=@/path/to/file/openai.mp3 \
--form model=whisper-1
By default, the response type will be json with the raw text included.
默認(rèn)情況下,響應(yīng)類型將是包含原始文本的json。
{
“text”: "Imagine the wildest idea that you’ve ever had, and you’re curious about how it might scale to something that’s a 100, a 1,000 times bigger.
…
}
{ “text”:“想象一下你有過的最瘋狂的想法,你很好奇它如何擴展到100倍,1,000倍大的東西?!?}
To set additional parameters in a request, you can add more --form
lines with the relevant options. For example, if you want to set the output format as text, you would add the following line:
要在請求中設(shè)置其他參數(shù),您可以添加更多帶有相關(guān)選項的 --form
行。例如,如果要將輸出格式設(shè)置為文本,則應(yīng)添加以下行:
...
--form file=@openai.mp3 \
--form model=whisper-1 \
--form response_format=text
Translations 翻譯
The translations API takes as input the audio file in any of the supported languages and transcribes, if necessary, the audio into english. This differs from our /Transcriptions endpoint since the output is not in the original input language and is instead translated to english text.
翻譯API接受任何支持語言的音頻文件作為輸入,并在必要時將音頻轉(zhuǎn)錄為英語。這與我們的/Transcriptions端點不同,因為輸出不是原始輸入語言,而是翻譯為英語文本。
python代碼
# Note: you need to be using OpenAI Python v0.27.0 for the code below to work
import openai
audio_file= open("/path/to/file/german.mp3", "rb")
transcript = openai.Audio.translate("whisper-1", audio_file)
cURL代碼
curl --request POST --url https://api.openai.com/v1/audio/translations --header 'Authorization: Bearer TOKEN' --header 'Content-Type: multipart/form-data' --form file=@/path/to/file/german.mp3 --form model=whisper-1
In this case, the inputted audio was german and the outputted text looks like:
在這種情況下,輸入的音頻是德語,輸出的文本看起來像:
Hello, my name is Wolfgang and I come from Germany. Where are you heading today?
大家好,我叫沃爾夫?qū)?,來自德國。你今天要去哪里?/p>
We only support translation into english at this time.
我們只支持翻譯成英語。
Supported languages 支持的語言
We currently support the following languages through both the transcriptions
and translations
endpoint:
我們目前通過 transcriptions
和 translations
端點支持以下語言:
Afrikaans, Arabic, Armenian, Azerbaijani, Belarusian, Bosnian, Bulgarian, Catalan, Chinese, Croatian, Czech, Danish, Dutch, English, Estonian, Finnish, French, Galician, German, Greek, Hebrew, Hindi, Hungarian, Icelandic, Indonesian, Italian, Japanese, Kannada, Kazakh, Korean, Latvian, Lithuanian, Macedonian, Malay, Marathi, Maori, Nepali, Norwegian, Persian, Polish, Portuguese, Romanian, Russian, Serbian, Slovak, Slovenian, Spanish, Swahili, Swedish, Tagalog, Tamil, Thai, Turkish, Ukrainian, Urdu, Vietnamese, and Welsh.
南非荷蘭語,阿拉伯語,亞美尼亞語,阿塞拜疆語,白俄羅斯語,波斯尼亞語,保加利亞語,加泰羅尼亞語,中文,克羅地亞語,捷克語,丹麥語,荷蘭語,英語,愛沙尼亞語,芬蘭語,法語,加利西亞語,德語,希臘語,希伯來語,印地語,匈牙利語,冰島語,印度尼西亞語,意大利語,日語,卡納達(dá)語,哈薩克語,韓語,拉脫維亞語,立陶宛語,馬其頓語,馬來語,馬拉地語,毛利語,尼泊爾語,挪威語,波斯語,波蘭語,葡萄牙語,羅馬尼亞語,俄語,塞爾維亞語、斯洛伐克語、斯洛文尼亞語、西班牙語、斯瓦希里語、瑞典語、菲律賓語、泰米爾語、泰語、土耳其語、烏克蘭語、烏爾都語、越南語和威爾士語。
While the underlying model was trained on 98 languages, we only list the languages that exceeded <50% word error rate (WER) which is an industry standard benchmark for speech to text model accuracy. The model will return results for languages not listed above but the quality will be low.
雖然底層模型在98種語言上進(jìn)行了訓(xùn)練,但我們只列出了超過50%單詞錯誤率(WER)的語言,這是語音到文本模型準(zhǔn)確性的行業(yè)標(biāo)準(zhǔn)基準(zhǔn)。該模型將返回上面未列出的語言的結(jié)果,但質(zhì)量將較低。
Longer inputs 長文件輸入
By default, the Whisper API only supports files that are less than 25 MB. If you have an audio file that is longer than that, you will need to break it up into chunks of 25 MB’s or less or used a compressed audio format. To get the best performance, we suggest that you avoid breaking the audio up mid-sentence as this may cause some context to be lost.
默認(rèn)情況下,Whisper API僅支持小于25 MB的文件。如果你有一個音頻文件比這更長,你需要把它分成25 MB或更少的塊,或者使用壓縮的音頻格式。為了獲得最佳性能,我們建議您避免在句子中間打斷音頻,因為這可能會導(dǎo)致一些上下文丟失。
One way to handle this is to use the PyDub open source Python package to split the audio:
處理這個問題的一種方法是使用PyDub開源Python包來分割音頻:
from pydub import AudioSegment
song = AudioSegment.from_mp3("good_morning.mp3")
# PyDub handles time in milliseconds
ten_minutes = 10 * 60 * 1000
first_10_minutes = song[:ten_minutes]
first_10_minutes.export("good_morning_10.mp3", format="mp3")
OpenAI makes no guarantees about the usability or security of 3rd party software like PyDub.
OpenAI不保證PyDub等第三方軟件的可用性或安全性。
Prompting 提示
You can use a prompt to improve the quality of the transcripts generated by the Whisper API. The model will try to match the style of the prompt, so it will be more likely to use capitalization and punctuation if the prompt does too. However, the current prompting system is much more limited than our other language models and only provides limited control over the generated audio. Here are some examples of how prompting can help in different scenarios:
您可以使用提示來提高Whisper API生成的轉(zhuǎn)錄本的質(zhì)量。該模型將嘗試匹配提示符的樣式,因此如果提示符也使用大寫和標(biāo)點符號,則更有可能使用大寫和標(biāo)點符號。然而,當(dāng)前的提示系統(tǒng)比我們的其他語言模型要有限得多,并且僅對生成的音頻提供有限的控制。以下是提示如何在不同情況下提供幫助的一些示例:
- Prompts can be very helpful for correcting specific words or acronyms that the model often misrecognizes in the audio. For example, the following prompt improves the transcription of the words DALL·E and GPT-3, which were previously written as “GDP 3” and “DALI”.
提示對于糾正模型經(jīng)常在音頻中誤識別的特定單詞或首字母縮寫詞非常有幫助。例如,下面的提示改進(jìn)了單詞DALL·E和GPT-3的轉(zhuǎn)錄,這些單詞以前被寫成“GDP 3”和“DALI”。
The transcript is about OpenAI which makes technology like DALL·E, GPT-3, and ChatGPT with the hope of one day building an AGI system that benefits all of humanity
OpenAI開發(fā)了DALL·E、GPT-3和ChatGPT等技術(shù),希望有一天能建立一個造福全人類的AGI系統(tǒng)。
-
To preserve the context of a file that was split into segments, you can prompt the model with the transcript of the preceding segment. This will make the transcript more accurate, as the model will use the relevant information from the previous audio. The model will only consider the final 224 tokens of the prompt and ignore anything earlier.
若要保留已拆分為段的文件的上下文,可以使用前一段的副本提示模型。這將使轉(zhuǎn)錄更準(zhǔn)確,因為模型將使用來自先前音頻的相關(guān)信息。該模型將只考慮提示符的最后224個標(biāo)記,而忽略之前的任何標(biāo)記。 -
Sometimes the model might skip punctuation in the transcript. You can avoid this by using a simple prompt that includes punctuation:
有時候模型可能會跳過文本中的標(biāo)點符號。您可以使用包含標(biāo)點符號的簡單提示來避免這種情況:
Hello, welcome to my lecture. 大家好,歡迎來聽我的講座。
- The model may also leave out common filler words in the audio. If you want to keep the filler words in your transcript, you can use a prompt that contains them:
該模型還可以省略音頻中的常見填充詞。如果要在記錄中保留填充詞,可以使用包含它們的提示符:
Umm, let me think like, hmm… Okay, here’s what I’m, like, thinking."
嗯,讓我想想,嗯……好吧,我是這么想的。”文章來源:http://www.zghlxwxcb.cn/news/detail-420692.html
- Some languages can be written in different ways, such as simplified or traditional Chinese. The model might not always use the writing style that you want for your transcript by default. You can improve this by using a prompt in your preferred writing style.
有些語言可以用不同的方式書寫,如簡體中文或繁體中文。默認(rèn)情況下,模型可能并不總是使用您希望用于抄錄的書寫樣式。你可以通過使用你喜歡的寫作風(fēng)格來改善這一點。
其它資料下載
如果大家想繼續(xù)了解人工智能相關(guān)學(xué)習(xí)路線和知識體系,歡迎大家翻閱我的另外一篇博客《重磅 | 完備的人工智能AI 學(xué)習(xí)——基礎(chǔ)知識學(xué)習(xí)路線,所有資料免關(guān)注免套路直接網(wǎng)盤下載》
這篇博客參考了Github知名開源平臺,AI技術(shù)平臺以及相關(guān)領(lǐng)域?qū)<遥篋atawhale,ApacheCN,AI有道和黃海廣博士等約有近100G相關(guān)資料,希望能幫助到所有小伙伴們。文章來源地址http://www.zghlxwxcb.cn/news/detail-420692.html
到了這里,關(guān)于OpenAI-ChatGPT最新官方接口《語音智能轉(zhuǎn)文本》全網(wǎng)最詳細(xì)中英文實用指南和教程,助你零基礎(chǔ)快速輕松掌握全新技術(shù)(六)(附源碼)的文章就介紹完了。如果您還想了解更多內(nèi)容,請在右上角搜索TOY模板網(wǎng)以前的文章或繼續(xù)瀏覽下面的相關(guān)文章,希望大家以后多多支持TOY模板網(wǎng)!