由于之前在學(xué)習(xí)油管的視頻的時(shí)候,發(fā)現(xiàn)沒有字幕,自己的口語聽力又不太好,所以,打算開發(fā)一個(gè)能夠語音或者視頻里面,提取出字幕的軟件。
在尋找了很多的開源倉庫,發(fā)現(xiàn)了openai早期發(fā)布的whisper
原倉庫鏈接如下
openai/whisper: Robust Speech Recognition via Large-Scale Weak Supervision (github.com)https://github.com/openai/whisper首先下載這個(gè)倉庫,解壓后如下圖所示:
另外由于,需要對音頻進(jìn)行處理,所以我們還需要下載一個(gè)ffempg
然后解壓,將bin的文件路徑放到環(huán)境變量里面去
安裝環(huán)境我用的anconda的方式去安裝的,
一鍵部署環(huán)境可以參考我上傳的資源(1積分)
用于whisper的python配置,里面包含environment.yaml文件,可以幫助下載者,快速部署環(huán)境資源-CSDN文庫
使用conda env create -f environment.yaml,就可以快速創(chuàng)建一個(gè)conda的虛擬環(huán)境了!
也可以使用以下方法配置配置:
首先是
pip install -U openai-whisper
然后再安裝
pip install git+https://github.com/openai/whisper.git
希望能幫到大家。里面還包含了一個(gè)python文件運(yùn)行,代碼如下:
import whisper
import io
import time
import os
import json
import pathlib
import torch
# Choose model to use by uncommenting
#modelName = "tiny.en"
#modelName = "base.en"
#modelName = "small.en"
#modelName = "medium.en"
"""在下面這句修改"""
modelName = "model/large-v2.pt"
# device=torch.device('cuda:0'if torch.cuda.is_available() else "cpu")
torch.cuda.empty_cache()
#todo 設(shè)置cpu
device=torch.device("cpu")
# Other Variables
exportTimestampData =False # (bool) Whether to export the segment data to a json file. Will include word level timestamps if word_timestamps is True.
outputFolder = "Output"
exportTimevtt=True
# ----- Select variables for transcribe method -----
# audio: path to audio file
verbose = False # (bool): Whether to display the text being decoded to the console. If True, displays all the details, If False, displays minimal details. If None, does not display anything
language="Chinese" # Language of audio file
word_timestamps=False # (bool): Extract word-level timestamps using the cross-attention pattern and dynamic time warping, and include the timestamps for each word in each segment.
#initial_prompt="" # (optional str): Optional text to provide as a prompt for the first window. This can be used to provide, or "prompt-engineer" a context for transcription, e.g. custom vocabularies or proper nouns to make it more likely to predict those word correctly.
# -------------------------------------------------------------------------
print(f"Using Model: {modelName}")
# filePath = input("Path to File Being Transcribed: ")
# filePath = filePath.strip("\"")
filePath = r"F:\CloudMusic\1.mp3"
if not os.path.exists(filePath):
print("Problem Getting File...")
input("Press Enter to Exit...")
exit()
# If output folder does not exist, create it
if not os.path.exists(outputFolder):
os.makedirs(outputFolder)
print("Created Output Folder.\n")
# Get filename stem using pathlib (filename without extension)
fileNameStem = pathlib.Path(filePath).stem
vttFileName=f"{fileNameStem}.vtt"
resultFileName = f"{fileNameStem}.txt"
jsonFileName = f"{fileNameStem}.json"
model = whisper.load_model(modelName,device)
start = time.time()
# ---------------------------------------------------
result = model.transcribe(audio=filePath, language=language, word_timestamps=word_timestamps, verbose=verbose,fp16=False)#將一段MP3分割成多段30s的間隔的視頻
# ---------------------------------------------------
end = time.time()
elapsed = float(end - start)#總的時(shí)間
print(result["segments"]) # 保存為.srt文件
# Save transcription text to file
print("\nWriting transcription to file...")
with open(os.path.join(outputFolder, resultFileName), "w", encoding="utf-8") as file:
file.write(result["text"])
print("Finished writing transcription file.")
# Save the segments data to json file
#if word_timestamps == True:
if exportTimestampData == True:
print("\nWriting segment data to file...")
with open(os.path.join(outputFolder, jsonFileName), "w", encoding="utf-8") as file:
segmentsData = result["segments"]
json.dump(segmentsData, file, indent=4)
print("Finished writing segment data file.")
if exportTimevtt==True:
print("\nWriting segment data to vtt file...")
with open(os.path.join(outputFolder, vttFileName), "w", encoding="utf-8") as f:
# 寫入第一行
# f.write("WEBVTT\n\n")
# 遍歷字典中的每個(gè)提示
for cue in result["segments"]:
# 獲取開始時(shí)間和結(jié)束時(shí)間,并轉(zhuǎn)換成vtt格式
start = cue["start"]
end = cue["end"]
start_h = int(start // 3600)
start_m = int((start % 3600) // 60)
start_s = int(start % 60)
start_ms = int((start % 1) * 1000)
end_h = int(end // 3600)
end_m = int((end % 3600) // 60)
end_s = int(end % 60)
end_ms = int((end % 1) * 1000)
start_str = f"{start_h:02}:{start_m:02}:{start_s:02}.{start_ms:03}"
end_str = f"{end_h:02}:{end_m:02}:{end_s:02}.{end_ms:03}"
# 獲取文本內(nèi)容,并去掉空格和換行符
text = cue["text"].strip().replace("\n", " ")
# 寫入時(shí)間標(biāo)記和文本內(nèi)容,并加上空行
f.write(f"{start_str} --> {end_str}\n")
f.write(f"{text}\n\n")
print("Finished writing segment vtt data file.")
elapsedMinutes = str(round(elapsed/60, 2))
print(f"\nElapsed Time With {modelName} Model: {elapsedMinutes} Minutes")
# input("Press Enter to exit...")
exit()
上述可以根據(jù)自己需要修改cpu,gpu來運(yùn)行。
還需要下載模型,是可以在倉庫鏈接里面可以找到的!
方式一、可以修改上面的代碼,為large-v2.pt就會開始下載模型,默認(rèn)是下載到C:\Users\Lenovo\.cache\whisper這個(gè)文件夾下面的。
方式二、還可以就是利用cmd命令,(在當(dāng)前目錄下,打開conda的python環(huán)境)
然后輸入以下指令
whisper audio.mp3 audio.wav --model base --model_dir 指定模型下載路徑
經(jīng)過測試進(jìn)行了測試,可以實(shí)現(xiàn)中文,英文的語音識別,另外還測試了mp4和mp3的語音識別。
在whisper的基礎(chǔ)上進(jìn)行延伸的exe(非原創(chuàng)),效果如下:
初始化,配置模型位置的界面
?
?這個(gè)是音頻轉(zhuǎn)文字的界面
?這個(gè)是麥克風(fēng)輸入,轉(zhuǎn)文字的界面。
這個(gè)exe的文件,我上傳到csdn有需要的自取。
whisper的Exe文件資源-CSDN文庫
需要加載模型文件(按照下面?zhèn)}庫鏈接下載模型文件)文章來源:http://www.zghlxwxcb.cn/news/detail-482745.html
whisper.cpp/models at master · ggerganov/whisper.cpp (github.com)文章來源地址http://www.zghlxwxcb.cn/news/detail-482745.html
到了這里,關(guān)于基于whisper的語音轉(zhuǎn)文字(視頻字幕)的文章就介紹完了。如果您還想了解更多內(nèi)容,請?jiān)谟疑辖撬阉鱐OY模板網(wǎng)以前的文章或繼續(xù)瀏覽下面的相關(guān)文章,希望大家以后多多支持TOY模板網(wǎng)!