前言
1.真人視頻三維重建數(shù)字人源碼是基于NeRF改進的RAD-NeRF,NeRF(Neural Radiance Fields)是最早在2020年ECCV會議上的Best Paper,其將隱式表達推上了一個新的高度,僅用 2D 的 posed images 作為監(jiān)督,即可表示復雜的三維場景。
NeRF其輸入稀疏的多角度帶pose的圖像訓練得到一個神經(jīng)輻射場模型,根據(jù)這個模型可以渲染出任意視角下的清晰的照片。也可以簡要概括為用一個MLP神經(jīng)網(wǎng)絡(luò)去隱式地學習一個三維場景。
NeRF最先是應(yīng)用在新視點合成方向,由于其超強的隱式表達三維信息的能力后續(xù)在三維重建方向迅速發(fā)展起來。
2.NeRF使用的場景有幾個主流應(yīng)用方向:
新視點合成:
物體精細重建:
城市重建:
人體重建:
3.真人視頻合成
通過音頻空間分解的實時神經(jīng)輻射談話肖像合成
3.討論群 企鵝:787501969
一、訓練環(huán)境
1.系統(tǒng)要求
我是在win下訓練,訓練的環(huán)境為win 10,GPU RTX 3080 12G,CUDA 11.7,cudnn 8.5,Anaconda 3,Vs2019。
2.環(huán)境依賴
使用conda環(huán)境進行安裝,python 3.10
#下載源碼
git clone https://github.com/ashawkey/RAD-NeRF.git
cd RAD-NeRF
#創(chuàng)建虛擬環(huán)境
conda create --name vrh python=3.10
activate vrh
#pytorch 要單獨對應(yīng)cuda進行安裝,要不然訓練時使用不了GPU
conda install pytorch==2.0.0 torchvision==0.15.0 torchaudio==2.0.0 pytorch-cuda=11.7 -c pytorch -c nvidia
conda install -c fvcore -c iopath -c conda-forge fvcore iopath
#安裝所需要的依賴
pip install -r requirements.txt
3.windows下安裝pytorch3d,這個依賴還是要在剛剛創(chuàng)建的conda環(huán)境里面進行安裝。
git clone https://github.com/facebookresearch/pytorch3d.git
cd pytorch3d
python setup.py install
安裝pytorch3d很慢,也有可能中間報錯退出,這里建議安裝vs 生成工具。Microsoft C++ 生成工具 - Visual Studiohttps://visualstudio.microsoft.com/zh-hans/visual-cpp-build-tools/
二、數(shù)據(jù)準備
1.從網(wǎng)上上下載或者自己拍攝一段不大于5分鐘的視頻,視頻人像單一,面對鏡頭,背景盡量簡單,這是方便等下進行摳人像與分割人臉用的。我這里從網(wǎng)上下載了一段5分鐘左右的視頻,然后視頻編輯軟件,只切取一部分上半身和頭部的畫面。按1比1切取。這里的剪切尺寸不做要求,只是1比1就可以了。
?2.把視頻剪切項目參數(shù)設(shè)置成1比1,分辨率設(shè)成512*512。
3.數(shù)據(jù)長寬按512*512,25fps,mp4格式導出視頻。
4.把導出的數(shù)據(jù)放到項目目錄下,如下圖所示, 我這里面在data下載創(chuàng)建了一個與文件名一樣的目錄,然后把剛剛剪切的視頻放進目錄里面。
視頻數(shù)據(jù)如下:
三、人臉模型準備
1.人臉解析模型
?模型是從AD-NeRF這個項目獲取。下載AD-NeRF這個項目。
git clone https://github.com/YudongGuo/AD-NeRF.git
把AD-NeRF項目下的data_utils/face_parsing/79999_iter.pth復制到RAD-NeRF/data_utils/face_parsing/79999_iter.pth 。
或者在RAD-NeRF目錄直接下載,這種方式可能會出現(xiàn)下載不了。
wget https://github.com/YudongGuo/AD-NeRF/blob/master/data_util/face_parsing/79999_iter.pth?raw=true -O data_utils/face_parsing/79999_iter.pth
2.basel臉部模型處理
從AD-NeRF/data_utils/face_trackong項目里面的3DMM這個目錄復制到Rad-NeRF/data_utils/face_trackong里面
?移動到的位置:
或者是在Rad_NeRF項目下,直接下載,命令如下:
wget https://github.com/YudongGuo/AD-NeRF/blob/master/data_util/face_parsing/79999_iter.pth?raw=true -O data_utils/face_parsing/79999_iter.pth
## prepare basel face model
# 1. download `01_MorphableModel.mat` from https://faces.dmi.unibas.ch/bfm/main.php?nav=1-2&id=downloads and put it under `data_utils/face_tracking/3DMM/`
# 2. download other necessary files from AD-NeRF's repository:
wget https://github.com/YudongGuo/AD-NeRF/blob/master/data_util/face_tracking/3DMM/exp_info.npy?raw=true -O data_utils/face_tracking/3DMM/exp_info.npy
wget https://github.com/YudongGuo/AD-NeRF/blob/master/data_util/face_tracking/3DMM/keys_info.npy?raw=true -O data_utils/face_tracking/3DMM/keys_info.npy
wget https://github.com/YudongGuo/AD-NeRF/blob/master/data_util/face_tracking/3DMM/sub_mesh.obj?raw=true -O data_utils/face_tracking/3DMM/sub_mesh.obj
wget https://github.com/YudongGuo/AD-NeRF/blob/master/data_util/face_tracking/3DMM/topology_info.npy?raw=true -O data_utils/face_tracking/3DMM/topology_info.npy
從?https://faces.dmi.unibas.ch/bfm/main.php?nav=1-2&id=downloads?下載01_MorphableModel.mat放到Rad-NeRF/data_utils/face_trackong/3DMM里面。
運行
cd xx/xx/Rad-NeRF/data_utils/face_tracking
python convert_BFM.py
四、數(shù)據(jù)處理
數(shù)據(jù)處理要花的時間跟視頻長短有關(guān),一般要1個小時以上,有兩種處理方式,一種是直接一次運行所有步驟,但處理過程可能存在錯誤,所以建議使用第二種,按步驟來處理.
1.一次性處理數(shù)據(jù)
#按自己的數(shù)據(jù)與目錄來運行對應(yīng)的路徑
python data_utils/process.py data/vrhm/vrhm.mp4
2.分步處理
python data_utils/process.py data/vrhm/vrhm.mp4 --task 1
--task 1
分離音頻
--task 2
生成aud_eo.npy
--task 3
把視頻拆分成圖像
--task 4
分割人像
--task 5
extracted background image
--task 6
?extract torso and gt images for data/woman
--task 7
extracted face landmarks 生成lms文件?
--task 8
perform face tracking
?--task 9
保存所有數(shù)據(jù)
在這一步會下載四個模型,如果沒有魔法上網(wǎng),這四個模型下載很慢,或者直接下到一半就崩掉了。
也可以先把這個模型下載好之后放到指定的目錄,在處理的過程中就不會再次下載,模型下載路徑:
https://download.pytorch.org/models/resnet18-5c106cde.pth
https://www.adrianbulat.com/downloads/python-fan/s3fd-619a316812.pth
https://www.adrianbulat.com/downloads/python-fan/2DFAN4-cd938726ad.zip
https://download.pytorch.org/models/alexnet-owt-7be5be79.pth
下載完成之后,把四個模型放到指定目錄,如果目錄則創(chuàng)建目錄之后再放入。目錄如下:
?2.處理數(shù)據(jù)時,會在data所放的視頻目錄下生成以下幾個目錄:
?這里主要注意的是parsing這個目錄,目錄下的數(shù)據(jù)是分割后的數(shù)據(jù)。
?這里要注意分割的質(zhì)量,如果分割質(zhì)量不好,就要借助別的工具先做人像分割,要不然訓練出來的人物會出現(xiàn)透背景或者斷開的現(xiàn)象。比如我之后處理的數(shù)據(jù):
?這里人的脖子下面有一塊白的色塊,訓練完成之后,生成數(shù)字人才發(fā)現(xiàn),這塊區(qū)域是分割模型把它當背景了,合成視頻時,這塊是綠色的背景,直接廢了。
?在數(shù)據(jù)準備中,也盡量不要這種頭發(fā)披下來的,很容易出現(xiàn)拼接錯落的現(xiàn)象。
?我在使用這個數(shù)據(jù)訓練時,剛剛開始不清楚其中的關(guān)鍵因素,第一次訓練效果如下,能感覺到頭部與身體的連接并不和協(xié)。
五、模型訓練
先看看訓練代碼的給的參數(shù),訓練時只要關(guān)注幾個主要參數(shù)就可以了。
import torch
import argparse
from nerf.provider import NeRFDataset
from nerf.gui import NeRFGUI
from nerf.utils import *
# torch.autograd.set_detect_anomaly(True)
if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument('path', type=str)
parser.add_argument('-O', action='store_true', help="equals --fp16 --cuda_ray --exp_eye")
parser.add_argument('--test', action='store_true', help="test mode (load model and test dataset)")
parser.add_argument('--test_train', action='store_true', help="test mode (load model and train dataset)")
parser.add_argument('--data_range', type=int, nargs='*', default=[0, -1], help="data range to use")
parser.add_argument('--workspace', type=str, default='workspace')
parser.add_argument('--seed', type=int, default=0)
### training options
parser.add_argument('--iters', type=int, default=200000, help="training iters")
parser.add_argument('--lr', type=float, default=5e-3, help="initial learning rate")
parser.add_argument('--lr_net', type=float, default=5e-4, help="initial learning rate")
parser.add_argument('--ckpt', type=str, default='latest')
parser.add_argument('--num_rays', type=int, default=4096 * 16, help="num rays sampled per image for each training step")
parser.add_argument('--cuda_ray', action='store_true', help="use CUDA raymarching instead of pytorch")
parser.add_argument('--max_steps', type=int, default=16, help="max num steps sampled per ray (only valid when using --cuda_ray)")
parser.add_argument('--num_steps', type=int, default=16, help="num steps sampled per ray (only valid when NOT using --cuda_ray)")
parser.add_argument('--upsample_steps', type=int, default=0, help="num steps up-sampled per ray (only valid when NOT using --cuda_ray)")
parser.add_argument('--update_extra_interval', type=int, default=16, help="iter interval to update extra status (only valid when using --cuda_ray)")
parser.add_argument('--max_ray_batch', type=int, default=4096, help="batch size of rays at inference to avoid OOM (only valid when NOT using --cuda_ray)")
### network backbone options
parser.add_argument('--fp16', action='store_true', help="use amp mixed precision training")
parser.add_argument('--lambda_amb', type=float, default=0.1, help="lambda for ambient loss")
parser.add_argument('--bg_img', type=str, default='', help="background image")
parser.add_argument('--fbg', action='store_true', help="frame-wise bg")
parser.add_argument('--exp_eye', action='store_true', help="explicitly control the eyes")
parser.add_argument('--fix_eye', type=float, default=-1, help="fixed eye area, negative to disable, set to 0-0.3 for a reasonable eye")
parser.add_argument('--smooth_eye', action='store_true', help="smooth the eye area sequence")
parser.add_argument('--torso_shrink', type=float, default=0.8, help="shrink bg coords to allow more flexibility in deform")
### dataset options
parser.add_argument('--color_space', type=str, default='srgb', help="Color space, supports (linear, srgb)")
parser.add_argument('--preload', type=int, default=0, help="0 means load data from disk on-the-fly, 1 means preload to CPU, 2 means GPU.")
# (the default value is for the fox dataset)
parser.add_argument('--bound', type=float, default=1, help="assume the scene is bounded in box[-bound, bound]^3, if > 1, will invoke adaptive ray marching.")
parser.add_argument('--scale', type=float, default=4, help="scale camera location into box[-bound, bound]^3")
parser.add_argument('--offset', type=float, nargs='*', default=[0, 0, 0], help="offset of camera location")
parser.add_argument('--dt_gamma', type=float, default=1/256, help="dt_gamma (>=0) for adaptive ray marching. set to 0 to disable, >0 to accelerate rendering (but usually with worse quality)")
parser.add_argument('--min_near', type=float, default=0.05, help="minimum near distance for camera")
parser.add_argument('--density_thresh', type=float, default=10, help="threshold for density grid to be occupied (sigma)")
parser.add_argument('--density_thresh_torso', type=float, default=0.01, help="threshold for density grid to be occupied (alpha)")
parser.add_argument('--patch_size', type=int, default=1, help="[experimental] render patches in training, so as to apply LPIPS loss. 1 means disabled, use [64, 32, 16] to enable")
parser.add_argument('--finetune_lips', action='store_true', help="use LPIPS and landmarks to fine tune lips region")
parser.add_argument('--smooth_lips', action='store_true', help="smooth the enc_a in a exponential decay way...")
parser.add_argument('--torso', action='store_true', help="fix head and train torso")
parser.add_argument('--head_ckpt', type=str, default='', help="head model")
### GUI options
parser.add_argument('--gui', action='store_true', help="start a GUI")
parser.add_argument('--W', type=int, default=450, help="GUI width")
parser.add_argument('--H', type=int, default=450, help="GUI height")
parser.add_argument('--radius', type=float, default=3.35, help="default GUI camera radius from center")
parser.add_argument('--fovy', type=float, default=21.24, help="default GUI camera fovy")
parser.add_argument('--max_spp', type=int, default=1, help="GUI rendering max sample per pixel")
### else
parser.add_argument('--att', type=int, default=2, help="audio attention mode (0 = turn off, 1 = left-direction, 2 = bi-direction)")
parser.add_argument('--aud', type=str, default='', help="audio source (empty will load the default, else should be a path to a npy file)")
parser.add_argument('--emb', action='store_true', help="use audio class + embedding instead of logits")
parser.add_argument('--ind_dim', type=int, default=4, help="individual code dim, 0 to turn off")
parser.add_argument('--ind_num', type=int, default=10000, help="number of individual codes, should be larger than training dataset size")
parser.add_argument('--ind_dim_torso', type=int, default=8, help="individual code dim, 0 to turn off")
parser.add_argument('--amb_dim', type=int, default=2, help="ambient dimension")
parser.add_argument('--part', action='store_true', help="use partial training data (1/10)")
parser.add_argument('--part2', action='store_true', help="use partial training data (first 15s)")
parser.add_argument('--train_camera', action='store_true', help="optimize camera pose")
parser.add_argument('--smooth_path', action='store_true', help="brute-force smooth camera pose trajectory with a window size")
parser.add_argument('--smooth_path_window', type=int, default=7, help="smoothing window size")
# asr
parser.add_argument('--asr', action='store_true', help="load asr for real-time app")
parser.add_argument('--asr_wav', type=str, default='', help="load the wav and use as input")
parser.add_argument('--asr_play', action='store_true', help="play out the audio")
parser.add_argument('--asr_model', type=str, default='cpierse/wav2vec2-large-xlsr-53-esperanto')
# parser.add_argument('--asr_model', type=str, default='facebook/wav2vec2-large-960h-lv60-self')
parser.add_argument('--asr_save_feats', action='store_true')
# audio FPS
parser.add_argument('--fps', type=int, default=50)
# sliding window left-middle-right length (unit: 20ms)
parser.add_argument('-l', type=int, default=10)
parser.add_argument('-m', type=int, default=50)
parser.add_argument('-r', type=int, default=10)
opt = parser.parse_args()
if opt.O:
opt.fp16 = True
opt.exp_eye = True
if opt.test:
opt.smooth_path = True
opt.smooth_eye = True
opt.smooth_lips = True
opt.cuda_ray = True
# assert opt.cuda_ray, "Only support CUDA ray mode."
if opt.patch_size > 1:
# assert opt.patch_size > 16, "patch_size should > 16 to run LPIPS loss."
assert opt.num_rays % (opt.patch_size ** 2) == 0, "patch_size ** 2 should be dividable by num_rays."
if opt.finetune_lips:
# do not update density grid in finetune stage
opt.update_extra_interval = 1e9
from nerf.network import NeRFNetwork
print(opt)
seed_everything(opt.seed)
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = NeRFNetwork(opt)
# manually load state dict for head
if opt.torso and opt.head_ckpt != '':
model_dict = torch.load(opt.head_ckpt, map_location='cpu')['model']
missing_keys, unexpected_keys = model.load_state_dict(model_dict, strict=False)
if len(missing_keys) > 0:
print(f"[WARN] missing keys: {missing_keys}")
if len(unexpected_keys) > 0:
print(f"[WARN] unexpected keys: {unexpected_keys}")
# freeze these keys
for k, v in model.named_parameters():
if k in model_dict:
# print(f'[INFO] freeze {k}, {v.shape}')
v.requires_grad = False
# print(model)
criterion = torch.nn.MSELoss(reduction='none')
if opt.test:
if opt.gui:
metrics = [] # use no metric in GUI for faster initialization...
else:
# metrics = [PSNRMeter(), LPIPSMeter(device=device)]
metrics = [PSNRMeter(), LPIPSMeter(device=device), LMDMeter(backend='fan')]
trainer = Trainer('ngp', opt, model, device=device, workspace=opt.workspace, criterion=criterion, fp16=opt.fp16, metrics=metrics, use_checkpoint=opt.ckpt)
if opt.test_train:
test_set = NeRFDataset(opt, device=device, type='train')
# a manual fix to test on the training dataset
test_set.training = False
test_set.num_rays = -1
test_loader = test_set.dataloader()
else:
test_loader = NeRFDataset(opt, device=device, type='test').dataloader()
# temp fix: for update_extra_states
model.aud_features = test_loader._data.auds
model.eye_areas = test_loader._data.eye_area
if opt.gui:
# we still need test_loader to provide audio features for testing.
with NeRFGUI(opt, trainer, test_loader) as gui:
gui.render()
else:
### evaluate metrics (slow)
if test_loader.has_gt:
trainer.evaluate(test_loader)
### test and save video (fast)
trainer.test(test_loader)
else:
optimizer = lambda model: torch.optim.Adam(model.get_params(opt.lr, opt.lr_net), betas=(0.9, 0.99), eps=1e-15)
train_loader = NeRFDataset(opt, device=device, type='train').dataloader()
assert len(train_loader) < opt.ind_num, f"[ERROR] dataset too many frames: {len(train_loader)}, please increase --ind_num to this number!"
# temp fix: for update_extra_states
model.aud_features = train_loader._data.auds
model.eye_area = train_loader._data.eye_area
model.poses = train_loader._data.poses
# decay to 0.1 * init_lr at last iter step
if opt.finetune_lips:
scheduler = lambda optimizer: optim.lr_scheduler.LambdaLR(optimizer, lambda iter: 0.05 ** (iter / opt.iters))
else:
scheduler = lambda optimizer: optim.lr_scheduler.LambdaLR(optimizer, lambda iter: 0.1 ** (iter / opt.iters))
metrics = [PSNRMeter(), LPIPSMeter(device=device)]
eval_interval = max(1, int(5000 / len(train_loader)))
trainer = Trainer('ngp', opt, model, device=device, workspace=opt.workspace, optimizer=optimizer, criterion=criterion, ema_decay=0.95, fp16=opt.fp16, lr_scheduler=scheduler, scheduler_update_every_step=True, metrics=metrics, use_checkpoint=opt.ckpt, eval_interval=eval_interval)
if opt.gui:
with NeRFGUI(opt, trainer, train_loader) as gui:
gui.render()
else:
valid_loader = NeRFDataset(opt, device=device, type='val', downscale=1).dataloader()
max_epoch = np.ceil(opt.iters / len(train_loader)).astype(np.int32)
print(f'[INFO] max_epoch = {max_epoch}')
trainer.train(train_loader, valid_loader, max_epoch)
# free some mem
del train_loader, valid_loader
torch.cuda.empty_cache()
# also test
test_loader = NeRFDataset(opt, device=device, type='test').dataloader()
if test_loader.has_gt:
trainer.evaluate(test_loader) # blender has gt, so evaluate it.
trainer.test(test_loader)
參數(shù):
--preload 0:從硬盤加載數(shù)據(jù)
--preload 1: 指定CPU,約70G內(nèi)存?
--preload 2: 指定GPU,約24G顯存?
1.頭部訓練
python main.py data/vrhm/ --workspace trial_vrhm/ -O --iters 200000
2.唇部微調(diào)
python main.py data/vrhm/ --workspace trial_vrhm/ -O --iters 500000 --finetune_lips
3.身體部分訓練
python main.py data/vrhm/ --workspace trial_vrhm_torso/ -O --torso --head_ckpt <trial_ID>/checkpoints/npg_xxx.pth> --iters 200000 --preload 2
六、報錯解決
報錯No module named 'sklearn'文章來源:http://www.zghlxwxcb.cn/news/detail-740271.html
pip install -U scikit-learn
注:如果對該項目感興趣或者在安裝的過程中遇到什么錯誤的的可以加我的企鵝群:487350510,大家一起探討。文章來源地址http://www.zghlxwxcb.cn/news/detail-740271.html
到了這里,關(guān)于數(shù)字人解決方案——RAD-NeRF真人視頻的三維重建數(shù)字人源碼與訓練方法的文章就介紹完了。如果您還想了解更多內(nèi)容,請在右上角搜索TOY模板網(wǎng)以前的文章或繼續(xù)瀏覽下面的相關(guān)文章,希望大家以后多多支持TOY模板網(wǎng)!