diffusers-Understanding models and schedulers

這篇具有很好參考價(jià)值的文章主要介紹了diffusers-Understanding models and schedulers。希望對(duì)大家有所幫助。如果存在錯(cuò)誤或未考慮完全的地方，請(qǐng)大家不吝賜教，您也可以點(diǎn)擊"舉報(bào)違法"按鈕提交疑問(wèn)。

https://huggingface.co/docs/diffusers/using-diffusers/write_own_pipelinehttps://huggingface.co/docs/diffusers/using-diffusers/write_own_pipelinediffusers有3個(gè)模塊：diffusion pipelines，noise schedulers，model。這個(gè)庫(kù)很不錯(cuò)，設(shè)計(jì)思想和mmlab系列的有的一拼，mm系列生成算法在mmagic中，但是不如diffusers豐富，再者幾乎所有的新算法的訓(xùn)練和推理都會(huì)采用標(biāo)準(zhǔn)的diffusers形式。

給一個(gè)標(biāo)準(zhǔn)的diffusers的sd算法的前向加載，配合huggingface hub，遙遙領(lǐng)先了，這是天工巧繪skypaint的文生圖算法。

from diffusers import StableDiffusionPipeline

device = 'cuda'
pipe = StableDiffusionPipeline.from_pretrained("path_to_our_model").to(device)

prompts = [
    '機(jī)械狗',
    '城堡 大海 夕陽(yáng) 宮崎駿動(dòng)畫(huà)',
    '花落知多少',
    '雞你太美',
]

for prompt in prompts:
    prompt = 'sai-v1 art, ' + prompt
    image = pipe(prompt).images[0]  
    image.save("%s.jpg" % prompt)

1.pipelines

將必要組件（多個(gè)獨(dú)立訓(xùn)練的model，scheduler，processor）包裝在一個(gè)端到端的類中。所有的pipelines都是從DiffusionPipeline中構(gòu)建而來(lái)，該類提供加載，下載和保存所有組件的基本功能。pipelines不提供training，UNet2Model和UNet2DConditionModel都是單獨(dú)訓(xùn)練的。

下面是目前v0.21.0版本支持的pipelines，后續(xù)會(huì)一直添加的。

diffusers-Understanding models and schedulers,多模態(tài)和生成模型,diffusers,stable diffusion

例子：?

from diffusers import DDPMPipeline

ddpm = DDPMPipeline.from_pretrained("google/ddpm-cat-256", use_safetensors=True).to("cuda")
image = ddpm(num_inference_steps=25).images[0]
image

在上面的示例中，pipeline中包含UNet2DModel和DDPMScheduler，pipline通過(guò)取隨機(jī)噪聲（與所需輸出大小相同）并將其多次輸入模型來(lái)去噪圖像。在每個(gè)時(shí)間步中，模型預(yù)測(cè)噪聲殘差，并且scheduler使用它來(lái)預(yù)測(cè)一個(gè)更少噪聲的圖像。pipeline重復(fù)此過(guò)程，直到達(dá)到指定的推理步數(shù)。

分別使用model和scheduler去重新創(chuàng)建pipeline，重新來(lái)寫(xiě)去噪過(guò)程：

1.加載model和scheduler

from diffusers import DDPMScheduler, UNet2DModel

scheduler = DDPMScheduler.from_pretrained("google/ddpm-cat-256")
model = UNet2DModel.from_pretrained("google/ddpm-cat-256", use_safetensors=True).to("cuda")

2.去噪過(guò)程的timesteps

scheduler.set_timesteps(50)

3.設(shè)置scheduler timesteps會(huì)創(chuàng)建一個(gè)張量，在其中均勻地分布元素，本例中為50個(gè)元素。每個(gè)元素對(duì)應(yīng)于模型去噪圖像的一個(gè)timestep。當(dāng)稍后創(chuàng)建去噪循環(huán)時(shí)，將迭代此張量以去噪圖像：

scheduler.timesteps
tensor([980, 960, 940, 920, 900, 880, 860, 840, 820, 800, 780, 760, 740, 720,
    700, 680, 660, 640, 620, 600, 580, 560, 540, 520, 500, 480, 460, 440,
    420, 400, 380, 360, 340, 320, 300, 280, 260, 240, 220, 200, 180, 160,
    140, 120, 100,  80,  60,  40,  20,   0])

4.創(chuàng)建一些和輸出形狀相同的隨機(jī)噪聲

sample_size = model.config.sample_size
noise = torch.randn((1, 3, sample_size, sample_size)).to("cuda")

5.編寫(xiě)一個(gè)循環(huán)來(lái)迭代timesteps。在每個(gè)timestep中，模型執(zhí)行UNet2DModel.forward()操作并返回帶噪聲的殘差。scheduler的step()方法接受帶噪聲的殘差、timestep和輸入，然后預(yù)測(cè)上一個(gè)timestep的圖像。該輸出成為去噪循環(huán)中模型的下一個(gè)輸入，并一直重復(fù)，直到達(dá)到時(shí)間步驟數(shù)組的末尾。這就是整個(gè)去噪過(guò)程。

input = noise

for t in scheduler.timesteps:
    with torch.no_grad():
        noisy_residual = model(input, t).sample
    previous_noisy_sample = scheduler.step(noisy_residual, t, input).prev_sample
    input = previous_noisy_sample

6.最后是將去噪輸出轉(zhuǎn)成圖像

image = (input / 2 + 0.5).clamp(0, 1).squeeze()
image = (image.permute(1, 2, 0) * 255).round().to(torch.uint8).cpu().numpy()
image = Image.fromarray(image)
image

2.stable diffusion pipeline

stable diffusion是一個(gè)文本-圖像潛在擴(kuò)散模型。它被稱為潛在擴(kuò)散模型，是因?yàn)樗褂脠D像的較低維度表示而不是實(shí)際的像素空間，這使得它更加內(nèi)存高效。編碼器將圖像壓縮成較小的表示，解碼器將壓縮表示轉(zhuǎn)換回圖像。對(duì)于文本到圖像的模型，需要一個(gè)分詞器和一個(gè)編碼器來(lái)生成文本嵌入。從前面的例子中，已經(jīng)知道需要一個(gè)UNet模型和一個(gè)調(diào)度器。

from PIL import Image
import torch
from transformers import CLIPTextModel, CLIPTokenizer
from diffusers import AutoencoderKL, UNet2DConditionModel, PNDMScheduler

vae = AutoencoderKL.from_pretrained("CompVis/stable-diffusion-v1-4", subfolder="vae", use_safetensors=True)
tokenizer = CLIPTokenizer.from_pretrained("CompVis/stable-diffusion-v1-4", subfolder="tokenizer")
text_encoder = CLIPTextModel.from_pretrained(
    "CompVis/stable-diffusion-v1-4", subfolder="text_encoder", use_safetensors=True
)
unet = UNet2DConditionModel.from_pretrained(
    "CompVis/stable-diffusion-v1-4", subfolder="unet", use_safetensors=True
)

代替默認(rèn)的PNDMScheduler，使用UniPCMultistepScheduler

from diffusers import UniPCMultistepScheduler

scheduler = UniPCMultistepScheduler.from_pretrained("CompVis/stable-diffusion-v1-4", subfolder="scheduler")

加速推理，scheduler沒(méi)有可訓(xùn)練權(quán)重，在不在gpu上推理無(wú)影響。

torch_device = "cuda"
vae.to(torch_device)
text_encoder.to(torch_device)
unet.to(torch_device)

2.1 create text embeddings

對(duì)文本進(jìn)行tokenize以生成embedding，該文本用于調(diào)節(jié)UNet并將擴(kuò)散模型引導(dǎo)至類似于屬于提示的方向。guidance_scale參數(shù)決定了生成圖像時(shí)應(yīng)賦予提示多少權(quán)重。

prompt = ["a photograph of an astronaut riding a horse"]
height = 512  # default height of Stable Diffusion
width = 512  # default width of Stable Diffusion
num_inference_steps = 25  # Number of denoising steps
guidance_scale = 7.5  # Scale for classifier-free guidance
generator = torch.manual_seed(0)  # Seed generator to create the inital latent noise
batch_size = len(prompt)

對(duì)文本進(jìn)行tokenize，生成文本embedding

text_input = tokenizer(
    prompt, padding="max_length", max_length=tokenizer.model_max_length, truncation=True, return_tensors="pt"
)

with torch.no_grad():
    text_embeddings = text_encoder(text_input.input_ids.to(torch_device))[0]

需要生成unconditional text embeddings，即用于填充標(biāo)記的嵌入。這些嵌入需要與條件文本嵌入具有相同的形狀（batch_size和seq_length）

max_length = text_input.input_ids.shape[-1]
uncond_input = tokenizer([""] * batch_size, padding="max_length", max_length=max_length, return_tensors="pt")
uncond_embeddings = text_encoder(uncond_input.input_ids.to(torch_device))[0]

把unconditional text embeddings和conditional embeddings放在同一個(gè)batch中，避免走兩次前向：

text_embeddings = torch.cat([uncond_embeddings, text_embeddings])

2.2 create random noise

接下來(lái)，生成一些初始的隨機(jī)噪聲作為擴(kuò)散過(guò)程的起點(diǎn)。這是圖像的潛在表示，將逐漸去噪。此時(shí)，潛在圖像的尺寸比最終的圖像尺寸要小，但這沒(méi)關(guān)系，因?yàn)槟Ｐ蛯⒃诤竺鎸⑵滢D(zhuǎn)換為最終的512x512圖像尺寸。

高度和寬度除以8，因?yàn)関ae有3個(gè)下采樣層。

2 ** (len(vae.config.block_out_channels) - 1) == 8

latents = torch.randn(
    (batch_size, unet.in_channels, height // 8, width // 8),
    generator=generator,
)
latents = latents.to(torch_device)

2.3 denoise the image

首先，通過(guò)初始噪聲分布以及噪聲尺度值sigma對(duì)輸入進(jìn)行縮放。這對(duì)于改進(jìn)的調(diào)度器（如UniPCMultistepScheduler）是必需的。

latents = latents * scheduler.init_noise_sigma

最后一步是創(chuàng)建去噪循環(huán)，逐步將潛在的純?cè)肼曓D(zhuǎn)換為由提示描述的圖像。請(qǐng)記住，去噪循環(huán)需要完成三件事：

1.設(shè)置調(diào)度器在去噪過(guò)程中使用的timesteps。 2.迭代timesteps。 3.在每個(gè)timestep中，調(diào)用UNet模型來(lái)預(yù)測(cè)噪聲殘差，并將其傳遞給scheduler以計(jì)算先前的噪聲樣本。

from tqdm.auto import tqdm

scheduler.set_timesteps(num_inference_steps)

for t in tqdm(scheduler.timesteps):
    # expand the latents if we are doing classifier-free guidance to avoid doing two forward passes.
    latent_model_input = torch.cat([latents] * 2)

    latent_model_input = scheduler.scale_model_input(latent_model_input, timestep=t)

    # predict the noise residual
    with torch.no_grad():
        noise_pred = unet(latent_model_input, t, encoder_hidden_states=text_embeddings).sample

    # perform guidance
    noise_pred_uncond, noise_pred_text = noise_pred.chunk(2)
    noise_pred = noise_pred_uncond + guidance_scale * (noise_pred_text - noise_pred_uncond)

    # compute the previous noisy sample x_t -> x_t-1
    latents = scheduler.step(noise_pred, t, latents).prev_sample

classifier-free guidance:通過(guò)在 UNet 模型中添加分類標(biāo)簽，使得模型在生成圖像時(shí)可以同時(shí)考慮文本嵌入信息和潛在變量。具體地，在每個(gè)時(shí)間步中，將噪聲殘差分為無(wú)條件部分和有條件部分，其中有條件部分通過(guò)加權(quán)求和的方式與文本嵌入信息相結(jié)合，從而達(dá)到有條件的控制效果。這里的加權(quán)系數(shù)就是指導(dǎo)尺度，用于調(diào)節(jié)噪聲殘差對(duì)文本嵌入信息的影響。因此，通過(guò)這種方式，可以在不使用分類器的情況下，仍然能夠結(jié)合文本嵌入信息進(jìn)行有條件的控制。這就是 Classifier-free Guidance 的實(shí)現(xiàn)方式之一。latents*2以及后面noise_pred.chunk(2)都是classifier-free guidance的實(shí)現(xiàn)。

2.4 decode the image

使用vae將潛在表示解碼成圖像文章來(lái)源地址http://www.zghlxwxcb.cn/news/detail-736860.html

# scale and decode the image latents with vae
latents = 1 / 0.18215 * latents
with torch.no_grad():
    image = vae.decode(latents).sample

image = (image / 2 + 0.5).clamp(0, 1).squeeze()
image = (image.permute(1, 2, 0) * 255).to(torch.uint8).cpu().numpy()
images = (image * 255).round().astype("uint8")
image = Image.fromarray(image)
image

到了這里，關(guān)于diffusers-Understanding models and schedulers的文章就介紹完了。如果您還想了解更多內(nèi)容，請(qǐng)?jiān)谟疑辖撬阉鱐OY模板網(wǎng)以前的文章或繼續(xù)瀏覽下面的相關(guān)文章，希望大家以后多多支持TOY模板網(wǎng)！