国产无码综合区,色欲AV无码国产永久播放,无码天堂亚洲国产AV,国产日韩欧美女同一区二区

DALLE2論文解讀及實現(xiàn)(一)

2年前作者：晚點吧分類：Toy博客閱讀(8)違法舉報

這篇具有很好參考價值的文章主要介紹了DALLE2論文解讀及實現(xiàn)(一)。希望對大家有所幫助。如果存在錯誤或未考慮完全的地方，請大家不吝賜教，您也可以點擊"舉報違法"按鈕提交疑問。

DALLE2: Hierarchical Text-Conditional Image Generation with CLIP Latents
paper: https://cdn.openai.com/papers/dall-e-2.pdf
github: https://github.com/lucidrains/DALLE2-pytorch
DALLE2概覽：
- CLIP模型：
用于生成text embedding zt 和image embedding zi
- prior模型：
1）模型輸入：為 the encoded text，the CLIP text embedding,time_embed,image_embed,learned_queries,(文本整體embedding，文本序列embedding，時間步embedding，當(dāng)前t步對應(yīng)的圖片embedding，用于輸出transformer 結(jié)果手動構(gòu)造用于學(xué)習(xí)的embedding )
2）模型： diffusion model使用transformer(不是unet)直接預(yù)測x0，然后通過diffusion遞推公式生成前一步圖片embedding.
3）最終輸出：為 image Embedding (不同于上面CLIP生成的image embedding )
- decoder 模型
1）模型輸入：為 prior 輸出的image Embedding
2）模型：diffusion model使用unet網(wǎng)絡(luò)，預(yù)測噪聲z (不同于prior模型直接預(yù)測x0)
3）模型輸出：經(jīng)過T步去噪后，最后一步x0即為模型輸出

0 Abstract

基于對比學(xué)習(xí)思想，我們提出了兩階段模型，
①一個先驗?zāi)Ｐ蚿rior:

在給定文本條件下生成CLIP的 image embedding

② 一個decoder模型：

在給定imge embedding 條件下，生成圖片

We use diffusion models for the decoder and experiment with both autoregressive and diffusion models for the prior, finding that the latter are computationally more efficient and produce higher quality samples.
我們使用diffusion 模型作為decoder 模型，實驗了自回歸autoregressive 和diffusion模型作為prior模型，發(fā)現(xiàn)diffusion 模型作為先驗?zāi)Ｐ托н^更好

1 Introduction

虛線上面的是CLIP模型，通過CLIP模型可以學(xué)習(xí)到text 和image的embedding，
虛線以下是文本到圖片的生成過程，
① CLIP的 text embedding 喂給autoregressive或者diffusion模型（ prior模型），生成image embedding
② 然后根據(jù)上面的image embedding喂給decoder 模型，生成最終的圖片image

DALLE2論文解讀及實現(xiàn)(一),CV,DALLE2,文本生成圖片,生成模型

2 Method

Our training dataset consists of pairs (x, y) of images x and their corresponding captions y. Given an image x,let zi and zt be its CLIP image and text embeddings, respectively. We design our generative stack to produce images from captions using two components:
我們訓(xùn)練數(shù)據(jù)集由成對的（x,y）組成，x是圖片，y是文本，給定x和y，通過CLIP模型，可以分別生成image 和text embedding，zi和 zt。

A prior P(zi|y) that produces CLIP image embeddings zi conditioned on captions y.
一個prior 模型用在給定文本時，生成image embedding zi.
A decoder P(x|zi, y) that produces images x conditioned on CLIP image embeddings zi (and optionally text captions y).
decoder 模型用于在給定條件zi時，生成最終圖片 x。
整個過程如下所示

2.1 Decoder

We use diffusion models to produce images conditioned on CLIP image embeddings (and optionally text captions).
在prior模型生成的image embedding的基礎(chǔ)上，我們使用 diffusion models生成image。
將image embedding作為條件直接加上timestep embedding（也可以選擇添加加text embedding，實驗發(fā)現(xiàn)用處不大），然后通過下面的diffusion 去噪公式，選擇unet網(wǎng)絡(luò)預(yù)測噪聲，生成最終的圖片x
$\bar \mu_t=\frac{1 } {\sqrt \alpha_{t}} (x_t -\frac{1-\alpha_t } {\sqrt{1- \bar \alpha_{t}}} z_t)$

2.2 Prior

? While a decoder can invert CLIP image embeddings zi to produce images x, we need a prior model that produces zi from captions y to enable image generations from text captions.
decoder 模型輸入 image embedding zi 生成image x,需要prior模型生成的zi.

? Diffusion prior: The continuous vector zi is directly modelled using a Gaussian diffusion model conditioned on the caption y.
Diffusion prior : 給定文本y（clip 模型生成的文本向量）時，通過Gaussian diffusion model 直接生成 zi。為了改善樣本質(zhì)量，訓(xùn)練時我們隨機mask掉10%的文本數(shù)據(jù)。

DALLE2論文解讀及實現(xiàn)(一),CV,DALLE2,文本生成圖片,生成模型文章來源地址http://www.zghlxwxcb.cn/news/detail-722133.html

對于 diffusion prior，我們訓(xùn)練一個 decoder-only的Transformer模型，對輸入序列使用causal attention mask。用于預(yù)測x0 (重點：不是噪聲zt)
Transformer模型的輸入： the encoded text，the CLIP text embedding,time_embed,image_embed,learned_queries,(文本整體embedding，文本序列embedding，時間步embedding，當(dāng)前t步對應(yīng)的圖片embedding，用于輸出transformer 結(jié)果手動構(gòu)造用于學(xué)習(xí)的embedding )
diffusion 過程：隨機初始化xt，dffusion通過下面公式反向傳播公式生成x(t-1)數(shù)據(jù)（transformer 模型直接生成x0），直到最后一步x0
$\bar \mu_t(x_t,x_0)=\frac{\sqrt{\alpha_{t}}(1-\bar \alpha_{t-1} ) } {1- \bar \alpha_{t}} x_t +\frac{\sqrt{\bar \alpha_{t-1}}(1-\alpha_t) } {1- \bar \alpha_{t}} x_0$

到了這里，關(guān)于DALLE2論文解讀及實現(xiàn)(一)的文章就介紹完了。如果您還想了解更多內(nèi)容，請在右上角搜索TOY模板網(wǎng)以前的文章或繼續(xù)瀏覽下面的相關(guān)文章，希望大家以后多多支持TOY模板網(wǎng)！

本文來自互聯(lián)網(wǎng)用戶投稿，該文觀點僅代表作者本人，不代表本站立場。本站僅提供信息存儲空間服務(wù)，不擁有所有權(quán)，不承擔(dān)相關(guān)法律責(zé)任。如若轉(zhuǎn)載，請注明出處：如若內(nèi)容造成侵權(quán)/違法違規(guī)/事實不符，請點擊違法舉報進行投訴反饋，一經(jīng)查實，立即刪除！

分享到：

領(lǐng)支付寶紅包贊助服務(wù)器費用

一行文本生成4D動態(tài)場景——Meta AI MAV3D論文解讀
論文鏈接:https://arxiv.org/pdf/2301.1128
2024年02月12日
瀏覽(17)
AI寫代碼修Bug畫畫寫詩，ChatGPT&DALLE2試用攻略
ChatGPTDALLE2是OpenAI的最新研究成果，在量子位看到他的強大功能后，就自己試玩了一下，比如我讓ChatGPT幫我寫一個GraphSage模型，ChatGPT先簡單解釋了一下GraphSage，然后寫出了不錯的PyTorch代碼 (詳見見示例一)，是不是很神奇？在我將量子位的公眾號文章轉(zhuǎn)發(fā)朋友圈之后，很多人
2024年02月11日
瀏覽(21)
大模型 Dalle2 學(xué)習(xí)三部曲（一）Latent Diffusion Models學(xué)習(xí)
Diffusion?model 大獲成功，但是它的短板也很明顯，需要大量的計算資源，并且推理速度比較慢。如何才能提升Diffusion?model的計算效率。業(yè)界有各種各樣的改進，無疑 Latent?Diffusion?Models（潛在擴散模型，LDMs）是比較成功的一篇，那就來學(xué)習(xí)一下LDMS是怎么做的吧 1，與基于變換
2024年01月18日
瀏覽(25)
dalle2：hierarchical text-conditional image generation with clip
DALL·E 2【論文精讀】_嗶哩嗶哩_bilibili 更多論文：https://github.com/mli/paper-reading, 視頻播放量 30350、彈幕量 256、點贊數(shù) 1767、投硬幣枚數(shù) 1318、收藏人數(shù) 751、轉(zhuǎn)發(fā)人數(shù) 344, 視頻作者跟李沐學(xué)AI, 作者簡介，相關(guān)視頻：博一研究生求偶視頻，如何做好文獻閱讀及筆記整理，在線求
2024年02月16日
瀏覽(25)
AI art 實驗：同樣的Prompt, DALLE2 跟 Disco Diffusion 的創(chuàng)作大比拼
關(guān)門測試的 DALL·E 2 昨日放出消息，說剛向社區(qū)投放了 1000 個內(nèi)測名額，趕緊奔去查我的郵箱！沒有！還是沒有，向幾位我認識搞機器學(xué)習(xí)的大佬們托了人情也不行，沒有插隊的！（奔走掩面甩淚）為什么那么多人在翹首期盼 DALL·E 2，看看下面這個創(chuàng)作實驗就明白了。這個
2024年02月09日
瀏覽(25)
Java原來可以這么玩！CV實現(xiàn)多張圖片生成視頻
比如我像將幾張圖片變成一個視頻的形式發(fā)不到短視頻平臺，雖然短視頻平臺也有上傳圖片變成視頻的功能，但是我想要具體控制每張圖片顯示多久后切換到下一個圖片，短視頻平臺目前無法實現(xiàn)，于是乎，我用java代碼實現(xiàn)了這個功能。生成視頻展示多張圖片生成視頻 Ja
2024年01月17日
瀏覽(26)
用python實現(xiàn)文本/圖片生成視頻
使用Python來生成視頻通常涉及到使用一些專門的庫，比如 OpenCV 或者 moviepy。下面是一個簡單的例子，使用OpenCV和PIL（Python Imaging Library）來創(chuàng)建一個視頻。 python復(fù)制代碼 import cv2 import numpy as np from PIL import Image import os # 圖片路徑列表 image_list = [\\\'img1.jpg\\\', \\\'img2.jpg\\\', \\\'img3.jpg\\\'] # 視頻
2024年01月17日
瀏覽(16)
DALL·E 2 解讀 | 結(jié)合預(yù)訓(xùn)練CLIP和擴散模型實現(xiàn)文本-圖像生成
? 論文標題: 《Hierarchical Text-Conditional Image Generation with CLIP Latents》作者/單位：Aditya Ramesh et al. / Open AI 論文鏈接:?http://arxiv.org/abs/2204.06125 論文中文對照版：論文筆記：DALL-E2：Hierarchical Text-ConditionalImage Generation with CLIP Latents詳解_nocol.的博客-CSDN博客代碼鏈接: 非官方實現(xiàn)?h
2024年02月11日
瀏覽(19)
基于 transformers 的 generate() 方法實現(xiàn)多樣化文本生成：參數(shù)含義和算法原理解讀
最近在做文本生成，用到huggingface transformers庫的文本生成 generate() 函數(shù)，是 GenerationMixin 類的實現(xiàn)（ class transformers.generation_utils.GenerationMixin ），是自回歸文本生成預(yù)訓(xùn)練模型相關(guān)參數(shù)的集大成者。因此本文解讀一下這些參數(shù)的含義以及常用的 Greedy Search 、 Beam Search 、 Sampli
2024年02月02日
瀏覽(23)
LLMs之LLaMA-2：源碼解讀(generation.py文件)—Llama類實現(xiàn)基于預(yù)訓(xùn)練模型的文本生成功能(基于單輪提示實現(xiàn)文本補全/多輪對話生成)=build函數(shù)構(gòu)建Llama實例+init
LLMs之LLaMA-2：源碼解讀(generation.py文件)—Llama類實現(xiàn)基于預(yù)訓(xùn)練模型的文本生成功能(基于單輪提示實現(xiàn)文本補全/多輪對話生成)=build函數(shù)構(gòu)建Llama實例+init函數(shù)初始化模型和詞表對象+generate函數(shù)基于提示文本生成文本序列+sample_top_p輔助函數(shù)實現(xiàn)了控制隨機性的核心采樣策略top
2024年02月07日
瀏覽(30)