論文地址及代碼
谷歌research的成果,ICLR 2022
https://arxiv.org/abs/2202.00512
tenserflow官方開源代碼: https://github.com/google-research/google-research/tree/master/diffusion_distillation
pytorch非官方代碼:https://github.com/lucidrains/imagen-pytorch
速覽
主要解決的問題—擴(kuò)散模型預(yù)測(cè)慢
- 1.擴(kuò)散模型雖然取得了很好的效果,但是預(yù)測(cè)速度慢。
- 2.作者提出了一種逐步蒸餾的方式,如下圖:
0.Abstruct
0.1 逐句翻譯
Diffusion models have recently shown great promise for generative modeling, out- performing GANs on perceptual quality and autoregressive models at density es- timation. A remaining downside is their slow sampling time: generating high quality samples takes many hundreds or thousands of model evaluations. Here we make two contributions to help eliminate this downside: First, we present new parameterizations of diffusion models that provide increased stability when using few sampling steps. Second, we present a method to distill a trained deterministic diffusion sampler, using many steps, into a new diffusion model that takes half as many sampling steps. We then keep progressively applying this distillation proce- dure to our model, halving the number of required sampling steps each time. On standard image generation benchmarks like CIFAR-10, ImageNet, and LSUN, we start out with state-of-the-art samplers taking as many as 8192 steps, and are able to distill down to models taking as few as 4 steps without losing much perceptual quality; achieving, for example, a FID of 3.0 on CIFAR-10 in 4 steps. Finally, we show that the full progressive distillation procedure does not take more time than it takes to train the original model, thus representing an efficient solution for generative modeling using diffusion at both train and test time.
擴(kuò)散模型最近在生成模型方面表現(xiàn)出了很大的優(yōu)點(diǎn),在感知質(zhì)量和密度估計(jì)方面比 GAN 表現(xiàn)更好,而缺點(diǎn)是采樣時(shí)間較長(zhǎng):生成高質(zhì)量的樣本需要數(shù)百或數(shù)千個(gè)模型評(píng)估。在這里,我們提出了兩個(gè)貢獻(xiàn)來幫助消除這個(gè)缺點(diǎn):首先,我們提出了新的擴(kuò)散模型參數(shù)化方式,在使用少量采樣步驟時(shí)可以提供更高的穩(wěn)定性。其次,我們提出了一種方法,可以將訓(xùn)練好的確定性擴(kuò)散采樣器通過許多步驟轉(zhuǎn)換為新的擴(kuò)散模型,只需要一半左右的采樣步驟。然后我們逐步應(yīng)用這個(gè)蒸餾過程到我們的模型中,每次將所需的采樣步驟減半。在標(biāo)準(zhǔn)圖像生成基準(zhǔn)測(cè)試如 CIFAR-10、ImageNet 和 LSUN 中,我們開始使用高達(dá) 8192 個(gè)步驟的最新采樣器,并能夠?qū)⑵湔麴s到只需要 4 個(gè)步驟的模型中,而不會(huì)丟失太多感知質(zhì)量。例如,在 CIFAR-10 中,我們?cè)?4 個(gè)步驟內(nèi)實(shí)現(xiàn)了 FID=3.0。最后,我們展示了完整的逐步蒸餾過程并不比訓(xùn)練原始模型花費(fèi)更多時(shí)間,因此它是一種在訓(xùn)練和測(cè)試時(shí)同時(shí)使用擴(kuò)散模型的有效解決方案。
總結(jié)
主要是兩個(gè)創(chuàng)新點(diǎn),
- 1.提出了一種可以簡(jiǎn)化步驟的方法
- 2.提出了一種知識(shí)蒸餾的方法,可以把更高的迭代次數(shù)優(yōu)化為更低的迭代次數(shù)。
1.INTRODUCTION
1.1逐句翻譯
第一段(擴(kuò)散模型在各個(gè)方面取得很好的成果)
Diffusion models (Sohl-Dickstein et al., 2015; Song & Ermon, 2019; Ho et al., 2020) are an emerg- ing class of generative models that has recently delivered impressive results on many standard gen- erative modeling benchmarks. These models have achieved ImageNet generation results outper- forming BigGAN-deep and VQ-VAE-2 in terms of FID score and classification accuracy score (Ho et al., 2021; Dhariwal & Nichol, 2021), and they have achieved likelihoods outperforming autore- gressive image models (Kingma et al., 2021; Song et al., 2021b). They have also succeeded in image super-resolution (Saharia et al., 2021; Li et al., 2021) and image inpainting (Song et al., 2021c), and there have been promising results in shape generation (Cai et al., 2020), graph generation (Niu et al., 2020), and text generation (Hoogeboom et al., 2021; Austin et al., 2021).
擴(kuò)散模型 (Sohl-Dickstein et al., 2015; Song & Ermon, 2019; Ho et al., 2020) 是一種新興的生成模型類群,最近在多個(gè)標(biāo)準(zhǔn)生成模型基準(zhǔn)測(cè)試中取得了令人印象深刻的結(jié)果。這些模型在 FID 得分和分類準(zhǔn)確率得分方面實(shí)現(xiàn)了 ImageNet 生成結(jié)果,遠(yuǎn)優(yōu)于 BigGAN-deep 和 VQ-VAE-2(Ho et al., 2021; Dhariwal & Nichol, 2021),并且在生成概率方面比自回歸圖像模型表現(xiàn)更好 (Kingma et al., 2021; Song et al., 2021b)。它們還成功地實(shí)現(xiàn)了圖像超分辨率 (Saharia et al., 2021; Li et al., 2021) 和圖像修復(fù) (Song et al., 2021c),并且在形狀生成 (Cai et al., 2020)、圖生成 (Niu et al., 2020) 和文本生成 (Hoogeboom et al., 2021; Austin et al., 2021) 等領(lǐng)域取得了有前途的結(jié)果。
第二段(提出擴(kuò)散模型預(yù)測(cè)慢的問題)
A major barrier remains to practical adoption of diffusion models: sampling speed. While sam- pling can be accomplished in relatively few steps in strongly conditioned settings, such as text-to- speech (Chen et al., 2021) and image super-resolution (Saharia et al., 2021), or when guiding the sampler using an auxiliary classifier (Dhariwal & Nichol, 2021), the situation is substantially differ- ent in settings in which there is less conditioning information available. Examples of such settings are unconditional and standard class-conditional image generation, which currently require hundreds or thousands of steps using network evaluations that are not amenable to the caching optimizations of other types of generative models (Ramachandran et al., 2017).
然而,擴(kuò)散模型的實(shí)際應(yīng)用仍然存在一個(gè)主要障礙:采樣速度。雖然在強(qiáng)條件設(shè)置下,如文本到語音 (Chen et al., 2021) 和圖像超分辨率 (Saharia et al., 2021),采樣可以在相對(duì)較短的時(shí)間內(nèi)完成,或者通過輔助分類器指導(dǎo)采樣 (Dhariwal & Nichol, 2021),但在條件較少的設(shè)置下,情況則完全不同。例如,無條件和標(biāo)準(zhǔn)條件的圖像生成目前需要使用網(wǎng)絡(luò)評(píng)估,這些評(píng)估不受其他類型生成模型的緩存優(yōu)化 (Ramachandran et al., 2017)。
第三段(作者提出自己的想法)
In this paper, we reduce the sampling time of diffusion models by orders of magnitude in uncondi- tional and class-conditional image generation, which represent the setting in which diffusion models have been slowest in previous work. We present a procedure to distill the behavior of a N -step DDIM sampler (Song et al., 2021a) for a pretrained diffusion model into a new model with N/2 steps, with little degradation in sample quality. In what we call progressive distillation, we repeat this distilla- tion procedure to produce models that generate in as few as 4 steps, still maintaining sample quality competitive with state-of-the-art models using thousands of steps.
在本文中,我們顯著提高了擴(kuò)散模型在無條件和標(biāo)準(zhǔn)條件圖像生成方面的采樣速度,這是擴(kuò)散模型在以前的工作中最慢的設(shè)置之一。我們提出了一種方法,將一個(gè)預(yù)訓(xùn)練的擴(kuò)散模型的 N 步 DDIM 采樣器的行為蒸餾到一個(gè)新的模型中,只需要 N/2 步,而樣本質(zhì)量幾乎沒有下降。我們稱之為逐步蒸餾,我們將這種方法重復(fù)應(yīng)用到我們的模型中,以產(chǎn)生只需 4 步即可生成圖像的模型,并保持與使用數(shù)千步的最先進(jìn)的模型相當(dāng)?shù)臉颖举|(zhì)量。
文字說明
Figure 1: A visualization of two iterations of our proposed progressive distillation algorithm. A sampler f(z;η), mapping random noise ε to samples x in 4 deterministic steps, is distilled into a new sampler f(z;θ) taking only a single step. The original sampler is derived by approximately integrating the probability flow ODE for a learned diffusion model, and distillation can thus be understood as learning to integrate in fewer steps, or amortizing this integration into the new sampler.
Figure 1: 一種可視化了我們提出的漸進(jìn)式蒸餾算法的兩個(gè)迭代。一個(gè)采樣器 f(z;η),將隨機(jī)噪聲ε映射到 4 個(gè)確定的采樣 x,被蒸餾成一個(gè)僅需要一個(gè)步驟的新采樣器 f(z;θ)。原始的采樣器是通過近似積分學(xué)習(xí)擴(kuò)散模型的概率流常微分方程 (ODE) 得出的,因此蒸餾可以被理解為在更少的步驟中學(xué)習(xí)積分,或者將積分分?jǐn)偟叫虏蓸悠髦小?/p>
1.2總結(jié)
- 1.擴(kuò)散模型雖然取得了很好的效果,但是預(yù)測(cè)速度慢。
- 2.作者提出了一種逐步蒸餾的方式,如下圖:
3 PROGRESSIVE DISTILLATION
第一段(簡(jiǎn)單介紹 如何蒸餾減少步數(shù))
To make diffusion models more efficient at sampling time, we propose progressive distillation: an algorithm that iteratively halves the number of required sampling steps by distilling a slow teacher diffusion model into a faster student model. Our implementation of progressive distillation stays very close to the implementation for training the original diffusion model, as described by e.g. Ho et al. (2020). Algorithm 1 and Algorithm 2 present diffusion model training and progressive distillation side-by-side, with the relative changes in progressive distillation highlighted in green.
為了在采樣時(shí)提高擴(kuò)散模型的效率,我們提出了漸進(jìn)式蒸餾算法:一種迭代地減半所需采樣步驟的算法,通過將慢速教師擴(kuò)散模型蒸餾成更快的學(xué)生模型來實(shí)現(xiàn)。我們實(shí)現(xiàn)的漸進(jìn)式蒸餾算法與訓(xùn)練原始擴(kuò)散模型的實(shí)現(xiàn)非常相似,例如 Ho 等 (2020) 所述。算法 1 和算法 2 同時(shí)展示了擴(kuò)散模型訓(xùn)練和漸進(jìn)式蒸餾,其中漸進(jìn)式蒸餾的相對(duì)變化被綠色突出顯示。
第二段
We start the progressive distillation procedure with a teacher diffusion model that is obtained by training in the standard way. At every iteration of progressive distillation, we then initialize the student model with a copy of the teacher, using both the same parameters and same model definition. Like in standard training, we then sample data from the training set and add noise to it, before forming the training loss by applying the student denoising model to this noisy data zt. The main difference in progressive distillation is in how we set the target for the denoising model: instead of the original data x, we have the student model denoise towards a target x ? that makes a single student DDIM step match 2 teacher DDIM steps. We calculate this target value by running 2 DDIM sampling steps using the teacher, starting from zt and ending at zt?1/N , with N being the number of student sampling steps. By inverting a single step of DDIM, we then calculate the value the student model would need to predict in order to move from zt to zt?1/N in a single step, as we show in detail in Appendix G. The resulting target value x ?(zt) is fully determined given the teacher model and starting point zt, which allows the student model to make a sharp prediction when evaluated at zt. In contrast, the original data point x is not fully determined given zt, since multiple different data points x can produce the same noisy data zt: this means that the original denoising model is predicting a weighted average of possible x values, which produces a blurry prediction. By making sharper predictions, the student model can make faster progress during sampling.
我們開始漸進(jìn)式蒸餾過程,使用通過標(biāo)準(zhǔn)訓(xùn)練得到的教師擴(kuò)散模型開始。在每次迭代中,然后我們初始化學(xué)生模型,使用與教師相同的參數(shù)和模型定義,從訓(xùn)練集中采樣數(shù)據(jù)并添加噪聲,然后在使用學(xué)生去噪模型對(duì)這個(gè)噪聲數(shù)據(jù) zt 應(yīng)用時(shí)形成訓(xùn)練損失。漸進(jìn)式蒸餾的主要區(qū)別在于我們?nèi)绾卧O(shè)置去噪模型的目標(biāo):而不是原始數(shù)據(jù) x,我們讓學(xué)生模型去噪以目標(biāo) x ?使單個(gè)學(xué)生 DDIM 步等于 2 個(gè)教師 DDIM 步。我們計(jì)算這個(gè)目標(biāo)值,通過使用教師運(yùn)行 2 個(gè) DDIM 采樣步驟,從 zt 開始并結(jié)束于 zt-1/N,其中 N 是學(xué)生采樣步驟的數(shù)量。通過逆推一個(gè) DDIM 步,我們?nèi)缓笥?jì)算學(xué)生模型需要在一步中從 zt 移動(dòng)到 zt-1/N 所需的預(yù)測(cè)值,具體細(xì)節(jié)在附錄 G 中展示。結(jié)果的目標(biāo)值 x ?(zt) 在教師模型和起始點(diǎn) zt 給定的情況下是完全確定的,這允許學(xué)生模型在評(píng)估 zt 時(shí)做出清晰的預(yù)測(cè)。相比之下,給定 zt 的原始數(shù)據(jù)點(diǎn) x 沒有完全確定,因?yàn)槎鄠€(gè)不同的數(shù)據(jù)點(diǎn) x 可以產(chǎn)生相同的噪聲數(shù)據(jù) zt:這意味著原始去噪模型預(yù)測(cè)加權(quán)平均可能 x 值,這產(chǎn)生模糊的預(yù)測(cè)。通過做出更清晰的預(yù)測(cè),學(xué)生模型在采樣時(shí)能夠更快地進(jìn)展。
第三段(繼續(xù)描述這個(gè)迭代可以不斷遞歸使用,學(xué)生變成新的老師)
After running distillation to learn a student model taking N sampling steps, we can repeat the pro- cedure with N/2 steps: The student model then becomes the new teacher, and a new student model is initialized by making a copy of this model.
運(yùn)行蒸餾以學(xué)習(xí)采取 N 采樣步驟的學(xué)生模型后,我們可以重復(fù)該過程以進(jìn)行 N/2 步驟:學(xué)生模型將成為新的老師,并通過復(fù)制該模型初始化一個(gè)新的學(xué)生模型。
第四段(這里調(diào)整Alph1為0真的沒看懂,得看看代碼)
Unlike our procedure for training the original model, we always run progressive distillation in dis- crete time: we sample this discrete time such that the highest time index corresponds to a signal-to- noise ratio of zero, i.e. α1 = 0, which exactly matches the distribution of input noise z1 ~ N (0, I) that is used at test time. We found this to work slightly better than starting from a non-zero signal- to-noise ratio as used by e.g. Ho et al. (2020), both for training the original model as well as when performing progressive distillation.文章來源:http://www.zghlxwxcb.cn/news/detail-559477.html
不同于我們訓(xùn)練原始模型的流程,我們總是在離散時(shí)間中進(jìn)行漸進(jìn)式蒸餾:我們采樣離散時(shí)間,使得最高時(shí)間索引對(duì)應(yīng)信號(hào)噪聲比為零,即α1=0,這完全匹配測(cè)試時(shí)使用的輸入噪聲 z1~N(0,I) 的分布。我們發(fā)現(xiàn),這種方法相對(duì)于例如 Ho 等 (2020) 使用非零信號(hào)噪聲比開始訓(xùn)練和進(jìn)行漸進(jìn)式蒸餾時(shí),在訓(xùn)練原始模型和進(jìn)行漸進(jìn)式蒸餾時(shí)表現(xiàn)更好。文章來源地址http://www.zghlxwxcb.cn/news/detail-559477.html
到了這里,關(guān)于擴(kuò)散模型相關(guān)論文閱讀,擴(kuò)散模型和知識(shí)蒸餾的結(jié)合提升預(yù)測(cè)速度:Progressive Distillation for Fast Sampling of Diffusion Models的文章就介紹完了。如果您還想了解更多內(nèi)容,請(qǐng)?jiān)谟疑辖撬阉鱐OY模板網(wǎng)以前的文章或繼續(xù)瀏覽下面的相關(guān)文章,希望大家以后多多支持TOY模板網(wǎng)!