參考資料:
相應的github和huggingface
LDM?[github]
StableDiffusion v1.1 ~ v1.4 [github] [huggingface]
StableDiffusion v1.5 [huggingface] [github]
StableDiffusion v2 v2.1 [github] [huggingface]
首先說一下,這篇文章的目的是讓你清晰地了解StableDffusion這個模型的發(fā)展脈絡,由于目前開源AIGC模型基本上都是基于SD的,因此了解它的發(fā)展歷史是非常有必要的,畢竟它是進行再創(chuàng)作的根基,不了解這個base而盲目地搞一些finetune,雖然可能也會出效果,但有事倍功半的危險。
1. LDM
3.?StableDiffusion v1.5
上面提到Compvis團隊不僅了Stablility-AI團隊合作,還和Runway團隊有合作。而SD1.5這個火遍大江南北的模型就是由RunwayML團隊發(fā)布在hugging face上的。值得注意的是這次模型的發(fā)布就不再是Compvis了(可能是利益相關的問題吧),看看它是怎么做的:
The Stable-Diffusion-v1-5 checkpoint was initialized with the weights of the Stable-Diffusion-v1-2 checkpoint and subsequently fine-tuned on 595k steps at resolution 512x512 on "laion-aesthetics v2 5+" and 10% dropping of the text-conditioning to improve classifier-free guidance sampling.
似乎沒有什么fancy的... 僅僅就是在一個美學打分較高的LAION子集上猛猛訓...步數(shù)超過了v1-4而已。但是這里有兩個需要注意的地方,先看一下runway在hugging face上發(fā)布的模型:
一個是pruned,這個pruned是什么意思?reddit上有一段很好的解釋:?https://www.reddit.com/r/StableDiffusion/comments/xymibu/what_does_it_mean_to_prune_a_model/
A neural network is just a bunch of math operations. The "neurons" are connected by various "weights," which is to say, the output of a neuron is multipled by a weight (just a number) and gets added into another neuron, along with lots of other connections to that other neuron.
When the neural network learns, these weights get modified. Often, many of them become zero (or real close to it). And since anything time zero is zero, we can skip this part of the math when using the network to predict something. Also, when a set of data has a lot of zeros, it can be compressed to be much smaller.
Pruning finds the nearly zero connections, makes them exactly zero, and then let's you save a smaller, compressed network. Moreover, when you use the network to predict/create something, an optimized neural network solution (i.e. the code that does all of the math specified by the network) can do so faster by intelligently skipping the unneeded calculations involving zero.
這下就知道了,模型的pruning就是剪掉不需要的部分。pruned知道了,ema又是什么意思呢?這個其實我可以解釋一下:EMA?stands for Exponential Moving Average, and it refers to a technique used to smooth out noise in the?training?data. 即EMA是一種訓練策略,在訓練的時候模型會有一個主模型例如Unet,同時也會保存一個copy即EMA_Unet,這個EMA_Unet可以簡單地看作是Unet的一個權值平均化的版本,可以使得訓練更加穩(wěn)定。一般認為EMA_Unet能夠降噪,因此load ema版本的權重就可以了,但是如果你想接著finetune,那么不妨同時load EMA_Unet和真實的Unet,繼續(xù)用ema的策略訓練下去。hugging face上有一段話:
? 可知v1-5-pruned.ckpt包含的信息是比v1-5-pruned-emaonly.ckpt的信息“絕對”多的,然后就按自己需求有選擇地下載就好了。
4.?StableDiffusion v2 v2.1
上面說到之前的模型發(fā)布都是Compvis和Runway完成的,現(xiàn)在老大Stability-AI也坐不住了。我猜它應該是想掙錢的,因此發(fā)布Stalediffusion v2的一個重要舉措是刪除NSFW的東西。這個也可以理解,因為做產(chǎn)品就要考慮風控啊。SD v2同樣有一個一句話定義:Stable Diffusion v2 refers to a specific configuration of the model architecture that uses a downsampling-factor 8 autoencoder with an 865M UNet and OpenCLIP ViT-H/14 text encoder for the diffusion model. The SD 2-v model produces 768x768 px outputs. 有三個變化,一個是text-encoder變了,變這個東西意味著什么?意味著和StableDiffusion v1割席,必須重新從零去訓練了。一個是分辨率加大了,這個東西似乎沒有什么技術壁壘,因為卷積這個操作好像天生就能夠兼容不同分辨率的方圖。
然后,StableDiffusion v2引入了一個叫v-prediction的概念,這導致模型出現(xiàn)了v2, v2-base, v2.1, v2.1-base,v2和v2.1是SD主推的產(chǎn)品(我的理解),而v2-base和v2.1-base則是原始的noise-prediction的模型。邏輯是這樣的,v2-base是從零開始訓的,并且屏蔽了NSFW,v2基于v2-base繼續(xù)finetune。v2.1-base是基于v2-base finetune的,v2.1基于v2.1-base繼續(xù)finetune。下面只展示v2-base的訓練介紹,其他的模型訓練信息可以到hugging face上自己去看:文章來源:http://www.zghlxwxcb.cn/news/detail-538634.html
文章來源地址http://www.zghlxwxcb.cn/news/detail-538634.html
到了這里,關于StableDiffusion模型發(fā)展歷史的文章就介紹完了。如果您還想了解更多內容,請在右上角搜索TOY模板網(wǎng)以前的文章或繼續(xù)瀏覽下面的相關文章,希望大家以后多多支持TOY模板網(wǎng)!