END-TO-END OPTIMIZED IMAGE COMPRESSION
單詞
image compression 圖像壓縮
quantizer 量化器
rate–distortion performance率失真性能
a variant of 什么什么的一個變體
construct 構(gòu)造
entropy 熵
discrete value 離散值
摘要:
We describe an image compression method, consisting of a nonlinear analysis transformation, a uniform quantizer, and a nonlinear synthesis transformation. The transforms are constructed in three successive stages of convolutional linear filters and nonlinear activation functions. Unlike most convolutional neural networks, the joint nonlinearity is chosen to implement a form of local gain control, inspired by those used to model biological neurons. Using a variant of stochastic gradient descent, we jointly optimize the entire model for rate–distortion performance over a database of training images, introducing a continuous proxy for the discontinuous loss function arising from the quantizer. Under certain conditions, the relaxed loss function may be interpreted as the log likelihood of a generative model, as implemented by a variational autoencoder. Unlike these models, however, the compression model must operate at any given point along the rate– distortion curve, as specified by a trade-off parameter. Across an independent set of test images, we find that the optimized method generally exhibits better rate–distortion performance than the standard JPEG and JPEG 2000 compression methods. More importantly, we observe a dramatic improvement in visual quality for all images at all bit rates, which is supported by objective quality estimates using MS-SSIM.
我們描述了一種圖像壓縮方法,包括非線性分析變換、均勻量化器和非線性合成變換。這些變換是在卷積線性濾波器和非線性激活函數(shù)的三個連續(xù)階段中構(gòu)造的。與大多數(shù)卷積神經(jīng)網(wǎng)絡(luò)不同,受用于模擬生物神經(jīng)元的網(wǎng)絡(luò)的啟發(fā),選擇聯(lián)合非線性來實現(xiàn)一種局部增益控制形式。使用隨機梯度下降的變體,我們在訓(xùn)練圖像數(shù)據(jù)庫上聯(lián)合優(yōu)化整個模型的率失真性能,引入量化器產(chǎn)生的不連續(xù)損失函數(shù)的連續(xù)代理。在某些條件下,松弛損失函數(shù)可以解釋為由變分自動編碼器實現(xiàn)的生成模型的對數(shù)似然。然而,與這些模型不同的是,壓縮模型必須在速率失真曲線上的任何給定點上運行,如權(quán)衡參數(shù)所指定。在一組獨立的測試圖像中,我們發(fā)現(xiàn)優(yōu)化方法通常比標(biāo)準(zhǔn) JPEG 和 JPEG 2000 壓縮方法表現(xiàn)出更好的率失真性能。更重要的是,我們觀察到所有比特率下所有圖像的視覺質(zhì)量都有顯著改善,這得到了使用 MS-SSIM 的客觀質(zhì)量估計的支持
1.INTRODUCTION
Data compression is a fundamental and well-studied problem in engineering, and is commonly formulated with the goal of designing codes for a given discrete data ensemble with minimal entropy (Shannon, 1948).The solution relies heavily on knowledge of the probabilistic structure of the data, and thus the problem is closely related to probabilistic source modeling. However, since all practical codes must have finite entropy, continuous-valued data (such as vectors of image pixel in- tensities) must be quantized to a finite set of discrete values, which introduces error. In this context, known as the lossy compression problem, one must trade off two competing costs: the entropy of the discretized representation (rate) and the error arising from the quantization (distortion).Different compression applications, such as data storage or transmission over limited-capacity channels, demand different rate–distortion trade-offs.
數(shù)據(jù)壓縮是工程中一個基本且經(jīng)過充分研究的問題,通常以為給定離散數(shù)據(jù)集合設(shè)計具有最小熵的代碼為目標(biāo)而制定(Shannon,1948)。該解決方案在很大程度上依賴于數(shù)據(jù)概率結(jié)構(gòu)的知識,因此該問題與概率源建模密切相關(guān)。**然而,由于所有實際代碼都必須具有有限的熵,因此連續(xù)值數(shù)據(jù)(例如圖像像素強度的向量)必須量化為一組有限的離散值,這會引入誤差。**在這種情況下,稱為有損壓縮問題,必須權(quán)衡兩個相互競爭的成本:離散表示的熵(速率)和量化產(chǎn)生的誤差(失真)。不同的壓縮應(yīng)用,例如數(shù)據(jù)存儲或通過有限容量通道傳輸,需要不同的速率-失真權(quán)衡
Joint optimization of rate and distortion is difficult.Without further constraints, the general problem of optimal quantization in high-dimensional spaces is intractable (Gersho and Gray, 1992).For this reason, most existing image compression methods operate by linearly transforming the data vector into a suitable continuous-valued representation, quantizing its elements independently, and then encoding the resulting discrete representation using a lossless entropy code (Wintz, 1972; Netravali and Limb,1980).
速率和失真的聯(lián)合優(yōu)化很困難。如果沒有進一步的約束,高維空間中最優(yōu)量化的一般問題是棘手的(Gersho 和 Gray,1992)。因此,大多數(shù)現(xiàn)有的圖像壓縮方法通過將數(shù)據(jù)向量線性變換為合適的連續(xù)值表示,獨立量化其元素,然后使用無損熵代碼對所得離散表示進行編碼(Wintz,1972;Netravali 和 Limb, 1980)。
This scheme is called transform coding due to the central role of the transformation.For example, JPEG uses a discrete cosine transform on blocks of pixels, and JPEG 2000 uses a multi-scale orthogonal wavelet decomposition. Typically, the three components of transform coding methods – transform, quantizer, and entropy code – are separately optimized (often through manual parameter adjustment).
由于變換的核心作用,該方案被稱為變換編碼。例如,JPEG 對像素塊使用離散余弦變換,而 JPEG 2000 使用多尺度正交小波分解。通常,變換編碼方法的三個組成部分——變換、量化器和熵代碼——是分別優(yōu)化的(通常通過手動參數(shù)調(diào)整)。
We have developed a framework for end-to-end optimization of an image compression model based on nonlinear transforms (figure 1).Previously, we demonstrated that a model consisting of linear– nonlinear block transformations, optimized for a measure of perceptual distortion, exhibited visually superior performance compared to a model optimized for mean squared error (MSE) (Ball ?e, La- parra, and Simoncelli,2016).Here, we optimize for MSE, but use a more flexible transforms built from cascades of linear convolutions and nonlinearities.Specifically, we use a generalized divisive normalization (GDN) joint nonlinearity that is inspired by models of neurons in biological visual systems, and has proven effective in Gaussianizing image densities (Ball ?e, Laparra, and Simoncelli, 2015).This cascaded transformation is followed by uniform scalar quantization (i.e., each element is rounded to the nearest integer), which effectively implements a parametric form of vector quantization on the original image space.The compressed image is reconstructed from these quantized values using an approximate parametric nonlinear inverse transform.
我們開發(fā)了一個基于非線性變換的圖像壓縮模型端到端優(yōu)化框架(圖 1)。之前,我們證明了由線性-非線性塊變換組成的模型,針對感知失真的測量進行了優(yōu)化,與針對均方誤差(MSE)優(yōu)化的模型(Ball ?e、Laparra 和 Simoncelli, 2016)。在這里,我們針對 MSE 進行優(yōu)化,但使用由線性卷積和非線性級聯(lián)構(gòu)建的更靈活的變換。具體來說,我們使用廣義除法歸一化(GDN)聯(lián)合非線性,其靈感來自生物視覺系統(tǒng)中的神經(jīng)元模型,并已被證明在高斯化圖像密度方面有效(Ball ?e、Laparra 和 Simoncelli,2015)。這種級聯(lián)變換之后是均勻標(biāo)量量化(即,每個元素都舍入到最接近的整數(shù)),這有效地在原始圖像空間上實現(xiàn)了矢量量化的參數(shù)形式。使用近似參數(shù)非線性逆變換從這些量化值重建壓縮圖像。
For any desired point along the rate–distortion curve, the parameters of both analysis and synthesis transforms are jointly optimized using stochastic gradient descent.To achieve this in the presence of quantization (which produces zero gradients almost everywhere), we use a proxy loss function based on a continuous relaxation of the probability model, replacing the quantization step with additive uniform noise.The relaxed rate–distortion optimization problem bears some resemblance to those used to fit generative image models, and in particular variational autoencoders (Kingma and Welling, 2014; Rezende, Mohamed, and Wierstra, 2014), but differs in the constraints we impose to ensurethat it approximates the discrete problem all along the rate–distortion curve.Finally, rather than reporting differential or discrete entropy estimates, we implement an entropy code and report performance using actual bit rates, thus demonstrating the feasibility of our solution as a complete lossy compression method.
對于速率失真曲線上的任何所需點,使用隨機梯度下降聯(lián)合優(yōu)化分析和綜合變換的參數(shù)。為了在存在量化(幾乎在任何地方產(chǎn)生零梯度)的情況下實現(xiàn)這一目標(biāo),我們使用基于概率模型的連續(xù)松弛的代理損失函數(shù),用加性均勻噪聲代替量化步驟。松弛率失真優(yōu)化問題與用于擬合生成圖像模型的問題有一些相似之處,特別是變分自動編碼器(Kingma 和 Welling,2014 年;Rezende、Mohamed 和 Wierstra,2014 年),但不同之處在于我們?yōu)榇_保它近似于沿著率失真曲線的離散問題。最后,我們不是報告差分或離散熵估計,而是使用實際比特率實現(xiàn)熵代碼并報告性能,從而證明了我們的解決方案作為完整有損壓縮方法的可行性。
2.CHOICE OF FORWARD, INVERSE, AND PERCEPTUAL TRANSFORMS
正向、反向和感知變換的選擇
Most compression methods are based on orthogonal linear transforms, chosen to reduce correlations in the data, and thus to simplify entropy coding.But the joint statistics of linear filter responses exhibit strong higher order dependencies.These may be significantly reduced through the use of joint localnonlinear gain control operations (Schwartz and Simoncelli, 2001; Lyu, 2010; Sinz and Bethge, 2013), inspired by models of visual neurons (Heeger, 1992; Carandini and Heeger, 2012).Cascaded versions of such models have been used to capturemultiple stages of visual transformation (Simoncelli and Heeger, 1998; Mante, Bonin, and Carandini, 2008).Some earlier results suggest that incorporating local normalization in linear block transform coding methods can improve coding performance (Malo et al., 2006), and can improve object recognition performance of cascaded convolutional neural networks (Jarrett et al., 2009).However, the normalization parameters in these cases were not optimized for the task.Here, we make use of a generalized divisive normalization (GDN) transformwith optimized parameters, that we have previously shown to be highly efficient in Gaussianizing the local joint statistics of natural images, much more so than cascades of linear transforms followed by pointwise nonlinearities (Ball ?e, Laparra, and Simoncelli, 2015)
大多數(shù)壓縮方法都基于正交線性變換,選擇這些方法是為了減少數(shù)據(jù)中的相關(guān)性,從而簡化熵編碼。但是線性濾波器響應(yīng)的聯(lián)合統(tǒng)計表現(xiàn)出強烈的高階依賴性。通過使用聯(lián)合局部可以顯著減少這些依賴性。非線性增益控制操作(Schwartz 和 Simoncelli,2001;Lyu,2010;Sinz 和 Bethge,2013),受到視覺神經(jīng)元模型的啟發(fā)(Heeger,1992;Carandini 和 Heeger,2012)。此類模型的級聯(lián)版本已用于捕獲視覺變換的多個階段(Simoncelli 和 Heeger,1998;Mante、Bonin 和 Carandini,2008)。一些早期結(jié)果表明,在線性塊變換編碼方法中結(jié)合局部歸一化可以提高編碼性能(Malo 等,2006) ,并且可以提高級聯(lián)卷積神經(jīng)網(wǎng)絡(luò)的對象識別性能(Jarrett et al., 2009)。但是,這些情況下的歸一化參數(shù)并未針對該任務(wù)進行優(yōu)化。在這里,我們利用廣義除法歸一化(GDN)變換使用優(yōu)化的參數(shù),我們之前已經(jīng)證明它在對自然圖像的局部聯(lián)合統(tǒng)計進行高斯化方面非常有效,比線性變換級聯(lián)和逐點非線性要高效得多(Ball ?e、Laparra 和 Simoncelli,2015)
Note that some training algorithms for deep convolutional networks incorporate “batch normalization”, rescaling the responses of linear filters in the network so as to keep it in a reasonable operating range (Ioffe and Szegedy, 2015).This type of normalization is different from local gain control in that the rescaling factor is identical across all spatial locations.Moreover, once the training is completed, the scaling parameters are typically fixed, which turns the normalization into an affine transformation with respect to the data – unlike GDN, which is spatially adaptive and can be highly nonlinear.
請注意,深度卷積網(wǎng)絡(luò)的一些訓(xùn)練算法包含“批量歸一化”,重新調(diào)整網(wǎng)絡(luò)中線性濾波器的響應(yīng),以使其保持在合理的操作范圍內(nèi)(Ioffe 和 Szegedy,2015)。這種類型的歸一化與局部增益控制不同,因為重新縮放因子在所有空間位置上都是相同的。此外,一旦訓(xùn)練完成,縮放參數(shù)通常是固定的,這會將歸一化轉(zhuǎn)變?yōu)獒槍?shù)據(jù)的仿射變換——與空間自適應(yīng)且高度非線性的 GDN 不同。
Specifically, our analysis transform ga consists of three stages of convolution, subsampling, and divisive normalization.We represent the ith input channel of the kth stage at spatial location (m, n) as u(k) i (m, n).The input image vector x corresponds to u(0) i (m, n), and the output vector y is u(3) i (m, n).Each stage then begins with an affine convolution:
具體來說,我們的分析變換 g a g_a ga? 由卷積、子采樣和除法歸一化三個階段組成。我們將空間位置 (m, n) 處第 k 級的第 i 個輸入通道表示為$ u^{(k)} i (m, n) 。輸入圖像向量 x 對應(yīng)于 。輸入圖像向量x對應(yīng)于 。輸入圖像向量x對應(yīng)于u^{(0)} i (m, n) ,輸出向量 y 為 ,輸出向量y為 ,輸出向量y為u^{(3)}_ i (m, n)$。然后每個階段都從仿射卷積開始:
(1)
where ? denotes 2D convolution. This is followed by downsampling:
其中*表示2D卷積。接下來是下采樣:
(2)
where sk is the downsampling factor for stage k.Each stage concludes with a GDN operation
其中 s k s_k sk? 是階段 k 的下采樣因子。每個階段都以 GDN 操作結(jié)束
(3)
The full set of h, c, β, and γ parameters (across all three stages) constitute the parameter vector φ to be optimized
全套 h、c、β 和 γ 參數(shù)(跨所有三個階段)構(gòu)成要優(yōu)化的參數(shù)向量 φ
Analogously, the synthesis transform gs consists of three stages, with the order of operations re- versed within each stage, downsampling replaced by upsampling, and GDN replaced by an approx- imate inverse we call IGDN (more details in the appendix).We define ?u(k) i (m, n) as the input to the kth synthesis stage, such that ?y corresponds to ?u(0) i (m, n), and ?x to ?u(3) i (m, n).Each stage then consists of the IGDN operation:
類似地,合成變換 $g_s 由三個階段組成,每個階段內(nèi)的操作順序顛倒,下采樣被上采樣取代, G D N 被我們稱為 I G D N 的近似逆代替(更多詳細信息參見附錄)。我們將 由三個階段組成,每個階段內(nèi)的操作順序顛倒,下采樣被上采樣取代,GDN 被我們稱為 IGDN 的近似逆代替(更多詳細信息參見附錄)。我們將 由三個階段組成,每個階段內(nèi)的操作順序顛倒,下采樣被上采樣取代,GDN被我們稱為IGDN的近似逆代替(更多詳細信息參見附錄)。我們將 \hat{u}^{(k)}_ i (m, n) $定義為第 k 個合成階段的輸入,使得 $\hat y $對應(yīng)于 u ^ i ( 0 ) ( m , n ) \hat u^{(0)}_ i (m, n) u^i(0)?(m,n),而 x ^ \hat x x^ 對應(yīng)于$ \hat u^{(3)}_ i (m, n)$ 。每個階段都包含 IGDN 操作:
(4)
which is followed by upsampling
接下來是上采樣
(5)
where ?sk is the upsampling factor for stage k.Finally, this is followed by an affine convolution:
其中 $\hat s_k $是階段 k 的上采樣因子。最后,進行仿射卷積:
(6)
Analogously, the set of ?h, ?c, ?β, and ?γ make up the parameter vector θ.Note that the down- /upsampling operations can be implemented jointly with their adjacent convolution, improving com- putational efficiency.
類似地, h ^ \hat h h^、 c ^ \hat c c^、 β ^ \hat β β^? 和$ \hat γ$ 的集合構(gòu)成了參數(shù)向量 θ。請注意,下采樣/上采樣操作可以與其相鄰卷積聯(lián)合實現(xiàn),從而提高計算效率。
Figure 2:Left: The rate–distortion trade-off.The gray region represents the set of all rate–distortion values that can be achieved (over all possible parameter settings).Optimal performance for a given choice of λ corresponds to a point on the convex hullof this set with slope ?1/λ.Right: One- dimensional illustration of relationship between densities of yi (elements of code space), ?yi (quan- tized elements), and ?yi (elements perturbed by uniform noise).Each discrete probabilityin p?yi equals the probability mass of the density pyi within the corresponding quantization bin (indicated by shading).The density p ?yi provides a continuous function that interpolates the discrete probability values p?yi at integer positions
左:率失真權(quán)衡?;疑珔^(qū)域表示可以實現(xiàn)的所有率失真值的集合(在所有可能的參數(shù)設(shè)置上)。給定 λ 選擇的最佳性能對應(yīng)于凸包上的點該集合的斜率為?1/λ。右圖: y i y_i yi?(代碼空間元素)、 y ^ i \hat y_i y^?i?(量化元素)和 y ~ i \widetilde y_i y ?i?(受均勻噪聲擾動的元素)密度之間關(guān)系的一維圖示。每個離散概率$ p_{\hat yi} $等于相應(yīng)量化倉內(nèi)密度 $p_{yi} 的概率質(zhì)量(由陰影表示)。密度 p ~ y i 提供了一個連續(xù)函數(shù),可在整數(shù)位置插值離散概率值 的概率質(zhì)量(由陰影表示)。密度 p?yi 提供了一個連續(xù)函數(shù),可在整數(shù)位置插值離散概率值 的概率質(zhì)量(由陰影表示)。密度p~?yi提供了一個連續(xù)函數(shù),可在整數(shù)位置插值離散概率值 p_{ \hat y_i}$
In previous work, we used a perceptual transform gp, separately optimized to mimic human judgements of grayscale image distortions (Laparra et al., 2016), and showed that a set of one-stage transforms optimized for this distortion measure ledto visually improved results (Ball ?e, Laparra, and Simoncelli, 2016).Here, we set the perceptual transform gp to the identity, and use mean squared error (MSE) as the metric (i.e., d(z, ?z) = ‖z ? ?z‖2 2).This allows a more interpretable comparison to existing methods, which are generally optimized for MSE, and also allows optimization for color images, for which we do not currently have a reliable perceptual metric.
在之前的工作中,我們使用了感知變換$ g_pKaTeX parse error: Expected 'EOF', got '?' at position 75: …致視覺效果得到改善(Ball ??e、Laparra 和 Sim… g_p$ 設(shè)置為恒等式,并使用均方誤差(MSE)作為度量(即 d(z, ?z) = ‖z ? ?z‖2 2)。這允許與現(xiàn)有方法進行更可解釋的比較,這些方法通常針對 MSE 進行優(yōu)化,并且還允許對彩色圖像進行優(yōu)化,而我們目前還沒有可靠的感知指標(biāo)。
3.OPTIMIZATION OF NONLINEAR TRANSFORM CODING MODEL
非線性變換編碼模型的優(yōu)化
Our objective is to minimize a weighted sum of the rate and distortion, R + λD, over the parameters of the analysis and synthesis transforms and the entropy code, where λ governs the trade-off between the two terms (figure 2, left panel).Rather than attempting optimal quantization directly in the image space, which is intractable due to the high dimensionality, we instead assume a fixed uniform scalar quantizer in the code space, and aim to have the nonlinear transformations warp the space in an appropriate way, effectively implementinga parametric form of vector quantization (figure 1).The actual rates achieved by a properly designed entropy code are only slightly larger than the entropy (Rissanen and Langdon, 1981), and thus we define the objective functional directly in terms of entropy
我們的目標(biāo)是在分析和綜合變換以及熵代碼的參數(shù)上最小化速率和失真的加權(quán)和 R + λD,其中 λ 控制兩項之間的權(quán)衡(圖 2,左圖) 。我們不是直接在圖像空間中嘗試最佳量化(由于高維度而難以處理),而是在代碼空間中假設(shè)一個固定的均勻標(biāo)量量化器,并旨在讓非線性變換以適當(dāng)?shù)姆绞脚で臻g,從而有效地實現(xiàn)矢量量化的參數(shù)形式(圖 1)。正確設(shè)計的熵代碼實現(xiàn)的實際速率僅略大于熵(Rissanen 和 Langdon,1981),因此我們直接根據(jù)熵定義目標(biāo)函數(shù)
(7)
where both expectations will be approximated by averages over a training set of images.Given a powerful enough set of transformations, we can assume without loss of generality that the quantization bin size is always one and the representing values are at the centers of the bins.That is
其中兩個期望將通過訓(xùn)練圖像集的平均值來近似。給定一組足夠強大的變換,我們可以不失一般性地假設(shè)量化箱大小始終為 1 并且表示值位于箱的中心。那是
(8)
where index i runs over all elements of the vectors, including channels and spatial locations.The marginal density of ?yi is then given by a train of discrete probability masses (Dirac delta functions, figure 2, right panel) with weights equal to the probability mass function of qi
其中索引 i 遍歷向量的所有元素,包括通道和空間位置。然后,^yi 的邊際密度由一系列離散概率質(zhì)量(狄拉克 δ 函數(shù),圖 2,右圖)給出,其權(quán)重等于 qi 的概率質(zhì)量函數(shù)
(9)
Note that both terms in (7) depend on the quantized values, and the derivatives of the quantization
function (8) are zero almost everywhere, rendering gradient descent ineffective. To allow optimization via stochastic gradient descent, we replace the quantizer with an additive i.i.d. uniform noise source ?y, which has the same width as the quantization bins (one). This relaxed formulation has two desirable properties. First, the density function of ?y = y + ?y is a continuous relaxation of the probability mass function of q (figure 2, right panel):
請注意,公式(7)中的兩個項都依賴于量化值,并且量化函數(shù)(公式(8))的導(dǎo)數(shù)幾乎處處為零,使得梯度下降方法無效。為了能夠通過隨機梯度下降進行優(yōu)化,我們將量化器替換為一個具有與量化區(qū)間(為一)相同寬度的加性獨立同分布均勻噪聲源?y。這種松弛的表述具有兩個可取之處。首先, y ~ \widetilde y y ? = y + ?y的密度函數(shù)是q的概率質(zhì)量函數(shù)的連續(xù)松弛(見圖2的右圖)。
(10)
which implies that the differential entropy of ?y can be used as an approximation of the entropy of q.Second, independent uniform noise approximates quantization error in terms of its marginal moments, and is frequently used as a model of quantization error (Gray and Neuhoff,1998).We can thus use the same approximation for our measure of distortion.We examine the empirical quality of these rate and distortion approximations in section 4
這意味著 ?y 的微分熵可以用作 q 熵的近似值。 其次,獨立均勻噪聲根據(jù)其邊緣矩來近似量化誤差,并且經(jīng)常用作量化誤差的模型(Gray 和 Neuhoff, 1998)。因此,我們可以使用相同的近似值來衡量失真。我們在第 4 節(jié)中檢查了這些速率和失真近似值的經(jīng)驗質(zhì)量
We assume independent marginals in the code space for both the relaxed probability model of ?y and the entropy code, and model the marginals p ?yi non-parametrically to reduce model error.Specifically, we use finely sampled piecewise linear functions which we update similarly to one- dimensional histograms (see appendix).Since p ?yi = pyi ? U(0, 1) is effectively smoothed by a box-car filter – the uniform density on the unit interval, U(0, 1) – the model error can be made arbitrarily small by decreasing the sampling interval
我們假設(shè) ?y 的寬松概率模型和熵代碼在代碼空間中具有獨立的邊際,并對邊際 p ?yi 非參數(shù)建模以減少模型誤差。具體來說,我們使用精細采樣的分段線性函數(shù),其更新方式與一維直方圖類似(參見附錄)。由于 p ?yi = pyi ? U(0, 1) 通過箱車濾波器有效平滑——單位間隔 U(0, 1) 上的均勻密度——通過減小采樣間隔可以使模型誤差任意小
Given this continuous approximation of the quantized coefficient distribution, the loss function for
parameters θ and φ can be written as:
給定量化系數(shù)分布的連續(xù)近似,參數(shù) θ 和 φ 的損失函數(shù)可以寫為:
(11)
where vector ψ(i) parameterizes the piecewise linear approximation of p ?yi (trained jointly with θ and φ).This is continuous and differentiable, and thus well-suited for stochastic optimization.
其中向量 ψ(i) 參數(shù)化 p ?yi 的分段線性近似(與 θ 和 φ 聯(lián)合訓(xùn)練)。這是連續(xù)且可微的,因此非常適合隨機優(yōu)化。
Figure 3: Representation of the relaxed rate–distortion optimization problem as the encoder and decoder graphs of a variational autoencoder.Nodes represent random variables, and gray shading indicates observed data;small filled nodes represent parameters;arrows indicate dependency;and nodes within boxes are per-image
圖 3:將松弛率失真優(yōu)化問題表示為變分自編碼器的編碼器和解碼器圖。節(jié)點代表隨機變量,灰色陰影表示觀測數(shù)據(jù);小的填充節(jié)點代表參數(shù);箭頭表示依賴關(guān)系;框內(nèi)的節(jié)點是針對每個圖像的
3.1 RELATIONSHIP TO VARIATIONAL GENERATIVE IMAGE MODELS
與變分生成圖像模型的關(guān)系
We derived our formulation directly from the classical rate–distortion optimization problem.How- ever, once the transition to a continuous loss function is made, the optimization problem resembles those encountered in fitting generative models of images, and can more specifically be cast in the context of variational autoencoders (Kingma and Welling, 2014; Rezende, Mohamed, and Wierstra, 2014).In Bayesian variational inference, we are given an ensemble of observations of a random variable x along with a generative model px|y (x|y).We seek to find a posterior py|x(y|x), which generally cannot be expressed in closed form.The approach followed by Kingma and Welling (2014) consists of approximating this posterior with a density q(y|x), by minimizing the Kullback– Leibler divergence between the two
我們直接從經(jīng)典的率失真優(yōu)化問題中得出公式。然而,一旦過渡到連續(xù)損失函數(shù),優(yōu)化問題類似于擬合圖像生成模型時遇到的問題,并且可以更具體地在變分自動編碼器的背景下進行轉(zhuǎn)換(Kingma 和 Welling,2014;Rezende,Mohamed) ,和 Wierstra,2014)。在貝葉斯變分推理中,我們得到隨機變量 x 的觀測值集合以及生成模型 px|y (x|y)。我們尋求找到后驗 py|x(y|x),它通常不能以封閉形式表示。 Kingma 和 Welling (2014) 采用的方法包括通過最小化兩者之間的 Kullback-Leibler 散度,用密度 q(y|x) 來近似該后驗
(12)
This objective function is equivalent to our relaxed rate–distortion optimization problem, with dis- tortion measured as MSE, if we define the generative model as follows (figure 3):
如果我們按如下方式定義生成模型(圖 3),則該目標(biāo)函數(shù)相當(dāng)于我們的寬松率失真優(yōu)化問題,其中失真測量為 MSE:
(13)
(14)
and the approximate posterior as follows:
以及近似后驗如下:
(15)
where U( ?yi; yi, 1) is the uniform density on the unit interval centered on yi.With this, the first term in the Kullback–Leibler divergence is constant;the second term corresponds to the distortion, and the third term corresponds to the rate (both up to additive constants).Note that if a perceptual transform gp is used, or the metric d is not Euclidean, px|?y is no longer Gaussian, and equivalence to variational autoencoders cannot be guaranteed, since the distortion term may not correspond to a normalizable density.For any affine and invertible perceptual transform and any translation-invariant metric, it can be shown to correspond to the density
其中U(?yi;yi,1)是以yi為中心的單位區(qū)間上的均勻密度。這樣,Kullback-Leibler 散度中的第一項是恒定的;第二項對應(yīng)于失真,第三項對應(yīng)于速率(兩者都達到加性常數(shù))。請注意,如果使用感知變換 gp,或者度量 d 不是歐幾里德,則 px| ?y 不再是高斯分布,并且不能保證與變分自動編碼器的等價性,因為失真項可能不對應(yīng)于可歸一化的密度。對于任何仿射和可逆感知變換以及任何平移不變度量,它可以被證明對應(yīng)于密度
(16)
其中 Z(λ) 對密度進行歸一化(但無需計算以擬合模型)。
Despite the similarity between our nonlinear transform coding framework and that of variational autoencoders, it is worth noting several fundamental differences.First, variational autoencoders are continuous-valued, and digital compression operates in the discrete domain.Comparing differential entropy with (discrete) entropy, or entropy with an actual bit rate, can potentially lead to misleading results.In this paper, we use the continous domain strictly for optimization, and perform the evalu- ation on actual bit rates, which allows comparison to existing image coding methods.We assess the quality of the rate and distortion approximations empirically.
盡管我們的非線性變換編碼框架與變分自動編碼器之間有相似之處,但值得注意的是幾個基本差異。首先,變分自動編碼器是連續(xù)值的,并且數(shù)字壓縮在離散域中操作。將微分熵與(離散)熵或熵與實際比特率進行比較可能會導(dǎo)致誤導(dǎo)結(jié)果。在本文中,我們嚴格使用連續(xù)域進行優(yōu)化,并對實際比特率進行評估,這可以與現(xiàn)有的圖像編碼方法進行比較。我們根據(jù)經(jīng)驗評估速率和失真近似的質(zhì)量。
Second, generative models aim to minimize differential entropy of the data ensemble under the model, i.e., explaining fluctuations in the data.This often means minimizing the variance of a “slack” term like (13), which in turn maximizes λ.Transform coding methods, on the other hand, are optimized to achieve the best trade-off between having the model explain the data (which in- creases rate and decreases distortion), and having the slack term explain the data (which decreases rate and increasesdistortion).The overall performance of a compression model is determined by the shape of the convex hull of attainable model distortions and rates, over all possible values of the model parameters.Finding this convex hull is equivalent to optimizing the model for particular values of λ (see figure 2).In contrast, generative models operate in a regime where λ is inferred and ideally approaches infinity for noiseless data, which corresponds to the regime of lossless compres- sion.Even so, lossless compression methods still need to operate in a discretized space, typically directly on quantized luminance values.For generative models, the discretization of luminance val- ues is usually considered a nuisance (Theis, van den Oord, and Bethge, 2015), although there are examples of generative models that operate on quantized pixel values (van den Oord, Kalchbrenner, andKavukcuoglu, 2016)
其次,生成模型旨在最小化模型下數(shù)據(jù)集合的微分熵,即解釋數(shù)據(jù)的波動。這通常意味著最小化“松弛”項(如(13))的方差,從而最大化 λ。另一方面,變換編碼方法經(jīng)過優(yōu)化,可以在模型解釋數(shù)據(jù)(增加速率并減少失真)和松弛項解釋數(shù)據(jù)(降低速率并增加失真)之間實現(xiàn)最佳權(quán)衡。失真)。壓縮模型的整體性能由模型參數(shù)的所有可能值上可達到的模型畸變和率的凸包形狀決定。找到這個凸包相當(dāng)于針對特定的 λ 值優(yōu)化模型(見圖 2)。相比之下,生成模型在 λ 被推斷的情況下運行,對于無噪聲數(shù)據(jù)理想地接近無窮大,這對應(yīng)于無損壓縮的情況。即便如此,無損壓縮方法仍然需要在離散空間中操作,通常直接在量化的亮度值上操作。對于生成模型,亮度值的離散化通常被認為是一件麻煩事(Theis、van den Oord 和 Bethge,2015),盡管有一些生成模型對量化像素值進行操作的示例(van den Oord、Kalchbrenner 和卡武克措格魯,2016)
Finally, although correspondence between the typical slack term (13) of a generative model (figure 3, left panel) and the distortion metric in rate–distortion optimization holds for simple metrics (e.g., Euclidean distance), a more general perceptual measure would beconsidered a peculiar choice from a generative modeling perspective, if it corresponds to a density at all
最后,雖然生成模型的典型松弛項 (13)(圖 3,左圖)與率失真優(yōu)化中的失真度量之間的對應(yīng)關(guān)系對于簡單度量(例如歐幾里德距離)而言成立,但更一般的感知度量將是如果它完全對應(yīng)于密度,那么從生成建模的角度來看,它被認為是一個特殊的選擇
4 EXPERIMENTAL RESULTS
We jointly optimized the full set of parameters φ, θ, and all ψ over a subset of the ImageNet database (Deng et al., 2009) consisting of 6507 images using stochastic descent.This optimization was performed separately for each λ, yielding separate transforms and marginal probability models for each value.
我們使用隨機下降法聯(lián)合優(yōu)化了由 6507 張圖像組成的 ImageNet 數(shù)據(jù)庫子集(Deng 等人,2009 年)的全套參數(shù) φ、θ 和所有 ψ。這種優(yōu)化是針對每個 λ 單獨執(zhí)行的,為每個值生成單獨的變換和邊際概率模型。
For the grayscale analysis transform, we used 128 filters (size 9 × 9) in the first stage, each sub- sampled by a factor of 4 vertically and horizontally.The remaining two stages retain the number of channels, but use filters operating across all input channels (5 × 5 × 128), with outputs subsampled by a factor of 2 in each dimension.The net output thus has half the dimensionality of the input.The synthesis transform is structured analogously.For RGB images, we trained a separate set of models, with the first stage augmented to operate across three (color) input channels.For the two largest values of λ, and for RGB models, we increased the network capacity by increasing the number of channels in each stage to 256 and 192, respectively.Further details about the parameterization of the transforms and their training can be found in the appendix
對于灰度分析變換,我們在第一階段使用 128 個濾波器(大小 9 × 9),每個濾波器在垂直和水平方向上進行 4 倍的子采樣。其余兩個階段保留通道數(shù),但使用在所有輸入通道 (5 × 5 × 128) 上運行的濾波器,輸出在每個維度上按 2 倍子采樣。因此,凈輸出的維數(shù)是輸入的一半。合成變換的結(jié)構(gòu)類似。對于 RGB 圖像,我們訓(xùn)練了一組單獨的模型,第一階段經(jīng)過增強,可以跨三個(顏色)輸入通道進行操作。對于 λ 的兩個最大值和 RGB 模型,我們通過將每個階段的通道數(shù)分別增加到 256 和 192 來增加網(wǎng)絡(luò)容量。有關(guān)變換參數(shù)化及其訓(xùn)練的更多詳細信息,請參閱附錄
We first verified that the continuously-relaxed loss function given in section 3 provides a good ap- proximation to the actual rate–distortion values obtained with quantization (figure 4).The relaxed distortion term appears to be mostly unbiased, and exhibits a relatively small variance.The relaxed (differential) entropy provides a somewhat positively biased estimate of the discrete entropy for the coarser quantization regime, but the bias disappears for finer quantization, as expected.Note that since the values of λ do not have any intrinsic meaning, but serve only to map out the convex hull of optimal points in the rate–distortion plane (figure 2, left panel), a constant bias in either of the terms would simplyalter the effective value of λ, with no effect on the compression performance
我們首先驗證了第 3 節(jié)中給出的連續(xù)松弛損失函數(shù)可以很好地逼近通過量化獲得的實際率失真值(圖 4)。松弛失真項看起來大部分是無偏的,并且表現(xiàn)出相對較小的方差。松弛(微分)熵為較粗的量化機制提供了離散熵的某種正偏差估計,但正如預(yù)期的那樣,對于更精細的量化,偏差消失了。請注意,由于 λ 的值沒有任何內(nèi)在含義,而僅用于繪制率失真平面中最優(yōu)點的凸包(圖 2,左圖),因此任何一項中的恒定偏差都將簡單地表示為改變λ的有效值,不影響壓縮性能
We compare the rate–distortion performance of our method to two standard methods: JPEG and JPEG 2000. For our method, all images were compressed using uniform quantization (the contin- uous relaxation using additive noise was used only for training purposes).To make the compar- isons more fair, we implemented a simple entropy code based on the context-based adaptive binary arithmetic coding framework (CABAC; Marpe, Schwarz, and Wiegand, 2003).All sideband in- formation needed by the decoder (size of images, value of λ, etc.) was included in the bit stream (see appendix).Note that although the computational costs for training our models are quite high, encoding or decoding an image with the trained models is efficient, requiring only execution of the optimized analysis transformation and quantizer, or the synthesis transformation, respectively.Evaluations were performed on the Kodak image dataset1, an uncompressed set of images com- monly used to evaluate image compression methods.We also examined a set of relatively standard (if outdated) images used by the compression community (known by the names “Lena”, “Barbara”, “Peppers”, and “Mandrill”) as well as a set of our own digital photographs.None of these test images was included in the training set.All test images, compressed at a variety of bit rates us- ing all three methods, along with their associated rate–distortion curves, are available online at http://www.cns.nyu.edu/ ?lcv/iclr2017
我們將我們的方法與兩種標(biāo)準(zhǔn)方法的率失真性能進行比較:JPEG 和 JPEG 2000。對于我們的方法,所有圖像都使用均勻量化進行壓縮(使用加性噪聲的連續(xù)松弛僅用于訓(xùn)練目的)。為了使比較更加公平,我們基于基于上下文的自適應(yīng)二進制算術(shù)編碼框架(CABAC;Marpe、Schwarz 和 Wiegand,2003)實現(xiàn)了一個簡單的熵代碼。解碼器所需的所有邊帶信息(圖像大小、λ 值等)都包含在比特流中(參見附錄)。請注意,盡管訓(xùn)練模型的計算成本相當(dāng)高,但使用訓(xùn)練模型對圖像進行編碼或解碼是高效的,只需要分別執(zhí)行優(yōu)化的分析變換和量化器或合成變換。評估是在柯達圖像數(shù)據(jù)集1上進行的,這是一組未壓縮的圖像,通常用于評估圖像壓縮方法。我們還檢查了壓縮社區(qū)使用的一組相對標(biāo)準(zhǔn)(如果過時)的圖像(名稱為“Lena”、“Barbara”、“Peppers”和“Mandrill”)以及一組我們自己的數(shù)碼照片。這些測試圖像均未包含在訓(xùn)練集中。使用所有三種方法以各種比特率壓縮的所有測試圖像及其相關(guān)的率失真曲線均可在線獲?。篽ttp://www.cns.nyu.edu/?lcv/iclr2017
Although we used MSE as a distortion metric for training, the appearance of compressed images is both qualitatively different and substantially improved, compared to JPEG and JPEG 2000. As an example, figure 5 shows an image compressed using our method optimized for a low value ofλ(and thus, a low bit rate), compared to JPEG/JPEG 2000 images compressed at equal or greater bit rates.The image compressed with our method has less detail than the original (not shown, but available online), with fine texture andother patterns often eliminated altogether, but this is accomplished in a way that preserves the smoothness of contours and sharpness of many of theedges, giving them a natural appearance.By comparison, the JPEG and JPEG 2000 images exhibit artifacts that are com- mon to alllinear transform coding methods: since local features (edges, contours, texture elements, etc.) are represented using particularcombinations of localized linear basis functions, independent scalar quantization of the transform coefficients causes imbalances in these combinations, and leads to visually disturbing blocking, aliasing, and ringing artifacts that reflect the underlying basis func- tions
盡管我們使用 MSE 作為訓(xùn)練的失真度量,但與 JPEG 和 JPEG 2000 相比,壓縮圖像的外觀在質(zhì)量上有所不同,并且得到了顯著改善。作為示例,圖 5 顯示了使用我們針對低 λ 值進行優(yōu)化的方法壓縮的圖像(因此,比特率較低),與以相同或更高比特率壓縮的 JPEG/JPEG 2000 圖像相比。使用我們的方法壓縮的圖像的細節(jié)少于原始圖像(未顯示,但可在線獲?。哂芯毜募y理和其他圖案通常會被完全消除,但這是通過保留輪廓的平滑度和許多邊緣的清晰度來實現(xiàn)的,從而使它們具有自然的外觀。相比之下,JPEG 和 JPEG 2000 圖像表現(xiàn)出所有圖像都常見的偽影。線性變換編碼方法:由于局部特征(邊緣、輪廓、紋理元素等)是使用局部線性基函數(shù)的特定組合來表示的,因此變換系數(shù)的獨立標(biāo)量量化會導(dǎo)致這些組合中的不平衡,并導(dǎo)致視覺上令人不安的塊、混疊,以及反映底層基函數(shù)的振鈴偽影
Remarkably, we find that the perceptual advantages of our method hold for all images tested, and at all bit rates.The progression from high to low bit rates is shown for an example image in figure 6 (additional examples provided in appendix and online).As bit rate is reduced, JPEG and JPEG 2000 degrade their approximation of the original image by coarsening the precision of the coefficients of linear basis functions, thus exposing the visual appearance of those basis functions.On the other hand, our method appears to progressively simplify contours and other image features, effectively concealing the underlying quantization of the representation.Consistent with the appearance of these example images, we find that distortion measured with a perceptual metric (MS-SSIM; Wang, Simoncelli, and Bovik, 2003), indicates substantial improvements across all tested images and bit rates (figure 7; additional examples providedin the appendix and online).Finally, when quantified with PSNR, we find that our method exhibits better rate–distortion performance than both JPEG and JPEG 2000 for most (but not all) test images, especially at the lower bit rates
值得注意的是,我們發(fā)現(xiàn)我們的方法的感知優(yōu)勢適用于所有測試的圖像和所有比特率。圖 6 中的示例圖像顯示了從高比特率到低比特率的進展(附錄和在線提供了其他示例)。隨著比特率的降低,JPEG 和 JPEG 2000 通過粗化線性基函數(shù)系數(shù)的精度來降低原始圖像的近似值,從而暴露這些基函數(shù)的視覺外觀。另一方面,我們的方法似乎逐漸簡化輪廓和其他圖像特征,有效隱藏了表示的底層量化。與這些示例圖像的外觀一致,我們發(fā)現(xiàn)使用感知度量(MS-SSIM;Wang、Simoncelli 和 Bovik,2003)測量的失真表明所有測試圖像和比特率都有顯著改善(圖 7;提供了其他示例)見附錄和在線)。最后,當(dāng)使用 PSNR 進行量化時,我們發(fā)現(xiàn)對于大多數(shù)(但不是全部)測試圖像,我們的方法比 JPEG 和 JPEG 2000 表現(xiàn)出更好的率失真性能,尤其是在較低比特率下
5 DISCUSSION
We have presented a complete image compression method based on nonlinear transform coding, and a framework to optimize it end-to-end for rate–distortion performance.Our compression method of- fers improvements in rate–distortion performance over JPEG and JPEG 2000 for most images and bit rates.More remarkably, although the method was optimized using mean squared error as a dis- tortion metric, the compressed images are much more natural in appearance than those compressed with JPEG or JPEG 2000, both of which suffer from the severe artifacts commonly seen in linear transformcoding methods.Consistent with this, perceptual quality (as estimated with the MS-SSIM index) exhibits substantial improvement across all test images and bit rates.We believe this visual improvement arises because the cascade of biologically-inspired nonlinear transformations in the model have been optimized to capture the features and attributes of images that are represented in the statistics of the data, parallel to the processes of evolution and development that are believedto have shaped visual representations within the human brain (Simoncelli and Olshausen, 2001).Nevertheless, additional visual improvements might be possible if the method were optimized using a perceptual metric in place of MSE (Ball ?e, Laparra, and Simoncelli, 2016)
我們提出了一種基于非線性變換編碼的完整圖像壓縮方法,以及一個端到端優(yōu)化其率失真性能的框架。對于大多數(shù)圖像和比特率,我們的壓縮方法比 JPEG 和 JPEG 2000 提供了率失真性能的改進。更值得注意的是,盡管該方法使用均方誤差作為失真度量進行了優(yōu)化,但壓縮后的圖像在外觀上比使用 JPEG 或 JPEG 2000 壓縮的圖像要自然得多,后兩者都存在線性變換中常見的嚴重偽影。編碼方法。與此一致的是,感知質(zhì)量(根據(jù) MS-SSIM 指數(shù)估計)在所有測試圖像和比特率上都表現(xiàn)出顯著改善。我們相信這種視覺改進的出現(xiàn)是因為模型中受生物學(xué)啟發(fā)的非線性變換的級聯(lián)已經(jīng)過優(yōu)化,以捕獲數(shù)據(jù)統(tǒng)計中表示的圖像的特征和屬性,與人們認為的進化和發(fā)展過程平行在人腦中形成視覺表征(Simoncelli 和 Olshausen,2001)。盡管如此,如果使用感知指標(biāo)代替 MSE 來優(yōu)化該方法,則可能會獲得額外的視覺改進(Ball ?e、Laparra 和 Simoncelli,2016)
For comparison to linear transform coding methods, we can interpret our analysis transform as a single-stage linear transform followed by a complex vector quantizer.As in many other optimized representations – e.g., sparse coding (Lewicki and Olshausen, 1998) – as well as many engineered representations – e.g., the steerable pyramid (Simoncelli, Freeman, et al., 1992), curvelets (Cand`es and Donoho, 2002), and dual-tree complex wavelets (Selesnick, Baraniuk, and Kingsbury, 2005) – the filters in this first stage are localized and oriented and the representation is overcomplete.Whereas most transform coding methods use complete (often orthogonal) linear transforms with spatially separable filters, the overcompleteness and orientation tuning of our initial transform may explain the ability of the model to better represent features and contours with continuously varying orientation, position and scale (Simoncelli, Freeman, et al., 1992).
為了與線性變換編碼方法進行比較,我們可以將我們的分析變換解釋為單級線性變換,然后是復(fù)矢量量化器。與許多其他優(yōu)化表示一樣 - 例如,稀疏編碼(Lewicki 和 Olshausen,1998) - 以及許多工程表示 - 例如,可操縱金字塔(Simoncelli,F(xiàn)reeman 等人,1992),曲線(Cand`es 和 Donoho) ,2002)和雙樹復(fù)小波(Selesnick、Baraniuk 和 Kingsbury,2005)——第一階段的濾波器是局部化和定向的,并且表示是過度完整的。雖然大多數(shù)變換編碼方法使用具有空間可分離濾波器的完整(通常是正交)線性變換,但我們初始變換的過度完整性和方向調(diào)整可以解釋模型更好地表示具有連續(xù)變化的方向、位置和尺度的特征和輪廓的能力(Simoncelli ,弗里曼等人,1992)。
Our work is related to two previous publications that optimize image representations with the goal of image compression.Gregor, Besse, et al.(2016) introduce an interesting hierarchical representa- tion of images, in which degradations are more natural looking than those of linear representations.However, rather than optimizing directly for rate–distortion performance, their modeling is genera- tive.Due to the differences between these approaches (as outlined in section 3.1), their procedure of obtaining coding representations from the generative model (scalar quantization, and elimination of hierarchical levels of the representation) is less systematic than our approach and unlikely to be optimal.Further, no entropy code is provided, and the authors therefore resort to comparing entropy estimates to bit rates of established compression methods, which can be unreliable.The model developed by Toderici et al.(2016) is optimized to provide various rate–distortion trade-offs and directly output a binary representation, making it more easily comparable to other image compres- sion methods.Moreover, their formulation has the advantage over ours that a single representation is sought for all rate points.However, it is not clear whether their formulation necessarily leads to rate–distortion optimality (and their empirical results suggest that this is not the case).
我們的工作與之前的兩篇出版物相關(guān),它們以圖像壓縮為目標(biāo)優(yōu)化圖像表示。格雷戈爾、貝斯等人。 (2016)引入了一種有趣的圖像分層表示,其中降級比線性表示看起來更自然。然而,他們的建模是生成式的,而不是直接優(yōu)化率失真性能。由于這些方法之間的差異(如第 3.1 節(jié)所述),它們從生成模型獲取編碼表示的過程(標(biāo)量量化和消除表示的層次級別)不如我們的方法系統(tǒng)化,并且不太可能是最優(yōu)的。此外,沒有提供熵代碼,因此作者將熵估計與已建立的壓縮方法的比特率進行比較,這可能是不可靠的。該模型由 Toderici 等人開發(fā)。 (2016)經(jīng)過優(yōu)化,提供各種率失真權(quán)衡并直接輸出二進制表示,使其更容易與其他圖像壓縮方法進行比較。此外,他們的公式比我們的公式具有優(yōu)勢,即為所有費率點尋求單一表示。然而,尚不清楚他們的表述是否必然導(dǎo)致率失真最優(yōu)(他們的經(jīng)驗結(jié)果表明情況并非如此)。
We are currently testing models that use simpler rectified-linear or sigmoidal nonlinearities, to de- termine how much of the performance and visual quality of our results is due to use of biologically- inspired joint nonlinearities.Preliminary results indicate that qualitatively similar results are achiev- able with other activation functions we tested, but that rectified linear units generally require a sub- stantially larger number of model parameters/stages to achieve the same rate–distortion performance as the GDN/IGDN nonlinearities.This suggests that GDN/IGDN transforms are more efficient for compression, producing better models with fewer stages of processing (as we previously found for density estimation; Ball ?e, Laparra, and Simoncelli, 2015), which might be an advantage for de- ploymentof our method, say, in embedded systems.However, such conclusions are based on a somewhat limited set of experiments and should at this point be considered provisional.More gen- erally, GDN represents a multivariate generalization of a particular type of sigmoidal function.As such, the observed efficiency advantage relative to pointwise nonlinearities is expected, and a variant of a universal function approximation theorem (e.g., Leshno et al., 1993) should hold
我們目前正在測試使用更簡單的修正線性或 S 形非線性的模型,以確定我們結(jié)果的性能和視覺質(zhì)量有多少是由于使用了受生物學(xué)啟發(fā)的聯(lián)合非線性。初步結(jié)果表明,使用我們測試的其他激活函數(shù)可以獲得類似的定性結(jié)果,但是修正線性單元通常需要大量的模型參數(shù)/階段才能實現(xiàn)與 GDN/IGDN 非線性相同的率失真性能。這表明 GDN/IGDN 變換的壓縮效率更高,可以用更少的處理階段生成更好的模型(正如我們之前在密度估計中發(fā)現(xiàn)的那樣;Ball ?e、Laparra 和 Simoncelli,2015 年),這可能是部署的優(yōu)勢例如,在嵌入式系統(tǒng)中。然而,這樣的結(jié)論是基于有限的一組實驗,在這一點上應(yīng)該被認為是臨時的。更一般地說,GDN 表示特定類型 sigmoidal 函數(shù)的多元推廣。因此,觀察到的相對于逐點非線性的效率優(yōu)勢是預(yù)期的,并且通用函數(shù)逼近定理的變體(例如 Leshno 等人,1993)應(yīng)該成立
The rate–distortion objective can be seen as a particular instantiation of the general unsupervised learning or density estimation problems.Since the transformation to a discrete representation may be viewed as a form of classification, it is worth considering whether our framework offers any insights that might be transferred to more specific supervised learning problems, such as object recognition.For example, the additive noise used in the objective function as a relaxation of quan- tization might also serve the purpose of making supervised classification networks more robust to small perturbations, and thus allow them to avoid catastrophic “adversarial” failures that have been demonstrated inprevious work (Szegedy et al., 2013).In any case, our results provide a strong example of the power of end-to-end optimization in achieving a new solution to a classical problem
率失真目標(biāo)可以被視為一般無監(jiān)督學(xué)習(xí)或密度估計問題的特定實例。由于到離散表示的轉(zhuǎn)換可以被視為一種分類形式,因此值得考慮我們的框架是否提供了任何可能轉(zhuǎn)移到更具體的監(jiān)督學(xué)習(xí)問題(例如對象識別)的見解。例如,目標(biāo)函數(shù)中使用的加性噪聲作為量化的松弛也可能有助于使監(jiān)督分類網(wǎng)絡(luò)對小擾動更加魯棒,從而使它們能夠避免災(zāi)難性的“對抗性”失敗,這些失敗已在之前的工作(Szegedy 等人,2013)。無論如何,我們的結(jié)果提供了一個強有力的例子,證明了端到端優(yōu)化在實現(xiàn)經(jīng)典問題的新解決方案方面的力量
學(xué)習(xí)參考資料:(68條消息) 端到端的圖像壓縮------《End-to-end optimized image compression》筆記_gdn層_葉笙簫的博客-CSDN博客
整體算法分為三個部分:非線性分析變換(編碼器),均勻量化器和非線性合成邊變換(解碼器)
x 與 $\hat{x} $分別代表輸入的原圖和經(jīng)過編解碼器后的重建圖片。
g
a
g_a
ga?表示編碼器提供的非線性分析變換,即由輸入圖片經(jīng)過編碼器網(wǎng)絡(luò)后得到的潛在特征,通過量化器q 后,得到 量化后結(jié)果:$ \hat{y} $
再通過 g S g_S gS?解碼器重建圖片結(jié)果.
實踐
TensorFlow Compression(TFC)是一個用于在TensorFlow中進行數(shù)據(jù)壓縮和解壓縮的庫。它提供了一系列的壓縮算法和工具,用于在機器學(xué)習(xí)和深度學(xué)習(xí)任務(wù)中對模型參數(shù)、特征和數(shù)據(jù)進行壓縮。
使用TFC,你可以實現(xiàn)以下功能:
- 模型壓縮:通過使用壓縮算法對模型參數(shù)進行壓縮,減少模型的存儲空間和內(nèi)存占用。
- 特征壓縮:對輸入數(shù)據(jù)的特征進行壓縮,以減少特征的維度和表示大小。
- 數(shù)據(jù)壓縮:對訓(xùn)練數(shù)據(jù)集或測試數(shù)據(jù)集進行壓縮,以減少數(shù)據(jù)的存儲和傳輸開銷。
TFC提供了多種壓縮算法,包括無損和有損壓縮方法。其中一些常用的算法包括:
- GDN(Generalized Divisive Normalization):用于圖像和視頻數(shù)據(jù)的非線性特征壓縮。
- Balle’s Method:一種基于無損壓縮的算法,可用于對模型參數(shù)進行壓縮。
- Entropy Coders:提供了各種熵編碼器,如Arithmetic Coding和Huffman Coding,用于數(shù)據(jù)壓縮。
通過使用TFC,你可以最大限度地減少模型和數(shù)據(jù)的存儲空間,同時保持壓縮后的數(shù)據(jù)的質(zhì)量和準(zhǔn)確性。這對于在資源受限的環(huán)境下部署機器學(xué)習(xí)模型或進行大規(guī)模數(shù)據(jù)處理非常有用。
學(xué)習(xí)網(wǎng)址 Learned data compression | TensorFlow Core
Overview
This notebook shows how to do lossy data compression using neural networks and TensorFlow Compression. Lossy compression involves making a trade-off between rate, the expected number of bits needed to encode a sample, and distortion, the expected error in the reconstruction of the sample. The examples below use an autoencoder-like model to compress images from the MNIST dataset.The method is based on the paper End-to-end Optimized Image Compression. More background on learned data compression can be found in this paper targeted at people familiar with classical data compression, or this survey targeted at a machine learning audience.
本筆記展示了如何使用神經(jīng)網(wǎng)絡(luò)和 TensorFlow Compression 進行有損數(shù)據(jù)壓縮。 有損壓縮涉及在速率(對樣本進行編碼所需的預(yù)期位數(shù))和失真(樣本重建中的預(yù)期誤差)之間進行權(quán)衡。 下面的示例使用類似自動編碼器的模型來壓縮 MNIST 數(shù)據(jù)集中的圖像。該方法基于論文“端到端優(yōu)化圖像壓縮”。 有關(guān)學(xué)習(xí)數(shù)據(jù)壓縮的更多背景信息可以在針對熟悉經(jīng)典數(shù)據(jù)壓縮的人員的本文或針對機器學(xué)習(xí)受眾的調(diào)查中找到。
Define the trainer model
Because the model resembles an autoencoder, and we need to perform a different set of functions during training and inference, the setup is a little different from, say, a classifier.
The training model consists of three parts:
- the analysis (or encoder) transform, converting from the image into a latent space,
- the synthesis (or decoder) transform, converting from the latent space back into image space, and
- a prior and entropy model, modeling the marginal probabilities of the latents.
First, define the transforms:
由于該模型類似于自動編碼器,并且我們需要在訓(xùn)練和推理期間執(zhí)行一組不同的功能,因此設(shè)置與分類器等略有不同。 訓(xùn)練模型由三部分組成:
分析(或編碼器)變換,從圖像轉(zhuǎn)換為潛在空間,
合成(或解碼器)變換,從潛在空間轉(zhuǎn)換回圖像空間,
以及 先驗和熵模型,對潛在的邊際概率進行建模。
首先,定義變換:
def make_analysis_transform(latent_dims):
"""Creates the analysis (encoder) transform."""
return tf.keras.Sequential([
tf.keras.layers.Conv2D(
20, 5, use_bias=True, strides=2, padding="same",
activation="leaky_relu", name="conv_1"),
tf.keras.layers.Conv2D(
50, 5, use_bias=True, strides=2, padding="same",
activation="leaky_relu", name="conv_2"),
tf.keras.layers.Flatten(),
tf.keras.layers.Dense(
500, use_bias=True, activation="leaky_relu", name="fc_1"),
tf.keras.layers.Dense(
latent_dims, use_bias=True, activation=None, name="fc_2"),
], name="analysis_transform")
def make_synthesis_transform():
"""Creates the synthesis (decoder) transform."""
return tf.keras.Sequential([
tf.keras.layers.Dense(
500, use_bias=True, activation="leaky_relu", name="fc_1"),
tf.keras.layers.Dense(
2450, use_bias=True, activation="leaky_relu", name="fc_2"),
tf.keras.layers.Reshape((7, 7, 50)),
tf.keras.layers.Conv2DTranspose(
20, 5, use_bias=True, strides=2, padding="same",
activation="leaky_relu", name="conv_1"),
tf.keras.layers.Conv2DTranspose(
1, 5, use_bias=True, strides=2, padding="same",
activation="leaky_relu", name="conv_2"),
], name="synthesis_transform")
The trainer holds an instance of both transforms, as well as the parameters of the prior.
Its call
method is set up to compute:
- rate, an estimate of the number of bits needed to represent the batch of digits, and
- distortion, the mean absolute difference between the pixels of the original digits and their reconstructions.
訓(xùn)練器保存兩個變換的實例以及先驗的參數(shù)。 其調(diào)用方法設(shè)置為計算:
速率,表示一批數(shù)字所需的位數(shù)的估計,
以及 失真,原始數(shù)字及其重建的像素之間的平均絕對差。
My idea:這里說的call
就是下面代碼中的call 方法,其作用是計算速率和失真
class MNISTCompressionTrainer(tf.keras.Model):
"""Model that trains a compressor/decompressor for MNIST."""
def __init__(self, latent_dims):
super().__init__()
self.analysis_transform = make_analysis_transform(latent_dims)
self.synthesis_transform = make_synthesis_transform()
self.prior_log_scales = tf.Variable(tf.zeros((latent_dims,)))
@property
def prior(self):
return tfc.NoisyLogistic(loc=0., scale=tf.exp(self.prior_log_scales))
def call(self, x, training):
"""Computes rate and distortion losses."""
# Ensure inputs are floats in the range (0, 1).
x = tf.cast(x, self.compute_dtype) / 255.
x = tf.reshape(x, (-1, 28, 28, 1))
# Compute latent space representation y, perturb it and model its entropy,
# then compute the reconstructed pixel-level representation x_hat.
y = self.analysis_transform(x)
entropy_model = tfc.ContinuousBatchedEntropyModel(
self.prior, coding_rank=1, compression=False)
y_tilde, rate = entropy_model(y, training=training)
x_tilde = self.synthesis_transform(y_tilde)
# Average number of bits per MNIST digit.
rate = tf.reduce_mean(rate)
# Mean absolute difference across pixels.
distortion = tf.reduce_mean(abs(x - x_tilde))
return dict(rate=rate, distortion=distortion)
My idea:計算潛空間表示y,對其進行擾動并建模其熵,然后計算重建的像素級表示x_hat。
Load the MNIST dataset for training and validation:
training_dataset, validation_dataset = tfds.load(
"mnist",
split=["train", "test"],
shuffle_files=True,
as_supervised=True,
with_info=False,
)
提取一張圖像x
(x, _), = validation_dataset.take(1)
plt.imshow(tf.squeeze(x))
print(f"Data type: {x.dtype}")
print(f"Shape: {x.shape}")
To get the latent representation y, we need to cast it to float32, add a batch dimension, and pass it through the analysis transform.
為了獲得潛空間表示 y,我們需要將其轉(zhuǎn)換為float32類型,添加批處理維度,并通過分析變換進行傳遞。
x = tf.cast(x, tf.float32) / 255.
x = tf.reshape(x, (-1, 28, 28, 1))
y = make_analysis_transform(10)(x)
print("y:", y)
My idea:這里就是通過輸入x(即圖像)通過編碼器來得到y(tǒng)
The latents will be quantized at test time. To model this in a differentiable way during training, we add uniform noise in the interval (?.5,.5) and call the result y ~ \widetilde y y ?. This is the same terminology as used in the paper End-to-end Optimized Image Compression.
y_tilde = y + tf.random.uniform(y.shape, -.5, .5)
print("y_tilde:", y_tilde)
My idea:為了在量化的時候可微,加入了均勻噪聲
The “prior” is a probability density that we train to model the marginal distribution of the noisy latents.For example, it could be a set of independent logistic distributions with different scales for each latent dimension.tfc.NoisyLogistic accounts for the fact that the latents have additive noise.As the scale approaches zero, a logistic distribution approaches a dirac delta (spike), but the added noise causes the “noisy” distribution to approach the uniform distribution instead.
“先驗”是我們訓(xùn)練來模擬噪聲潛伏的邊緣分布的概率密度。例如,它可以是一組獨立的邏輯分布,每個潛在維度具有不同的尺度。 tfc.NoisyLogistic 解釋了潛在因素具有加性噪聲的事實。當(dāng)尺度接近零時,邏輯分布接近狄拉克三角洲(尖峰),但添加的噪聲會導(dǎo)致“噪聲”分布接近均勻分布。
prior = tfc.NoisyLogistic(loc=0., scale=tf.linspace(.01, 2., 10))
_ = tf.linspace(-6., 6., 501)[:, None]
plt.plot(_, prior.prob(_));
During training, tfc.ContinuousBatchedEntropyModel
adds uniform noise, and uses the noise and the prior to compute a (differentiable) upper bound on the rate (the average number of bits necessary to encode the latent representation).That bound can be minimized as a loss.
在訓(xùn)練期間,tfc.ContinouslyBatchedEntropyModel
添加均勻噪聲,并使用噪聲和先驗來計算速率的(可微分)上限(編碼潛在表示所需的平均位數(shù))。該界限可以作為損失最小化。
entropy_model = tfc.ContinuousBatchedEntropyModel(
prior, coding_rank=1, compression=False)
y_tilde, rate = entropy_model(y, training=True)
print("rate:", rate)
print("y_tilde:", y_tilde)
Lastly, the noisy latents are passed back through the synthesis transform to produce an image reconstruction x ~ \widetilde x x . Distortion is the error between original image and reconstruction. Obviously, with the transforms untrained, the reconstruction is not very useful.
最后,將帶有噪聲的潛變量通過合成變換傳遞回去,生成圖像重建x。失真是原始圖像和重建圖像之間的誤差。顯然,如果變換未經(jīng)訓(xùn)練,重建圖像是不太有用的。
x_tilde = make_synthesis_transform()(y_tilde)
# Mean absolute difference across pixels.
distortion = tf.reduce_mean(abs(x - x_tilde))
print("distortion:", distortion)
x_tilde = tf.saturate_cast(x_tilde[0] * 255, tf.uint8)
plt.imshow(tf.squeeze(x_tilde))
print(f"Data type: {x_tilde.dtype}")
print(f"Shape: {x_tilde.shape}")
For every batch of digits, calling the MNISTCompressionTrainer produces the rate and distortion as an average over that batch:
對于每批數(shù)字,調(diào)用 MNISTCompressionTrainer 會生成該批次的平均速率和失真:
(example_batch, _), = validation_dataset.batch(32).take(1)
trainer = MNISTCompressionTrainer(10)
example_output = trainer(example_batch)
print("rate: ", example_output["rate"])
print("distortion: ", example_output["distortion"])
In the next section, we set up the model to do gradient descent on these two losses.
在下一節(jié)中,我們設(shè)置模型對這兩個損失進行梯度下降。
My idea:沒有量化嗎,看下來好像就是一個圖像直接扔近編碼器,然后扔進熵編碼,然后再通過解碼器得到重建圖
Train the model
We compile the trainer in a way that it optimizes the rate–distortion Lagrangian, that is, a sum of rate and distortion, where one of the terms is weighted by Lagrange parameter λ \lambda λ.
This loss function affects the different parts of the model differently:
- The analysis transform is trained to produce a latent representation that achieves the desired trade-off between rate and distortion.
- The synthesis transform is trained to minimize distortion, given the latent representation.
- The parameters of the prior are trained to minimize the rate given the latent representation. This is identical to fitting the prior to the marginal distribution of latents in a maximum likelihood sense.
My idea:編譯訓(xùn)練器來優(yōu)化速率-失真拉格朗日函數(shù)。
該損失函數(shù)對模型的不同部分產(chǎn)生不同的影響:
分析變換經(jīng)過訓(xùn)練以產(chǎn)生潛在表示,從而實現(xiàn)速率和失真之間的所需權(quán)衡。
在給定潛在表示的情況下,訓(xùn)練合成變換以最小化失真。
先驗參數(shù)的訓(xùn)練目標(biāo)是在給定潛在表示的情況下最小化速率。從最大似然意義上說,這相當(dāng)于將先驗擬合到潛在表示的邊緣分布 什么意思??????
def pass_through_loss(_, x):
# Since rate and distortion are unsupervised, the loss doesn't need a target.
# 由于速率和失真是無監(jiān)督的,因此損失不需要目標(biāo)。
return x
def make_mnist_compression_trainer(lmbda, latent_dims=50):
trainer = MNISTCompressionTrainer(latent_dims)
trainer.compile(
optimizer=tf.keras.optimizers.Adam(learning_rate=1e-3),
# Just pass through rate and distortion as losses/metrics.
# 只需將速率和失真作為損失/指標(biāo)傳遞。
loss=dict(rate=pass_through_loss, distortion=pass_through_loss),
metrics=dict(rate=pass_through_loss, distortion=pass_through_loss),
loss_weights=dict(rate=1., distortion=lmbda),
)
return trainer
Next, train the model.The human annotations are not necessary here, since we just want to compress the images, so we drop them using a map and instead add “dummy” targets for rate and distortion.
接下來,訓(xùn)練模型。這里不需要人工注釋,因為我們只想壓縮圖像,所以我們使用字典刪除它們,而是添加速率和失真的“虛擬”目標(biāo)。
def add_rd_targets(image, label):
# Training is unsupervised, so labels aren't necessary here. However, we
# need to add "dummy" targets for rate and distortion.
return image, dict(rate=0., distortion=0.)
def train_mnist_model(lmbda):
trainer = make_mnist_compression_trainer(lmbda)
trainer.fit(
training_dataset.map(add_rd_targets).batch(128).prefetch(8),
epochs=15,
validation_data=validation_dataset.map(add_rd_targets).batch(128).cache(),
validation_freq=1,
verbose=1,
)
return trainer
trainer = train_mnist_model(lmbda=2000)
Compress some MNIST images
For compression and decompression at test time, we split the trained model in two parts:
- The encoder side consists of the analysis transform and the entropy model.
- The decoder side consists of the synthesis transform and the same entropy model.
At test time, the latents will not have additive noise, but they will be quantized and then losslessly compressed, so we give them new names. We call them and the image reconstruction x ^ \hat x x^ and y ^ \hat y y^?, respectively (following End-to-end Optimized Image Compression).
class MNISTCompressor(tf.keras.Model):
"""Compresses MNIST images to strings."""
def __init__(self, analysis_transform, entropy_model):
super().__init__()
self.analysis_transform = analysis_transform
self.entropy_model = entropy_model
def call(self, x):
# Ensure inputs are floats in the range (0, 1).
x = tf.cast(x, self.compute_dtype) / 255.
y = self.analysis_transform(x)
# Also return the exact information content of each digit.
_, bits = self.entropy_model(y, training=False)
return self.entropy_model.compress(y), bits
class MNISTDecompressor(tf.keras.Model):
"""Decompresses MNIST images from strings."""
def __init__(self, entropy_model, synthesis_transform):
super().__init__()
self.entropy_model = entropy_model
self.synthesis_transform = synthesis_transform
def call(self, string):
y_hat = self.entropy_model.decompress(string, ())
x_hat = self.synthesis_transform(y_hat)
# Scale and cast back to 8-bit integer.
return tf.saturate_cast(tf.round(x_hat * 255.), tf.uint8)
When instantiated with compression=True
, the entropy model converts the learned prior into tables for a range coding algorithm.When calling compress()
, this algorithm is invoked to convert the latent space vector into bit sequences.The length of each binary string approximates the information content of the latent (the negative log likelihood of the latent under the prior).
The entropy model for compression and decompression must be the same instance, because the range coding tables need to be exactly identical on both sides.Otherwise, decoding errors can occur.
當(dāng)使用compression = True
進行實例化時,熵模型將學(xué)習(xí)到的先驗轉(zhuǎn)換為范圍編碼算法的表。當(dāng)調(diào)用 compress()
時,會調(diào)用該算法將潛在空間向量轉(zhuǎn)換為位序列。每個二進制字符串的長度近似于潛在的信息內(nèi)容(潛在在先驗條件下的負對數(shù)似然)。
壓縮和解壓縮的熵模型必須是相同的實例,因為范圍編碼表需要兩側(cè)完全相同。否則,可能會發(fā)生解碼錯誤。
def make_mnist_codec(trainer, **kwargs):
# The entropy model must be created with `compression=True` and the same
# instance must be shared between compressor and decompressor.
entropy_model = tfc.ContinuousBatchedEntropyModel(
trainer.prior, coding_rank=1, compression=True, **kwargs)
compressor = MNISTCompressor(trainer.analysis_transform, entropy_model)
decompressor = MNISTDecompressor(entropy_model, trainer.synthesis_transform)
return compressor, decompressor
compressor, decompressor = make_mnist_codec(trainer)
Grab 16 images from the validation dataset.You can select a different subset by changing the argument to skip
從驗證數(shù)據(jù)集中獲取 16 張圖像。您可以通過更改跳過參數(shù)來選擇不同的子集
(originals, _), = validation_dataset.batch(16).skip(3).take(1)
Compress them to strings, and keep track of each of their information content in bits.
將它們壓縮為字符串,并以位的形式跟蹤它們的每個信息內(nèi)容。
strings, entropies = compressor(originals)
print(f"String representation of first digit in hexadecimal: 0x{strings[0].numpy().hex()}")
print(f"Number of bits actually needed to represent it: {entropies[0]:0.2f}")
Decompress the images back from the strings.
將圖像從字符串中解壓回來。
reconstructions = decompressor(strings)
Display each of the 16 original digits together with its compressed binary representation, and the reconstructed digit.
顯示 16 個原始數(shù)字中的每一個及其壓縮的二進制表示形式,以及重建的數(shù)字。
def display_digits(originals, strings, entropies, reconstructions):
"""Visualizes 16 digits together with their reconstructions."""
fig, axes = plt.subplots(4, 4, sharex=True, sharey=True, figsize=(12.5, 5))
axes = axes.ravel()
for i in range(len(axes)):
image = tf.concat([
tf.squeeze(originals[i]),
tf.zeros((28, 14), tf.uint8),
tf.squeeze(reconstructions[i]),
], 1)
axes[i].imshow(image)
axes[i].text(
.5, .5, f"→ 0x{strings[i].numpy().hex()} →\n{entropies[i]:0.2f} bits",
ha="center", va="top", color="white", fontsize="small",
transform=axes[i].transAxes)
axes[i].axis("off")
plt.subplots_adjust(wspace=0, hspace=0, left=0, right=1, bottom=0, top=1)
display_digits(originals, strings, entropies, reconstructions)
Note that the length of the encoded string differs from the information content of each digit.
This is because the range coding process works with discrete probabilities, and has a small amount of overhead. So, especially for short strings, the correspondence is only approximate. However, range coding is asymptotically optimal: in the limit, the expected bit count will approach the cross entropy (the expected information content), for which the rate term in the training model is an upper bound.
請注意,編碼字符串的長度與每個數(shù)字的信息內(nèi)容不同。
這是因為范圍編碼過程使用離散概率,并且具有少量開銷。因此,特別是對于短字符串,對應(yīng)關(guān)系只是近似的。然而,范圍編碼是漸近最優(yōu)的:在極限情況下,預(yù)期的比特數(shù)將接近交叉熵(預(yù)期的信息內(nèi)容),其中訓(xùn)練模型中的速率項是上限。
The rate–distortion trade-off
率與失真的權(quán)衡
Above, the model was trained for a specific trade-off (given by lmbda=2000
) between the average number of bits used to represent each digit and the incurred error in the reconstruction.
What happens when we repeat the experiment with different values?
Let’s start by reducing λ \lambda λ to 500.
在上述例子中,模型是根據(jù)給定的權(quán)衡(由lmbda=2000確定)在表示每個數(shù)字所使用的平均比特數(shù)和重建時產(chǎn)生的誤差之間進行訓(xùn)練的。
當(dāng)我們使用不同的值重復(fù)實驗時會發(fā)生什么呢?
def train_and_visualize_model(lmbda):
trainer = train_mnist_model(lmbda=lmbda)
compressor, decompressor = make_mnist_codec(trainer)
strings, entropies = compressor(originals)
reconstructions = decompressor(strings)
display_digits(originals, strings, entropies, reconstructions)
train_and_visualize_model(lmbda=500)
The strings begin to get much shorter now, on the order of one byte per digit.However, this comes at a cost.More digits are becoming unrecognizable.
This demonstrates that this model is agnostic to human perceptions of error, it just measures the absolute deviation in terms of pixel values.To achieve a better perceived image quality, we would need to replace the pixel loss with a perceptual loss.
現(xiàn)在字符串開始變得更短,大約每個數(shù)字一個字節(jié)。然而,這是有代價的。更多的數(shù)字變得無法識別。
這表明該模型與人類對錯誤的感知無關(guān),它僅測量像素值的絕對偏差。為了獲得更好的感知圖像質(zhì)量,我們需要用感知損失來代替像素損失。
Use the decoder as a generative model
使用解碼器作為生成模型
If we feed the decoder random bits, this will effectively sample from the distribution that the model learned to represent digits.
First, re-instantiate the compressor/decompressor without a sanity check that would detect if the input string isn’t completely decoded.
如果我們向解碼器提供隨機位,這將有效地從模型學(xué)習(xí)表示數(shù)字的分布中進行采樣。 首先,重新實例化壓縮器/解壓縮器,而不進行健全性檢查,以檢測輸入字符串是否未完全解碼。
compressor, decompressor = make_mnist_codec(trainer, decode_sanity_check=False)
Now, feed long enough random strings into the decompressor so that it can decode/sample digits from them.
現(xiàn)在,將足夠長的隨機字符串輸入解壓縮器,以便它可以解碼/采樣其中的數(shù)字。
import os
strings = tf.constant([os.urandom(8) for _ in range(16)])
samples = decompressor(strings)
fig, axes = plt.subplots(4, 4, sharex=True, sharey=True, figsize=(5, 5))
axes = axes.ravel()
for i in range(len(axes)):
axes[i].imshow(tf.squeeze(samples[i]))
axes[i].axis("off")
plt.subplots_adjust(wspace=0, hspace=0, left=0, right=1, bottom=0, top=1)
tensorflow.python.framework.errors_impl.NotFoundError: /home/jjh/compression-master/tensorflow_compression/python/ops/…/…/cc/libtensorflow_compression.so: cannot open shared object file:
cd tensorflow_compression
python -m pip install -U pip setuptools wheel
變分自編碼器
通過 tf.keras.Sequential 連接生成網(wǎng)絡(luò)與推理網(wǎng)絡(luò)
在此 VAE 示例中,對編碼器和解碼器網(wǎng)絡(luò)使用兩個小型 ConvNet。在文獻中,這些網(wǎng)絡(luò)也分別稱為推斷/識別和生成模型。使用 tf.keras.Sequential
來簡化實現(xiàn)。在下面的描述中,使 x 和 z 分別表示觀測值和隱變量。
生成網(wǎng)絡(luò)
這定義了近似后驗分布 q(z|x),它會將輸入取作觀測值并輸出一組參數(shù),用于指定隱變量表示 z 的條件分布。在本例中,簡單地將分布建模為對角高斯分布,網(wǎng)絡(luò)會輸出分解高斯分布的均值和對數(shù)方差參數(shù)。輸出對數(shù)方差而不是直接用于數(shù)值穩(wěn)定性的方差。
推理網(wǎng)絡(luò)
這定義了觀測值的條件分布 p(x|z),它會將隱變量樣本 z 取作輸入并輸出觀測值條件分布的參數(shù)。將隱變量先驗分布 p(z) 建模為單位高斯分布。
重參數(shù)化技巧
要在訓(xùn)練期間為解碼器生成樣本 z,您可以在給定輸入觀測值 x 的情況下從編碼器輸出的參數(shù)所定義的隱變量分布中采樣。然而,這種采樣操作會產(chǎn)生瓶頸,因為反向傳播不能流經(jīng)隨機節(jié)點。
要解決這個問題,請使用重參數(shù)化技巧。在我們的示例中,使用解碼器參數(shù) ε \varepsilon ε和另一個參數(shù) 來逼近 z,如下所示:
z = μ + σ ? ε z=\mu+\sigma *\varepsilon z=μ+σ?ε
其中 μ \mu μ 和 σ \sigma σ 分別代表高斯分布的均值和標(biāo)準(zhǔn)差。它們可通過解碼器輸出推導(dǎo)得出。 ε \varepsilon ε可被認為是用于保持 z 的隨機性的隨機噪聲。從標(biāo)準(zhǔn)正態(tài)分布生成 ε \varepsilon ε。
隱變量 z 現(xiàn)在由 μ \mu μ 、 σ \sigma σ 和 ε \varepsilon ε的函數(shù)生成,這將使模型能夠分別通過$ \mu$ 和 σ \sigma σ在編碼器中反向傳播梯度,同時通過$ \varepsilon $保持隨機性。文章來源:http://www.zghlxwxcb.cn/news/detail-522505.html
網(wǎng)絡(luò)架構(gòu)
對于編碼器網(wǎng)絡(luò),使用兩個卷積層后接一個全連接層。在解碼器網(wǎng)絡(luò)中,通過使用一個全連接層后接三個卷積轉(zhuǎn)置層(在某些背景下也稱為反卷積層)來鏡像此架構(gòu)。請注意,通常的做法是在訓(xùn)練 VAE 時避免使用批量歸一化,因為使用 mini-batch 導(dǎo)致的額外隨機性可能會在提高采樣隨機性的同時加劇不穩(wěn)定性。文章來源地址http://www.zghlxwxcb.cn/news/detail-522505.html
到了這里,關(guān)于END-TO-END OPTIMIZED IMAGE COMPRESSION論文閱讀的文章就介紹完了。如果您還想了解更多內(nèi)容,請在右上角搜索TOY模板網(wǎng)以前的文章或繼續(xù)瀏覽下面的相關(guān)文章,希望大家以后多多支持TOY模板網(wǎng)!