본문 바로가기

Deep Learning/Computer Vision

[2021.04] Rethinking and Improving the Robustness of Image Style Transfer

728x90

Neural Style Transfer에 있어서, Pretrained된 VGG Network는 영상의 Visual Style을 Capture하는데 있어서 꽤나 좋은 성능을 보여주었습니다. 하지만 이를 ResNet 계열의 Network에서 추출된 Feature로 Style Transfer를 할 경우 잘 동작하지 않는 경우가 있는데, 이 논문에서는 실험을 통해 Residual Connection이 주로 낮은 Entropy를 가지는, Large Peak을 가지는 Feature Map을 만들고, 이러한 특성이 Feature간 Correlation에 기반한 Gram Matrix와 그에 기반한 Style Transfer에는 잘 맞지 않는 다는 것을 밝혀냈습니다. 이를 개선하기 위해, Feature Map의 Entropy를 높이기 위한 간단한 Softmax Transformation을 제안하고 Style Loss에 이를 적용하여 Random Weight Network에서도 이러한 방법으로 잘 동작하는 것을 보였습니다. 이는 결국, Neural Style Transfer Task에 있어서 Network Architecture 그 자체가 (Skip Connection이 없는 구조) 학습된 Weight 보다 더 중요하다는 것을 보입니다.

이 논문은 개인적으로 Neural Style Transfer에 있어서의 응용보다는, Residual Connection 기반의 Architecture들이 Deeper Layer로 갈수록, Highly Correlated 된 Large Peak, Low Entropy Distribution의 경향을 보인다는 점을 실험적으로 보였다는 점에서 흥미롭습니다. 아마도, 이전 Layer Activation에 새로운 Feature가 Accumulate 되어가는 Recursive한 구조적인 특성상 (Skip Connection에서 더해지는 부분이 ReLU에 의해서 오로지 양수만 더해짐) 이럴 수도 있지 않을까 하는 생각이 들기도 합니다. 여기에 더해, Knowledge Distillation에서의 원리와 같이 High Entropy의 특성을 가지는 Regularization을 추가하는 것이, 결국 어떠한 영향을 끼치는가에 대해서 한번 더 고민해 보면 좋을 것 같습니다.

Conference: CVPR 2021 (Best Paper Candidate)
URL: https://arxiv.org/abs/2104.05623

1. Introduction

Figure 1 shows an example of style transfer using different models. The VGG transfers style (color, texture, strokes) more faithfully than the ResNet, for both pre-trained and random weights. Why style transfer is much more effective for the VGG, there are still no clear answers. The fact that the VGG with random weights can generate comparable results [13, 3] suggests that this property is somewhat inherent to the architecture.

Figure 1: Neural style transfer by different architectures. (‘p-’, ‘r-’ denotes pre-trained and randomly initialization.

We seek the architectural properties that explain the differences between these activations from different architectures, and how these could explain the discrepancy between stylization results. Taking the ResNet, we study the statistics of both activations and the derived Gram matrices, and the striking observation is that, when normalized into a probability distribution, the ResNet activations of deeper layers have large peaks and small entropy. This shows that they are dominated by a few feature channels and have nearly deterministic correlation patterns. It suggests that the optimization used to synthesize the stylized images is biased into replicating a few dominant patterns of the style image and ignoring the majority. This explains why the ResNet is unable to transfer high-level style patterns, usually captured in deeper layers of the network. 

In contrast, VGG activations are approximately uniform for all layers, capturing a much larger diversity of style patterns. We then analyze that peaky activations are explained by the existence of residual or shortcut connections between layers. The fact that these connections are prevalent in most modern architectures explains why the robustness problem is so widespread. 

2. Methods

2.1. Preliminaries (Optional)

A convolutional neural network (CNN) maps $\mathrm{x}_{0}$ into a set of feature maps $\left\{F^{l}\left(\mathbf{x}_{0}\right)\right\}_{l=1}^{L}$, where $F^{l}: \mathbb{R}^{W_{0} \times H_{0} \times 3} \rightarrow \mathbb{R}^{W_{l} \times H_{l} \times D_{l}}$ is the mapping from the image to the tensor of activations of the $l^{t h}$ layer. The activation tensor $F^{l}\left(\mathbf{x}_{0}\right)$ can also be reshaped into a matrix $F^{l}\left(\mathbf{x}_{0}\right) \in \mathbb{R}^{D_{l} \times M_{l}}$, where $M_{l}=W_{l} H_{l} .$ Image style is frequently assumed to be encoded by a set of Gram matrices $\left\{G^{l}\right\}_{l=1}^{L}$ where $G^{l} \in \mathbb{R}^{D_{l} \times D_{l}}$ is derived from the activations $F^{l}$ of layer $l$ by computing the correlation between activation channels, i.e. 

$\left[G^{l}\left(F^{l}\right)\right]_{i j}=\sum_{k} F_{i k}^{l} F_{j k}^{l} \quad (1)$

We consider the image stylization framework of $[7]$, where given a content image $\mathbf{x}_{0}^{c}$ and a style image $\mathbf{x}_{0}^{s}$, an image $\mathbf{x}^{*}$ that presents the content of $\mathbf{x}_{0}^{c}$ under the style of $\mathbf{x}_{0}^{s}$ is synthesized by solving

$\mathbf{x}^{*}=\underset{x \in \mathbb{R}^{W_{0} \times H_{0} \times 3}}{\operatorname{argmin}} \alpha \mathcal{L}_{\text {content }}\left(\mathbf{x}_{0}^{c}, \mathbf{x}\right)+\beta \mathcal{L}_{\text {style }}\left(\mathbf{x}_{0}^{s}, \mathbf{x}\right). \quad (2) $
with
$\begin{aligned} \mathcal{L}_{\text {content }}\left(\mathbf{x}_{0}^{c}, \mathbf{x}\right) &=\frac{1}{2}\left\|F^{l}(\mathbf{x})-F^{l}\left(\mathbf{x}_{0}^{c}\right)\right\|_{2}^{2} \\
 \mathcal{L}_{\text {style }}\left(\mathbf{x}_{0}^{s}, \mathbf{x}\right) &=\sum_{l=1}^{L} \frac{w_{l}}{4 D_{l}^{2} M_{l}^{2}}\left\|G^{l}\left(F^{l}(\mathbf{x})\right)-G^{l}\left(F^{l}\left(\mathbf{x}_{0}^{s}\right)\right)\right\|_{2}^{2} \end{aligned}$

where $w_{l} \in\{0,1\}$ are weighting factors of the contribution of each layer to the total loss. 

※ [7] Image style transfer using convolutional neural networks. CVPR 2016.

2.2. Residual Connection and Low Entropy Activations

Residual Connection Bottleneck 모듈의 경우, 아래와 같이 이전 Feature $F^{l-1}$에, Convolution과 Batch Normalization으로 구성된 R(·) 함수를 통과한 Feature가 더해지게 됩니다. 그리고, ReLU를 통과하게 되는데, 여기서 Negative의 값이 살아남지 않도록 구성된 이 Architecture의 특성이, 결과적으로 Deeper Layer에서 Low Entropy Activation을 초래하게 됩니다.

$F^{l}(\mathbf{x})=\operatorname{ReLU}\left(R\left(F^{l-1}(\mathbf{x})\right)+F^{l-1}(\mathbf{x})\right)$

뒤에 Layer에서 항상 양수만 더해지는 특성으로, 중간에 나타난 Large Activation이 뒤로 갈수록 더 도드라지는 "Whack-a-mole (두더지 잡기)" 현상, 즉 Large Peak에 Low Entropy Activation 현상이 생길 수 있다고 하고 있습니다. Gram Matrix의 경우도 Eq. (1)에 의해 Activation의 특성을 따라서 같은 Low Entropy 현상이 생기고, Neural Stlye Transfer에 안 좋은 영향을 끼치게 됩니다.

2.3. Residual Connection and Statistics

Figure 3 presents the average of maximum values $\max _{i, k} F_{i, k}^{l}$ and $\max _{i, j} G_{i, j}^{l}$, as well as normalized entropies of the activations and Gram matrices, respectively.

Figure 3: Activation statistics of different random architectures.

The figure shows that activations and Gram values have similar behavior. In both cases, the maximum value increases and the entropy decreases gradually with layer depth for the architectures with residual connections (ResNet and pseudo-ResVGG). This is unlike the networks without shortcuts (NoResNet and pseudo-VGG), where activations tend to decrease and entropies remain fairly constant and much higher. 

2.4. Activation Smoothing

We propose a very simple solution, inspired by the interpretation of stylization as knowledge distillation [15], where significant gains are observed by smoothing teaching distributions. In the same vein, we propose to avoid peaky activations of low entropy, by smoothing all activations with a softmax-based smoothing transformation

$\sigma\left(F_{i k}^{l}(\mathbf{x})\right)=\frac{e^{F_{i k}^{l}(\mathbf{x})}}{\sum_{m, n} e^{F_{m n}^{l}(\mathbf{x})}}$

Note that the softmax layer is not inserted in the network, which continues to be the original model, but only used to redefine the style and content loss functions of (2). The softmax transformation reduces large peaks and increases small values, creating a more uniform distribution. Since this can be seen as a form of smoothing, we denote this approach to stylization as Stylization With Activation smoothinG (SWAG). SWAG successfully suppresses the maxima and increases entropies, especially on deeper layers.

3. Experiments

Figure 6: Comparison of neural style transfer performance between standard and SWAG (denoted with ∗ ) models, for different pre-trained architectures with shortcut connections (R: ResNet, W: WRN, I: Inception). The results of the standard VGG model are also shown for comparison.

 

Figure 7: Comparison of neural style transfer performance between standard and SWAG (denoted with ∗ ) implementations of different stylization algorithms. Top: algorithm of [22]; Bottom: algorithm of [24]. The results of the VGG model are also shown for comparison.