본문 바로가기

Deep Learning/Computer Vision

[2020.11] GIRAFFE: Representing Scenes As Compositional Generative Neural Feature Fields

728x90

2D 영상 기반의 Photorealistic Image Synthesis는 최근 고화질 영상까지 실제와 같이 생성해 낼 수 있는 수준에 이르렀습니다. 그리고 최근에는 Disentangle Learning과 Controllable 개념까지 추가되어, 특정 Object나 Background의 특성을 변경해가며 (e.g, 머리의 색을 Continous하게 변경) 영상을 생성할 수 있는 수준을 보여주고 있습니다. 하지만, 이는 오로지 학습에 사용된 2D 영상 기준의 Dataset Distribution에 기반하여 그 Latent Space에서 약간의 변화를 준 (해당되는 Image Space에서는 생성된 영상으로 Mapping) 정도의 Synthetic 영상을 생성하는 것이기에, 실제 3D 세상에서 2D 세상으로 Projection된 영상의 특성들은 잘 모사되지 않는 경향이 있습니다.

본 논문에서는, 기존에 2D 기반으로 동작하던 Controllable GAN Image Synthesis를 Neural Radiance Field에 기반한 3D Scene Representation과 결합하여 실제 3차원에 특성까지도 (e.g., Camera Pose의 변화에 따라서 Projection 된 영상을 어떻게 생성할 수 있는가) 모사할 수 있는 Image Synthesis 방법을 제안하였습니다. Neural Radiance Field에 기반한 기존에 방법 (NeRF) 들은 View Synthesis를 위해서 Camera pose가 Supervision으로 주어진 Multi-view 영상들이 필요하였다면, 제안된 방법에서는 Object 별로 이러한 Supervision이 없이도 Disentangled된 Shape과 Pose Distribution으로 부터 3차원 Geometry Model과 학습된 Implicit 3D Scene Representation을 통해서, Camera Pose까지 고려된 2D Image를 출력할 수 있습니다. 별도의 Supervision이 필요 없다는 점 때문에, 향후에 확장성이 높다고 생각 됩니다. 

개인적으로, Projection이 된 영상이 물체별 Density까지 고려해 실제 세상에 Occlusion까지 잘 반영할 수 있다는 사실이, NeRF에서도 놀라웠지만, 이 논문에서 생성된 영상을 다시 봐도 놀라운 것 같습니다. 아직 Data로부터 Object-level Transformation이나 정확한 Camera Pose는 배울 수 없기에, 정확히 원하는 Pinpoint로 Image Rendering이 되지는 않는 것 같지만, 그래도 최근에 소개된 두 가지 기술을 결합하여 좋은 응용을 보인 논문이고 향후 발전 가능성이 많다고 생각 됩니다.

Figure 1: We represent scenes as compositional generative neural feature fields. For a randomly sampled camera, we volume render a featured image of the scene based on individual feature fields. While training only on raw image collections, at test time we are able to control the image formation process wrt. camera pose, object poses, as well as the objects’ shapes and appearances.

Conference: CVPR 2021 (Best Paper Candidate)
URL: https://arxiv.org/pdf/2011.12100.pdf
Code: https://github.com/autonomousvision/giraffe.

1. Introduction

In recent years, the computer vision community has made great strides towards highly realistic image generation based on Generative Adversarial Networks (GANs). Despite these successes, synthesizing realistic 2D images is not the only aspect required in applications of generative models. The generation process should also be controllable in a simple and consistent manner.

To this end, many works investigate how disentangled representations (e.g., shape, pose) can be learned from data without explicit supervision. Most approaches, however, do not consider the compositional nature of scenes and operate in the 2D domain, ignoring that our world is three-dimensional. 

Figure 2: Controllable Image Generation. While most generative models operate in 2D, we incorporate a compositional 3D scene representation into the generative model. This leads to more consistent image synthesis results, e.g. note how, in contrast to our method, translating one object might change the other when operating in 2D (Fig. 2a and 2b).

2. Methods

Our goal is a controllable image synthesis pipeline that can be trained from raw image collections without additional supervision. First, we model individual objects as neural feature fields (Sec. 2.1). Next, we exploit the additive property of feature fields to composite scenes from multiple individual objects (Sec. 2.2). For rendering, we explore an efficient combination of volume and neural rendering techniques (Sec. 2.3).

Figure 3. Our generator G takes a camera pose ξ and N shape and appearance codes z and affine transformations T as input and synthesizes an image of the generated scene which consists of N−1 objects and background. The discriminator D takes the generated image I_hat and the real image I as input and our full model is trained with an adversarial loss. At test time, we can control the camera pose, the shape and appearance codes of the objects, and the objects’ poses in the scene. Orange indicates learnable and blue non-learnable operations.

2.1. Objects as Neural Feature Fields

Neural Radiance Fields: A radiance field is a continuous function $f$ which maps a $3 \mathrm{D}$ point $\mathrm{x} \in \mathbb{R}^{3}$ and a viewing direction $\mathbf{d} \in \mathbb{S}^{2}$ to a volume density $\sigma \in \mathbb{R}^{+}$ and an RGB color value $\mathbf{c} \in \mathbb{R}^{3}$. [61] 에서는 저차원인 $\mathrm{x}$과 $\mathrm{d}$를 그대로 사용하기 보다, Positional Encoding $\gamma$ 을 통해 고차원으로 Mapping 했을 때 학습이 잘되는 것을 보인바 있습니다. Mildenhall et al. [61] propose to learn Neural Radiance Fields (NeRFs) by parameterizing $f$ with a multi-layer perceptron (MLP):

$\begin{aligned} f_{\theta}: \mathbb{R}^{L_{\mathbf{x}}} \times \mathbb{R}^{L_{\mathrm{d}}} & \rightarrow \mathbb{R}^{+} \times \mathbb{R}^{3} \\(\gamma(\mathbf{x}), \gamma(\mathbf{d})) & \mapsto(\sigma, \mathbf{c}) \end{aligned}$

where θ indicate the network parameters and $L_x$, $L_d$ the output dimensionalities of the positional encodings.

* [61] NeRF: Representing scenes as neural radiance fields for view synthesis. (ECCV), 2020.

Generative Neural Feature Fields: While [61] fits θ to multiple posed images of a single scene, Schwarz et al. [77] propose a generative model for Neural Radiance Fields (GRAF) that is trained from unposed image collections. To learn a latent space of NeRFs, they condition the MLP on shape and appearance codes  $\mathbf{z}_{s}, \mathbf{z}_{a} \sim \mathcal{N}(\mathbf{0}, I)$

$\begin{aligned} g_{\theta}: \mathbb{R}^{L_{\mathbf{x}}} \times \mathbb{R}^{L_{\mathbf{d}}} \times \mathbb{R}^{M_{s}} \times \mathbb{R}^{M_{a}} & \rightarrow \mathbb{R}^{+} \times \mathbb{R}^{3} \\\left(\gamma(\mathbf{x}), \gamma(\mathbf{d}), \mathbf{z}_{s}, \mathbf{z}_{a}\right) & \mapsto(\sigma, \mathbf{c}) \end{aligned}$

where $M_{s}, M_{a}$ are the dimensionalities of the latent codes. In this work we explore a more efficient combination of volume and neural rendering. We replace GRAF’s formulation for the three-dimensional color output c with a more generic $M_f$-dimensional feature f and represent objects as Generative Neural Feature Fields:

$\begin{aligned} h_{\theta}: \mathbb{R}^{L_{\mathbf{x}}} \times \mathbb{R}^{L_{\mathrm{d}}} \times \mathbb{R}^{M_{s}} \times \mathbb{R}^{M_{a}} & \rightarrow \mathbb{R}^{+} \times \mathbb{R}^{M_{f}} \\\left(\gamma(\mathbf{x}), \gamma(\mathbf{d}), \mathbf{z}_{s}, \mathbf{z}_{a}\right) & \mapsto(\sigma, \mathbf{f}) \end{aligned}$

* [77] Graf: Generative radiance fields for 3d-aware image synthesis. NeurIPS, 2020

Object Representation for GIRAFFE: A key limitation of NeRF and GRAF is that the entire scene is represented by a single model. We need control over the pose, shape, background and appearance of individual objects and therefore, represent each object using a separate feature field in combination with an affine transformation $\mathbf{T}=\{\mathbf{s}, \mathbf{t}, \mathbf{R}\}$, where $\mathrm{s}, \mathrm{t} \in \mathbb{R}^{3}$ indicate scale and translation parameters, and $\mathbf{R} \in S O(3)$ a rotation matrix. Using this representation, we transform points from object to scene space as follows: 

$$
k(\mathbf{x})=\mathbf{R} \cdot\left[\begin{array}{lll}
s_{1} & & \\
& s_{2} & \\
& & s_{3}
\end{array}\right] \cdot \mathbf{x}+\mathbf{t}
$$

In practice, we volume render in scene space and evaluate the feature field in its canonical object space (see Fig. 1), which allows us to arrange multiple objects in a scene. (여기서 $k^{-1}$는 3D World Point를 Camera Perspective로 변환한 좌표계를 일컫는 것 같습니다).

$$
(\sigma, \mathbf{f})=h_{\theta}\left(\gamma\left(k^{-1}(\mathbf{x})\right), \gamma\left(k^{-1}(\mathbf{d})\right), \mathbf{z}_{s}, \mathbf{z}_{a}\right)
$$

2.2. Scene Compositions

N개의 Entity를 현재 시점의 Camera Pose로 변환후에 Composition하여 Scene (2D Image)를 만듭니다. 여기서 N개의 Entity는 Object들과 함께 Background를 포함하고 있습니다. 여기서 여러개의 물체가 겹쳐지는 경우가 이슈가 될 수 있는데, 기존의 NERF [61] 에서 소개된 Density-Weighted, 즉, Feature Ray Direction 방향에 위치해 있는 Density들을 Weighted Sum하는 방법을 사용하고 있습니다. 

2.3. Scene Rendering

3D Volume Rendering: While previous works volume renders an RGB color value, we extend this formulation to rendering an $M_f$-dimensional feature vector f. For given camera extrinsics $\boldsymbol{\xi}$, let $\left\{\mathbf{x}_{j}\right\}_{j=1}^{N_{s}}$ be sample points along the camera ray $\mathbf{d}$ for a given pixel, and $\left(\sigma_{j}, \mathbf{f}_{j}\right)=C\left(\mathbf{x}_{j}, \mathbf{d}\right)$ the corresponding densities and feature vectors of the field. The volume rendering operator $\pi_{\text {vol }} [37]$ maps these evaluations to the pixel's final feature vector $\mathbf{f}$ :

$$
\pi_{\mathrm{vol}}:\left(\mathbb{R}^{+} \times \mathbb{R}^{M_{f}}\right)^{N_{s}} \rightarrow \mathbb{R}^{M_{f}}, \quad\left\{\sigma_{j}, \mathbf{f}_{j}\right\}_{j=1}^{N_{s}} \mapsto \mathbf{f}
$$

$\mathbf{f}$를 구하기 위해서, NERF [61]에서와 같이 Ray 위에 존재하는 $N_s$ 개의 Feature Point를 Scene에 해당하는 하나의 Pixel 로 Weighted Sum을 하게 됩니다. 이를 모든 Pixel에 대해서 다 하게 되면, 하나의 Scene을 얻게 되는데, Computation Efficiency를 위해 Lower-Resolution에서 3D Rendering을 수행한 후, 2D Neural Rendering이라는 추가적인 과정을 통해 Higher-Resolution의 RGB 영상을 얻게 됩니다.

2D Neural Rendering: The neural rendering operator $\pi_{\theta}^{\text {neural}}$ maps the feature image $\mathbf{I}_{V} \in \mathbb{R}^{H_{V} \times W_{V} \times M_{f}}$ to the final synthesized image $\hat{\mathbf{I}} \in \mathbb{R}^{H \times W \times 3}$. Inspired by [40], we map the feature image to an RGB image at every spatial resolution and add the previous output to the next via bilinear upsampling. These skip connections ensure a strong gradient flow to the feature fields. We obtain our final image prediction $\hat{\mathbf{I}}$ by applying a sigmoid activation to the last RGB layer.

3. Experiments

Figure 5: Scene Disentanglement. From top to bottom, we show only backgrounds, only objects, color-coded object alpha maps, and the final synthesized images at 64x64 pixel resolution. Disentanglement emerges without supervision, and the model learns to generate plausible backgrounds although the training data only contains images with objects.

생각보다 잘 된다. Disentangled Learning으로 Supervision 없이도 학습이 잘 되고, 3D controlling에 어느정도 괜찮은 Rendering Quality를 보여준다. 

Figure 7: Controllable Scene Generation at 2562 Pixel Resolution. Controlling the generated scenes during image synthesis: Here we rotate or translate objects, change their appearances, and perform complex operations like circular translations.
Figure 8: Generalization Beyond Training Data. As individual objects are correctly disentangled, our model allows for generating out of distribution samples at test time. For example, we can increase the translation ranges or add more objects than there were present in the training data.