[Generation] OASIS(You Only Need Adversarial Supervision for Semantic Image Synthesis) 논문 리뷰

Title

You Only Need Adversarial Supervision for Semantic Image Synthesis (ICLR'21)

Pix2pixHD의 후속 논문 중 하나로, diversity를 위한 noise를 주는 방식이 pix2pixHD 보다 괜찮을 것이라고 생각해 선택
Diversity와 perceptual loss의 한계를 강조한 paper
본 포스팅에서는 Multi-modal synthesis를 어떻게 하는지에 집중하려고 한다.

기존의 Semantic Image Synthesis를 위한 GAN model들은 VGG-based perceptual loss에 지나치게 의존한다
- VGG-based perceptual loss : synthetic과 real image의 feature를 매칭해주는 loss로, 이미지 quality를 높이는 데에 큰 도움을 준다.
- 하지만 저자들에 따르면 perceptual loss의 dominance는 computational overhead를 발생시켜, 결과적으로 diversity나 quality에 negative impact를 줄 수 있다.

따라서 위를 해결하기 위해 크게 다음과 같은 두 method를 제안한다
- Segmentation-Based Discriminator
- 3D noise tensor for multi-modal synthesis
  - 우리는 여기에 집중하도록 한다.

기존의 Pix2pixHD 계열의 생성 모델(SPADE) 들과 diversity를 위해 noise를 주는 방식이 차이가 있다. (개인적으로 주목한 부분)
- SPADE의 경우 Image encoder의 output인 1D noise vector를 multi-modal synthesis에 이용함. (다소 deterministic하다는 특징이 있을 것이다)
- 하지만 OASIS의 경우 인코더를 이용하지 않고, 3D random noise(64xHxW)를 input label과 channel-wise concat해 모델의 input으로 사용한다. (위 encoder output이 noise vector로 변경된 것과 같다)
Input에 부과하는 noise는 pixel-wise나 region-wise로 local & global에 sensitive하게 조절할 수 있으며, 이로 인해 noise-dependent한 이미지가 생성된다.
Training 시 Noise sampling을 다음과 같이 다르게 조절할 수 있다. 그리고 이러한 노이즈는 training 과정의 layer마다 계속 모델에 영향을 준다. (Appendix.7)
- Image-level : 1개의 global 1D noise vector를 만든 후 모든 픽셀에 복사
- Region-level : label마다 1D noise vector를 만든 후 라벨별로 복사
- Pixel-level : Every spatial position(=픽셀별)로 다른 noise를 만들기
- Mix : 위 방법을 무작위로 섞기
Results

Metrics : 생성된 이미지의 diversity를 평가하는 아래 두 metric이 존재함을 확인하였다. 같은 label에 대해 20개의 이미지를 생성하고, mean pairwise score를 계산함.
- MS-SSIM : 낮을 수록 more-diverse
- LPIPS : 높을 수록 more-diverse
Quality와 Diversity는 다소 trade-off 관계이다. 특히 이미지의 quality를 높여주는 VGG-perceptual loss를 사용할수록 diversity가 낮아짐을 아래에서 확인할 수 있다.
Results

[Generation] 자세한 Pix2pixHD 논문 리뷰 (High-Resolution Image Synthesis and Semantic Manipulation with Conditional GANs) (1)	2022.08.06