[Daily] Unified Reward Model for Multimodal Understanding and Generation — 다은이의 컴퓨터 공부

TLDR;

First unified (generation+ understanding) reward model을 제안 (현재까지는 specific task에 대한 reward model만 존재)

Motivation

현재까지 reward model들은 specific task에만 한정되어 있었음

하지만 task들은 서로 연결되어 있고, 상호작용할 때 효과가 강해진다고 믿음. (e.g., image evaluation이 video evaluation에 도움)

Method

먼저 (1) Large-scale human preference dataset을 만들고
(2) Preference pair dataset을 위한 reward model을 학습
- specific baseline (VLM, Diffusion model)에서 multi-stage filtering (e.g., pair ranking, point sifting)을 통해 pair data를 선택
(3) 이 preference pair dataset으로 DPO를 통해 model들을 학습

1. Unified Preference Dataset Construction (for reward model)

기존에 존재하는 image/video generation + understanding dataset을 모아 합침

2. Unified Preference Learning

VLM을 위 데이터로 fine-tuning함 (assessment ability를 가질 수 있도록 함)

3. Preference Data Construction

3.1. Data generation: image/video pair (or generation prompt)가 존재할 때 VLM (or diffusion model)이 multiple candidate output을 생성
3.2. Pair ranking: N개의 output을 N/2 pair (chosen, rejected pair)로 나눔
3.3. Point sifting: model에게 pointwise score를 내놓으라고 시킴
3.4. Preference pair: 이 score에 따라 preference pair를 생성

4. DPO

Generation과 Understanding에 맞는 loss를 각각 사용

Result

Reward model: LLaVA-OneVision 7B
Multimodal Understanding DPO: apply DPO to LLaVA-OneVision 7B

Multimodal Generation DPO: apply DPO to T2V-turbo / SDXL-turbo

저작자표시

티스토리툴바