[Daily] MM-EUREKA: Exploring Visual Aha Moment with Rule-Based Large-Scale Reinforcement Learning — 다은이의 컴퓨터 공부

TLDR;

DeepSeek-R1 (rule-based reinforcement learning) 을 Multimodal setting에서 재현한 첫 opensource model
Multimodal reasoning model 'MM-Eureka'를 제안

Motivation

DeepSeek-R1을 multimodal setting에서 재현하려는 노력은 많이 있어왔지만, 거의 close source 모델이거나 'aha moment'에서 재현이 잘 안됨
여기서 aha moment란 reasoning 중간에 이미지를 다시 체크하거나.. 확인하는 것

Method

Basic setting: InternVL2.5 (8B, 32B)를 이용 + DeepSeekR1의 rule-based reward를 사용함
Data clearning: GeoQA 같은 오픈소스 데이터셋을 filtering 해서 사용 (이 과정이 중요했다고 함)

Reward function (rule-based reward)

DeepSeek R1에서 제안한 accuracy reward와 format reward를 그대로 사용 (두개를 합침)
- Accuracy reward: math-verify library로 answer를 추출해 맞으면 1, 틀리면 0
- Format reward: ...... 이 format 을 맞추면 1, 아니면 0
- $r = r_{acc} + \lambda r_{format}$

Advantage estimation + Policy update

REINFORCEMENT Leave-One-Out (RLOO) 알고리즘을 사용 (GRPO와 달리 critic model이 필요 없다고 한다)
K개의 query-response pair를 생성해 advantage estimator를 계산

Actor loss로는 PPO-clip loss를 그대로 사용

Policy와 reference policy 사이의 KL divergence loss의 경우 GRPO와 같은 method를 사용해 PPO 뒤에 붙임

Result

저작자표시 (새창열림)

'Computer Vision💖 > Multimodal' 카테고리의 다른 글

[Daily] InternVL3: Exploring Advanced Training and Test-TimeRecipes for Open-Source Multimodal Models (0)	2025.04.16
[Multimodal] 멀티모달 러닝 (Multimodal Learning)에 대한 아주 기초적인 이해 (1)	2024.01.18
[VQA] Zero-shot VQA + Domain Adaptation VQA 분야 개괄 (0)	2023.08.01
[XAI] Generating Visual Explanations(2016) - 이미지 분류에 대한 설명을 생성하는 알고리즘 (0)	2021.08.15
[XAI] OpenAI CLIP 논문 리뷰[3] - Domain Generalization (2)	2021.07.19

티스토리툴바