Computer Vision💖
![[Daily] CLIMB: CLustering-based Iterative Data Mixture Bootstrapping for Language Model Pre-training](https://img1.daumcdn.net/thumb/R750x0/?scode=mtistory2&fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdna%2FciYCfJ%2FbtsNrznq5jt%2FAAAAAAAAAAAAAAAAAAAAACOs_s8tSkNQCOI2rgY0IOhtMmoBwcgaaNqhOKXlg96_%2Fimg.png%3Fcredential%3DyqXZFxpELC7KVnFOS48ylbz2pIh7yKj8%26expires%3D1759244399%26allow_ip%3D%26allow_referer%3D%26signature%3DsXCB4MP3Tx%252BYAc%252BhNe4iZPDhcN4%253D)
[Daily] CLIMB: CLustering-based Iterative Data Mixture Bootstrapping for Language Model Pre-training
지금 하고있는 연구랑 비슷해서 좀 자세히 읽어봤당 역시 OOD 재밌다TLDR;Web data는 보통 web에서 수집되기 때문에 explicit domain label이 없는데, domain-specific training을 위해 optimal pre-training data mixture를 identify하는건 어려운 문제임.Cluster-based로 최적의 data mixture weight를 도출하는 framework -> Efficient domain-specific pre-trainingMotivationDomain-specific task의 성능을 올리는데는 final pre-training phase가 중요하다고 함.General/ domain-specific task에 맞는 pre-train..
![[Daily] InternVL3: Exploring Advanced Training and Test-TimeRecipes for Open-Source Multimodal Models](https://img1.daumcdn.net/thumb/R750x0/?scode=mtistory2&fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdna%2FnTcce%2FbtsNkNFZCL2%2FAAAAAAAAAAAAAAAAAAAAAPKdPC4sUPGXrK2noBkISbNYOWoP51-MbESlt7FFKDQ-%2Fimg.png%3Fcredential%3DyqXZFxpELC7KVnFOS48ylbz2pIh7yKj8%26expires%3D1759244399%26allow_ip%3D%26allow_referer%3D%26signature%3DA02IZwBhvGLpZWl1mFuIMY4eomY%253D)
[Daily] InternVL3: Exploring Advanced Training and Test-TimeRecipes for Open-Source Multimodal Models
오늘 InternVL3이 나왔다. 매일매일 따라가는게 진심으로 벅차지만 .. 매일 daily paper를 체크하니까 꽤 트랜드를 따라가기 좋고 연구에도 도움이 많이 되는 것 같다! 긍정긍정 TLDR; Pre-training stage에서 multimodal/ linguistic capability를 joint하게 학습시킨다. 이전에는 text-only pre-training을 먼저 거치고, visual processing을 위해 multimodal alignment를 그 후에 학습시켰음. Variable visual position encoding을 제안Post-training: SFT + Mixed preference optimization Test-time scaling: Answer N개 중 ver..
![[Daily] VideoChat-R1: Enhancing Spatio-TemporalPerception via Reinforcement Fine-Tuning](https://img1.daumcdn.net/thumb/R750x0/?scode=mtistory2&fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdna%2FcRrup2%2FbtsNgqkbAuG%2FAAAAAAAAAAAAAAAAAAAAAD94Ux05UnobV1nYGFCAEgkwNchMKf5nuBvTXgCIKiws%2Fimg.png%3Fcredential%3DyqXZFxpELC7KVnFOS48ylbz2pIh7yKj8%26expires%3D1759244399%26allow_ip%3D%26allow_referer%3D%26signature%3D3u3TqLqKzABsrH2hFnCpHFJaZfk%253D)
[Daily] VideoChat-R1: Enhancing Spatio-TemporalPerception via Reinforcement Fine-Tuning
TLDR;VideoLLM에 GRPO를 적용한 또 다른 버전, spatio-temporal perception 성능을 높이고자 했다고 한다.VideoLLM의 general capability를 유지하면서 task-specific performance를 높일 수 있다고 함.MotivationVideo understanding에는 reasoning ability를 위한 training/evaluation corpus가 부족 + underexploredMethod1. GRPOPPO에서 critic model에 대한 dependency를 줄인 것Response에 대한 group을 생성한 뒤 (여러개 response candidate) 아래와 같이 quality 측정GRPO는 그룹 내 better answer가 나..
![[Daily] Video-R1: Reinforcing Video Reasoning in MLLMs](https://img1.daumcdn.net/thumb/R750x0/?scode=mtistory2&fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdna%2FMxWjz%2FbtsNfQIy7rf%2FAAAAAAAAAAAAAAAAAAAAAGEow9UGbMVQM4Au2UPIza3YN3vJetHOGSUyV1dSOJW7%2Fimg.png%3Fcredential%3DyqXZFxpELC7KVnFOS48ylbz2pIh7yKj8%26expires%3D1759244399%26allow_ip%3D%26allow_referer%3D%26signature%3Dd0oAF4GokiQ07xC4OntA5cXnbgM%253D)
[Daily] Video-R1: Reinforcing Video Reasoning in MLLMs
TLDR;DeepSeek-R1을 이용한 video reasoningMotivation기존 GRPO를 이용한 video reasoning은 아래와 같은 단점이 존재함Video reasoning에는 temporal reasoning이 중요한데, 이 temporal reasoning이 없으면 모델은 single frame으로 'shortcut'을 통해 답을 내리는 경향 존재또한, high-quality video reasoning dataset이 없음MethodGRPO를 extension한 T-GRPO를 제안 (temporal reasoning을 encourage)Image-based reasoning data (CoT + RL 학습용 데이터셋)을 제안T-GRPO (Temporal Group Relative ..
![[Daily] Token-Efficient Long Video Understanding for Multimodal LLMs](https://img1.daumcdn.net/thumb/R750x0/?scode=mtistory2&fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdna%2FdAiFly%2FbtsMLrdeRvA%2FAAAAAAAAAAAAAAAAAAAAAOW4pXEGAc0veNrQCLgwpGIZ_39IA272_Ut3jRUGb08S%2Fimg.png%3Fcredential%3DyqXZFxpELC7KVnFOS48ylbz2pIh7yKj8%26expires%3D1759244399%26allow_ip%3D%26allow_referer%3D%26signature%3D89xVN2lKND4CPB%252FsF7mClcN7YIA%253D)
[Daily] Token-Efficient Long Video Understanding for Multimodal LLMs
TLDR;기존 video llm은 각 frame을 독립적으로 처리Image encoder와 llm 사이의 temporal encoder를 제시 -> efficient long video understandingMotivation기존 video llm은 비디오를 frame image sequence로 처리하고, 각 frame을 image encoder와 visual language projector에서 independently하게 처리함하지만 이는 computational burden을 초래또한 LLM에 넣어지는 token 수를 줄이기 위해 naive frame subsampling을 진행하지만 이는 information loss 유발 + 또는 information이 overlap되는 현상 발생Method..
![[Daily] MM-EUREKA: Exploring Visual Aha Moment with Rule-Based Large-Scale Reinforcement Learning](https://img1.daumcdn.net/thumb/R750x0/?scode=mtistory2&fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdna%2FVSEBv%2FbtsMGOMwnxN%2FAAAAAAAAAAAAAAAAAAAAANF87OdtEEV2uU93XtMlC_In_c6jH0x1DDXUhbr7YNW9%2Fimg.png%3Fcredential%3DyqXZFxpELC7KVnFOS48ylbz2pIh7yKj8%26expires%3D1759244399%26allow_ip%3D%26allow_referer%3D%26signature%3DNtdPPCtxS8yaIRTHSO22BlFxaVA%253D)
[Daily] MM-EUREKA: Exploring Visual Aha Moment with Rule-Based Large-Scale Reinforcement Learning
TLDR;DeepSeek-R1 (rule-based reinforcement learning) 을 Multimodal setting에서 재현한 첫 opensource modelMultimodal reasoning model 'MM-Eureka'를 제안MotivationDeepSeek-R1을 multimodal setting에서 재현하려는 노력은 많이 있어왔지만, 거의 close source 모델이거나 'aha moment'에서 재현이 잘 안됨여기서 aha moment란 reasoning 중간에 이미지를 다시 체크하거나.. 확인하는 것MethodBasic setting: InternVL2.5 (8B, 32B)를 이용 + DeepSeekR1의 rule-based reward를 사용함Data clearning..
![[TIL] Video Diffusion Model과 시뮬레이터](https://img1.daumcdn.net/thumb/R750x0/?scode=mtistory2&fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdna%2Fcoajsm%2FbtsJE0CC0dh%2FAAAAAAAAAAAAAAAAAAAAAEJ7H4LfWEk4st7M8YWRcZFcagUAB0tFjbh6ljsIckUV%2Fimg.png%3Fcredential%3DyqXZFxpELC7KVnFOS48ylbz2pIh7yKj8%26expires%3D1759244399%26allow_ip%3D%26allow_referer%3D%26signature%3DxzgscgdTmICFJkAHkrM7zHKoB98%253D)
[TIL] Video Diffusion Model과 시뮬레이터
오늘의 세미나 주제는 .. Video Diffusion model이 real world의 dynamics를 반영할 수 있는 시뮬레이터로서 기능할 수 있을지이다. ☑️ Learning Interactive Real-World Simulators (Jan 2024) - ICLR24 Outstanding paperGood world simulator가 있다면, human은 diverse scene에 대해 더 많은 interaction이 가능할 것 We explore the possibility of learning a universal simulator of real-world interaction through generative modeling.이 paper에서는 action-in-video-out con..
![[TIL] Long Video Understanding](https://img1.daumcdn.net/thumb/R750x0/?scode=mtistory2&fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdna%2FbQe3A5%2FbtsJsmevcnR%2FAAAAAAAAAAAAAAAAAAAAAF-Mk1OS7IsHi7lh3H0dNJ_WqJpbjD3OhU6yZuVXeboo%2Fimg.png%3Fcredential%3DyqXZFxpELC7KVnFOS48ylbz2pIh7yKj8%26expires%3D1759244399%26allow_ip%3D%26allow_referer%3D%26signature%3DKUc2O4YcLHWgCXLJO4IwrF0FS54%253D)
[TIL] Long Video Understanding
Recent Trend in Long Video Understanding Content LLM context length Compress visual tokens with streaming models ☑️ Long Context Transfer from Laugnage to Vision (Jul 2024) Background Video data는 너무 densy + lengthy 하고, (보통 8/32 frame 단위로 sampling해서 진행한다) high-quality long video pretraining dataset이 아직 없다. (커뮤니티의 문제) Idea Longer text data로 train해서 context length를 늘린다 Context-extended LM을 이용해 lo..