[TIL] Long Video Understanding

Background
- Video data는 너무 densy + lengthy 하고, (보통 8/32 frame 단위로 sampling해서 진행한다)
- high-quality long video pretraining dataset이 아직 없다. (커뮤니티의 문제)
Idea
- Longer text data로 train해서 context length를 늘린다
- Context-extended LM을 이용해 long video text pair 없이도 modality alignment를 가능하게 했음
Method: Unified encoding for image and video
- Train the multimodal capability of LLM alignment: 고화질 이미지와 long video를 similar domain으로 봄
- Long video sequence가 있다고 mimic 하고 싶었고, 어떻게 alignment를 학습할지가 중요했음.

Motivation
- Most transformer-based video encoders are limited to short temporal contexts due to their quadratic complexity: complexity 때문에 temporal context가 부족함
Idea
- Process video in the streaming setting
- Memory-Consolidated ViT를 제안

[Daily] VideoChat-R1: Enhancing Spatio-TemporalPerception via Reinforcement Fine-Tuning (0)	2025.04.11
[Daily] Video-R1: Reinforcing Video Reasoning in MLLMs (0)	2025.04.09
[Daily] Token-Efficient Long Video Understanding for Multimodal LLMs (0)	2025.03.17
[TIL] Video Diffusion Model과 시뮬레이터 (0)	2024.09.20

티스토리툴바