[Daily] Token-Efficient Long Video Understanding for Multimodal LLMs — 다은이의 컴퓨터 공부

TLDR;

기존 video llm은 각 frame을 독립적으로 처리
Image encoder와 llm 사이의 temporal encoder를 제시 -> efficient long video understanding

Motivation

기존 video llm은 비디오를 frame image sequence로 처리하고, 각 frame을 image encoder와 visual language projector에서 independently하게 처리함
- 하지만 이는 computational burden을 초래
또한 LLM에 넣어지는 token 수를 줄이기 위해 naive frame subsampling을 진행
- 하지만 이는 information loss 유발 + 또는 information이 overlap되는 현상 발생

Method

Mamba-based temporal projector + token compression techniques
- image encoder 사이의 temporal encoder를 통해 temporal dynamics를 earlier pipeline에서 잡을 수 있도록 함 (temporal information을 visual token에 directly하게 삽입)

Mamba-based Temporal Projector

Video frame 사이의 temporal information을 integrate
L개의 Mamba layer에서는 temporal information을 visual token에 fuse'

Training-time Token Compression

모든 frame을 handling 하는건 비싸고, LLM token length는 정해져 있기 때문에 long-video processing에는 token compression이 중요
1) Temporal pooling: temporal projector의 output을 연속된 k frame마다 average

2) Spatial pooling: each frame에서 spatial pooling ratio를 통해 pooling

Test-time (Training-free) temporal token sampling

Temporal dimension에 대해 test-time에서 visual token을 subsampling

Result

Long video understanding task 위주로 test

저작자표시

'Computer Vision💖 > Video' 카테고리의 다른 글

[TIL] Video Diffusion Model과 시뮬레이터 (0)	2024.09.20
[TIL] Long Video Understanding (0)	2024.09.06

티스토리툴바