no code implementations • 8 Mar 2025 • Seil Kang, Jinyeong Kim, Junhyeok Kim, Seong Jae Hwang
However, we discover that a few attention heads in frozen LVLMs demonstrate strong visual grounding capabilities.
no code implementations • 5 Mar 2025 • Seil Kang, Jinyeong Kim, Junhyeok Kim, Seong Jae Hwang
Large multimodal models (LMMs) "see" images by leveraging the attention mechanism between text and visual tokens in the transformer decoder.