Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding

14 Nov 2023  ·  Peng Jin, Ryuichi Takanobu, Caiwan Zhang, Xiaochun Cao, Li Yuan ·

Large language models have demonstrated impressive universal capabilities across a wide range of open-ended tasks and have extended their utility to encompass multimodal conversations. However, existing methods encounter challenges in effectively handling both image and video understanding, particularly with limited visual tokens. In this work, we introduce Chat-UniVi, a unified vision-language model capable of comprehending and engaging in conversations involving images and videos through a unified visual representation. Specifically, we employ a set of dynamic visual tokens to uniformly represent images and videos. This representation framework empowers the model to efficiently utilize a limited number of visual tokens to simultaneously capture the spatial details necessary for images and the comprehensive temporal relationship required for videos. Moreover, we leverage a multi-scale representation, enabling the model to perceive both high-level semantic concepts and low-level visual details. Notably, Chat-UniVi is trained on a mixed dataset containing both images and videos, allowing direct application to tasks involving both mediums without requiring any modifications. Extensive experimental results demonstrate that Chat-UniVi, as a unified model, consistently outperforms even existing methods exclusively designed for either images or videos.

PDF Abstract
Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Result Benchmark
Zero-Shot Video Question Answer ActivityNet-QA Chat-UniVi-13B Confidence Score 3.6 # 1
Accuracy 46.4 # 4
Zero-Shot Video Question Answer ActivityNet-QA Chat-UniVi Confidence Score 3.2 # 6
Accuracy 45.8 # 7
Video Question Answering ActivityNet-QA Chat-UniVi-13B Accuracy 46.4 # 4
Confidence score 3.3 # 2
Image-based Generative Performance Benchmarking ImageInstruct Chat-UniVi-13B Conversation 84.1 # 1
Detail description 79.4 # 1
Complex reasoning 94.7 # 2
All 86.1 # 1
Image-based Generative Performance Benchmarking ImageInstruct Chat-UniVi-7B Conversation 84.1 # 1
Detail description 74.2 # 3
Complex reasoning 93.7 # 3
All 84.2 # 3
Image-based Generative Performance Benchmarking ImageInstruct LLaVA-7B Conversation 70.3 # 4
Detail description 56.6 # 4
Complex reasoning 83.3 # 4
All 70.1 # 4
Image-based Generative Performance Benchmarking ImageInstruct LLaVA-13B Conversation 83.1 # 3
Detail description 75.3 # 2
Complex reasoning 96.5 # 1
All 85.1 # 2
Zero-Shot Video Question Answer MSRVTT-QA Chat-UniVi-7B Accuracy 54.6 # 7
Confidence Score 3.1 # 7
Zero-Shot Video Question Answer MSVD-QA Chat-UniVi-7B Accuracy 65 # 8
Confidence Score 3.6 # 5
Science Question Answering ScienceQA Chat-UniVi-13B Natural Science 90.41 # 3
Social Science 95.05 # 2
Language Science 88.91 # 3
Text Context 89.64 # 3
Image Context 88.05 # 3
No Context 90.94 # 3
Grades 1-6 91.19 # 3
Grades 7-12 90.64 # 2
Avg. Accuracy 90.99 # 4
Zero-Shot Video Question Answer TGIF-QA Chat-UniVi-7B Accuracy 60.3 # 2
Confidence Score 3.4 # 2
Video-based Generative Performance Benchmarking (Detail Orientation)) VideoInstruct Chat-UniVi gpt-score 2.91 # 3
Video-based Generative Performance Benchmarking VideoInstruct Chat-UniVi Correctness of Information 2.89 # 4
Detail Orientation 2.91 # 4
Contextual Understanding 3.46 # 4
Temporal Understanding 2.89 # 1
Consistency 2.81 # 1
mean 2.99 # 1
Video-based Generative Performance Benchmarking (Temporal Understanding) VideoInstruct Chat-UniVi gpt-score 2.89 # 1
Video-based Generative Performance Benchmarking (Consistency) VideoInstruct Chat-UniVi gpt-score 2.81 # 1
Video-based Generative Performance Benchmarking (Correctness of Information) VideoInstruct Chat-UniVi gpt-score 2.89 # 2
Video-based Generative Performance Benchmarking (Contextual Understanding) VideoInstruct Chat-UniVi gpt-score 3.46 # 2

Methods


No methods listed for this paper. Add relevant methods here