CREMA: Multimodal Compositional Video Reasoning via Efficient Modular Adaptation and Fusion

8 Feb 2024  ·  Shoubin Yu, Jaehong Yoon, Mohit Bansal ·

Despite impressive advancements in multimodal compositional reasoning approaches, they are still limited in their flexibility and efficiency by processing fixed modality inputs while updating a lot of model parameters. This paper tackles these critical challenges and proposes CREMA, an efficient and modular modality-fusion framework for injecting any new modality into video reasoning. We first augment multiple informative modalities (such as optical flow, 3D point cloud, audio) from given videos without extra human annotation by leveraging existing pre-trained models. Next, we introduce a query transformer with multiple parameter-efficient modules associated with each accessible modality. It projects diverse modality features to the LLM token embedding space, allowing the model to integrate different data types for response generation. Furthermore, we propose a fusion module designed to compress multimodal queries, maintaining computational efficiency in the LLM while combining additional modalities. We validate our method on video-3D, video-audio, and video-language reasoning tasks and achieve better/equivalent performance against strong multimodal LLMs, including BLIP-2, 3D-LLM, and SeViLA while using 96% fewer trainable parameters. We provide extensive analyses of CREMA, including the impact of each modality on reasoning domains, the design of the fusion module, and example visualizations.

PDF Abstract

Results from the Paper

Task Dataset Model Metric Name Metric Value Global Rank Benchmark
Video Question Answering NExT-QA CREMA Accuracy 73.5 # 5
Question Answering SQA3D CREMA AnswerExactMatch (Question Answering) 53.0 # 1


No methods listed for this paper. Add relevant methods here