multimodal interaction
33 papers with code • 0 benchmarks • 0 datasets
Benchmarks
These leaderboards are used to track progress in multimodal interaction
Most implemented papers
ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision
Vision-and-Language Pre-training (VLP) has improved performance on various joint vision-and-language downstream tasks.
MM-BigBench: Evaluating Multimodal Models on Multimodal Content Comprehension Tasks
Consequently, our work complements research on the performance of MLLMs in multimodal comprehension tasks, achieving a more comprehensive and holistic evaluation of MLLMs.
Recurrent Multimodal Interaction for Referring Image Segmentation
In this paper we are interested in the problem of image segmentation given natural language descriptions, i. e. referring expressions.
Improving Multimodal Named Entity Recognition via Entity Span Detection with Unified Multimodal Transformer
To tackle the first issue, we propose a multimodal interaction module to obtain both image-aware word representations and word-aware visual representations.
Dynamic Modality Interaction Modeling for Image-Text Retrieval
To address these issues, we develop a novel modality interaction modeling network based upon the routing mechanism, which is the first unified and dynamic multimodal interaction framework towards image-text retrieval.
Temporal Pyramid Transformer with Multimodal Interaction for Video Question Answering
Targeting these issues, this paper proposes a novel Temporal Pyramid Transformer (TPT) model with multimodal interaction for VideoQA.
ML-PersRef: A Machine Learning-based Personalized Multimodal Fusion Approach for Referencing Outside Objects From a Moving Vehicle
This allows for novel approaches to interaction with the vehicle that go beyond traditional touch-based and voice command approaches, such as emotion recognition, head rotation, eye gaze, and pointing gestures.
Multilevel Hierarchical Network with Multiscale Sampling for Video Question Answering
With a multiscale sampling, RMI iterates the interaction of appearance-motion information at each scale and the question embeddings to build the multilevel question-guided visual representations.
Adaptive Contrastive Learning on Multimodal Transformer for Review Helpfulness Predictions
To overcome the aforementioned issues, we propose Multimodal Contrastive Learning for Multimodal Review Helpfulness Prediction (MRHP) problem, concentrating on mutual information between input modalities to explicitly elaborate cross-modal relations.
A Comprehensive Survey of AI-Generated Content (AIGC): A History of Generative AI from GAN to ChatGPT
The goal of AIGC is to make the content creation process more efficient and accessible, allowing for the production of high-quality content at a faster pace.