Multi-modal Classification
9 papers with code • 2 benchmarks • 3 datasets
Most implemented papers
What Makes Training Multi-Modal Classification Networks Hard?
Consider end-to-end training of a multi-modal vs. a single-modal network on a task with multiple input modalities: the multi-modal network receives more information, so it should match or outperform its single-modal counterpart.
Image and Encoded Text Fusion for Multi-Modal Classification
To learn feature representations of resulting images, standard Convolutional Neural Networks (CNNs) are employed for the classification task.
Look, Read and Enrich. Learning from Scientific Figures and their Captions
Compared to natural images, understanding scientific figures is particularly hard for machines.
Multi-modal Sarcasm Detection and Humor Classification in Code-mixed Conversations
In this work, we make two major contributions considering the above limitations: (1) we develop a Hindi-English code-mixed dataset, MaSaC, for the multi-modal sarcasm detection and humor classification in conversational dialog, which to our knowledge is the first dataset of its kind; (2) we propose MSH-COMICS, a novel attention-rich neural architecture for the utterance classification.
Multimodal Dynamics: Dynamical Fusion for Trustworthy Multimodal Classification
To the best of our knowledge, this is the first work to jointly model both feature and modality variation for different samples to provide trustworthy fusion in multi-modal classification.
On Modality Bias Recognition and Reduction
From the results on four datasets regarding the above three tasks, our method yields remarkable performance improvements compared with the baselines, demonstrating its superiority on reducing the modality bias problem.
UAVM: Towards Unifying Audio and Visual Models
Conventional audio-visual models have independent audio and video branches.
Contrastive Audio-Visual Masked Autoencoder
In this paper, we first extend the recent Masked Auto-Encoder (MAE) model from a single modality to audio-visual multi-modalities.
FAME-ViL: Multi-Tasking Vision-Language Model for Heterogeneous Fashion Tasks
In the fashion domain, there exists a variety of vision-and-language (V+L) tasks, including cross-modal retrieval, text-guided image retrieval, multi-modal classification, and image captioning.