However, one issue that often arises in MTL is the convergence speed between tasks varies due to differences in task difficulty, so it can be a challenge to simultaneously achieve the best performance on all tasks with a single model checkpoint.
Recent vision-language understanding approaches adopt a multi-modal transformer pre-training and finetuning paradigm.
ForeSeer transfers reviews from similar products on a large product graph and exploits these reviews to predict aspects that might emerge in future reviews.
Model pre-training on large text corpora has been demonstrated effective for various downstream applications in the NLP domain.
Hence we advocate that the key of better performance lies in meaningful latent modality structures instead of perfect modality alignment.
The effectiveness in our framework is achieved by applying stage-wise fine-tuning of the BERT model first with heterogenous graph information and then with a GNN model.
In this paper, we propose an improvement to prompt-based fine-tuning that addresses these two issues.
Aligning signals from different modalities is an important step in vision-language representation learning as it affects the performance of later stages such as cross-modality fusion.
Besides CMA, TCL introduces an intra-modal contrastive objective to provide complementary benefits in representation learning.
Ranked #1 on Zero-Shot Cross-Modal Retrieval on COCO 2014
To achieve this, we propose a novel idea, Magic Pyramid (MP), to reduce both width-wise and depth-wise computation via token pruning and early exiting for Transformer-based models, particularly BERT.
Vision-and-Language Pre-training (VLP) improves model performance for downstream tasks that require image and text inputs.
InfoNCE-based contrastive representation learners, such as SimCLR, have been tremendously successful in recent years.