This paper focuses on the data-insufficiency problem in multi-task learning within an episodic training setup.
To alleviate this problem, data augmentation coupled with consistency regularization are commonly adopted to make the model less sensitive to domain-specific attributes.
Referring video object segmentation (RVOS), as a supervised learning task, relies on sufficient annotated data for a given scene.
Referring Video Object Segmentation Semantic Segmentation +1
We show the use of this embedding on two tasks namely disease classification of X-ray reports and image classification.
Pre-trained vision-language models, e. g., CLIP, working with manually designed prompts have demonstrated great capacity of transfer learning.
But the majority of media images on the internet remain in 8-bit standard dynamic range (SDR) format.
We formulate the generalization at test time as a variational inference problem by modeling pseudo labels as distributions to consider the uncertainty during generalization and alleviate the misleading signal of inaccurate pseudo labels.
By learning to retain and recall the learning process of past training tasks, EMO nudges parameter updates in the right direction, even when the gradients provided by a limited number of examples are uninformative.
To account for the uncertainty caused by the limited training tasks, we propose a variational MetaModulation where the modulation parameters are treated as latent variables.
Recently, Transformers have emerged as the go-to architecture for both vision and language modeling tasks, but their computational efficiency is limited by the length of the input sequence.