This process enables incrementally improving the model by processing multiple learning episodes, each representing a different learning task, even with few training examples.
We study semi-supervised learning (SSL) for vision transformers (ViT), an under-explored topic despite the wide adoption of the ViT architectures to different tasks.
Instead of developing masked language modeling (MLM) and masked image modeling (MIM) independently, we propose to build joint masked vision and language modeling, where the masked signal of one modality is reconstructed with the help from another modality.
In this paper, we study the challenging instance-wise vision-language tasks, where the free-form language is required to align with the objects instead of the whole image.
We hypothesize that a strong base model can provide a good representation for novel classes and incremental learning can be done with small adaptations.
TAPS solves a joint optimization problem which determines which layers to share with the base model and the value of the task-specific weights.
Indeed, we observe experimentally that standard distillation of task-specific teachers, or using these teacher representations directly, **reduces** downstream transferability compared to a task-agnostic generalist model.
Traditionally, distillation has been used to train a student model to emulate the input/output functionality of a teacher.
Since all model selection algorithms in the literature have been tested on different use-cases and never compared directly, we introduce a new comprehensive benchmark for model selection comprising of: i) A model zoo of single and multi-domain models, and ii) Many target tasks.
We define a notion of information that an individual sample provides to the training of a neural network, and we specialize it to measure both how much a sample informs the final weights and how much it informs the function computed by the weights.
Document unwarping attempts to undo the physical deformation of the paper and recover a 'flatbed' scanned document-image for downstream tasks such as OCR.
We tackle the problem of predicting the number of optimization steps that a pre-trained deep network needs to converge to a given value of the loss function.
Our findings challenge common practices of fine-tuning and encourages deep learning practitioners to rethink the hyperparameters for fine-tuning.
For the difficult cases, where the domain gaps and especially category differences are large, we explore three different exemplar sampling methods and show the proposed adaptive sampling method is effective to select diverse and informative samples from entire datasets, to further prevent forgetting.
Majority of the modern meta-learning methods for few-shot classification tasks operate in two phases: a meta-training phase where the meta-learner learns a generic representation by solving multiple few-shot tasks sampled from a large dataset and a testing phase, where the meta-learner leverages its learnt internal representation for a specific few-shot task involving classes which were not seen during the meta-training phase.
We propose a method for learning embeddings for few-shot learning that is suitable for use with any number of ways and any number of shots (shot-free).