Sufficient knowledge extraction from the teacher network plays a critical role in the knowledge distillation task to improve the performance of the student network.
Large Language Models (LLMs) face significant deployment challenges due to their substantial memory requirements and the computational demands of auto-regressive text generation process.
By employing one shared BERT-type network to refine textual and cross-modal features simultaneously, SNP is lightweight and could support various downstream applications.
HAMMER performs 1) manipulation-aware contrastive learning between two uni-modal encoders as shallow manipulation reasoning, and 2) modality-aware cross-attention by multi-modal aggregator as deep manipulation reasoning.
The goal of TSGSV is to evaluate the relevance between a video stream and a given sentence query.
Beyond the normal case, long-tail class incremental learning and few-shot class incremental learning are also proposed to consider the data imbalance and data scarcity, respectively, which are common in real-world implementations and further exacerbate the well-known problem of catastrophic forgetting.
Existing methods mostly focus on analyzing video content, neglecting users' social influence and tag relation.
In the meantime, we make full use of the structured information in the hierarchical labels to learn an accurate affinity graph for contrastive learning.
Several studies have recently pointed that existing Visual Question Answering (VQA) models heavily suffer from the language prior problem, which refers to capturing superficial statistical correlations between the question type and the answer whereas ignoring the image contents.
Visual Question Answering (VQA) is fundamentally compositional in nature, and many questions are simply answered by decomposing them into modular sub-problems.
We observe that the core difficulty for heterogeneous KD (hetero-KD) is the significant semantic gap between the backbone features of heterogeneous detectors due to the different optimization manners.
Scene Graph Generation, which generally follows a regular encoder-decoder pipeline, aims to first encode the visual contents within the given image and then parse them into a compact summary graph.
Ranked #1 on Unbiased Scene Graph Generation on Visual Genome (mR@20 metric)
To segment 4K or 6K ultra high-resolution images needs extra computation consideration in image segmentation.
To address these issues, we develop a novel modality interaction modeling network based upon the routing mechanism, which is the first unified and dynamic multimodal interaction framework towards image-text retrieval.
On the other hand, a novel graph-based contrastive learning strategy is proposed to learn more compact clustering assignments.
First of all, to perform matrix inverse, we provide a differentiable yet efficient way, named LD-Minv, which is a learnable deep neural network (DNN) with each layer being an $L$-th order matrix polynomial.
In this paper, we examine the diversity of teacher models in the gradient space and regard the ensemble knowledge distillation as a multi-objective optimization problem so that we can determine a better optimization direction for the training of student network.
Understanding food recipe requires anticipating the implicit causal effects of cooking actions, such that the recipe can be converted into a graph describing the temporal workflow of the recipe.
While successful in many fields, deep neural networks (DNNs) still suffer from some open problems such as bad local minima and unsatisfactory generalization performance.
We establish a stability condition for ResNets with step sizes and weight parameters, and point out the effects of step sizes on the stability and performance.
In order to overcome the lack of supervision, we introduce a differentiable module to resolve the overlap between any pair of instances.
Ranked #8 on Panoptic Segmentation on Cityscapes test
It is designed to compute the representation of each position by a weighted sum of the features at all positions.
Ranked #11 on Semantic Segmentation on COCO-Stuff test
Recently, a number of learning-based optimization methods that combine data-driven architectures with the classical optimization algorithms have been proposed and explored, showing superior empirical performance in solving various ill-posed inverse problems, but there is still a scarcity of rigorous analysis about the convergence behaviors of learning-based optimization.
Recent developed deep unsupervised methods allow us to jointly learn representation and cluster unlabelled data.
Ranked #7 on Image Clustering on Tiny-ImageNet
To tackle this issue, we propose a novel method for matrix recovery in this paper, which could well handle the case where the target matrix is low-rank in an implicit feature space but high-rank or even full-rank in its original form.
In heavy rain, rain streaks have various directions and shapes, which can be regarded as the accumulation of multiple rain streak layers.
Ranked #7 on Single Image Deraining on Test2800
In this paper, we focus on the Markov chain based spectral clustering method and propose a novel essential tensor learning method to explore the high order correlations for multi-view representation.