Recent studies have shown that dense retrieval models, lacking dedicated training data, struggle to perform well across diverse retrieval tasks, as different retrieval tasks often entail distinct search intents.
To address this issue, we propose a generic and lightweight controllable knowledge re-injection module, which utilizes the sophisticated reasoning ability of LLMs to control the VPG to conditionally extract instruction-specific visual information and re-inject it into the LLM.
The PT is developed to reduce redundant instances in bags by integrating prototypical learning into the Transformer architecture.
To address this issue, a novel Two-Stage Detection and Diagnosis Network (TSDDNet) is proposed based on weakly supervised learning to enhance diagnostic accuracy of the ultrasound-based CAD for breast cancers.
The key idea of MEGT is to adopt two independent Efficient Graph-based Transformer (EGT) branches to process the low-resolution and high-resolution patch embeddings (i. e., tokens in a Transformer) of WSIs, respectively, and then fuse these tokens via a multi-scale feature fusion module (MFFM).
This paper presents Sim-Suction, a robust object-aware suction grasp policy for mobile manipulation platforms with dynamic camera viewpoints, designed to pick up unknown objects from cluttered environments.
To alleviate such limitations, we propose a GlObal Structure knowledgeguided relation Extraction (GOSE) framework, which captures dependencies between entity pairs in an iterative manner.
Recent text-to-image generation models have shown promising results in generating high-fidelity photo-realistic images.
To improve the consistency between adjacent frames of generated videos, we propose the Frame Difference Loss, which is incorporated during the training process.
Our dataset generation process combines analytic models and dynamic simulations of the entire cluttered environment to provide accurate grasp labels.
To solve this problem, we propose a lightweight and accurate Edge Attention MRI Reconstruction Network (EAMRI) to reconstruct images with edge guidance.
Stereo image super-resolution aims to boost the performance of image super-resolution by exploiting the supplementary information provided by binocular systems.
Based on that, we further introduce a novel Entangled cross-modal prompt approach for open-world predicate scene graph generation (Epic), where models can generalize to unseen predicates in a zero-shot manner.
Though effective, prompt tuning under few-shot settings on the one hand heavily relies on a good initialization of soft prompts.
Prompt tuning, a recently emerging paradigm, enables the powerful vision-language pre-training models to adapt to downstream tasks in a parameter -- and data -- efficient way, by learning the ``soft prompts'' to condition frozen pre-training models.
In the past decade, convolutional neural networks (CNNs) have shown prominence for semantic segmentation.
To systematically benchmark the compositional generalizability of temporal grounding models, we introduce a new Compositional Temporal Grounding task and construct two new dataset splits, i. e., Charades-CG and ActivityNet-CG.
Moreover, we treat the uncertainty score of frames in a video as a whole, and estimate the difficulty of each video, which can further relieve the burden of video selection.
Recently, great progress has been made in single-image super-resolution (SISR) based on deep learning technology.
Furthermore, we show that the hidden state dimension can be approximated by extending the Johnson-Lindenstrauss lemma, optimizing the attention in bilinear form.
Large-scale vision-language pre-training has shown impressive advances in a wide range of downstream tasks.
In this paper, we introduce a new task, named Temporal Emotion Localization in videos~(TEL), which aims to detect human emotions and localize their corresponding temporal boundaries in untrimmed videos with aligned subtitles.
This paper is concerned with the quantized consensus problem for uncertain nonlinear multi-agent systems under data-rate constraints and Denial-of-Service (DoS) attacks.
Following the Transformer encoder-decoder design in MAE, our Audio-MAE first encodes audio spectrogram patches with a high masking ratio, feeding only the non-masked tokens through encoder layers.
Ranked #2 on Speaker Identification on VoxCeleb1 (using extra training data)
Recently, some methods have been proposed for snow removing, and most methods deal with snow images directly as the optimization object.
In this scenario, the input image serves as an intuitive context and background for the search, while the corresponding language expressly requests new traits on how specific characteristics of the query image should be modified in order to get the intended target image.
Recently, Transformer-based methods have shown impressive performance in single image super-resolution (SISR) tasks due to the ability of global feature extraction.
Single image denoising (SID) has achieved significant breakthroughs with the development of deep learning.
Single-image super-resolution (SISR) has achieved significant breakthroughs with the development of deep learning.
In this paper, we summarize the 1st NTIRE challenge on stereo image super-resolution (restoration of rich details in a pair of low-resolution stereo images) with a focus on new solutions and results.
Then, we design an efficient Feature Refinement Module (FRM) to enhance the encoded features.
To systematically measure the compositional generalizability of temporal grounding models, we introduce a new Compositional Temporal Grounding task and construct two new dataset splits, i. e., Charades-CG and ActivityNet-CG.
Convolutional neural networks based single-image super-resolution (SISR) has made great progress in recent years.
We propose the Multimodal relAtional Graph adversarIal inferenCe (MAGIC) framework for diverse and unpaired TextCap.
In this survey, we give an overview of DL-based SISR methods and group them according to their targets, such as reconstruction efficiency, reconstruction accuracy, and perceptual accuracy.
Specifically, FBSNet employs a symmetrical encoder-decoder structure with two branches, semantic information branch and spatial detail branch.
LTB is composed of a series of Efficient Transformers (ET), which occupies a small GPU memory occupation, thanks to the specially designed Efficient Multi-Head Attention (EMHA).
Although these methods can remove part of the rain streaks, it is difficult for them to adapt to real-world scenarios and restore high-quality rain-free images with clear and accurate structures.
Secondly, we introduce semantic coherence learning to explicitly encourage the semantic coherence of the adaptive hierarchical graph network from three hierarchies.
Besides the cross-view information exploitation in the low-resolution (LR) space, HR representations produced by the SR process are utilized to perform HR disparity estimation with higher accuracy, through which the HR features can be aggregated to generate a finer SR result.
Recently, the single image super-resolution (SISR) approaches with deep and complex convolutional neural network structures have achieved promising performance.
To achieve this, we propose a Multi-scale Topological Network (MSTN) to fully explore the features at different scales.
Although the emergence of deep learning has greatly promoted the development of this field, crowd counting under cluttered background is still a serious challenge.
In particular, we first cast the meta-overfitting problem (overfitting on sampling and label noise) as a gradient noise problem since few available samples cause meta-learner to overfit on existing examples (clean or corrupted) of an individual task at every gradient step.
In this paper, we provide a theoretical explanation that low total correlation of sampled representation cannot guarantee low total correlation of the mean representation.
Among them, MDCB aims to detect multi-scale features and maximize the use of image features flow at different scales, HFDB focuses on adaptively recalibrate channel-wise feature responses to achieve feature distillation, and DRB attempts to reconstruct SR images with different upsampling factors in a single model.
In this paper, we focus on enhancing the generalization ability of the VIST model by considering the few-shot setting.
Perceptual learning approaches like perceptual loss are empirically powerful for such tasks but they usually rely on the pre-trained classification network to provide features, which are not necessarily optimal in terms of visual perception of image transformation.
The difficulty of this task is that phoneme inventories often differ between the training languages and the target language, making it infeasible to recognize unseen phonemes.
1 code implementation • 26 Feb 2020 • Xinjian Li, Siddharth Dalmia, Juncheng Li, Matthew Lee, Patrick Littell, Jiali Yao, Antonios Anastasopoulos, David R. Mortensen, Graham Neubig, Alan W. black, Florian Metze
Multilingual models can improve language processing, particularly for low resource situations, by sharing parameters across languages.
Our method is compared with the state-of-the-art loop closure detection methods and the results show that it outperforms the traditional methods at both recall rate and speed.
However, plenty of studies have shown that global information is crucial for image restoration tasks like image demosaicing and enhancing.
Visual navigation is a task of training an embodied agent by intelligently navigating to a target object (e. g., television) using only visual observations.
The experimental results and further analysis prove that the agent with the MIND module is superior to its counterparts not only in EQA performance but in many other aspects such as route planning, behavioral interpretation, and the ability to generalize from a few examples.
In this work, we consider an alternative question: is it possible to fool deep classifiers, over all perceived objects of a certain type, by physically manipulating the camera itself?
This paper compares five types of pooling functions both theoretically and experimentally, with special focus on their performance of localization.
Sound Audio and Speech Processing
Most of them follow a classic atmospheric scattering model which is an elegant simplified physical model based on the assumption of single-scattering and homogeneous atmospheric medium.
Recent works on single-image super-resolution are concentrated on improving performance through enhancing spatial encoding between convolutional layers.
Meanwhile, we let these features interact with each other to get the most efficacious image information, we call this structure Multi-scale Residual Block (MSRB).
Constructing a joint representation invariant across different modalities (e. g., video, language) is of significant importance in many multimedia applications.
Ranked #30 on Video Retrieval on MSR-VTT
On these features, we apply five models: Gaussian Mixture Model (GMM), Deep Neural Network (DNN), Recurrent Neural Network (RNN), Convolutional Deep Neural Net- work (CNN) and i-vector.
Our CNNs, with up to 34 weight layers, are efficient to optimize over very long sequences (e. g., vector of size 32000), necessary for processing acoustic waveforms.