For MoV, we utilize the text-to-speech (TTS) algorithms with a variety of pre-defined tones and select the most matching one based on the user-provided text description automatically.
A fundamental limitation of object detectors is that they suffer from "spatial bias", and in particular perform less satisfactorily when detecting objects near image borders.
We present DFormer, a novel RGB-D pretraining framework to learn transferable representations for RGB-D segmentation tasks.
Ranked #1 on RGB-D Salient Object Detection on DES
In this work, we identify that a crucial factor leading to the text-image mismatch issue is the inadequate cross-modality relation learning between the prompt and the output image.
We aim at providing the object detection community with an efficient and performant object detector, termed YOLO-MS.
Such a distillation manner relieves the student's head from receiving contradictory supervision signals from the ground-truth annotations and the teacher's predictions, greatly improving the student's detection performance.
We consider the problem of referring camouflaged object detection (Ref-COD), a new task that aims to segment specified camouflaged objects based on a small set of referring images with salient target objects.
In this paper, we present a simple but performant semi-supervised semantic segmentation approach, termed CorrMatch.
The study reveals that: 1) MIM can be viewed as an effective method to improve the model capacity when the scale of the training data is relatively small; 2) Strong reconstruction targets can endow the models with increased capacities on downstream tasks; 3) MIM pre-training is data-agnostic under most scenarios, which means that the strategy of sampling pre-training data is non-critical.
A significant research effort is focused on exploiting the amazing capacities of pretrained diffusion models for the editing of images.
Ranked #4 on Text-based Image Editing on PIE-Bench
Previous works have shown that increasing the window size for Transformer-based image super-resolution models (e. g., SwinIR) can significantly improve the model performance but the computation overhead is also considerable.
To the best of our knowledge, this is the first time that large and selective kernel mechanisms have been explored in the field of remote sensing object detection.
Ranked #1 on Semantic Segmentation on UAVid
Contrastive Masked Autoencoder (CMAE), as a new self-supervised framework, has shown its potential of learning expressive feature representations in visual image recognition.
In this paper, we study the spatial disequilibrium problem of modern object detectors and propose to quantify this ``spatial bias'' by measuring the detection performance over zones.
Ensemble learning serves as a straightforward way to improve the performance of almost any machine learning algorithm.
How to identify and segment camouflaged objects from the background is challenging.
This paper does not attempt to design a state-of-the-art method for visual recognition but investigates a more efficient way to make use of convolutions to encode spatial features.
Notably, SegNeXt outperforms EfficientNet-L2 w/ NAS-FPN and achieves 90. 6% mIoU on the Pascal VOC 2012 test leaderboard using only 1/10 parameters of it.
Ranked #1 on Semantic Segmentation on iSAID
The target encoder, fed with the full images, enhances the feature discriminability via contrastive learning with its online counterpart.
Combining these two new components, for the first time, we show that logit mimicking can outperform feature imitation and the absence of localization distillation is a critical reason for why logit mimicking underperforms for years.
Our framework conducts the global network to learn the captured rich object detail knowledge from a global view and thereby produces high-quality attention maps that can be directly used as pseudo annotations for semantic segmentation networks.
Ranked #11 on Weakly-Supervised Semantic Segmentation on COCO 2014 val
Though recently the prevailing vision transformers (ViTs) have shown great potential of self-attention based models in ImageNet classification, their performance is still inferior to that of the latest SOTA CNNs if no extra data are provided.
Ranked #1 on Domain Generalization on VizWiz-Classification
By realizing the importance of the positional information carried by 2D feature representations, unlike recent MLP-like models that encode the spatial information along the flattened spatial dimensions, Vision Permutator separately encodes the feature representations along the height and width dimensions with linear projections.
In this paper, beyond this stereotyped layer pattern, we aim to improve pre-trained models by exploiting layer variety from two aspects: the layer type set and the layer order.
To evaluate the quality of the class activation maps produced by LayerCAM, we apply them to weakly-supervised object localization and semantic segmentation.
Vision Transformers (ViTs) have shown competitive accuracy in image classification tasks compared with CNNs.
Ranked #169 on Image Classification on ImageNet
In this paper, we present token labeling -- a new training objective for training high-performance vision transformers (ViTs).
Ranked #81 on Semantic Segmentation on ADE20K
However, these methods usually encounter boundary-related imbalance problem, leading to limited generation capability.
In this paper, we show that, unlike convolution neural networks (CNNs)that can be improved by stacking more convolutional layers, the performance of ViTs saturate fast when scaled to be deeper.
Ranked #409 on Image Classification on ImageNet
Current neural architecture search (NAS) algorithms still require expert knowledge and effort to design a search space for network construction.
Recent studies on mobile network design have demonstrated the remarkable effectiveness of channel attention (e. g., the Squeeze-and-Excitation attention) for lifting model performance, but they generally neglect the positional information, which is important for generating spatially selective attention maps.
Previous KD methods for object detection mostly focus on imitating deep features within the imitation regions instead of mimicking classification logit due to its inefficiency in distilling localization information and trivial improvement.
Experiments demonstrate that based on the same classification models, the proposed approach can effectively improve the classification performance on CIFAR-100, ImageNet, and fine-grained datasets.
In this paper, we investigate light-weight but effective attention mechanisms and present triplet attention, a novel method for computing attention weights by capturing cross-dimension interaction using a three-branch structure.
In this paper, we rethink the necessity of such design changes and find it may bring risks of information loss and gradient confusion.
In this paper, we propose a novel multi-miner framework to perform a region mining process that adapts to diverse object sizes and is thus able to mine more integral and finer object regions.
To evaluate the performance of our proposed network on these tasks, we conduct exhaustive experiments on multiple representative datasets.
Spatial pooling has been proven highly effective in capturing long-range contextual information for pixel-wise prediction tasks, such as scene parsing.
Ranked #28 on Semantic Segmentation on Cityscapes test
To successfully align the multi-modal data structures across domains, the following works exploit discriminative information in the adversarial training process, e. g., using multiple class-wise discriminators and introducing conditional information in input or output of the domain discriminator.
Feature pyramid network (FPN) based models, which fuse the semantics and salient details in a progressive manner, have been proven highly effective in salient object detection.
Previous adversarial learning methods condition domain alignment only on pseudo labels, but noisy and inaccurate pseudo labels may perturb the multi-class distribution embedded in probabilistic predictions, hence bringing insufficient alleviation to the latent mismatch problem.
The use of RGB-D information for salient object detection has been extensively explored in recent years.
Ranked #4 on RGB-D Salient Object Detection on RGBD135
The recent WSNet  is a new model compression method through sampling filterweights from a compact set and has demonstrated to be effective for 1D convolutionneural networks (CNNs).
We further design a feature aggregation module (FAM) to make the coarse-level semantic information well fused with the fine-level features from the top-down pathway.
Ranked #1 on RGB Salient Object Detection on SOD
To test the quality of the generated attention maps, we employ the mined object regions as heuristic cues for learning semantic segmentation models.
We also combine our method with Mask R-CNN for instance segmentation, and demonstrated for the first time the ability of weakly supervised instance segmentation using only keyword annotations.
Ranked #4 on Image-level Supervised Instance Segmentation on COCO test-dev (using extra training data)
Although these tasks are inherently very different, we show that our unified approach performs very well on all of them and works far better than current single-purpose state-of-the-art methods.
In this paper, we improve semantic segmentation by automatically learning from Flickr images associated with a particular keyword, without relying on any explicit user annotations, thus substantially alleviating the dependence on accurate annotations when compared to previous weakly supervised methods.
Our analysis identifies a serious design bias of existing SOD datasets which assumes that each image contains at least one clearly outstanding salient object in low clutter.
Taking into account the category-independent property of each target, we design a single stage salient instance segmentation framework, with a novel segmentation branch.
It also has drawbacks on convergence rate as a result of both the fixed search region and separately doing the assignment step and the update step.
Recent progress on saliency detection is substantial, benefiting mostly from the explosive development of Convolutional Neural Networks (CNNs).
Ranked #4 on RGB Salient Object Detection on SBU
Detecting and segmenting salient objects from natural scenes, often referred to as salient object detection, has attracted great interest in computer vision.