Though recently the prevailing vision transformers (ViTs) have shown great potential of self-attention based models in ImageNet classification, their performance is still inferior to that of the latest SOTA CNNs if no extra data are provided.
Ranked #1 on Semantic Segmentation on Graz-02
By realizing the importance of the positional information carried by 2D feature representations, unlike recent MLP-like models that encode the spatial information along the flattened spatial dimensions, Vision Permutator separately encodes the feature representations along the height and width dimensions with linear projections.
To evaluate the quality of the class activation maps produced by LayerCAM, we apply them to weakly-supervised object localization and semantic segmentation.
In this paper, beyond this stereotyped layer pattern, we aim to improve pre-trained models by exploiting layer variety from two aspects: the layer type set and the layer order.
Vision Transformers (ViTs) have shown competitive accuracy in image classification tasks compared with CNNs.
Ranked #59 on Image Classification on ImageNet
In this paper, we present token labeling -- a new training objective for training high-performance vision transformers (ViTs).
Ranked #16 on Semantic Segmentation on ADE20K
However, these methods usually encounter boundary-related imbalance problem, leading to limited generation capability.
Current neural architecture search (NAS) algorithms still require expert knowledge and effort to design a search space for network construction.
In this paper, we show that, unlike convolution neural networks (CNNs)that can be improved by stacking more convolutional layers, the performance of ViTs saturate fast when scaled to be deeper.
Ranked #174 on Image Classification on ImageNet
Recent studies on mobile network design have demonstrated the remarkable effectiveness of channel attention (e. g., the Squeeze-and-Excitation attention) for lifting model performance, but they generally neglect the positional information, which is important for generating spatially selective attention maps.
Previous KD methods for object detection mostly focus on imitating deep features within the imitation regions instead of mimicking classification logits due to its inefficiency in distilling localization information.
Experiments demonstrate that based on the same classification models, the proposed approach can effectively improve the classification performance on CIFAR-100, ImageNet, and fine-grained datasets.
In this paper, we investigate light-weight but effective attention mechanisms and present triplet attention, a novel method for computing attention weights by capturing cross-dimension interaction using a three-branch structure.
In this paper, we rethink the necessity of such design changes and find it may bring risks of information loss and gradient confusion.
In this paper, we propose a novel multi-miner framework to perform a region mining process that adapts to diverse object sizes and is thus able to mine more integral and finer object regions.
To evaluate the performance of our proposed network on these tasks, we conduct exhaustive experiments on multiple representative datasets.
To successfully align the multi-modal data structures across domains, the following works exploit discriminative information in the adversarial training process, e. g., using multiple class-wise discriminators and introducing conditional information in input or output of the domain discriminator.
Spatial pooling has been proven highly effective in capturing long-range contextual information for pixel-wise prediction tasks, such as scene parsing.
Ranked #24 on Semantic Segmentation on Cityscapes test
Feature pyramid network (FPN) based models, which fuse the semantics and salient details in a progressive manner, have been proven highly effective in salient object detection.
Previous adversarial learning methods condition domain alignment only on pseudo labels, but noisy and inaccurate pseudo labels may perturb the multi-class distribution embedded in probabilistic predictions, hence bringing insufficient alleviation to the latent mismatch problem.
The use of RGB-D information for salient object detection has been extensively explored in recent years.
Ranked #4 on RGB-D Salient Object Detection on RGBD135
The recent WSNet  is a new model compression method through sampling filterweights from a compact set and has demonstrated to be effective for 1D convolutionneural networks (CNNs).
We further design a feature aggregation module (FAM) to make the coarse-level semantic information well fused with the fine-level features from the top-down pathway.
Ranked #1 on RGB Salient Object Detection on SOD
To test the quality of the generated attention maps, we employ the mined object regions as heuristic cues for learning semantic segmentation models.
We also combine our method with Mask R-CNN for instance segmentation, and demonstrated for the first time the ability of weakly supervised instance segmentation using only keyword annotations.
Although these tasks are inherently very different, we show that our unified approach performs very well on all of them and works far better than current single-purpose state-of-the-art methods.
In this paper, we improve semantic segmentation by automatically learning from Flickr images associated with a particular keyword, without relying on any explicit user annotations, thus substantially alleviating the dependence on accurate annotations when compared to previous weakly supervised methods.
Our analysis identifies a serious design bias of existing SOD datasets which assumes that each image contains at least one clearly outstanding salient object in low clutter.
Taking into account the category-independent property of each target, we design a single stage salient instance segmentation framework, with a novel segmentation branch.
It also has drawbacks on convergence rate as a result of both the fixed search region and separately doing the assignment step and the update step.
Recent progress on saliency detection is substantial, benefiting mostly from the explosive development of Convolutional Neural Networks (CNNs).
Ranked #4 on RGB Salient Object Detection on SBU
Detecting and segmenting salient objects from natural scenes, often referred to as salient object detection, has attracted great interest in computer vision.