In this paper, we propose a simple encoder-decoder, named SED, for open-vocabulary semantic segmentation, which comprises a hierarchical encoder-based cost map generation and a gradual fusion decoder with category early rejection.
To address these issues, we propose a transformer network with multi-stage CNN (Convolutional Neural Network) feature injection for surface defect segmentation, which is a UNet-like structure named CINFormer.
To this end, we develop a Global Context Aggregation Network (GCANet) for lightweight saliency detection of surface defects on the encoder-decoder structure.
Existing video-based breast lesion detection approaches typically perform temporal feature aggregation of deep backbone features based on the self-attention operation.
Our DFormer outperforms the recent diffusion-based panoptic segmentation method Pix2Seq-D with a gain of 3. 6% on MS COCO val2017 set.
In this paper, we explore the model design of vision Transformers in stereo 3D object detection, focusing particularly on extracting and encoding the task-specific image correspondence information.
Given a set of sparse and learnable proposals, LEAPS employs a dynamic person search head to directly perform person detection and corresponding re-id feature generation without non-maximum suppression post-processing.
We hope that our simple intra-image contrastive learning can provide more paradigms on weakly supervised person search.
The success of the transformer architecture in natural language processing has recently triggered attention in the computer vision field.
We propose a novel one-step transformer-based person search framework, PSTR, that jointly performs person detection and re-identification (re-id) in a single architecture.
When using the ResNet50 backbone, our MS-STS achieves a mask AP of 50. 1 %, outperforming the best reported results in literature by 2. 7 % and by 4. 8 % at higher overlap threshold of AP_75, while being comparable in model size and speed on Youtube-VIS 2019 val.
The key in our ESGN is an efficient geometry-aware feature generation (EGFG) module.
Compared with the baseline RTS3D, our proposed method has 2. 57% improvement on AP3d almost without extra network parameters.
Most online multi-object trackers perform object detection stand-alone in a neural net without any input from tracking.
Ranked #1 on Instance Segmentation on nuScenes
Object detectors usually achieve promising results with the supervision of complete instance annotations.
In addition to single-spectral pedestrian detection, we also review multi-spectral pedestrian detection, which provides more robust features for illumination variance.
In terms of real-time capabilities, SipMask outperforms YOLACT with an absolute gain of 3. 0% (mask AP) under similar settings, while operating at comparable speed on a Titan Xp.
Ranked #12 on Real-time Instance Segmentation on MSCOCO
For precise localization, we introduce a dense local regression that predicts multiple dense box offsets for an object proposal.
Ranked #66 on Instance Segmentation on COCO test-dev
With this observation, we propose a new Neighbor Erasing and Transferring (NET) mechanism to reconfigure the pyramid features and explore scale-aware features.
To further solve the second problem, a hierarchical shot detector (HSD) is proposed, which stacks two ROC modules and one feature enhanced module.
Ranked #3 on Object Detection on PASCAL VOC 2007
Experimental results on the VOC2007 and VOC2012 datasets demonstrate that the proposed TripleNet is able to improve both the detection and segmentation accuracies without adding extra computational costs.
Ranked #18 on Semantic Segmentation on PASCAL VOC 2012 test
In this paper, we propose a multi-branch and high-level semantic network by gradually splitting a base network into multiple different branches.
For example, CNN classifies these proposals by the full-connected layer features while proposal scores and the features in the inner-layers of CNN are ignored.
Ranked #25 on Pedestrian Detection on Caltech
Finally, we propose to combine both non-neighboring and neighboring features for pedestrian detection.
Ranked #28 on Pedestrian Detection on Caltech
Multistage particle windows (MPW), proposed by Gualdi et al., is an algorithm of fast and accurate object detection.
iCascade searches the optimal number ri of weak classifiers of each stage i by directly minimizing the computation cost of the cascade.