The core element of CAP-Net is a module named Correspondence-Aware Fusion (CAF) which integrates the local features of the two modalities based on their correspondence scores.
This paper presents a neural network built upon Transformers, namely PlaneTR, to simultaneously detect and reconstruct planes from a single image.
To model the representations of the two levels, we first encode the information from the whole into part vectors through an attention mechanism, then decode the global information within the part vectors back into the whole representation.
Ranked #98 on Image Classification on ImageNet
Temporal action detection (TAD) aims to determine the semantic label and the boundaries of every action instance in an untrimmed video.
Leveraging the advances of natural language processing, most recent scene text recognizers adopt an encoder-decoder architecture where text images are first converted to representative features and then a sequence of characters via `direct decoding'.
Object detection, instance segmentation, and pose estimation are popular visual recognition tasks which require localizing the object by internal or boundary landmarks.
Ranked #28 on Object Detection on COCO test-dev
Person search aims to simultaneously localize and identify a query person from realistic, uncropped images, which can be regarded as the unified task of pedestrian detection and person re-identification (re-id).
Ranked #4 on Person Search on CUHK-SYSU
In this work we present SwiftNet for real-time semisupervised video object segmentation (one-shot VOS), which reports 77. 8% J &F and 70 FPS on DAVIS 2017 validation dataset, leading all present solutions in overall accuracy and speed performance.
Can our video understanding systems perceive objects when a heavy occlusion exists in a scene?
Ranked #2 on Video Instance Segmentation on OVIS validation
Current developments in temporal event or action localization usually target actions captured by a single camera.
Ranked #2 on Temporal Action Localization on THUMOS’14 (using extra training data)
However, the cross entropy loss can not take the different importance of each class in an self-driving system into account.
We also propose two novel modules, i. e., position-wise Spatial Attention Module (SAM) and scale-wise Channel Attention Module (CAM), to capture semantic structure attention in spatial and channel dimensions, respectively.
The ground metric of Wasserstein distance can be pre-defined following the experience on a specific task.
We present a novel Bipartite Graph Reasoning GAN (BiGraphGAN) for the challenging person image generation task.
Ranked #1 on Pose Transfer on Deep-Fashion
On the MS-COCO dataset, CPN achieves an AP of 49. 2% which is competitive among state-of-the-art object detection methods.
Ranked #55 on Object Detection on COCO test-dev
In this paper, we study how to make use of decentralized datasets for training a robust scene text recognizer while keeping them stay on local devices.
We propose a novel Generative Adversarial Network (XingGAN or CrossingGAN) for person image generation tasks, i. e., translating the pose of a given person to a desired one.
Ranked #1 on Pose Transfer on Deep-Fashion
However, it has been rarely explored to embed the NL blocks in mobile neural networks, mainly due to the following challenges: 1) NL blocks generally have heavy computation cost which makes it difficult to be applied in applications where computational resources are limited, and 2) it is an open problem to discover an optimal configuration to embed NL blocks into mobile neural networks.
Ranked #32 on Neural Architecture Search on ImageNet
For computing line segment proposals, a novel exact dual representation is proposed which exploits a parsimonious geometric reparameterization for line segments and forms a holistic 4-dimensional attraction field map for an input image.
Ranked #2 on Line Segment Detection on York Urban Dataset
A major issue is that the density map on dense regions usually accumulates density values from a number of nearby Gaussian blobs, yielding different large density values on a small set of pixels.
Given a line segment map, the proposed regional attraction first establishes the relationship between line segments and regions in the image lattice.
Unsupervised video object segmentation has often been tackled by methods based on recurrent neural networks and optical flow.
Ranked #4 on Unsupervised Video Object Segmentation on DAVIS 2016 (using extra training data)
The non-local module works as a particularly useful technique for semantic segmentation while criticized for its prohibitive computation and GPU memory occupation.
Ranked #10 on Semantic Segmentation on COCO-Stuff test
By doing so, spatial information across multiple views is captured, which helps to learn a discriminative global embedding for each 3D object.
Reading text in the wild is a very challenging task due to the diversity of text instances and the complexity of natural scenes.
Dense crowd counting aims to predict thousands of human instances from an image, by calculating integrals of a density map over image pixels.
This work studies the unsupervised re-ranking procedure for object retrieval and person re-identification with a specific concentration on an ensemble of multiple metrics (or similarities).
In object detection, keypoint-based approaches often suffer a large number of incorrect object bounding boxes, arguably due to the lack of an additional look into the cropped regions.
Ranked #3 on Object Detection on UA-DETRAC
Accurate multi-organ abdominal CT segmentation is essential to many clinical applications such as computer-aided intervention.
Ranked #3 on Medical Image Segmentation on Synapse multi-organ CT
We observe the property of regional homogeneity in adversarial perturbations and suggest that the defenses are less robust to regionally homogeneous perturbations.
However, our work observes the extreme vulnerability of existing distance metrics to adversarial examples, generated by simply adding human-imperceptible perturbations to person images.
To efficiently learn deep embeddings on the high-order graph-structured data, we introduce two end-to-end trainable operators to the family of graph neural networks, i. e., hypergraph convolution and hypergraph attention.
In contrast to previous a-posteriori methods of visualizing DeepRL policies, we propose an end-to-end trainable framework based on Rainbow, a representative Deep Q-Network (DQN) agent.
The critical principle of ghost networks is to apply feature-level perturbations to an existing model to potentially create a huge set of diverse models.
In experiments, our method is tested on the WireFrame dataset and the YorkUrban dataset with state-of-the-art performance obtained.
Ranked #4 on Line Segment Detection on York Urban Dataset (using extra training data)
Person re-identification (re-ID) is a highly challenging task due to large variations of pose, viewpoint, illumination, and occlusion.
The iterative instance classifier refinement is implemented online using multiple streams in convolutional neural networks, where the first is an MIL network and the others are for instance classifier refinement supervised by the preceding one.
Ranked #1 on Weakly Supervised Object Detection on ImageNet
In multi-organ segmentation of abdominal CT scans, most existing fully supervised deep learning algorithms require lots of voxel-wise annotations, which are usually difficult, expensive, and slow to obtain.
We hope that our proposed attack strategy can serve as a strong benchmark baseline for evaluating the robustness of networks to adversaries and the effectiveness of different defense methods in the future.
Most existing 3D object recognition algorithms focus on leveraging the strong discriminative power of deep learning models with softmax loss for the classification of 3D data, while learning discriminative features with deep metric learning for 3D object retrieval is more or less neglected.
This stimulates a great research interest of considering similarity fusion in the framework of diffusion process (i. e., fusion with diffusion) for robust retrieval.
Most existing person re-identification algorithms either extract robust visual features or learn discriminative metrics for person images.
Ranked #71 on Person Re-Identification on Market-1501
We name the proposed 3D shape search engine, which combines GPU acceleration and Inverted File Twice, as GIFT.
By combing the global deep learning representation and the local descriptor representation, our method can obtain the state-of-the-art performance on 3D shape retrieval benchmarks.