Images taken in dynamic scenes may contain unwanted motion blur, which significantly degrades visual quality.
We address weakly supervised point cloud segmentation by proposing a new model, MIL-derived transformer, to mine additional supervisory signals.
The audio-visual video parsing task aims to temporally parse a video into audio or visual event categories.
Image motion blur usually results from moving objects or camera shakes.
Ranked #11 on Image Deblurring on GoPro
We formulate this task as an object point sampling problem, and develop two techniques, the mutual attention module and co-contrastive learning, to enable it.
Estimating3D hand poses from RGB images is essentialto a wide range of potential applications, but is challengingowing to substantial ambiguity in the inference of depth in-formation from RGB images.
Experiments show that our modelachieves surprisingly good results, with 3D estimation ac-curacy on par with the state-of-the-art models trained with3D annotations, highlighting the benefit of the temporalconsistency in constraining 3D prediction models.
Based on the match algorithm, we propose an efficient pipeline to generate a large-scale multi-view hand mesh (MVHM) dataset with accurate 3D hand mesh and joint labels.
Estimating the 3D hand pose from a monocular RGB image is important but challenging.
A domain adaptive object detector aims to adapt itself to unseen domains that may contain variations of object appearance, viewpoints or backgrounds.
With the growing attention on learning-to-learn new tasks using only a few examples, meta-learning has been widely used in numerous problems such as few-shot classification, reinforcement learning, and domain generalization.
Establishing dense semantic correspondences between object instances remains a challenging problem due to background clutter, significant scale and pose differences, and large intra-class variations.
Person re-identification (re-ID) aims at matching images of the same person across camera views.
Unsupervised domain adaptation algorithms aim to transfer the knowledge learned from one domain to another (e. g., synthetic to real images).
This paper presents a weakly supervised instance segmentation method that consumes training data with tight bounding box annotations.
To this end, we propose an end-to-end trainable comprehension network that consists of the language and visual encoders to extract feature representations from both domains.
Ranked #8 on Referring Expression Segmentation on RefCoCo val
Person re-identification (re-ID) aims at matching images of the same identity across camera views.
In contrast to existing algorithms that tackle the tasks of semantic matching and object co-segmentation in isolation, our method exploits the complementary nature of the two tasks.
In addition to the cycle consistency loss, we propose two extensions: motion linearity loss and edge-guided training.
The entire process is decomposed into two tasks: 1) solving a submodular function for selecting object-like segments, and 2) learning a CNN model with a transferable module for adapting seen categories in the source domain to the unseen target video.
This paper aims at recognizing partially observed human actions in videos.
Hand pose estimation from a monocular RGB image is an important but challenging task.
In this paper, we address co-saliency detection in a set of images jointly covering objects of a specific class by an unsupervised convolutional neural network (CNN).
We propose a method for semi-supervised semantic segmentation using an adversarial network.
This paper presents the DeepCD framework which learns a pair of complementary descriptors jointly for a patch by employing deep learning techniques.
We tackle the three issues by introducing a new network layer, called co-occurrence layer.
Experiments on popular benchmarks demonstrate the effectiveness of our descriptors and their superiority to the state-of-the-art descriptors.
First, the performance of descriptor-based approaches to image alignment relies on the chosen descriptor, but the optimal descriptor typically varies from image to image, or even pixel to pixel.
Inspired by the observation that the homographies of correct feature correspondences vary smoothly along the spatial domain, our approach stands on the unsupervised nature of feature matching, and can select a good descriptor for matching each feature point.
By treating a bounding box as a bag with its segment hypotheses as structured instances, MSIL-CRF selects the most likely segment hypotheses by leveraging the knowledge derived from both the labeled and uncertain training data.
Our approach aims to enhance action recognition in RGB videos by leveraging the extra database.
Inspired by the fact that nearby features on the same object share coherent homographies in matching, we cast the task of feature matching as a density estimation problem in the Hough space spanned by the hypotheses of homographies.
In solving complex visual learning tasks, adopting multiple descriptors to more precisely characterize the data has been a feasible way for improving performance.