In this paper, we take the advantage of previous pre-trained models (PTMs) and propose a novel Chinese Pre-trained Unbalanced Transformer (CPT).
In this paper, we present a novel neural scene rendering system, which learns an object-compositional neural radiance field and produces realistic rendering with editing capability for a clustered and real-world scene.
Panorama images have a much larger field-of-view thus naturally encode enriched scene context information compared to standard perspective images, which however is not well exploited in the previous scene understanding methods.
With the rapid development of data-driven techniques, data has played an essential role in various computer vision tasks.
Experimental results show that our method can generate high-quality alpha mattes for various videos featuring appearance change, occlusion, and fast motion.
To address this problem, we propose a novel visual localization framework that establishes 2D-to-3D correspondences between the query image and the 3D map with a series of learnable scene-specific landmarks.
Moreover, the learned blend weight fields can be combined with input skeletal motions to generate new deformation fields to animate the human model.
In this paper, we propose StereoPIFu, which integrates the geometric constraints of stereo vision with implicit function representation of PIFu, to recover the 3D shape of the clothed human from a pair of low-cost rectified images.
We present a novel framework named NeuralRecon for real-time 3D scene reconstruction from a monocular video.
In this paper, we introduce the new task of reconstructing 3D human pose from a single image in which we can see the person and the person's image through a mirror.
Generating high-fidelity talking head video by fitting with the input audio sequence is a challenging problem that receives considerable attentions recently.
This paper proposes a novel active boundary loss for semantic segmentation.
In this work, we propose a novel system for integrated 3D object detection and tracking, which uses a dynamic object occupancy map and previous object states as spatial-temporal memory to assist object detection in future frames.
Different from traditional video cameras, event cameras capture asynchronous events stream in which each event encodes pixel location, trigger time, and the polarity of the brightness changes.
To this end, we propose Neural Body, a new human body representation which assumes that the learned neural representations at different frames share the same set of latent codes anchored to a deformable mesh, so that the observations across frames can be naturally integrated.
Learning non-rigid registration in an end-to-end manner is challenging due to the inherent high degrees of freedom and the lack of labeled training data.
To suit our network to self-supervised learning, we design several novel loss functions that utilize the inherent properties of LiDAR point clouds.
Recovering multi-person 3D poses with absolute scales from a single RGB image is a challenging problem due to the inherent depth and scale ambiguity from a single view.
Therefore, we propose to capture human motion by jointly analyzing these Internet videos instead of using single videos separately.
In this paper, we propose a novel system named Disp R-CNN for 3D object detection from stereo images.
Ranked #14 on Vehicle Pose Estimation on KITTI Cars Hard
In this paper, we consider the problem to automatically reconstruct garment and body shapes from a single near-front view RGB image.
In this paper, we address this problem by proposing a deep neural network model that takes an audio signal A of a source person and a very short video V of a target person as input, and outputs a synthesized high-quality talking face video with personalized head pose (making use of the visual information in V), expression and lip synchronization (by considering both A and V).
Based on deep snake, we develop a two-stage pipeline for instance segmentation: initial contour proposal and contour deformation, which can handle errors in object localization.
Ranked #1 on Semantic Contour Prediction on Sbd val
Instead of feature pooling, we use group convolutions to exploit underlying structures of the extracted features on the group, resulting in descriptors that are both discriminative and provably invariant to the group of transformations.
Most of existing methods directly train a network to learn a mapping from sparse depth inputs to dense depth maps, which has difficulties in utilizing the 3D geometric constraints and handling the practical sensor noises.
This paper addresses the problem of 3D pose estimation for multiple people in a few calibrated camera views.
Ranked #5 on 3D Multi-Person Pose Estimation on Shelf
We further create a Truncation LINEMOD dataset to validate the robustness of our approach against truncation.
Ranked #1 on 6D Pose Estimation using RGB on YCB-Video (Mean AUC metric)
However, jointly using visual and inertial measurements to optimize SLAM objective functions is a problem of high computational complexity.
In this paper, we present RKD-SLAM, a robust keyframe-based dense SLAM approach for an RGB-D camera that can robustly handle fast motion and dense loop closure, and run without time limitation in a moderate size scene.
Our framework consists of steps of solving the feature `dropout' problem when indistinctive structures, noise or large image distortion exists, and of rapidly recognizing and joining common features located in different subsequences.