The crucial problem in vehicle re-identification is to find the same vehicle identity when reviewing this object from cross-view cameras, which sets a higher demand for learning viewpoint-invariant representations.
To construct our framework as well as achieving accurate salient detection results, we propose a Ubiquitous Target Awareness (UTA) network to solve three important challenges in RGB-D SOD task: 1) a depth awareness module to excavate depth information and to mine ambiguous regions via adaptive depth-error weights, 2) a spatial-aware cross-modal interaction and a channel-aware cross-level interaction, exploiting the low-level boundary cues and amplifying high-level salient channels, and 3) a gated multi-scale predictor module to perceive the object saliency in different contextual scales.
Therefore, we propose a Pose-guided inter-and intra-part relational transformer (Pirt) for occluded person Re-Id, which builds part-aware long-term correlations by introducing transformers.
In this paper, we propose a talking face generation method that takes an audio signal as input and a short target video clip as reference, and synthesizes a photo-realistic video of the target face with natural lip motions, head poses, and eye blinks that are in-sync with the input audio signal.
To achieve this, we exploit the consistency between the input sparse point cloud and generated dense point cloud for the shapes and rendered images.
Fine-grained object recognition aims to learn effective features that can identify the subtle differences between visually similar objects.
Ranked #13 on Fine-Grained Image Classification on CUB-200-2011
Dealing with multi-agent control in networked systems is one of the biggest challenges in Reinforcement Learning (RL) and limited success has been presented compared to recent deep reinforcement learning in single-agent domain.
APIRec-CST is a deep learning model that combines the API usage with the text information in the source code based on an API Context Graph Network and a Code Token Network that simultaneously learn structural and textual features for API recommendation.
To explore the relationship between music and dance movements, we propose a cross-modal alignment module that focuses on dancing video clips, accompanied on pre-designed music, to learn a system that can judge the consistency between the visual features of pose sequences and the acoustic features of music.
Given base classes with sufficient labeled samples, the target of few-shot classification is to recognize unlabeled samples of novel classes with only a few labeled samples.
To solve this, many recent RGBD-based networks are proposed by adopting the depth map as an independent input and fuse the features with RGB information.
Ranked #1 on RGB-D Salient Object Detection on SSD
By modeling the undesired layer-to-layer misalignments in 3D as continuous random variables in the optical forward model, diffractive networks are trained to maintain their inference accuracy over a large range of misalignments; we term this diffractive network design as vaccinated D2NN (v-D2NN).
Recent years have witnessed increasing attention in cartoon media, powered by the strong demands of industrial applications.
Along with the development of virtual reality (VR), omnidirectional images play an important role in producing multimedia content with immersive experience.