Then we present a supervised segmentation and unsupervised reconstruction networks to learn the characteristics of 3D point clouds.
Entity-aware image captioning aims to describe named entities and events related to the image by utilizing the background knowledge in the associated article.
Temporal action detection (TAD) aims to determine the semantic label and the boundaries of every action instance in an untrimmed video.
In this paper, we study the Salient Object Ranking (SOR) task, which manages to assign a ranking order of each detected object according to its visual saliency.
While several cross domain sequential recommendation models have been proposed to leverage information from a source domain to improve CTR predictions in a target domain, they did not take into account bidirectional latent relations of user preferences across source-target domain pairs.
Classical recommender system methods typically face the filter bubble problem when users only receive recommendations of their familiar items, making them bored and dissatisfied.
In this work we present SwiftNet for real-time semisupervised video object segmentation (one-shot VOS), which reports 77. 8% J &F and 70 FPS on DAVIS 2017 validation dataset, leading all present solutions in overall accuracy and speed performance.
Can our video understanding systems perceive objects when a heavy occlusion exists in a scene?
Ranked #2 on Video Instance Segmentation on OVIS validation
Alongside the prevalence of mobile videos, the general public leans towards consuming vertical videos on hand-held devices.
Current developments in temporal event or action localization usually target actions captured by a single camera.
Ranked #2 on Temporal Action Localization on THUMOS’14 (using extra training data)
Meanwhile, such applications usually require modeling the intrinsic clusters in high-dimensional data, which usually displays heterogeneous statistical patterns as the patterns of different clusters may appear in different dimensions.
Finally, we combine the SRG algorithm with our improved CNN using a refinement method called SRG-Net to conduct the segmentation tasks on the terracotta warriors.
On the other hand, there is a large semantic gap between seen and unseen classes in the existing multi-label classification datasets.
Click-through rate (CTR) prediction is an essential task in industrial applications such as video recommendation.
To tackle this critical problem, we propose an attribute-aware pedestrian detector to explicitly model people's semantic attributes in a high-level feature detection fashion.
Inspired by the widely-used structural similarity (SSIM) index in image quality assessment, we use the linear correlation between two images to quantify their structural similarity.
An increasing number of well-trained deep networks have been released online by researchers and developers, enabling the community to reuse them in a plug-and-play way without accessing the training annotations.
In this paper, we study the multi-objective bandits (MOB) problem, where a learner repeatedly selects one arm to play and then receives a reward vector consisting of multiple objectives.