Monocular 3D object detection is a challenging task due to unreliable depth, resulting in a distinct performance gap between monocular and LiDAR-based approaches.
Modern smart sensor-based energy management systems leverage non-intrusive load monitoring (NILM) to predict and optimize appliance load distribution in real-time.
When people deliver a speech, they naturally move heads, and this rhythmic head motion conveys prosodic information.
However, existing GCN-based ride-hailing demand prediction methods only assign the same importance to different neighbor regions, and maintain a fixed graph structure with static spatial relationships throughout the timeline when extracting the irregular non-Euclidean spatial correlations.
As an economical and healthy mode of shared transportation, Bike Sharing System (BSS) develops quickly in many big cities.
In this work, we present a carefully-designed benchmark for evaluating talking-head video generation with standardized dataset pre-processing strategies.
In this paper, we propose a confidence segmentation (ConfSeg) module that builds confidence score for each pixel in CAM without introducing additional hyper-parameters.
Weakly Supervised Object Localization (WSOL) methodsusually rely on fully convolutional networks in order to ob-tain class activation maps(CAMs) of targeted labels.
Furthermore, we study multiple modalities including description and transcripts for the purpose of boosting video understanding.