Specifically, we propose a dynamic temporal disentanglement model to infer the propagation of utterances and hidden variables, enabling the accumulation of emotion-related information throughout the conversation.
Unlike most previous HOI methods that focus on learning better human-object features, we propose a novel and complementary approach called category query learning.
Ranked #8 on Human-Object Interaction Detection on HICO-DET
Moreover, the joint learning of unified query representation can greatly improve the detection performance of DETR.
Ranked #4 on Object Detection on COCO minival (AP75 metric)
Temporal modeling of objects is a key challenge in multiple object tracking (MOT).
Ranked #12 on Multi-Object Tracking on MOT16
%We argue that such flexibility is also important for deep metric learning, because different visual concepts indeed correspond to different semantic scales.
Ranked #2 on Metric Learning on DyML-Animal
We propose HOI Transformer to tackle human object interaction (HOI) detection in an end-to-end manner.
Ranked #29 on Human-Object Interaction Detection on HICO-DET (using extra training data)
With ten teacher-student combinations on six datasets, PAD promotes the performance of existing distillation methods and outperforms recent state-of-the-art methods.
To tackle these three naturally different dimensions, we proposed a general framework by defining pruning as seeking the best pruning vector (i. e., the numerical value of layer-wise channel number, spacial size, depth) and construct a unique mapping from the pruning vector to the pruned network structures.
Comprehensive experiments show that ABS can dramatically enhance existing NAS approaches by providing a promising shrunk search space.
This work applies data uncertainty learning to face recognition, such that the feature (mean) and uncertainty (variance) are learnt simultaneously, for the first time.
This paper provides a pair similarity optimization viewpoint on deep feature learning, aiming to maximize the within-class similarity $s_p$ and minimize the between-class similarity $s_n$.
Ranked #1 on Face Verification on IJB-C (training dataset metric)
Therefore many modified normalization techniques have been proposed, which either fail to restore the performance of BN completely, or have to introduce additional nonlinear operations in inference procedure and increase huge consumption.
Recently, 3D face reconstruction and face alignment tasks are gradually combined into one task: 3D dense face alignment.
The estimation of 3D human body pose and shape from a single image has been extensively studied in recent years.
It is easy to train and fast to search.
Ranked #88 on Neural Architecture Search on ImageNet (Accuracy metric)
Existing pose estimation approaches fall into two categories: single-stage and multi-stage methods.
Ranked #1 on Pose Estimation on COCO minival
There has been significant progress on pose estimation and increasing interests on pose tracking in recent years.
Ranked #2 on 2D Human Pose Estimation on JHMDB (2D poses only)
In this paper, we present a light weight network architecture for video object detection on mobiles.
While most steps in the modern object detection methods are learnable, the region feature extraction step remains largely hand-crafted, featured by RoI pooling methods.
In this work, we present a novel and effective framework to facilitate object detection with the instance-level segmentation information that is only supervised by bounding box annotation.
Although it is well believed for years that modeling relations between objects would help object recognition, there has not been evidence that the idea is working in the deep learning era.
State-of-the-art human pose estimation methods are based on heat map representation.
Ranked #23 on Pose Estimation on MPII Human Pose
We propose a weakly-supervised transfer learning method that uses mixed 2D and 3D labels in a unified deep neutral network that presents two-stage cascaded structure.
A central problem is that the structural information in the pose is not well exploited in the previous regression methods.
Ranked #36 on Pose Estimation on MPII Human Pose
The accuracy of detection suffers from degenerated object appearances in videos, e. g., motion blur, video defocus, rare poses, etc.
Ranked #22 on Video Object Detection on ImageNet VID
Convolutional neural networks (CNNs) are inherently limited to model geometric transformations due to the fixed geometric structures in its building modules.
Ranked #3 on Vessel Detection on Vessel detection Dateset
Yet, it is non-trivial to transfer the state-of-the-art image recognition networks to videos as per-frame evaluation is too slow and unaffordable.
Ranked #9 on Video Semantic Segmentation on Cityscapes val
It inherits all the merits of FCNs for semantic segmentation and instance mask proposal.
Ranked #95 on Instance Segmentation on COCO test-dev
In this work, we propose to directly embed a kinematic object model into the deep neutral network learning for general articulated object pose estimation.
Ranked #301 on 3D Human Pose Estimation on Human3.6M
For the first time, we show that embedding such a non-linear generative process in deep learning is feasible for hand pose estimation.
We extends the previous 2D cascaded object pose regression work  in two aspects so that it works better for 3D articulated objects.
Hierarchical segmentation based object proposal methods have become an important step in modern object detection paradigm.
The locality principle guides us to learn a set of highly discriminative local binary features for each facial landmark independently.
However, their usage of boundary prior is very simple, fragile, and the integration with other cues is mostly heuristic.
We present a very efficient, highly accurate, “Explicit Shape Regression” approach for face alignment.
Ranked #35 on Face Alignment on WFLW