To ensure high discrimination, we propose a Feature Restoration (FR) operation to distill task-relevant features from the residual information and use them to compensate for the aligned features.
Ranked #70 on Domain Generalization on PACS
There is a lack of loss design which enables the joint optimization of multiple instances (of multiple classes) within per-query optimization for person ReID.
To address this problem, we introduce a global distance-distributions separation (GDS) constraint over the two distributions to encourage the clear separation of positive and negative samples from a global view.
Existing fully-supervised person re-identification (ReID) methods usually suffer from poor generalization capability caused by domain gaps.
Ranked #7 on Unsupervised Domain Adaptation on Market to Duke
In contrast to the previous efforts which require to establish cross-view correspondence based on noisy and incomplete 2D pose estimations, we present an end-to-end solution which directly operates in the $3$D space, therefore avoids making incorrect decisions in the 2D space.
Ranked #5 on 3D Multi-Person Pose Estimation on Panoptic (using extra training data)
Based on the probability space, we further generate new fusion strategies which achieve the state-of-the-art performance on four well-known action recognition datasets.
Formulating MOT as multi-task learning of object detection and re-ID in a single network is appealing since it allows joint optimization of the two tasks and enjoys high computation efficiency.
Ranked #1 on Multi-Object Tracking on 2DMOT15 (using extra training data)
In this paper, we propose an attentive feature aggregation module, namely Multi-Granularity Reference-aided Attentive Feature Aggregation (MG-RAFA), to delicately aggregate spatio-temporal features into a discriminative video-level feature representation.
Then we lift the multi-view 2D poses to the 3D space by an Orientation Regularized Pictorial Structure Model (ORPSM) which jointly minimizes the projection error between the 3D and 2D poses, along with the discrepancy between the 3D pose and IMU orientations.
Ranked #1 on 3D Absolute Human Pose Estimation on Total Capture
To the best of our knowledge, we are the first to make use of multi-shots of an object in a teacher-student learning manner for effectively boosting the single image based re-id.
Unsupervised depth learning takes the appearance difference between a target view and a view synthesized from its adjacent frame as supervisory signal.
Our experimental evaluation demonstrates that the result of our method is comparable to fully supervised methods on the NYU Depth V2 benchmark.
In this paper, we make a thorough investigation on the attention mechanisms in a SR model and shed light on how simple and effective improvements on these ideas improve the state-of-the-arts.
It consists of two separate steps: (1) estimating the 2D poses in multi-view images and (2) recovering the 3D poses from the multi-view 2D poses.
Ranked #6 on 3D Human Pose Estimation on Total Capture
For an RNN block, an EleAttG is used for adaptively modulating the input by assigning different levels of importance, i. e., attention, to each element/dimension of the input.
Ranked #3 on Skeleton Based Action Recognition on SYSU 3D
Accordingly, a hybrid network representation is presented which enables us to leverage the Variational Dropout so that the approximation of the posterior distribution becomes fully gradient-based and highly efficient.
Specifically, we build a Semantics Aligning Network (SAN) which consists of a base network as encoder (SA-Enc) for re-ID, and a decoder (SA-Dec) for reconstructing/regressing the densely semantics aligned full texture image.
An efficient and effective person re-identification (ReID) system relieves the users from painful and boring video watching and accelerates the process of video analysis.
The two stages are connected in series as the input proposals of the FM stage are generated by the CM stage.
For person re-identification (re-id), attention mechanisms have become attractive as they aim at strengthening discriminative features and suppressing irrelevant ones, which matches well the key of re-id, i. e., discriminative feature learning.
We further explore more powerful representations by integrating language prior with the visual context in the transformation for the scene graph generation.
Skeleton-based human action recognition has attracted great interest thanks to the easy accessibility of the human skeleton data.
Ranked #1 on Skeleton Based Action Recognition on SYSU 3D
The diversity of capturing viewpoints and the flexibility of the human poses, however, remain some significant challenges.
Trackers are in general more efficient than detectors but bear the risk of drifting.
We propose a video level 2D feature representation by transforming the convolutional features of all frames to a 2D feature map, referred to as VideoMap.
Ranked #49 on Action Recognition on UCF101
Recently, Siamese network based trackers have received tremendous interest for their fast tracking speed and high performance.
Ranked #9 on Visual Object Tracking on VOT2017/18
The archetypes generally correspond to the extremal points in the dataset and are learned by requiring them to be convex combinations of the training data.
We propose adding a simple yet effective Element-wiseAttention Gate (EleAttG) to an RNN block (e. g., all RNN neurons in a network layer) that empowers the RNN neurons to have the attentiveness capability.
Ranked #92 on Skeleton Based Action Recognition on NTU RGB+D
Generally, model update is formulated as an online learning problem where a target model is learned over the online training set.
Ranked #1 on Visual Tracking on OTB-2013
Recent attempts use 3D convolutional neural networks (CNNs) to explore spatio-temporal information for human action recognition.
In order to alleviate the effects of view variations, this paper introduces a novel view adaptation scheme, which automatically determines the virtual observation viewpoints in a learning based data driven manner.
Ranked #1 on Skeleton Based Action Recognition on UWA3D
In particular, our method improves results by 8. 8% over the static image detector for fast moving objects.
We present a comprehensive study and evaluation of existing single image dehazing algorithms, using a new large-scale benchmark consisting of both synthetic and real-world hazy images, called REalistic Single Image DEhazing (RESIDE).
We present a two-stage normalization scheme, human body normalization and limb normalization, to make the distribution of the relative joint locations compact, resulting in easier learning of convolutional spatial models and more accurate pose estimation.
Rather than re-positioning the skeletons based on a human defined prior criterion, we design a view adaptive recurrent neural network (RNN) with LSTM architecture, which enables the network itself to adapt to the most suitable observation viewpoints from end to end.
Ranked #6 on Skeleton Based Action Recognition on SYSU 3D
A deep residual network, built by stacking a sequence of residual blocks, is easy to train, because identity mappings skip residual branches and thus improve information flow.
In this work, we propose an end-to-end spatial and temporal attention model for human action recognition from skeleton data.
Ranked #104 on Skeleton Based Action Recognition on NTU RGB+D
With the rapid development of social network and multimedia technology, customized image and video stylization has been widely used for various social-media applications.
Second, in our suggested fused net formed by one deep and one shallow base networks, the flows of the information from the earlier intermediate layer of the deep base network to the output and from the input to the later intermediate layer of the deep base network are both improved.
In this paper, we study the problem of online action detection from streaming skeleton data.
Skeleton based action recognition distinguishes human actions using the trajectories of skeleton joints, which provide a very good representation for describing actions.
We propose a novel dual-camera design to acquire 4D high-speed hyperspectral (HSHS) videos with high spatial and spectral resolution.