Most of existing methods for this task rely heavily on convolutional neural networks, which however have trouble capturing long-range dependencies between entities in the language expression and are not flexible enough for modeling interactions between the two different modalities.
Improving the generalization capability of Deep Neural Networks (DNNs) is critical for their practical uses, which has been a longstanding challenge.
This paper presents ActiveMLP, a general MLP-like backbone for computer vision.
Ranked #33 on Object Detection on COCO minival
For deep reinforcement learning (RL) from pixels, learning effective state representations is crucial for achieving high performance.
In this paper, to address more practical scenarios, we propose a new task, Lifelong Unsupervised Domain Adaptive (LUDA) person ReID.
In this paper, we propose a novel Confounder Identification-free Causal Visual Feature Learning (CICF) method, which obviates the need for identifying confounders.
Skeleton data is of low dimension.
Skeleton data carries valuable motion information and is widely explored in human action recognition.
We also present a method which injects the style representation of the web-crawled data into the source domain on-the-fly during training, which enables the network to experience images of diverse styles with reliable labels for effective training.
Occluded person re-identification (ReID) aims to match person images with occlusion.
Unsupervised domain adaptive classifcation intends to improve the classifcation performance on unlabeled target domain.
In this work, we propose a novel method, dubbed PlayVirtual, which augments cycle-consistent virtual trajectories to enhance the data efficiency for RL feature representation learning.
For unsupervised domain adaptation (UDA), to alleviate the effect of domain shift, many approaches align the source and target domains in the feature space by adversarial learning or by explicitly aligning their statistics.
Each recomposed feature, obtained based on the domain-invariant feature (which enables a reliable inheritance of identity) and an enhancement from a domain specific feature (which enables the approximation of real distributions), is thus an "ideal" augmentation.
Many unsupervised domain adaptation (UDA) methods exploit domain adversarial training to align the features to reduce domain gap, where a feature extractor is trained to fool a domain discriminator in order to have aligned feature distributions.
Domain generalization deals with a challenging setting where one or several different but related domain(s) are given, and the goal is to learn a model that can generalize to an unseen test domain.
Vehicle Re-Identification (V-ReID) is a critical task that associates the same vehicle across images from different camera viewpoints.
Ranked #1 on Vehicle Re-Identification on VeRi-Wild Large
In this paper, we design a novel Style Normalization and Restitution module (SNR) to simultaneously ensure both high generalization and discrimination capability of the networks.
Based on this finding, we propose to exploit the uncertainty (measured by consistency levels) to evaluate the reliability of the pseudo-label of a sample and incorporate the uncertainty to re-weight its contribution within various ReID losses, including the identity (ID) classification loss per sample, the triplet loss, and the contrastive loss.
In this work, we propose Uncertainty-Aware Few-Shot framework for image classification by modeling uncertainty of the similarities of query-support pairs and performing uncertainty-aware optimization.
To ensure high discrimination, we propose a Feature Restoration (FR) operation to distill task-relevant features from the residual information and use them to compensate for the aligned features.
Ranked #49 on Domain Generalization on PACS
There is a lack of loss design which enables the joint optimization of multiple instances (of multiple classes) within per-query optimization for person ReID.
To address this problem, we introduce a global distance-distributions separation (GDS) constraint over the two distributions to encourage the clear separation of positive and negative samples from a global view.
Existing fully-supervised person re-identification (ReID) methods usually suffer from poor generalization capability caused by domain gaps.
Ranked #8 on Unsupervised Domain Adaptation on Market to Duke
In this paper, we propose an attentive feature aggregation module, namely Multi-Granularity Reference-aided Attentive Feature Aggregation (MG-RAFA), to delicately aggregate spatio-temporal features into a discriminative video-level feature representation.
To exploit such flexible and comprehensive information, we propose a semi-supervised Feature Pyramidal Correlation and Residual Reconstruction Network (FPCR-Net) for optical flow estimation from frame pairs.
To the best of our knowledge, we are the first to make use of multi-shots of an object in a teacher-student learning manner for effectively boosting the single image based re-id.
For an RNN block, an EleAttG is used for adaptively modulating the input by assigning different levels of importance, i. e., attention, to each element/dimension of the input.
Ranked #3 on Skeleton Based Action Recognition on SYSU 3D
Specifically, we build a Semantics Aligning Network (SAN) which consists of a base network as encoder (SA-Enc) for re-ID, and a decoder (SA-Dec) for reconstructing/regressing the densely semantics aligned full texture image.
For person re-identification (re-id), attention mechanisms have become attractive as they aim at strengthening discriminative features and suppressing irrelevant ones, which matches well the key of re-id, i. e., discriminative feature learning.
We further explore more powerful representations by integrating language prior with the visual context in the transformation for the scene graph generation.
Skeleton-based human action recognition has attracted great interest thanks to the easy accessibility of the human skeleton data.
Ranked #1 on Skeleton Based Action Recognition on SYSU 3D
The diversity of capturing viewpoints and the flexibility of the human poses, however, remain some significant challenges.
We propose a video level 2D feature representation by transforming the convolutional features of all frames to a 2D feature map, referred to as VideoMap.
Ranked #46 on Action Recognition on UCF101
We propose adding a simple yet effective Element-wiseAttention Gate (EleAttG) to an RNN block (e. g., all RNN neurons in a network layer) that empowers the RNN neurons to have the attentiveness capability.
Ranked #70 on Skeleton Based Action Recognition on NTU RGB+D
In order to alleviate the effects of view variations, this paper introduces a novel view adaptation scheme, which automatically determines the virtual observation viewpoints in a learning based data driven manner.
Ranked #1 on Skeleton Based Action Recognition on UWA3D
We present a two-stage normalization scheme, human body normalization and limb normalization, to make the distribution of the relative joint locations compact, resulting in easier learning of convolutional spatial models and more accurate pose estimation.
Rather than re-positioning the skeletons based on a human defined prior criterion, we design a view adaptive recurrent neural network (RNN) with LSTM architecture, which enables the network itself to adapt to the most suitable observation viewpoints from end to end.
Ranked #6 on Skeleton Based Action Recognition on SYSU 3D
In this work, we propose an end-to-end spatial and temporal attention model for human action recognition from skeleton data.
Ranked #80 on Skeleton Based Action Recognition on NTU RGB+D
In this paper, we study the problem of online action detection from streaming skeleton data.
Skeleton based action recognition distinguishes human actions using the trajectories of skeleton joints, which provide a very good representation for describing actions.