This paper introduces Ranking Info Noise Contrastive Estimation (RINCE), a new member in the family of InfoNCE losses that preserves a ranked ordering of positive samples.
Our proposed module improves the SOTA by reducing the computational cost (GFLOPs) by 2x while preserving the accuracy of SOTA models on ImageNet, Kinetics-400, and Kinetics-600 datasets.
Video-based person re-identification (re-ID) is an important technique in visual surveillance systems which aims to match video snippets of people captured by different cameras.
While recurrent neural networks (RNNs) demonstrate outstanding capabilities in future video frame prediction, they model dynamics in a discrete time space and sequentially go through all frames until the desired future temporal step is reached.
We argue that a single representation to capture both types of features is sub-optimal, and propose to decompose the representation space into stationary and non-stationary features via contrastive learning from long and short views, i. e. long video sequences and their shorter sub-sequences.
We apply FIFA on top of state-of-the-art approaches for weakly supervised action segmentation and alignment as well as fully supervised action segmentation.
Our method learns to predict the motions that occur during the nominal execution of a task, including camera and robot body motion.
This poses a problem for domains such as autonomous driving, where the reaction time is crucial.
Ranked #5 on Action Anticipation on EPIC-KITCHENS-100 (test)
We show that SIV-GAN successfully deals with a new challenging task of learning from a single video, for which prior GAN models fail to achieve synthesis of both high quality and diversity.
To demonstrate the effectiveness of timestamp supervision, we propose an approach to train a segmentation model using only timestamps annotations.
Ranked #2 on Weakly Supervised Action Localization on GTEA
In this work we propose an approach for estimating 3D human poses of multiple people from a set of calibrated cameras.
In this paper, we propose an approach that spatially localizes the activities in a video frame where each person can perform multiple activities at the same time.
Since computing, the probabilities for the full power set becomes intractable as the number of action classes increases, we assign an action set to each detected person under the constraint that the assignment is consistent with the annotation of the video clip.
Trajectory forecasting is a crucial step for autonomous vehicles and mobile robots in order to navigate and interact safely.
By providing stronger supervision to the discriminator as well as to the generator through spatially- and semantically-aware discriminator feedback, we are able to synthesize images of higher fidelity with better alignment to their input label maps, making the use of the perceptual loss superfluous.
While the GFLOPs of a 3D CNN can be decreased by reducing the temporal feature resolution within the network, there is no setting that is optimal for all input clips.
For that reason, we present PoseTrackReID, a large-scale dataset for multi-person pose tracking and video-based person re-ID.
This paper introduces a novel method for self-supervised video representation learning via feature prediction.
To this end, the network first refines the poses before they are further processed to recognize the action.
With the success of deep learning methods in analyzing activities in videos, more attention has recently been focused towards anticipating future activities.
Real-time semantic segmentation of LiDAR data is crucial for autonomously driving vehicles, which are usually equipped with an embedded platform and have limited computational resources.
Ranked #2 on Real-Time 3D Semantic Segmentation on SemanticKITTI
Codec Avatars are a recent class of learned, photorealistic face models that accurately represent the geometry and texture of a person in 3D (i. e., for virtual reality), and are almost indistinguishable from video.
Many point-based semantic segmentation methods have been designed for indoor scenarios, but they struggle if they are applied to point clouds that are captured by a LiDAR sensor in an outdoor environment.
Despite the capabilities of these approaches in capturing temporal dependencies, their predictions suffer from over-segmentation errors.
Ranked #12 on Action Segmentation on GTEA
Instead of training the network for estimating keypoint correspondences on video data, it is trained on a large scale image datasets for human pose estimation using self-supervision.
On unlabeled images, we predict a probability map for latent classes and use it as a supervision signal to learn semantic segmentation.
no code implementations • 13 Dec 2019 • Julian Tanke, Oh-Hun Kwon, Patrick Stotko, Radu Alexandru Rosu, Michael Weinmann, Hassan Errami, Sven Behnke, Maren Bennewitz, Reinhard Klein, Andreas Weber, Angela Yao, Juergen Gall
The key prerequisite for accessing the huge potential of current machine learning techniques is the availability of large databases that capture the complex relations of interest.
The estimation of viewpoints and keypoints effectively enhance object detection methods by extracting valuable traits of the object instances.
We exploit this connection by first anticipating symbolic labels and then generate human motion, conditioned on the human motion input sequence as well as on the forecast labels.
To this end, we extract the knowledge of the trained teacher network for the source modality and transfer it to a small ensemble of student networks for the target modality.
Since this assumption is violated under real-world conditions, we propose an approach for open set domain adaptation where the target domain contains instances of categories that are not present in the source domain.
Action recognition is so far mainly focusing on the problem of classification of hand selected preclipped actions and reaching impressive results in this field.
Action recognition has become a rapidly developing research field within the last decade.
They do not require additional curation as it is the case for the clean class tags used by current weakly supervised approaches and they provide textual context for the classes present in an image.
We address the task of 3D semantic scene completion, i. e. , given a single depth image, we predict the semantic labels and occupancy of voxels in a 3D grid representing the scene.
We propose a joint model of human joint detection and association for 2D multi-person pose estimation (MPPE).
Action segmentation is the task of predicting the actions for each frame of a video.
Despite the relevance of semantic scene understanding for this application, there is a lack of a large dataset for this task which is based on an automotive LiDAR.
Ranked #20 on 3D Semantic Segmentation on SemanticKITTI
Temporally locating and classifying action segments in long untrimmed videos is of particular interest to many applications like surveillance and robotics.
Ranked #15 on Action Segmentation on GTEA
First, we represent the data using a spatio-temporal tensor of 3D skeleton coordinates which allows formulating the prediction problem as an inpainting one, for which GANs work particularly well.
Weakly supervised semantic segmentation has been a subject of increased interest due to the scarcity of fully annotated images.
The idea of compressed sensing is to exploit representations in suitable (overcomplete) dictionaries that allow to recover signals far beyond the Nyquist rate provided that they admit a sparse representation in the respective dictionary.
Our experiments show that adding STC blocks to current state-of-the-art architectures outperforms the state-of-the-art methods on the HMDB51, UCF101 and Kinetics datasets.
Video learning is an important task in computer vision and has experienced increasing interest over the recent years.
The general formulation of our temporal network allows to rely on any multi person pose estimation approach as spatial network.
In this work, we propose a two stream approach that leverages depth information and semantic information, which is inferred from the RGB image, for this task.
Ranked #6 on 3D Semantic Scene Completion on SemanticKITTI
In this paper, we propose a structural recurrent neural network (SRNN) that uses a series of interconnected RNNs to jointly capture the actions of individuals, their interactions, as well as the group activity.
We question the dominant role of real-world training images in the field of material classification by investigating whether synthesized data can generalise more effectively than real-world data.
In this work, we aim to further advance the state of the art by establishing "PoseTrack", a new large-scale benchmark for video-based human pose estimation and articulated tracking, and bringing together the community of researchers working on visual human analysis.
Ranked #3 on Multi-Person Pose Estimation on PoseTrack2017
The approach learns a mapping from the source to the target domain by jointly solving an assignment problem that labels those target instances that potentially belong to the categories of interest present in the source dataset.
Localizing functional regions of objects or affordances is an important aspect of scene understanding and relevant for many robotics applications.
In this work, we propose a novel recurrent ConvNet architecture called recurrent residual networks to address the task of action recognition.
Action detection and temporal segmentation of actions in videos are topics of increasing interest.
To this end, we first convert the motion capture data into a normalized 2D pose space, and separately learn a 2D pose estimation model from the image data.
Ranked #28 on Monocular 3D Human Pose Estimation on Human3.6M
In this work, we propose a framework for hand tracking that can capture the motion of two interacting hands using only a single, inexpensive RGB-D camera.
In this work, we propose a recurrent neural network that is equivalent to the traditional bag-of-words approach but enables for the application of discriminative training.
We conduct a rigorous evaluation on a common ground by combining this dataset with different state-of-the-art deep convolutional architectures in order to achieve recognition of human rights violations.
Determining the material category of a surface from an image is a demanding task in perception that is drawing increasing attention.
In this work, we introduce the challenging problem of joint multi-person pose estimation and tracking of an unknown number of persons in unconstrained videos.
Ranked #1 on Pose Tracking on Multi-Person PoseTrack
Our system is based on the idea that, given a sequence of input data and a transcript, i. e. a list of the order the actions occur in the video, it is possible to infer the actions within the video stream, and thus, learn the related action models without the need for any frame-based annotation.
Although commercial and open-source software exist to reconstruct a static object from a sequence recorded with an RGB-D sensor, there is a lack of tools that build rigged models of articulated objects that deform realistically and can be used for tracking or animation.
To this end, we consider multi-person pose estimation as a joint-to-person association problem.
Ranked #8 on Multi-Person Pose Estimation on MPII Multi-Person
While current approaches to action recognition on pre-segmented video clips already achieve high accuracies, temporal action detection is still far from comparably good results.
In this work we propose to utilize information about human actions to improve pose estimation in monocular videos.
Ranked #5 on Pose Estimation on UPenn Action
To integrate both sources, we propose a dual-source approach that combines 2D pose estimation with efficient and robust 3D pose retrieval.
Ranked #9 on 3D Human Pose Estimation on HumanEva-I
We describe an end-to-end generative approach for the segmentation and recognition of human activities.
Through extensive system evaluations, we demonstrate that combining compact video representations based on Fisher Vectors with HMM-based modeling yields very significant gains in accuracy and when properly trained with sufficient training samples, structured temporal models outperform unstructured bag-of-word types of models by a large margin on the tested performance metric.
Hand motion capture is a popular research field, recently gaining more attention due to the ubiquity of RGB-D sensors.
Compared to approaches that disregard the extra coarse labeled data, we achieve a relative improvement in subcategory classification accuracy of up to 22% in our large-scale image classification experiments.
NCMFs not only outperform conventional random forests, but are also well suited for integrating new classes.
The second layer takes the estimated class distributions of the first one into account and is thereby able to predict joint locations by modeling the interdependence and co-occurrence of the parts.
A common approach for handling the complexity and inherent ambiguities of 3D human pose estimation is to use pose priors learned from training data.