This lets us show that the decay in generalization performance of adversarial training is a result of the model's attempt to fit hard adversarial instances.
The Skinned Multi-Person Linear (SMPL) model can represent a human body by mapping pose and shape parameters to body meshes.
To evaluate this, and because no existing motion prediction datasets depict two closely-interacting subjects, we introduce the LindyHop600K dance dataset.
We summarise our findings into a set of guidelines and demonstrate their effectiveness by applying them to different baseline methods, DCP and IDAM.
The key to making these correspondences semantically meaningful is to guarantee that the metric tensors computed at corresponding points are as similar as possible.
Ranked #1 on Surface Reconstruction on ANIM
Estimating the depth of comics images is challenging as such images a) are monocular; b) lack ground-truth depth annotations; c) differ across different artistic styles; d) are sparse and noisy.
Ranked #1 on Depth Estimation on eBDtheque
Recent progress in stochastic motion prediction, i. e., predicting multiple possible future human motions given a single past pose sequence, has led to producing truly diverse future motions and even providing control over the motion of some body parts.
Whether based on recurrent or feed-forward neural networks, existing learning based methods fail to model the observation that human motion tends to repeat itself, even for complex sports actions and cooking activities.
While domain adaptation has been used to improve the performance of object detectors when the training and test data follow different distributions, previous work has mostly focused on two-stage detectors.
Knowledge distillation constitutes a simple yet effective way to improve the performance of a compact student network by exploiting the knowledge of a more powerful teacher.
We demonstrate the effectiveness of our approach using an ESPNet trained on the Cityscapes dataset as segmentation model, an affine Normalizing Flow as density estimator and use blue noise to ensure homogeneous sampling.
State-of-the-art semantic or instance segmentation deep neural networks (DNNs) are usually trained on a closed set of semantic classes.
We propose a method for the unsupervised reconstruction of a temporally-coherent sequence of surfaces from a sequence of time-evolving point clouds, yielding dense, semantically meaningful correspondences between all keyframes.
Weight sharing has become a de facto standard in neural architecture search because it enables the search to be done on commodity hardware.
Saliency prediction has made great strides over the past two decades, with current techniques modeling low-level information, such as color, intensity and size contrasts, and high-level one, such as attention and gaze direction for entire objects.
6D pose estimation in space poses unique challenges that are not commonly encountered in the terrestrial setting.
Correspondence selection aims to correctly select the consistent matches (inliers) from an initial set of putative correspondences.
In this setting, existing techniques focus on the challenging task of isolating the unknown target samples, so as to avoid the negative transfer resulting from aligning the source feature distributions with the broader target one that encompasses the additional unknown classes.
While these methods were shown to be vulnerable to adversarial attacks, as most deep networks for visual recognition tasks, the existing attacks for VOT trackers all require perturbing the search region of every input frame to be effective, which comes at a non-negligible cost, considering that VOT is a real-time task.
Deep learning-solutions for hand-object 3D pose and shape estimation are now very effective when an annotated dataset is available to train them to handle the scenarios and lighting conditions they will encounter at test time.
Self-supervised detection and segmentation of foreground objects aims for accuracy without annotated training data.
Long term human motion prediction is essential in safety-critical applications such as human-robot interaction and autonomous driving.
Despite the recent advances in multiple object tracking (MOT), achieved by joint detection and tracking, dealing with long occlusions remains a challenge.
In the presence of annotated data, deep human pose estimation networks yield impressive performance.
Our conclusion is that it is important to utilize camera calibration information when available, for classical and deep-learning-based computer vision alike.
A recent trend in Non-Rigid Structure-from-Motion (NRSfM) is to express local, differential constraints between pairs of images, from which the surface normal at any point can be obtained by solving a system of polynomial equations.
While much progress has been made on the task of 3D point cloud registration, there still exists no learning-based method able to estimate the 6D pose of an object observed by a 2. 5D sensor in a scene.
While supervised object detection and segmentation methods achieve impressive accuracy, they generalize poorly to images whose appearance significantly differs from the data they have been trained on.
We argue that the diverse temporal scales are important as they allow us to look at the past frames with different receptive fields, which can lead to better predictions.
We achieve state of the art performance on LINEMOD, and OccludedLINEMOD in without real-pose setting, even outperforming methods that rely on real annotations during training on Occluded-LINEMOD.
We introduce a two-stream deep network model that produces a visually plausible draping of a template cloth on virtual 3D bodies by extracting features from both the body and garment shapes.
We design our VTN as an encoder-decoder network, with modules dedicated to letting the information flow across the feature channels, to account for the dependencies between the semantic parts.
We analyze the influence of adversarial training on the loss landscape of machine learning models.
In this paper, we identify the proximity of the latent representations of different classes in fine-grained recognition networks as a key factor to the success of adversarial attacks.
While 3D-3D registration is traditionally tacked by optimization-based methods, recent work has shown that learning-based techniques could achieve faster and more robust results.
We tackle unsupervised domain adaptation by accounting for the fact that different domains may need to be processed differently to arrive to a common feature representation effective for recognition.
One of the core components in online multiple object tracking (MOT) frameworks is associating new detections with existing tracklets, typically done via a scoring function.
In this paper, we introduce an eigendecomposition-free approach to training a deep network whose loss depends on the eigenvector corresponding to a zero eigenvalue of a matrix predicted by the network.
Weight sharing promises to make neural architecture search (NAS) tractable even on commodity hardware.
We tackle the task of diverse 3D human motion prediction, that is, forecasting multiple plausible future 3D poses given a sequence of observed 3D poses.
The accuracy of monocular 3D human pose estimation depends on the viewpoint from which the image is captured.
Recently, deep networks have achieved impressive semantic segmentation performance, in particular thanks to their use of larger contextual information.
State-of-the-art methods for counting people in crowded scenes rely on deep networks to estimate crowd density.
In this paper, we advocate estimating people flows across image locations between consecutive images and inferring the people densities from these flows instead of directly regressing.
Generative models that produce point clouds have emerged as a powerful tool to represent 3D surfaces, and the best current ones rely on learning an ensemble of parametric representations.
Second, training the deep network relies on a surrogate loss that does not directly reflect the final 6D pose estimation task.
In this paper, we propose a simple feed-forward deep network for motion prediction, which takes into account both temporal smoothness and spatial dependencies among human body joints.
Ranked #2 on Human Pose Forecasting on Human3.6M
In this paper, we introduce an approach to stochastically combine the root of variations with previous pose information, which forces the model to take the noise into account.
While supervised object detection methods achieve impressive accuracy, they generalize poorly to images whose appearance significantly differs from the data they have been trained on.
State-of-the-art segmentation methods rely on very deep networks that are not always easy to train without very large training datasets and tend to be relatively slow to run on standard GPUs.
In this paper, we tackle the more realistic scenario where unexpected objects of unknown classes can appear at test time.
To this end, we introduce a self-supervised approach to learning what we call a neural scene decomposition (NSD) that can be exploited for 3D pose estimation.
We identify a phenomenon, which we refer to as multi-model forgetting, that occurs when sequentially training multiple deep networks with partially-shared parameters; the performance of previously-trained models degrades as one optimizes a subsequent one, due to the overwriting of shared parameters.
The reason behind the prediction for a new sample can then be interpreted by looking at the visual representation of the most highly activated codeword.
The most recent trend in estimating the 6D pose of rigid objects has been to train deep networks to either directly regress the pose from the image or to predict the 2D locations of 3D keypoints, from which the pose can be obtained using a PnP algorithm.
Ranked #3 on 6D Pose Estimation using RGB on YCB-Video
We fuse these features with those extracted in parallel from the 3D body, so as to model the cloth-body interactions.
As evidenced by our results on standard hand segmentation benchmarks and on our own dataset, our approach outperforms these other, simpler recurrent segmentation techniques, as well as the state-of-the-art hand segmentation one.
As evidenced by our experiments, our approach outperforms both training the compact network from scratch and performing knowledge distillation from a teacher.
The difficulty of obtaining annotations to build training databases still slows down the adoption of recent deep learning approaches for biomedical image analysis.
Action anticipation is critical in scenarios where one needs to react before the action is finalized.
Our approach builds on the observation that foreground and background classes are not affected in the same manner by the domain shift, and thus should be treated differently.
To this end, we rely on the intuition that the source and target samples depicting the known classes can be generated by a shared subspace, whereas the target samples from unknown classes come from a different, private subspace.
The presented algorithms can be applied to any labelling problem using a dense CRF with sparse higher-order potentials.
Structured representations, such as Bags of Words, VLAD and Fisher Vectors, have proven highly effective to tackle complex visual recognition tasks.
In this paper, we propose to overcome this problem by learning a geometry-aware body representation from multi-view images without annotations.
Ranked #15 on Weakly-supervised 3D Human Pose Estimation on Human3.6M
In this paper, we explicitly model the scale changes and reason in terms of people per square-meter.
Recent years have seen the development of mature solutions for reconstructing deformable surfaces from a single image, provided that they are relatively well-textured.
Many classical Computer Vision problems, such as essential matrix computation and pose estimation from 3D to 2D correspondences, can be solved by finding the eigenvector corresponding to the smallest, or zero, eigenvalue of a matrix representing a linear system.
Accurate 3D human pose estimation from single images is possible with sophisticated deep-net architectures that have been trained on very large datasets.
Our approach is motivated by a statistical analysis of the network's activations, relying on operations that lead to a Gaussian-distributed final representation, as inherently used by first-order deep networks.
The goal of Deep Domain Adaptation is to make it possible to use Deep Nets trained in one domain where there is enough annotated training data in another where there is little or none.
We propose to address this issue, by formulating multimodal semantic labeling as inference in a CRF and introducing latent nodes to explicitly model inconsistencies between two modalities.
Our experiments demonstrate the benefits of our classifier heatmaps and of our two-stream architecture on challenging urban scene datasets and on the YouTube-Objects benchmark, where we obtain state-of-the-art results.
To be tractable and robust to data noise, existing metric learning algorithms commonly rely on PCA as a pre-processing step.
In particular, while some of them aim at segmenting the image into regions, such as object or surface instances, others aim at inferring the semantic labels of given regions, or their support relationships.
Imposing constraints on the output of a Deep Neural Net is one way to improve the quality of its predictions while loosening the requirements for labeled training data.
We then show how to obtain multi-class masks by the fusion of foreground/background ones with information extracted from a weakly-supervised localization network.
In contrast to the widely studied problem of recognizing an action given a complete sequence, action anticipation aims to identify the action from only partially available videos.
By performing linear combinations and element-wise nonlinear operations, these networks can be thought of as extracting solely first-order information from an input image.
Multi-label submodular Markov Random Fields (MRFs) have been shown to be solvable using max-flow based on an encoding of the labels proposed by Ishikawa, in which each variable $X_i$ is represented by $\ell$ nodes (where $\ell$ is the number of labels) arranged in a column.
In this context, existing methods typically propose candidate objects, usually as bounding boxes, and directly predict a binary mask within each such proposal.
To this end, we develop a proximal minimization framework, where the dual of each proximal problem is optimized via block coordinate descent.
We outperform the state-of-the-art methods that, as us, rely only on RGB frames as input for both action recognition and anticipation.
Hence, weak supervision using only image tags could have a significant impact in semantic segmentation.
In particular, we introduce a deep structured network that jointly predicts the objectness scores and the bounding box locations of multiple object candidates.
Despite much progress, state-of-the-art techniques suffer from two drawbacks: (i) they rely on the assumption that intensity edges coincide with depth discontinuities, which, unfortunately, is only true in controlled environments; and (ii) they typically exploit the availability of high-resolution training depth maps, which can often not be acquired in practice due to the sensors' limitations.
This lets us formulate dimensionality reduction as the problem of finding a projection that yields a low-dimensional manifold either with maximum discriminative power in the supervised scenario, or with maximum variance of the data in the unsupervised one.
Most recent approaches to monocular 3D pose estimation rely on Deep Learning.
To this end, we introduce a two-stream architecture, where one operates in the source domain and the other in the target domain.
Feature tracking is a fundamental problem in computer vision, with applications in many computer vision tasks, such as visual SLAM and action recognition.
In this paper, we address the problem of data misalignment and label inconsistencies, e. g., due to moving objects, in semantic labeling, which violate the assumption of existing techniques.
To exploit the correlations between objects, we build a fully-connected CRF on the candidates, which explicitly incorporates both geometric layout relations across object classes and similarity relations across multiple images.
By contrast, nonparametric approaches, which bypass any learning phase and directly transfer the labels from the training data to the query images, can readily exploit new labeled samples as they become available.
The Shape Interaction Matrix (SIM) is one of the earliest approaches to performing subspace clustering (i. e., separating points drawn from a union of subspaces).
Ranked #2 on Motion Segmentation on Hopkins155
State-of-the-art image-set matching techniques typically implicitly model each image-set with a Gaussian distribution.
Vectors of Locally Aggregated Descriptors (VLAD) have emerged as powerful image/video representations that compete with or even outperform state-of-the-art approaches on many challenging visual recognition tasks.
We tackle the problem of single image depth estimation, which, without additional knowledge, suffers from many ambiguities.
While sparse coding on non-flat Riemannian manifolds has recently become increasingly popular, existing solutions either are dedicated to specific manifolds, or rely on optimization problems that are difficult to solve, especially when it comes to dictionary learning.
We propose a framework for 2D shape analysis using positive definite kernels defined on Kendall's shape manifold.
To encode the geometry of the manifold in the mapping, we introduce a family of provably positive definite kernels on the Riemannian manifold of SPD matrices.
We tackle the problem of optimizing over all possible positive definite radial kernels on Riemannian manifolds for classification.
We then use the proposed framework to identify positive definite kernels on two specific manifolds commonly encountered in computer vision: the Riemannian manifold of symmetric positive definite matrices and the Grassmann manifold, i. e., the Riemannian manifold of linear subspaces of a Euclidean space.
While widely acknowledged as highly effective in computer vision, multi-label MRFs with non-convex priors are difficult to optimize.
In contrast, here, we study the problem of performing coding in a high-dimensional Hilbert space, where the classes are expected to be more easily separable.
In particular, we search for a projection that yields a low-dimensional manifold with maximum discriminative power encoded via an affinity-weighted similarity measure based on metrics on the manifold.
Modeling videos and image-sets as linear subspaces has proven beneficial for many visual recognition tasks.
Here, we propose to make better use of the structure of this manifold and rely on the distance on the manifold to compare the source and target distributions.
We introduce an approach to computing and comparing Covariance Descriptors (CovDs) in infinite-dimensional spaces.
In such conditions, our differential geometry analysis provides a theoretical proof that the shape of the mirror surface can be uniquely recovered if the pose of the reference target is known.
In this paper, we tackle the problem of performing inference in graphical models whose energy is a polynomial function of continuous variables.
Recent approaches to multi-view learning have shown that factorizing the information into parts that are shared across all views and parts that are private to each view could effectively account for the dependencies and independencies between the different input modalities.