Deep unsupervised learning for optical flow has been proposed, where the loss measures image similarity with the warping function parameterized by estimated flow.
Monocular 3D object detection plays an important role in autonomous driving and still remains challenging.
This is done by using proto-policies as modules to divide the tasks into simple sub-behaviours that can be shared.
Recently, hierarchical networks that consist of both a global network dealing with the whole image and multiple local networks focusing on local parts are showing success.
To enable KD across the unpaired modalities, we first propose a bidirectional modality reconstruction (BMR) module to bridge both modalities and simultaneously exploit them to distill knowledge via the crafted pairs, causing no extra computation in the inference.
Whilst both branches are required during training, the RGB branch is our primary network and the semantic branch is not needed for inference.
This paper studies a theoretical framework for value factorisation with interpretability via Shapley value theory.
Multi-domain image-to-image translation with conditional Generative Adversarial Networks (GANs) can generate highly photo realistic images with desired target classes, yet these synthetic images have not always been helpful to improve downstream supervised tasks such as image classification.
The experimental results show that our method achieves the state-of-the-art performance on the monocular 3D Object Detection and Birds Eye View tasks of the KITTI dataset, and can generalize to images with different camera intrinsics.
Ranked #10 on Monocular 3D Object Detection on KITTI Cars Moderate
Deep Imitation Learning requires a large number of expert demonstrations, which are not always easy to obtain, especially for complex tasks.
We utilise this dataset to minimise the novel depth consistency loss via adversarial learning (note we do not have ground truth depth maps for generated face images) and the depth categorical loss of synthetic data on the discriminator.
Dexterous manipulation of objects in virtual environments with our bare hands, by using only a depth sensor and a state-of-the-art 3D hand pose estimator (HPE), is challenging.
We flip the label of newly queried nodes from unlabelled to labelled, re-train the learner to optimise the downstream task and the graph to minimise its modified objective.
Ranked #3 on Active Learning on CIFAR10 (10,000)
We perform augmentation by randomly sampling sensible labels from the label space of the few labelled examples available and assigning them as target labels to the abundant unlabelled examples from the same distribution as that of the labelled ones.
We test HDNO on MultiWoz 2. 0 and MultiWoz 2. 1, the datasets on multi-domain dialogues, in comparison with word-level E2E model trained with RL, LaRL and HDSA, showing improvements on the performance evaluated by automatic evaluation metrics and human evaluation.
no code implementations • • Anil Armagan, Guillermo Garcia-Hernando, Seungryul Baek, Shreyas Hampali, Mahdi Rad, Zhaohui Zhang, Shipeng Xie, Mingxiu Chen, Boshen Zhang, Fu Xiong, Yang Xiao, Zhiguo Cao, Junsong Yuan, Pengfei Ren, Weiting Huang, Haifeng Sun, Marek Hrúz, Jakub Kanis, Zdeněk Krňoul, Qingfu Wan, Shile Li, Linlin Yang, Dongheui Lee, Angela Yao, Weiguo Zhou, Sijia Mei, Yun-hui Liu, Adrian Spurr, Umar Iqbal, Pavlo Molchanov, Philippe Weinzaepfel, Romain Brégier, Grégory Rogez, Vincent Lepetit, Tae-Kyun Kim
To address these issues, we designed a public challenge (HANDS'19) to evaluate the abilities of current 3D hand pose estimators (HPEs) to interpolate and extrapolate the poses of a training set.
Most successful approaches to estimate the 6D pose of an object typically train a neural network by supervising the learning with annotated poses in real world images.
In this work, we propose to use a few-shot learning approach that requires less training data and can be used to predict label classes of test samples from an unseen dataset.
While each phase is mainly for one of the three tasks, the networks in earlier phases are fine-tuned by respective loss functions in an end-to-end manner.
In this paper, we present the first comprehensive and most recent review of the methods on object pose recovery, from 3D bounding box detectors to full 6D pose estimators.
Current 6D object pose methods consist of deep CNN models fully optimized for a single object but with its architecture standardized among objects with different shapes.
In this work, we explore how a strategic selection of camera movements can facilitate the task of 6D multi-object pose estimation in cluttered scenarios while respecting real-world constraints important in robotics and augmented reality applications, such as time and distance traveled.
Unlike previous studies of randomly augmenting the synthetic data with real data, we present our simple, effective and easy to implement synthetic data sampling methods to train deep CNNs more efficiently and accurately.
In this work, we present a modified fuzzy decision forest for real-time 3D object pose estimation based on typical template representation.
To deal with this problem, we i) introduce a cooperative-game theoretical framework called extended convex game (ECG) that is a superset of global reward game, and ii) propose a local reward approach called Shapley Q-value.
Once the model is successfully fitted to input RGB images, its meshes i. e. shapes and articulations, are realistic, and we augment view-points on top of estimated dense hand poses.
6D object pose estimation is an important task that determines the 3D position and 3D rotation of an object in camera-centred coordinates.
We explore different ways of using this privileged information: (1) using depth data to initially train a depth-based network, (2) using the features from the depth-based network of the paired depth images to constrain mid-level RGB network weights, and (3) using the foreground mask, obtained from the depth data, to suppress the responses from the background area.
The fourth instantiation of this workshop attracted significant interest from both academia and the industry.
no code implementations • 9 Oct 2018 • Tomas Hodan, Rigas Kouskouridas, Tae-Kyun Kim, Federico Tombari, Kostas Bekris, Bertram Drost, Thibault Groueix, Krzysztof Walas, Vincent Lepetit, Ales Leonardis, Carsten Steger, Frank Michel, Caner Sahin, Carsten Rother, Jiri Matas
The workshop featured four invited talks, oral and poster presentations of accepted workshop papers, and an introduction of the BOP benchmark for 6D object pose estimation.
In this work, we capture the hand information by using a state-of-the-art hand pose estimator.
1 code implementation • • Tomas Hodan, Frank Michel, Eric Brachmann, Wadim Kehl, Anders Glent Buch, Dirk Kraft, Bertram Drost, Joel Vidal, Stephan Ihrke, Xenophon Zabulis, Caner Sahin, Fabian Manhardt, Federico Tombari, Tae-Kyun Kim, Jiri Matas, Carsten Rother
We propose a benchmark for 6D pose estimation of a rigid object from a single RGB-D input image.
The semantic segmentation network assigns semantic labels for each point in the point set.
Ranked #6 on Hand Pose Estimation on MSRA Hands
Intra-class variations, distribution shifts among source and target domains are the major challenges of category-level tasks.
Our architecture jointly learns multiple sub-tasks: 2D detection, depth, and 3D pose estimation of individual objects; and joint registration of multiple objects.
By training the HPG and HPE in a single unified optimization framework enforcing that 1) the HPE agrees with the paired depth and skeleton entries; and 2) the HPG-HPE combination satisfies the cyclic consistency (both the input and the output of HPG-HPE are skeletons) observed via the newly generated unpaired skeletons, our algorithm constructs a HPE which is robust to variations that go beyond the coverage of the existing database.
We propose a novel end-to-end semi-supervised adversarial framework to generate photorealistic face images of new identities with wide ranges of expressions, poses, and illuminations conditioned by a 3D morphable model.
Ranked #16 on Face Verification on IJB-A
2 code implementations • • Shanxin Yuan, Guillermo Garcia-Hernando, Bjorn Stenger, Gyeongsik Moon, Ju Yong Chang, Kyoung Mu Lee, Pavlo Molchanov, Jan Kautz, Sina Honari, Liuhao Ge, Junsong Yuan, Xinghao Chen, Guijin Wang, Fan Yang, Kai Akiyama, Yang Wu, Qingfu Wan, Meysam Madadi, Sergio Escalera, Shile Li, Dongheui Lee, Iason Oikonomidis, Antonis Argyros, Tae-Kyun Kim
Official Torch7 implementation of "V2V-PoseNet: Voxel-to-Voxel Prediction Network for Accurate 3D Hand and Human Pose Estimation from a Single Depth Map", CVPR 2018
Ranked #4 on Hand Pose Estimation on HANDS 2017
The proposed method leverages the state-of-the-art hand pose estimators based on Convolutional Neural Networks to facilitate feature learning, while it models the multiple modes in a two-level hierarchy to reconcile single-valued and multi-valued mapping in its output.
In this paper we examine the effects of using object poses as guidance to learning robust features for 3D object pose estimation.
In this work, we investigate several methods and strategies to learn deep embeddings for face recognition, using joint sample- and set-based optimization.
We present the 2017 Hands in the Million Challenge, a public competition designed for the evaluation of the task of 3D hand pose estimation.
A large number of studies analyse object detection and pose estimation at visual level in 2D, discussing the effects of challenges such as occlusion, clutter, texture, etc., on the performances of the methods, which work in the context of RGB modality.
Each response map-or node-in both the convolutional and fully-connected layers selectively respond to class labels s. t.
Ranked #139 on Image Classification on CIFAR-100
We also show significant improvements in egocentric hand pose estimation with a CNN trained on the new dataset.
Our dataset and experiments can be of interest to communities of 3D hand pose estimation, 6D object pose, and robotics as well as action recognition.
Our PI-based classification loss maintains a consistency between latent PI and predicted distribution.
The iterative refinement is accomplished based on finer (smaller) parts that are represented with more discriminative control point descriptors by using our Iterative Hough Forest.
Online action detection (OAD) is challenging since 1) robust yet computationally expensive features cannot be straightforwardly used due to the real-time processing requirements and 2) the localization and classification of actions have to be performed even before they are fully observed.
In this paper, we tackle the problem of 24 hours-monitoring patient actions in a ward such as "stretching an arm out of the bed", "falling out of the bed", where temporal movements are subtle or significant.
A human action can be seen as transitions between one's body poses over time, where the transition depicts a temporal relation between two poses.
Furthermore, we argue that our pose-guided feature learning using our Siamese Regression Network generates more discriminative features that outperform the state of the art.
LBSVM is based on Structured-Output SVM, but extends it to handle noisy video data and ensure consistency of the output decision throughout time.
Backscatter corresponds to a complex term with several unknown variables, and makes the problem of normal estimation hard.
In this paper, a hybrid hand pose estimation method is proposed by applying the kinematic hierarchy strategy to the input space (as well as the output space) of the discriminative method by a spatial attention mechanism and to the optimization of the generative method by hierarchical Particle Swarm Optimization (PSO).
State-of-the-art techniques proposed for 6D object pose recovery depend on occlusion-free point clouds to accurately register objects in 3D space.
In this paper we present Latent-Class Hough Forests, a method for object detection and 6 DoF pose estimation in heavily cluttered and occluded scenarios.
In this work, we present a complete framework for both single shot-based 6D object pose estimation and next-best-view prediction based on Hough Forests, the state of the art object pose estimator that performs classification and regression jointly.
Faces in the wild are usually captured with various poses, illuminations and occlusions, and thus inherently multimodally distributed in many tasks.
In this paper, we show that we can significantly improving upon black box optimization by exploiting high-level knowledge of the structure of the parameters and using a local surrogate energy function.
Image understanding is an important research domain in the computer vision due to its wide real-world applications.
We inspect the recent advances in various aspects and propose some interesting directions for future research.
In this paper, we propose a label propagation framework to handle the multiple object tracking (MOT) problem for a generic object type (cf.
In this paper, we present a unified method for joint face image analysis, i. e., simultaneously estimating head pose, facial expression and landmark positions in real-world face images.
We compare our method with previous approaches through extensive experimental results, where a variety of objects are imaged in a big water tank whose turbidity is systematically increased, and show reconstruction quality which degrades little relative to clean water results even with a very significant scattering level.
In contrast to prior forest-based methods, which take dense pixels as input, classify them independently and then estimate joint positions afterwards; our method can be considered as a structured coarse-to-fine search, starting from the centre of mass of a point cloud until locating all the skeletal joints.
This work addresses the challenging problem of unconstrained 3D human pose estimation (HPE) from a novel perspective.