We propose global context vision transformer (GC ViT), a novel architecture that enhances parameter and compute utilization.
In this work, we present, DRaCoN, a framework for learning full-body volumetric avatars which exploits the advantages of both the 2D and 3D neural rendering techniques.
Federated learning (FL) allows the collaborative training of AI models without needing to share raw data.
A-ViT achieves this by automatically reducing the number of tokens in vision transformers that are processed in the network as inference proceeds.
Since the joint reconstruction of human motions and camera poses is underconstrained, we propose a global trajectory predictor that generates global human trajectories based on local body movements.
Through extensive experiments on ImageNet, we show that EPI empowers a quick tracking of early training epochs suitable for pruning, offering same efficacy as an otherwise ``oracle'' grid-search that scans through epochs and requires orders of magnitude more compute.
We propose Hardware-Aware Latency Pruning (HALP) that formulates structural pruning as a global resource allocation optimization problem, aiming at maximizing the accuracy while constraining latency under a predefined budget.
On ImageNet-1K, we prune the DEIT-Base (Touvron et al., 2021) model to a 2. 6x FLOPs reduction, 5. 1x parameter reduction, and 1. 9x run-time speedup with only 0. 07% loss in accuracy.
In the second phase, it solves the combinatorial selection of efficient operations using a novel constrained integer linear optimization approach.
Understanding the behavior and vulnerability of pre-trained deep neural networks (DNNs) can help to improve them.
We analyze three popular network architectures: EfficientNetV1, EfficientNetV2 and ResNeST, and achieve accuracy improvement for all models (up to $3. 0\%$) when compressing larger models to the latency level of smaller models.
We study the problem of quantizing N sorted, scalar datapoints with a fixed codebook containing K entries that are allowed to be rescaled.
Hand pose estimation is difficult due to different environmental conditions, object- and self-occlusion as well as diversity in hand shape and appearance.
We present KAMA, a 3D Keypoint Aware Mesh Articulation approach that allows us to estimate a human body mesh from the positions of 3D body keypoints.
Ranked #16 on 3D Human Pose Estimation on 3DPW
In this work, we introduce GradInversion, using which input images from a larger batch (8 - 48 images) can also be recovered for large networks such as ResNets (50 layers), on complex datasets such as ImageNet (1000 classes, 224x224 px).
We introduce DexYCB, a new dataset for capturing hand grasping of objects.
no code implementations • • Anil Armagan, Guillermo Garcia-Hernando, Seungryul Baek, Shreyas Hampali, Mahdi Rad, Zhaohui Zhang, Shipeng Xie, Mingxiu Chen, Boshen Zhang, Fu Xiong, Yang Xiao, Zhiguo Cao, Junsong Yuan, Pengfei Ren, Weiting Huang, Haifeng Sun, Marek Hrúz, Jakub Kanis, Zdeněk Krňoul, Qingfu Wan, Shile Li, Linlin Yang, Dongheui Lee, Angela Yao, Weiguo Zhou, Sijia Mei, Yun-hui Liu, Adrian Spurr, Umar Iqbal, Pavlo Molchanov, Philippe Weinzaepfel, Romain Brégier, Grégory Rogez, Vincent Lepetit, Tae-Kyun Kim
To address these issues, we designed a public challenge (HANDS'19) to evaluate the abilities of current 3D hand pose estimators (HPEs) to interpolate and extrapolate the poses of a training set.
Estimating 3D hand pose from 2D images is a difficult, inverse problem due to the inherent scale and depth ambiguities.
One major challenge for monocular 3D human pose estimation in-the-wild is the acquisition of training data that contains unconstrained images annotated with accurate 3D poses.
no code implementations • 22 Feb 2020 • Abdulrahman Mahmoud, Siva Kumar Sastry Hari, Christopher W. Fletcher, Sarita V. Adve, Charbel Sakr, Naresh Shanbhag, Pavlo Molchanov, Michael B. Sullivan, Timothy Tsai, Stephen W. Keckler
As Convolutional Neural Networks (CNNs) are increasingly being employed in safety-critical applications, it is important that they behave reliably in the face of hardware errors.
We introduce DeepInversion, a new method for synthesizing images from the image distribution used to train a deep neural network.
On ResNet-101, we achieve a 40% FLOPS reduction by removing 30% of the parameters, with a loss of 0. 02% in the top-1 accuracy on ImageNet.
Inter-personal anatomical differences limit the accuracy of person-independent gaze estimation networks.
Ranked #1 on Gaze Estimation on MPII Gaze (using extra training data)
Parts provide a good intermediate representation of objects that is robust with respect to the camera, pose and appearance variations.
Specifically, we propose a semi-supervised framework that employs unpaired image-to-image translation between two domains, presence vs. absence of cancer, as the unsupervised objective.
Recurrent neural networks (RNNs) have emerged as a powerful model for a broad range of machine learning problems that involve sequential data.
Deep residual networks (ResNets) made a recent breakthrough in deep learning.
2 code implementations • • Shanxin Yuan, Guillermo Garcia-Hernando, Bjorn Stenger, Gyeongsik Moon, Ju Yong Chang, Kyoung Mu Lee, Pavlo Molchanov, Jan Kautz, Sina Honari, Liuhao Ge, Junsong Yuan, Xinghao Chen, Guijin Wang, Fan Yang, Kai Akiyama, Yang Wu, Qingfu Wan, Meysam Madadi, Sergio Escalera, Shile Li, Dongheui Lee, Iason Oikonomidis, Antonis Argyros, Tae-Kyun Kim
Official Torch7 implementation of "V2V-PoseNet: Voxel-to-Voxel Prediction Network for Accurate 3D Hand and Human Pose Estimation from a Single Depth Map", CVPR 2018
Ranked #4 on Hand Pose Estimation on HANDS 2017
In this paper, we address the challenging problem of efficient temporal activity detection in untrimmed long videos.
First, we propose the framework of sequential multitasking and explore it here through an architecture for landmark localization where training with class labels acts as an auxiliary signal to guide the landmark localization on unlabeled data.
In addition, we have created a large synthetic dataset, SynBRDF, which comprises a total of $500$K RGBD images rendered with a physically-based ray tracer under a variety of natural illumination, covering $5000$ materials and $5000$ shapes.
We propose a new criterion based on Taylor expansion that approximates the change in the cost function induced by pruning network parameters.
Automatic detection and classification of dynamic hand gestures in real-world systems intended for human computer interaction is challenging as: 1) there is a large diversity in how people perform gestures, making detection and classification difficult; 2) the system must work online in order to avoid noticeable lag between performing a gesture and its classification; in fact, a negative lag (classification before the gesture is finished) is desirable, as feedback to the user can then be truly instantaneous.