Many adaptations of transformers have emerged to address the single-modal vision tasks, where self-attention modules are stacked to handle input sources like images.
Ranked #1 on Semantic Segmentation on SUN-RGBD
For instance, our approach achieves a 66. 4\% mAP with the 0. 5 IoU threshold on the ScanNetV2 test set, which is 1. 9\% higher than the state-of-the-art method.
Ranked #1 on 3D Instance Segmentation on S3DIS
Adder neural networks (AdderNets) have shown impressive performance on image classification with only addition operations, which are more energy efficient than traditional convolutional neural networks built with multiplications.
Adder neural networks (ANNs) are designed for low energy cost which replace expensive multiplications in convolutional neural networks (CNNs) with cheaper additions to yield energy-efficient neural networks and hardware accelerations.
Adder neural network (AdderNet) replaces the original convolutions with massive multiplications by cheap additions while achieving comparable performance thus yields a series of energy-efficient neural networks.
Previous vision MLPs such as MLP-Mixer and ResMLP accept linearly flattened image patches as input, making them inflexible for different input sizes and hard to capture spatial information.
Experiments on various datasets and architectures demonstrate that the proposed method is able to be utilized for effectively learning portable student networks without the original data, e. g., with 0. 16dB PSNR drop on Set5 for x2 super resolution.
In this paper, we present a positive-unlabeled learning based scheme to expand training data by purifying valuable images from massive unlabeled ones, where the original training data are viewed as positive data and the unlabeled images in the wild are unlabeled data.
Adder neural network (AdderNet) is a new kind of deep model that replaces the original massive multiplications in convolutions by additions while preserving the high performance.
To this end, we present a novel distillation algorithm via decoupled features (DeFeat) for learning a better student detector.
Transformer, first applied to the field of natural language processing, is a type of deep neural network mainly based on the self-attention mechanism.
1 code implementation • 3 Nov 2020 • Bochao Wang, Hang Xu, Jiajin Zhang, Chen Chen, Xiaozhi Fang, Yixing Xu, Ning Kang, Lanqing Hong, Chenhan Jiang, Xinyue Cai, Jiawei Li, Fengwei Zhou, Yong Li, Zhicheng Liu, Xinghao Chen, Kai Han, Han Shu, Dehua Song, Yunhe Wang, Wei zhang, Chunjing Xu, Zhenguo Li, Wenzhi Liu, Tong Zhang
Automated Machine Learning (AutoML) is an important industrial solution for automatic discovery and deployment of the machine learning models.
A convolutional neural network (CNN) with the same architecture is simultaneously initialized and trained as a teacher network, features and weights of ANN and CNN will be transformed to a new space to eliminate the accuracy drop.
To identify the redundancy in segmentation networks, we present a multi-task channel pruning approach.
This paper proposes to learn a lightweight video style transfer network via knowledge distillation paradigm.
To achieve an extremely fast NAS while preserving the high accuracy, we propose to identify the vital blocks and make them the priority in the architecture search.
To this end, we propose a hierarchical trinity search framework to simultaneously discover efficient architectures for all components (i. e. backbone, neck, and head) of object detector in an end-to-end manner.
Architectures in the population that share parameters within one SuperNet in the latest generation will be tuned over the training dataset with a few epochs.
To mitigate these limitations and promote further research on hand pose estimation from stereo images, we propose a new large-scale binocular hand pose dataset called THU-Bi-Hand, offering a new perspective for fingertip localization.
Dynamic hand gesture recognition has attracted increasing attention because of its importance for human–computer interaction.
The semantic segmentation network assigns semantic labels for each point in the point set.
Ranked #5 on Hand Pose Estimation on MSRA Hands
Different from previous works, we propose a new framework, named Two-Stream Binocular Network (TSBnet) to detect fingertips from binocular images directly.
In the reality of HMI, joints in fingers stretching out, especially corresponding fingertips, are much more important than other joints.
2 code implementations • • Shanxin Yuan, Guillermo Garcia-Hernando, Bjorn Stenger, Gyeongsik Moon, Ju Yong Chang, Kyoung Mu Lee, Pavlo Molchanov, Jan Kautz, Sina Honari, Liuhao Ge, Junsong Yuan, Xinghao Chen, Guijin Wang, Fan Yang, Kai Akiyama, Yang Wu, Qingfu Wan, Meysam Madadi, Sergio Escalera, Shile Li, Dongheui Lee, Iason Oikonomidis, Antonis Argyros, Tae-Kyun Kim
Official Torch7 implementation of "V2V-PoseNet: Voxel-to-Voxel Prediction Network for Accurate 3D Hand and Human Pose Estimation from a Single Depth Map", CVPR 2018
Ranked #4 on Hand Pose Estimation on HANDS 2017
The proposed method extracts regions from the feature maps of convolutional neural network under the guide of an initially estimated pose, generating more optimal and representative features for hand pose estimation.
Ranked #6 on Hand Pose Estimation on ICVL Hands
Dynamic hand gesture recognition has attracted increasing interests because of its importance for human computer interaction.
Ranked #3 on Hand Gesture Recognition on DHG-28
3D hand pose estimation from single depth image is an important and challenging problem for human-computer interaction.
Ranked #2 on Pose Estimation on ITOP front-view
Hand pose estimation from monocular depth images is an important and challenging problem for human-computer interaction.
Ranked #9 on Hand Pose Estimation on MSRA Hands
Accurate detection of fingertips in depth image is critical for human-computer interaction.