Theoretically, we show that the small networks pruned using our method achieve provably lower loss than small networks trained from scratch with the same size.
Introducing such multi label examples at the cost of annotating fewer examples brings clear gains on natural language inference task and entity typing task, even when we simply first train with a single label data and then fine tune with multi label examples.
The idea is to generate a set of augmented data with some random perturbations or transforms, and minimize the maximum, or worst case loss over the augmented data.
We study calibration in question answering, estimating whether model correctly predicts answer for each question.
To alleviate this problem, in this work, we introduce novel loss functions in vision transformer training to explicitly encourage diversity across patch representations for more discriminative feature extraction.
Ranked #5 on Semantic Segmentation on Cityscapes val
Weight-sharing NAS builds a supernet that assembles all the architectures as its sub-networks and jointly trains the supernet with the sub-networks.
Ranked #2 on Neural Architecture Search on ImageNet
We depart from the standard practice of collecting a single reference per each training example, and find that collecting multiple references can achieve better accuracy under the fixed annotation budget.
We apply our method to recently-proposed MOCO, SimCLR, SwAV and notice that we can reduce the computational cost with little loss on the performance of ImageNet linear classification and other downstream tasks.
Data augmentation (DA) is an essential technique for training state-of-the-art deep learning systems.
Semi-supervised learning (SSL) is a key approach toward more data-efficient machine learning by jointly leverage both labeled and unlabeled data.
Our discovered model family, AttentiveNAS models, achieves top-1 accuracy from 77. 3% to 80. 7% on ImageNet, and outperforms SOTA models, including BigNAS and Once-for-All networks.
Ranked #6 on Neural Architecture Search on ImageNet
For security reasons, it is of critical importance to develop models with certified robustness that can provably guarantee that the prediction is can not be altered by any possible synonymous word substitution.
This differs from the existing methods based on backward elimination, which remove redundant neurons from the large network.
Randomized classifiers have been shown to provide a promising approach for achieving certified robustness against adversarial attacks in deep learning.
The idea is to generate a set of augmented data with some random perturbations or transforms and minimize the maximum, or worst case loss over the augmented data.
Ranked #52 on Image Classification on ImageNet
Theoretically, we show that our adversarial mechanism effectively encourages the diversity of the embedding vectors, helping to increase the robustness of models.
Ranked #1 on Machine Translation on IWSLT2015 German-English
Maximum-likelihood estimation (MLE) is widely used in sequence to sequence tasks for model training.
Continuous word representation (aka word embedding) is a basic building block in many neural network-based models used in natural language processing tasks.
Ranked #3 on Machine Translation on IWSLT2015 German-English