no code implementations • 21 Feb 2022 • Weizhe Hua, Zihang Dai, Hanxiao Liu, Quoc V. Le
We revisit the design choices in Transformers, and propose methods to address their weaknesses in handling long sequences.
Ranked #1 on
Language Modelling
on Wiki-40B
no code implementations • NeurIPS 2021 • David So, Wojciech Mańke, Hanxiao Liu, Zihang Dai, Noam Shazeer, Quoc Le
For example, at a 500M parameter size, Primer improves the original T5 architecture on C4 auto-regressive language modeling, reducing the training cost by 4X.
no code implementations • 19 Nov 2021 • Hieu Pham, Zihang Dai, Golnaz Ghiasi, Kenji Kawaguchi, Hanxiao Liu, Adams Wei Yu, Jiahui Yu, Yi-Ting Chen, Minh-Thang Luong, Yonghui Wu, Mingxing Tan, Quoc V. Le
Second, while increasing the dataset size and the model size has been the defacto method to improve the performance of deep learning models like BASIC, the effect of a large contrastive batch size on such contrastive-trained image-text models is not well-understood.
3 code implementations • 17 Sep 2021 • David R. So, Wojciech Mańke, Hanxiao Liu, Zihang Dai, Noam Shazeer, Quoc V. Le
For example, at a 500M parameter size, Primer improves the original T5 architecture on C4 auto-regressive language modeling, reducing the training cost by 4X.
Ranked #1 on
Language Modelling
on C4
2 code implementations • ICLR 2022 • ZiRui Wang, Jiahui Yu, Adams Wei Yu, Zihang Dai, Yulia Tsvetkov, Yuan Cao
With recent progress in joint modeling of visual and textual representations, Vision-Language Pretraining (VLP) has achieved impressive performance on many multimodal downstream tasks.
Ranked #4 on
Visual Entailment
on SNLI-VE val
2 code implementations • NeurIPS 2021 • Hongyu Ren, Hanjun Dai, Zihang Dai, Mengjiao Yang, Jure Leskovec, Dale Schuurmans, Bo Dai
However, the key limitation of transformers is their quadratic memory and time complexity $\mathcal{O}(L^2)$ with respect to the sequence length in attention layers, which restricts application in extremely long sequences.
Ranked #2 on
Language Modelling
on Wiki-40B
14 code implementations • NeurIPS 2021 • Zihang Dai, Hanxiao Liu, Quoc V. Le, Mingxing Tan
Transformers have attracted increasing interests in computer vision, but they still fall behind state-of-the-art convolutional networks.
Ranked #1 on
Image Classification
on GasHisSDB
18 code implementations • NeurIPS 2021 • Hanxiao Liu, Zihang Dai, David R. So, Quoc V. Le
Transformers have become one of the most important architectural innovations in deep learning and have enabled many breakthroughs over the past few years.
Ranked #22 on
Natural Language Inference
on MultiNLI
no code implementations • 18 Sep 2020 • Guokun Lai, Zihang Dai, Yiming Yang
In contrast, there is a large-scale of parallel corpus created by humans on the Internet.
3 code implementations • NeurIPS 2020 • Zihang Dai, Guokun Lai, Yiming Yang, Quoc V. Le
With the success of language pretraining, it is highly desirable to develop more efficient architectures of good scalability that can exploit the abundant unlabeled data at a lower cost.
Ranked #6 on
Reading Comprehension
on RACE
no code implementations • LREC 2020 • M. Guo, y, Zihang Dai, Vr, Denny e{\v{c}}i{\'c}, Rami Al-Rfou
We released the cleaned-up text of 40+ Wikipedia language editions, the corresponding trained monolingual language models, and several multilingual language models with different fixed vocabulary sizes.
8 code implementations • CVPR 2021 • Hieu Pham, Zihang Dai, Qizhe Xie, Minh-Thang Luong, Quoc V. Le
We present Meta Pseudo Labels, a semi-supervised learning method that achieves a new state-of-the-art top-1 accuracy of 90. 2% on ImageNet, which is 1. 6% better than the existing state-of-the-art.
no code implementations • ICLR 2020 • Lingpeng Kong, Cyprien de Masson d'Autume, Wang Ling, Lei Yu, Zihang Dai, Dani Yogatama
We show state-of-the-art word representation learning methods maximize an objective function that is a lower bound on the mutual information between different parts of a word sequence (i. e., a sentence).
23 code implementations • NeurIPS 2019 • Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, Quoc V. Le
With the capability of modeling bidirectional contexts, denoising autoencoding based pretraining like BERT achieves better performance than pretraining approaches based on autoregressive language modeling.
20 code implementations • NeurIPS 2020 • Qizhe Xie, Zihang Dai, Eduard Hovy, Minh-Thang Luong, Quoc V. Le
In this work, we present a new perspective on how to effectively noise unlabeled examples and argue that the quality of noising, specifically those produced by advanced data augmentation methods, plays a crucial role in semi-supervised learning.
Ranked #1 on
Sentiment Analysis
on Amazon Review Full
1 code implementation • NeurIPS 2019 • Zihang Dai, Guokun Lai, Yiming Yang, Shinjae Yoo
With latent variables, stochastic recurrent models have achieved state-of-the-art performance in modeling sound-wave sequence.
34 code implementations • ACL 2019 • Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc V. Le, Ruslan Salakhutdinov
Transformers have a potential of learning longer-term dependency, but are limited by a fixed-length context in the setting of language modeling.
Ranked #3 on
Language Modelling
on One Billion Word
no code implementations • CVPR 2019 • Zirui Wang, Zihang Dai, Barnabás Póczos, Jaime Carbonell
When labeled data is scarce for a specific target task, transfer learning often offers an effective solution by utilizing data from a related source task.
1 code implementation • 25 Sep 2018 • Xiang Kong, Qizhe Xie, Zihang Dai, Eduard Hovy
Mixture of Softmaxes (MoS) has been shown to be effective at addressing the expressiveness limitation of Softmax-based models.
Ranked #18 on
Machine Translation
on WMT2014 English-French
no code implementations • EMNLP 2018 • Xinyi Wang, Hieu Pham, Zihang Dai, Graham Neubig
In this work, we examine methods for data augmentation for text-based tasks such as neural machine translation (NMT).
1 code implementation • ACL 2018 • Zihang Dai, Qizhe Xie, Eduard Hovy
In this work, we study the credit assignment problem in reward augmented maximum likelihood (RAML) learning, and establish a theoretical equivalence between the token-level counterpart of RAML and the entropy regularized reinforcement learning.
no code implementations • ICLR 2018 • Qizhe Xie, Guokun Lai, Zihang Dai, Eduard Hovy
Cloze test is widely adopted in language exams to evaluate students' language proficiency.
9 code implementations • ICLR 2018 • Zhilin Yang, Zihang Dai, Ruslan Salakhutdinov, William W. Cohen
We formulate language modeling as a matrix factorization problem, and show that the expressiveness of Softmax-based models (including the majority of neural language models) is limited by a Softmax bottleneck.
Ranked #10 on
Language Modelling
on Penn Treebank (Word Level)
2 code implementations • EMNLP 2018 • Qizhe Xie, Guokun Lai, Zihang Dai, Eduard Hovy
Cloze tests are widely adopted in language exams to evaluate students' language proficiency.
1 code implementation • NeurIPS 2017 • Qizhe Xie, Zihang Dai, Yulun Du, Eduard Hovy, Graham Neubig
Learning meaningful representations that maintain the content necessary for a particular task while filtering away detrimental variations is a problem of great interest in machine learning.
1 code implementation • NeurIPS 2017 • Zihang Dai, Zhilin Yang, Fan Yang, William W. Cohen, Ruslan Salakhutdinov
Semi-supervised learning methods based on generative adversarial networks (GANs) obtained strong empirical results, but it is not clear 1) how the discriminator benefits from joint training with a generator, and 2) why good semi-supervised classification performance and a good generator cannot be obtained at the same time.
no code implementations • ACL 2017 • Qizhe Xie, Xuezhe Ma, Zihang Dai, Eduard Hovy
Knowledge bases are important resources for a variety of natural language processing tasks but suffer from incompleteness.
1 code implementation • 6 Feb 2017 • Zihang Dai, Amjad Almahairi, Philip Bachman, Eduard Hovy, Aaron Courville
In this paper, we propose to equip Generative Adversarial Networks with the ability to produce direct energy estimates for samples. Specifically, we propose a flexible adversarial training framework, and prove this framework not only ensures the generator converges to the true data distribution, but also enables the discriminator to retain the density information at the global optimal.
Ranked #17 on
Conditional Image Generation
on CIFAR-10
(Inception score metric)
no code implementations • ACL 2016 • Zihang Dai, Lei LI, Wei Xu
We propose CFO, a Conditional Focused neural-network-based approach to answering factoid questions with knowledge bases.