Transformer Quality in Linear Time

no code implementations21 Feb 2022 Weizhe Hua, Zihang Dai, Hanxiao Liu, Quoc V. Le

We revisit the design choices in Transformers, and propose methods to address their weaknesses in handling long sequences.

Language Modelling Masked Language Modeling

Searching for Efficient Transformers for Language Modeling

no code implementations NeurIPS 2021 David So, Wojciech Mańke, Hanxiao Liu, Zihang Dai, Noam Shazeer, Quoc Le

For example, at a 500M parameter size, Primer improves the original T5 architecture on C4 auto-regressive language modeling, reducing the training cost by 4X.

Language Modelling

Combined Scaling for Open-Vocabulary Image Classification

no code implementations19 Nov 2021 Hieu Pham, Zihang Dai, Golnaz Ghiasi, Kenji Kawaguchi, Hanxiao Liu, Adams Wei Yu, Jiahui Yu, Yi-Ting Chen, Minh-Thang Luong, Yonghui Wu, Mingxing Tan, Quoc V. Le

Second, while increasing the dataset size and the model size has been the defacto method to improve the performance of deep learning models like BASIC, the effect of a large contrastive batch size on such contrastive-trained image-text models is not well-understood.

Ranked #2 on Zero-Shot Transfer Image Classification on ImageNet (using extra training data)

Classification Contrastive Learning +3

SimVLM: Simple Visual Language Model Pretraining with Weak Supervision

no code implementations ICLR 2022 ZiRui Wang, Jiahui Yu, Adams Wei Yu, Zihang Dai, Yulia Tsvetkov, Yuan Cao

With recent progress in joint modeling of visual and textual representations, Vision-Language Pretraining (VLP) has achieved impressive performance on many multimodal downstream tasks.

Image Captioning Language Modelling +3

Combiner: Full Attention Transformer with Sparse Computation Cost

1 code implementation NeurIPS 2021 Hongyu Ren, Hanjun Dai, Zihang Dai, Mengjiao Yang, Jure Leskovec, Dale Schuurmans, Bo Dai

However, the key limitation of transformers is their quadratic memory and time complexity $\mathcal{O}(L^2)$ with respect to the sequence length in attention layers, which restricts application in extremely long sequences.

Image Generation Language Modelling

CoAtNet: Marrying Convolution and Attention for All Data Sizes

8 code implementations NeurIPS 2021 Zihang Dai, Hanxiao Liu, Quoc V. Le, Mingxing Tan

Transformers have attracted increasing interests in computer vision, but they still fall behind state-of-the-art convolutional networks.

Ranked #3 on Image Classification on ImageNet (using extra training data)

Image Classification

Pay Attention to MLPs

20 code implementations NeurIPS 2021 Hanxiao Liu, Zihang Dai, David R. So, Quoc V. Le

Transformers have become one of the most important architectural innovations in deep learning and have enabled many breakthroughs over the past few years.

Image Classification Natural Language Inference +2

Unsupervised Parallel Corpus Mining on Web Data

no code implementations18 Sep 2020 Guokun Lai, Zihang Dai, Yiming Yang

In contrast, there is a large-scale of parallel corpus created by humans on the Internet.

14 Machine Translation +2

Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing

3 code implementations NeurIPS 2020 Zihang Dai, Guokun Lai, Yiming Yang, Quoc V. Le

With the success of language pretraining, it is highly desirable to develop more efficient architectures of good scalability that can exploit the abundant unlabeled data at a lower cost.

Reading Comprehension Text Classification

Wiki-40B: Multilingual Language Model Dataset

no code implementations LREC 2020 M. Guo, y, Zihang Dai, Vr, Denny e{\v{c}}i{\'c}, Rami Al-Rfou

We released the cleaned-up text of 40+ Wikipedia language editions, the corresponding trained monolingual language models, and several multilingual language models with different fixed vocabulary sizes.

Causal Language Modeling Language Modelling

Meta Pseudo Labels

7 code implementations CVPR 2021 Hieu Pham, Zihang Dai, Qizhe Xie, Minh-Thang Luong, Quoc V. Le

We present Meta Pseudo Labels, a semi-supervised learning method that achieves a new state-of-the-art top-1 accuracy of 90. 2% on ImageNet, which is 1. 6% better than the existing state-of-the-art.

Meta-Learning Semi-Supervised Image Classification

A Mutual Information Maximization Perspective of Language Representation Learning

no code implementations ICLR 2020 Lingpeng Kong, Cyprien de Masson d'Autume, Wang Ling, Lei Yu, Zihang Dai, Dani Yogatama

We show state-of-the-art word representation learning methods maximize an objective function that is a lower bound on the mutual information between different parts of a word sequence (i. e., a sentence).

Representation Learning

XLNet: Generalized Autoregressive Pretraining for Language Understanding

23 code implementations NeurIPS 2019 Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, Quoc V. Le

With the capability of modeling bidirectional contexts, denoising autoencoding based pretraining like BERT achieves better performance than pretraining approaches based on autoregressive language modeling.

Audio Question Answering Chinese Reading Comprehension +9

Unsupervised Data Augmentation for Consistency Training

18 code implementations NeurIPS 2020 Qizhe Xie, Zihang Dai, Eduard Hovy, Minh-Thang Luong, Quoc V. Le

In this work, we present a new perspective on how to effectively noise unlabeled examples and argue that the quality of noising, specifically those produced by advanced data augmentation methods, plays a crucial role in semi-supervised learning.

Image Augmentation Semi-Supervised Image Classification +2

Re-examination of the Role of Latent Variables in Sequence Modeling

1 code implementation NeurIPS 2019 Zihang Dai, Guokun Lai, Yiming Yang, Shinjae Yoo

With latent variables, stochastic recurrent models have achieved state-of-the-art performance in modeling sound-wave sequence.

Density Estimation Frame

Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context

27 code implementations ACL 2019 Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc V. Le, Ruslan Salakhutdinov

Transformers have a potential of learning longer-term dependency, but are limited by a fixed-length context in the setting of language modeling.

Language Modelling

Characterizing and Avoiding Negative Transfer

no code implementations CVPR 2019 Zirui Wang, Zihang Dai, Barnabás Póczos, Jaime Carbonell

When labeled data is scarce for a specific target task, transfer learning often offers an effective solution by utilizing data from a related source task.

Transfer Learning

From Credit Assignment to Entropy Regularization: Two New Algorithms for Neural Sequence Prediction

1 code implementation ACL 2018 Zihang Dai, Qizhe Xie, Eduard Hovy

In this work, we study the credit assignment problem in reward augmented maximum likelihood (RAML) learning, and establish a theoretical equivalence between the token-level counterpart of RAML and the entropy regularized reinforcement learning.


Large-scale Cloze Test Dataset Designed by Teachers

no code implementations ICLR 2018 Qizhe Xie, Guokun Lai, Zihang Dai, Eduard Hovy

Cloze test is widely adopted in language exams to evaluate students' language proficiency.

Cloze Test

Breaking the Softmax Bottleneck: A High-Rank RNN Language Model

9 code implementations ICLR 2018 Zhilin Yang, Zihang Dai, Ruslan Salakhutdinov, William W. Cohen

We formulate language modeling as a matrix factorization problem, and show that the expressiveness of Softmax-based models (including the majority of neural language models) is limited by a Softmax bottleneck.

Language Modelling Word Embeddings

Controllable Invariance through Adversarial Feature Learning

1 code implementation NeurIPS 2017 Qizhe Xie, Zihang Dai, Yulun Du, Eduard Hovy, Graham Neubig

Learning meaningful representations that maintain the content necessary for a particular task while filtering away detrimental variations is a problem of great interest in machine learning.

General Classification Image Classification +1

Good Semi-supervised Learning that Requires a Bad GAN

1 code implementation NeurIPS 2017 Zihang Dai, Zhilin Yang, Fan Yang, William W. Cohen, Ruslan Salakhutdinov

Semi-supervised learning methods based on generative adversarial networks (GANs) obtained strong empirical results, but it is not clear 1) how the discriminator benefits from joint training with a generator, and 2) why good semi-supervised classification performance and a good generator cannot be obtained at the same time.

General Classification Semi-Supervised Image Classification

An Interpretable Knowledge Transfer Model for Knowledge Base Completion

no code implementations ACL 2017 Qizhe Xie, Xuezhe Ma, Zihang Dai, Eduard Hovy

Knowledge bases are important resources for a variety of natural language processing tasks but suffer from incompleteness.

Knowledge Base Completion Transfer Learning

Calibrating Energy-based Generative Adversarial Networks

1 code implementation6 Feb 2017 Zihang Dai, Amjad Almahairi, Philip Bachman, Eduard Hovy, Aaron Courville

In this paper, we propose to equip Generative Adversarial Networks with the ability to produce direct energy estimates for samples. Specifically, we propose a flexible adversarial training framework, and prove this framework not only ensures the generator converges to the true data distribution, but also enables the discriminator to retain the density information at the global optimal.

Image Generation

CFO: Conditional Focused Neural Question Answering with Large-scale Knowledge Bases

no code implementations ACL 2016 Zihang Dai, Lei LI, Wei Xu

We propose CFO, a Conditional Focused neural-network-based approach to answering factoid questions with knowledge bases.

Question Answering

