Search Results for author: Shawn Tan

Found 18 papers, 12 papers with code

Unsupervised Dependency Graph Network

1 code implementation ACL 2022 Yikang Shen, Shawn Tan, Alessandro Sordoni, Peng Li, Jie zhou, Aaron Courville

We introduce a new model, the Unsupervised Dependency Graph Network (UDGN), that can induce dependency structures from raw corpora and the masked language modeling task.

Language Modeling Language Modelling +4

Stick-breaking Attention

2 code implementations23 Oct 2024 Shawn Tan, Yikang Shen, Songlin Yang, Aaron Courville, Rameswar Panda

We propose an alternative attention mechanism based on the stick-breaking process: For each token before the current, we determine a break point $\beta_{i, j}$, which represents the proportion of the remaining stick to allocate to the current token.

Power Scheduler: A Batch Size and Token Number Agnostic Learning Rate Scheduler

1 code implementation23 Aug 2024 Yikang Shen, Matthew Stallone, Mayank Mishra, Gaoyuan Zhang, Shawn Tan, Aditya Prasad, Adriana Meza Soria, David D. Cox, Rameswar Panda

This is not only because there is a complicated correlation between learning rate, batch size, number of training tokens, model size, and other hyperparameters but also because it is prohibitively expensive to perform a hyperparameter search for large language models with Billions or Trillions of parameters.

Scattered Mixture-of-Experts Implementation

3 code implementations13 Mar 2024 Shawn Tan, Yikang Shen, Rameswar Panda, Aaron Courville

We present ScatterMoE, an implementation of Sparse Mixture-of-Experts (SMoE) on GPUs.

Sparse Universal Transformer

2 code implementations11 Oct 2023 Shawn Tan, Yikang Shen, Zhenfang Chen, Aaron Courville, Chuang Gan

The Universal Transformer (UT) is a variant of the Transformer that shares parameters across its layers.

ModuleFormer: Modularity Emerges from Mixture-of-Experts

1 code implementation7 Jun 2023 Yikang Shen, Zheyu Zhang, Tianyou Cao, Shawn Tan, Zhenfang Chen, Chuang Gan

In our experiment, we found that the modular architecture enables three important abilities for large pre-trained language models: 1) Efficiency, since ModuleFormer only activates a subset of its modules for each input token, thus it could achieve the same performance as dense LLMs with more than two times throughput; 2) Extendability, ModuleFormer is more immune to catastrophic forgetting than dense LLMs and can be easily extended with new modules to learn new knowledge that is not included in the training data; 3) Specialisation, finetuning ModuleFormer could specialize a subset of modules to the finetuning task and the task-unrelated modules could be easily pruned for a lightweight deployment.

Language Modelling

Learning to Dequantise with Truncated Flows

no code implementations ICLR 2022 Shawn Tan, Chin-wei Huang, Alessandro Sordoni, Aaron Courville

Addtionally, since the support of the marginal $q(z)$ is bounded and the support of prior $p(z)$ is not, we propose renormalising the prior distribution over the support of $q(z)$.

Variational Inference

Ordered Memory

1 code implementation NeurIPS 2019 Yikang Shen, Shawn Tan, Arian Hosseini, Zhouhan Lin, Alessandro Sordoni, Aaron Courville

Inspired by Ordered Neurons (Shen et al., 2018), we introduce a new attention-based mechanism and use its cumulative probability to control the writing and erasing operation of the memory.

ListOps

Icentia11K: An Unsupervised Representation Learning Dataset for Arrhythmia Subtype Discovery

1 code implementation21 Oct 2019 Shawn Tan, Guillaume Androz, Ahmad Chamseddine, Pierre Fecteau, Aaron Courville, Yoshua Bengio, Joseph Paul Cohen

We release the largest public ECG dataset of continuous raw signals for representation learning containing 11 thousand patients and 2 billion labelled beats.

Clustering Representation Learning

{COMPANYNAME}11K: An Unsupervised Representation Learning Dataset for Arrhythmia Subtype Discovery

no code implementations25 Sep 2019 Shawn Tan, Guillaume Androz, Ahmad Chamseddine, Pierre Fecteau, Aaron Courville, Yoshua Bengio, Joseph Paul Cohen

We release the largest public ECG dataset of continuous raw signals for representation learning containing over 11k patients and 2 billion labelled beats.

Clustering Representation Learning

Investigating Biases in Textual Entailment Datasets

no code implementations23 Jun 2019 Shawn Tan, Yikang Shen, Chin-wei Huang, Aaron Courville

The ability to understand logical relationships between sentences is an important task in language understanding.

BIG-bench Machine Learning Natural Language Inference +2

Improving Explorability in Variational Inference with Annealed Variational Objectives

1 code implementation NeurIPS 2018 Chin-wei Huang, Shawn Tan, Alexandre Lacoste, Aaron Courville

Despite the advances in the representational capacity of approximate distributions for variational inference, the optimization process can still limit the density that is ultimately learned.

Variational Inference

Generating Contradictory, Neutral, and Entailing Sentences

no code implementations7 Mar 2018 Yikang Shen, Shawn Tan, Chin-wei Huang, Aaron Courville

Learning distributed sentence representations remains an interesting problem in the field of Natural Language Processing (NLP).

Diversity Natural Language Inference +2

Self-organized Hierarchical Softmax

no code implementations26 Jul 2017 Yikang Shen, Shawn Tan, Chrisopher Pal, Aaron Courville

We propose a new self-organizing hierarchical softmax formulation for neural-network-based language models over large vocabularies.

Language Modeling Language Modelling +2

Cannot find the paper you are looking for? You can Submit a new open access paper.