Search Results for author: Yikang Shen

Found 65 papers, 28 papers with code

Unsupervised Dependency Graph Network

1 code implementation ACL 2022 Yikang Shen, Shawn Tan, Alessandro Sordoni, Peng Li, Jie zhou, Aaron Courville

We introduce a new model, the Unsupervised Dependency Graph Network (UDGN), that can induce dependency structures from raw corpora and the masked language modeling task.

Language Modelling Masked Language Modeling +3

Phrase-aware Unsupervised Constituency Parsing

no code implementations ACL 2022 Xiaotao Gu, Yikang Shen, Jiaming Shen, Jingbo Shang, Jiawei Han

Recent studies have achieved inspiring success in unsupervised grammar induction using masked language modeling (MLM) as the proxy task.

Constituency Parsing Language Modelling +1

Stick-breaking Attention

1 code implementation23 Oct 2024 Shawn Tan, Yikang Shen, Songlin Yang, Aaron Courville, Rameswar Panda

We propose an alternative attention mechanism based on the stick-breaking process: For each token before the current, we determine a break point $\beta_{i, j}$, which represents the proportion of the remaining stick to allocate to the current token.

Power Scheduler: A Batch Size and Token Number Agnostic Learning Rate Scheduler

no code implementations23 Aug 2024 Yikang Shen, Matthew Stallone, Mayank Mishra, Gaoyuan Zhang, Shawn Tan, Aditya Prasad, Adriana Meza Soria, David D. Cox, Rameswar Panda

This is not only because there is a complicated correlation between learning rate, batch size, number of training tokens, model size, and other hyperparameters but also because it is prohibitively expensive to perform a hyperparameter search for large language models with Billions or Trillions of parameters.

FlexAttention for Efficient High-Resolution Vision-Language Models

no code implementations29 Jul 2024 Junyan Li, Delin Chen, Tianle Cai, Peihao Chen, Yining Hong, Zhenfang Chen, Yikang Shen, Chuang Gan

Specifically, a high-resolution image is encoded both as high-resolution tokens and low-resolution tokens, where only the low-resolution tokens and a few selected high-resolution tokens are utilized to calculate the attention map, which greatly shrinks the computational cost.

Getting the Agent to Wait

no code implementations26 Jul 2024 Maryam Saeedi, Yikang Shen, Ali Shourideh

We examine the strategic interaction between an expert (principal) maximizing engagement and an agent seeking swift information.

LaMAGIC: Language-Model-based Topology Generation for Analog Integrated Circuits

no code implementations19 Jul 2024 Chen-Chia Chang, Yikang Shen, Shaoze Fan, Jing Li, Shun Zhang, Ningyuan Cao, Yiran Chen, Xin Zhang

To this end, we introduce LaMAGIC, a pioneering language model-based topology generation model that leverages supervised finetuning for automated analog circuit design.

Electrical Engineering Graph Generation +1

The infrastructure powering IBM's Gen AI model development

no code implementations7 Jul 2024 Talia Gershon, Seetharami Seelam, Brian Belgodere, Milton Bonilla, Lan Hoang, Danny Barnett, I-Hsin Chung, Apoorve Mohan, Ming-Hung Chen, Lixiang Luo, Robert Walkup, Constantinos Evangelinos, Shweta Salaria, Marc Dombrowa, Yoonho Park, Apo Kayi, Liran Schour, Alim Alim, Ali Sydney, Pavlos Maniotis, Laurent Schares, Bernard Metzler, Bengi Karacali-Akyamac, Sophia Wen, Tatsuhiro Chiba, Sunyanan Choochotkaew, Takeshi Yoshimura, Claudia Misale, Tonia Elengikal, Kevin O Connor, Zhuoran Liu, Richard Molina, Lars Schneidenbach, James Caden, Christopher Laibinis, Carlos Fonseca, Vasily Tarasov, Swaminathan Sundararaman, Frank Schmuck, Scott Guthridge, Jeremy Cohn, Marc Eshel, Paul Muench, Runyu Liu, William Pointer, Drew Wyskida, Bob Krull, Ray Rose, Brent Wolfe, William Cornejo, John Walter, Colm Malone, Clifford Perucci, Frank Franco, Nigel Hinds, Bob Calio, Pavel Druyan, Robert Kilduff, John Kienle, Connor McStay, Andrew Figueroa, Matthew Connolly, Edie Fost, Gina Roma, Jake Fonseca, Ido Levy, Michele Payne, Ryan Schenkel, Amir Malki, Lion Schneider, Aniruddha Narkhede, Shekeba Moshref, Alexandra Kisin, Olga Dodin, Bill Rippon, Henry Wrieth, John Ganci, Johnny Colino, Donna Habeger-Rose, Rakesh Pandey, Aditya Gidh, Dennis Patterson, Samsuddin Salmani, Rambilas Varma, Rumana Rumana, Shubham Sharma, Aditya Gaur, Mayank Mishra, Rameswar Panda, Aditya Prasad, Matt Stallone, Gaoyuan Zhang, Yikang Shen, David Cox, Ruchir Puri, Dakshi Agrawal, Drew Thorstensen, Joel Belog, Brent Tang, Saurabh Kumar Gupta, Amitabha Biswas, Anup Maheshwari, Eran Gampel, Jason Van Patten, Matthew Runion, Sai Kaki, Yigal Bogin, Brian Reitz, Steve Pritko, Shahan Najam, Surya Nambala, Radhika Chirra, Rick Welp, Frank DiMitri, Felipe Telles, Amilcar Arvelo, King Chu, Ed Seminaro, Andrew Schram, Felix Eickhoff, William Hanson, Eric Mckeever, Dinakaran Joseph, Piyush Chaudhary, Piyush Shivam, Puneet Chaudhary, Wesley Jones, Robert Guthrie, Chris Bostic, Rezaul Islam, Steve Duersch, Wayne Sawdon, John Lewars, Matthew Klos, Michael Spriggs, Bill McMillan, George Gao, Ashish Kamra, Gaurav Singh, Marc Curry, Tushar Katarki, Joe Talerico, Zenghui Shi, Sai Sindhur Malleni, Erwan Gallen

This infrastructure includes (1) Vela: an AI-optimized supercomputing capability directly integrated into the IBM Cloud, delivering scalable, dynamic, multi-tenant and geographically distributed infrastructure for large-scale model training and other AI workflow steps and (2) Blue Vela: a large-scale, purpose-built, on-premises hosting environment that is optimized to support our largest and most ambitious AI model training tasks.

Octo-planner: On-device Language Model for Planner-Action Agents

no code implementations26 Jun 2024 Wei Chen, Zhiyuan Li, Zhen Guo, Yikang Shen

In this paper, we present an efficient on-device Planner-Action framework that separates planning and action execution into two distinct components: a planner agent based on Phi-3 Mini, a 3. 8 billion parameter LLM optimized for edge devices, and an action agent using the Octopus model for function execution.

Computational Efficiency In-Context Learning +1

Efficient Continual Pre-training by Mitigating the Stability Gap

no code implementations21 Jun 2024 Yiduo Guo, Jie Fu, Huishuai Zhang, Dongyan Zhao, Yikang Shen

This process involves updating the pre-trained LLM with a corpus from a new domain, resulting in a shift in the training distribution.

Parallelizing Linear Transformers with the Delta Rule over Sequence Length

1 code implementation10 Jun 2024 Songlin Yang, Bailin Wang, Yu Zhang, Yikang Shen, Yoon Kim

Transformers with linear attention (i. e., linear transformers) and state-space models have recently been suggested as a viable linear-time alternative to transformers with softmax attention.

Language Modelling Mamba +1

Stacking Your Transformers: A Closer Look at Model Growth for Efficient LLM Pre-Training

no code implementations24 May 2024 Wenyu Du, Tongxu Luo, Zihan Qiu, Zeyu Huang, Yikang Shen, Reynold Cheng, Yike Guo, Jie Fu

For example, compared to a conventionally trained 7B model using 300B tokens, our $G_{\text{stack}}$ model converges to the same loss with 194B tokens, resulting in a 54. 6\% speedup.

JetMoE: Reaching Llama2 Performance with 0.1M Dollars

4 code implementations11 Apr 2024 Yikang Shen, Zhen Guo, Tianle Cai, Zengyi Qin

Large Language Models (LLMs) have achieved remarkable results, but their increasing resource demand has become a major obstacle to the development of powerful and accessible super-human intelligence.

Dense Training, Sparse Inference: Rethinking Training of Mixture-of-Experts Language Models

no code implementations8 Apr 2024 Bowen Pan, Yikang Shen, Haokun Liu, Mayank Mishra, Gaoyuan Zhang, Aude Oliva, Colin Raffel, Rameswar Panda

Mixture-of-Experts (MoE) language models can reduce computational costs by 2-4$\times$ compared to dense models without sacrificing performance, making them more efficient in computation-bounded scenarios.

Easy-to-Hard Generalization: Scalable Alignment Beyond Human Supervision

1 code implementation14 Mar 2024 Zhiqing Sun, Longhui Yu, Yikang Shen, Weiyang Liu, Yiming Yang, Sean Welleck, Chuang Gan

This paper answers this question in the context of tackling hard reasoning tasks (e. g., level 4-5 MATH problems) via learning from human annotations on easier tasks (e. g., level 1-3 MATH problems), which we term as \textit{easy-to-hard generalization}.

Math Reinforcement Learning (RL) +1

Scattered Mixture-of-Experts Implementation

2 code implementations13 Mar 2024 Shawn Tan, Yikang Shen, Rameswar Panda, Aaron Courville

We present ScatterMoE, an implementation of Sparse Mixture-of-Experts (SMoE) on GPUs.

API Pack: A Massive Multi-Programming Language Dataset for API Call Generation

1 code implementation14 Feb 2024 Zhen Guo, Adriana Meza Soria, Wei Sun, Yikang Shen, Rameswar Panda

We introduce API Pack, a massive multi-programming language dataset containing more than 1 million instruction-API call pairs to improve the API call generation capabilities of large language models.

Diversity Measurement and Subset Selection for Instruction Tuning Datasets

no code implementations4 Feb 2024 Peiqi Wang, Yikang Shen, Zhen Guo, Matthew Stallone, Yoon Kim, Polina Golland, Rameswar Panda

Our experiments demonstrate that the proposed diversity measure in the normalized weight gradient space is correlated with downstream instruction-following performance.

Diversity Instruction Following +1

Improving Reinforcement Learning from Human Feedback with Efficient Reward Model Ensemble

no code implementations30 Jan 2024 Shun Zhang, Zhenfang Chen, Sunli Chen, Yikang Shen, Zhiqing Sun, Chuang Gan

Reinforcement Learning from Human Feedback (RLHF) is a widely adopted approach for aligning large language models with human values.

Language Modelling Large Language Model +1

Structured Code Representations Enable Data-Efficient Adaptation of Code Language Models

no code implementations19 Jan 2024 Mayank Agarwal, Yikang Shen, Bailin Wang, Yoon Kim, Jie Chen

In this work, we explore data-efficient adaptation of pre-trained code models by further pre-training and fine-tuning them with program structures.

Gated Linear Attention Transformers with Hardware-Efficient Training

4 code implementations11 Dec 2023 Songlin Yang, Bailin Wang, Yikang Shen, Rameswar Panda, Yoon Kim

When used as a replacement for the standard attention layer in Transformers, the resulting gated linear attention (GLA) Transformer is found to perform competitively against the LLaMA-architecture Transformer (Touvron et al., 2023) as well recent linear-time-inference baselines such as RetNet (Sun et al., 2023a) and Mamba (Gu & Dao, 2023) on moderate-scale language modeling experiments.

2k Language Modelling +1

CoVLM: Composing Visual Entities and Relationships in Large Language Models Via Communicative Decoding

no code implementations6 Nov 2023 Junyan Li, Delin Chen, Yining Hong, Zhenfang Chen, Peihao Chen, Yikang Shen, Chuang Gan

A communication token is generated by the LLM following a visual entity or a relation, to inform the detection network to propose regions that are relevant to the sentence generated so far.

CoLA Question Answering +5

The Consensus Game: Language Model Generation via Equilibrium Search

no code implementations13 Oct 2023 Athul Paul Jacob, Yikang Shen, Gabriele Farina, Jacob Andreas

When applied to question answering and other text generation tasks, language models (LMs) may be queried generatively (by sampling answers from their output distribution) or discriminatively (by using them to score or rank a set of candidate outputs).

Language Modelling Mathematical Problem-Solving +3

Sparse Universal Transformer

2 code implementations11 Oct 2023 Shawn Tan, Yikang Shen, Zhenfang Chen, Aaron Courville, Chuang Gan

The Universal Transformer (UT) is a variant of the Transformer that shares parameters across its layers.

TextPSG: Panoptic Scene Graph Generation from Textual Descriptions

no code implementations ICCV 2023 Chengyang Zhao, Yikang Shen, Zhenfang Chen, Mingyu Ding, Chuang Gan

To tackle this problem, we propose a new framework TextPSG consisting of four modules, i. e., a region grouper, an entity grounder, a segment merger, and a label generator, with several novel techniques.

Graph Generation Panoptic Scene Graph Generation +1

SALMON: Self-Alignment with Instructable Reward Models

1 code implementation9 Oct 2023 Zhiqing Sun, Yikang Shen, Hongxin Zhang, Qinhong Zhou, Zhenfang Chen, David Cox, Yiming Yang, Chuang Gan

Supervised Fine-Tuning (SFT) on response demonstrations combined with Reinforcement Learning from Human Feedback (RLHF) constitutes a powerful paradigm for aligning LLM-based AI agents.

In-Context Learning Language Modelling

GraphText: Graph Reasoning in Text Space

1 code implementation2 Oct 2023 Jianan Zhao, Le Zhuo, Yikang Shen, Meng Qu, Kai Liu, Michael Bronstein, Zhaocheng Zhu, Jian Tang

Furthermore, GraphText paves the way for interactive graph reasoning, allowing both humans and LLMs to communicate with the model seamlessly using natural language.

In-Context Learning Text Generation

Aligning Large Multimodal Models with Factually Augmented RLHF

no code implementations25 Sep 2023 Zhiqing Sun, Sheng Shen, Shengcao Cao, Haotian Liu, Chunyuan Li, Yikang Shen, Chuang Gan, Liang-Yan Gui, Yu-Xiong Wang, Yiming Yang, Kurt Keutzer, Trevor Darrell

Large Multimodal Models (LMM) are built across modalities and the misalignment between two modalities can result in "hallucination", generating textual outputs that are not grounded by the multimodal information in context.

Hallucination Image Captioning +1

ModuleFormer: Modularity Emerges from Mixture-of-Experts

1 code implementation7 Jun 2023 Yikang Shen, Zheyu Zhang, Tianyou Cao, Shawn Tan, Zhenfang Chen, Chuang Gan

In our experiment, we found that the modular architecture enables three important abilities for large pre-trained language models: 1) Efficiency, since ModuleFormer only activates a subset of its modules for each input token, thus it could achieve the same performance as dense LLMs with more than two times throughput; 2) Extendability, ModuleFormer is more immune to catastrophic forgetting than dense LLMs and can be easily extended with new modules to learn new knowledge that is not included in the training data; 3) Specialisation, finetuning ModuleFormer could specialize a subset of modules to the finetuning task and the task-unrelated modules could be easily pruned for a lightweight deployment.

Language Modelling

Principle-Driven Self-Alignment of Language Models from Scratch with Minimal Human Supervision

1 code implementation NeurIPS 2023 Zhiqing Sun, Yikang Shen, Qinhong Zhou, Hongxin Zhang, Zhenfang Chen, David Cox, Yiming Yang, Chuang Gan

Recent AI-assistant agents, such as ChatGPT, predominantly rely on supervised fine-tuning (SFT) with human annotations and reinforcement learning from human feedback (RLHF) to align the output of large language models (LLMs) with human intentions, ensuring they are helpful, ethical, and reliable.

Diversity In-Context Learning +1

Hyper-Decision Transformer for Efficient Online Policy Adaptation

no code implementations17 Apr 2023 Mengdi Xu, Yuchen Lu, Yikang Shen, Shun Zhang, Ding Zhao, Chuang Gan

To address this challenge, we propose a new framework, called Hyper-Decision Transformer (HDT), that can generalize to novel tasks from a handful of demonstrations in a data- and parameter-efficient manner.

Visual Dependency Transformers: Dependency Tree Emerges from Reversed Attention

1 code implementation CVPR 2023 Mingyu Ding, Yikang Shen, Lijie Fan, Zhenfang Chen, Zitian Chen, Ping Luo, Joshua B. Tenenbaum, Chuang Gan

When looking at an image, we can decompose the scene into entities and their parts as well as obtain the dependencies between them.

Planning with Large Language Models for Code Generation

no code implementations9 Mar 2023 Shun Zhang, Zhenfang Chen, Yikang Shen, Mingyu Ding, Joshua B. Tenenbaum, Chuang Gan

Existing large language model-based code generation pipelines typically use beam search or sampling algorithms during the decoding process.

Code Generation Language Modelling +1

Transformer-Patcher: One Mistake worth One Neuron

1 code implementation24 Jan 2023 Zeyu Huang, Yikang Shen, Xiaofeng Zhang, Jie zhou, Wenge Rong, Zhang Xiong

Our method outperforms previous fine-tuning and HyperNetwork-based methods and achieves state-of-the-art performance for Sequential Model Editing (SME).

Model Editing

Mod-Squad: Designing Mixture of Experts As Modular Multi-Task Learners

no code implementations15 Dec 2022 Zitian Chen, Yikang Shen, Mingyu Ding, Zhenfang Chen, Hengshuang Zhao, Erik Learned-Miller, Chuang Gan

To address the MTL challenge, we propose Mod-Squad, a new model that is Modularized into groups of experts (a 'Squad').

Multi-Task Learning

Mixture of Attention Heads: Selecting Attention Heads Per Token

2 code implementations11 Oct 2022 Xiaofeng Zhang, Yikang Shen, Zeyu Huang, Jie zhou, Wenge Rong, Zhang Xiong

This paper proposes the Mixture of Attention Heads (MoA), a new architecture that combines multi-head attention with the MoE mechanism.

Computational Efficiency Language Modelling +2

Syntactic Inductive Biases for Deep Learning Methods

no code implementations8 Jun 2022 Yikang Shen

We propose two families of inductive biases, one for constituency structure and another one for dependency structure.

Deep Learning Inductive Bias

Self-Instantiated Recurrent Units with Dynamic Soft Recursion

no code implementations NeurIPS 2021 Aston Zhang, Yi Tay, Yikang Shen, Alvin Chan Guo Wei, Shuai Zhang

On the other hand, the extent of the Self-IRU recursion is controlled by gates whose values are between 0 and 1 and may vary across the temporal dimension of sequences, enabling dynamic soft recursion depth at each time step.

Inductive Bias

Inducing Reusable Skills From Demonstrations with Option-Controller Network

no code implementations29 Sep 2021 Siyuan Zhou, Yikang Shen, Yuchen Lu, Aaron Courville, Joshua B. Tenenbaum, Chuang Gan

With the isolation of information and the synchronous calling mechanism, we can impose a division of works between the controller and options in an end-to-end training regime.

Learning Task Decomposition with Ordered Memory Policy Network

no code implementations19 Mar 2021 Yuchen Lu, Yikang Shen, Siyuan Zhou, Aaron Courville, Joshua B. Tenenbaum, Chuang Gan

The discovered subtask hierarchy could be used to perform task decomposition, recovering the subtask boundaries in an unstruc-tured demonstration.

Inductive Bias

StructFormer: Joint Unsupervised Induction of Dependency and Constituency Structure from Masked Language Modeling

2 code implementations ACL 2021 Yikang Shen, Yi Tay, Che Zheng, Dara Bahri, Donald Metzler, Aaron Courville

There are two major classes of natural language grammar -- the dependency grammar that models one-to-one correspondences between words and the constituency grammar that models the assembly of one or several corresponded words.

Constituency Parsing Language Modelling +2

Long Range Arena: A Benchmark for Efficient Transformers

5 code implementations8 Nov 2020 Yi Tay, Mostafa Dehghani, Samira Abnar, Yikang Shen, Dara Bahri, Philip Pham, Jinfeng Rao, Liu Yang, Sebastian Ruder, Donald Metzler

In the recent months, a wide spectrum of efficient, fast Transformers have been proposed to tackle this problem, more often than not claiming superior or comparable model quality to vanilla Transformer models.

16k Benchmarking +3

Ordered Memory

1 code implementation NeurIPS 2019 Yikang Shen, Shawn Tan, Arian Hosseini, Zhouhan Lin, Alessandro Sordoni, Aaron Courville

Inspired by Ordered Neurons (Shen et al., 2018), we introduce a new attention-based mechanism and use its cumulative probability to control the writing and erasing operation of the memory.

ListOps

Investigating Biases in Textual Entailment Datasets

no code implementations23 Jun 2019 Shawn Tan, Yikang Shen, Chin-wei Huang, Aaron Courville

The ability to understand logical relationships between sentences is an important task in language understanding.

BIG-bench Machine Learning Natural Language Inference +2

BanditSum: Extractive Summarization as a Contextual Bandit

1 code implementation EMNLP 2018 Yue Dong, Yikang Shen, Eric Crawford, Herke van Hoof, Jackie Chi Kit Cheung

In this work, we propose a novel method for training neural networks to perform single-document extractive summarization without heuristically-generated extractive labels.

Extractive Summarization Extractive Text Summarization +1

Generating Contradictory, Neutral, and Entailing Sentences

no code implementations7 Mar 2018 Yikang Shen, Shawn Tan, Chin-wei Huang, Aaron Courville

Learning distributed sentence representations remains an interesting problem in the field of Natural Language Processing (NLP).

Diversity Natural Language Inference +2

Neural Language Modeling by Jointly Learning Syntax and Lexicon

1 code implementation ICLR 2018 Yikang Shen, Zhouhan Lin, Chin-wei Huang, Aaron Courville

In this paper, We propose a novel neural language model, called the Parsing-Reading-Predict Networks (PRPN), that can simultaneously induce the syntactic structure from unannotated sentences and leverage the inferred structure to learn a better language model.

Constituency Grammar Induction Language Modelling

Self-organized Hierarchical Softmax

no code implementations26 Jul 2017 Yikang Shen, Shawn Tan, Chrisopher Pal, Aaron Courville

We propose a new self-organizing hierarchical softmax formulation for neural-network-based language models over large vocabularies.

Language Modelling Sentence +1

Word Embedding based Correlation Model for Question/Answer Matching

no code implementations15 Nov 2015 Yikang Shen, Wenge Rong, Nan Jiang, Baolin Peng, Jie Tang, Zhang Xiong

With the development of community based question answering (Q&A) services, a large scale of Q&A archives have been accumulated and are an important information and knowledge resource on the web.

Question Answering Translation

Cannot find the paper you are looking for? You can Submit a new open access paper.