Search Results for author: Xiaoxia Wu

Found 26 papers, 16 papers with code

Found in the Middle: How Language Models Use Long Contexts Better via Plug-and-Play Positional Encoding

1 code implementation5 Mar 2024 Zhenyu Zhang, Runjin Chen, Shiwei Liu, Zhewei Yao, Olatunji Ruwase, Beidi Chen, Xiaoxia Wu, Zhangyang Wang

To address this problem, this paper introduces Multi-scale Positional Encoding (Ms-PoE) which is a simple yet effective plug-and-play approach to enhance the capacity of LLMs to handle the relevant information located in the middle of the context, without fine-tuning or introducing any additional overhead.

Language Modelling

FP6-LLM: Efficiently Serving Large Language Models Through FP6-Centric Algorithm-System Co-Design

2 code implementations25 Jan 2024 Haojun Xia, Zhen Zheng, Xiaoxia Wu, Shiyang Chen, Zhewei Yao, Stephen Youn, Arash Bakhtiari, Michael Wyatt, Donglin Zhuang, Zhongzhu Zhou, Olatunji Ruwase, Yuxiong He, Shuaiwen Leon Song

However, existing systems do not provide Tensor Core support for FP6 quantization and struggle to achieve practical performance improvements during LLM inference.

Quantization

DeepSpeed4Science Initiative: Enabling Large-Scale Scientific Discovery through Sophisticated AI System Technologies

no code implementations6 Oct 2023 Shuaiwen Leon Song, Bonnie Kruft, Minjia Zhang, Conglong Li, Shiyang Chen, Chengming Zhang, Masahiro Tanaka, Xiaoxia Wu, Jeff Rasley, Ammar Ahmad Awan, Connor Holmes, Martin Cai, Adam Ghanem, Zhongzhu Zhou, Yuxiong He, Pete Luferenko, Divya Kumar, Jonathan Weyn, Ruixiong Zhang, Sylwester Klocek, Volodymyr Vragov, Mohammed AlQuraishi, Gustaf Ahdritz, Christina Floristean, Cristina Negri, Rao Kotamarthi, Venkatram Vishwanath, Arvind Ramanathan, Sam Foreman, Kyle Hippe, Troy Arcomano, Romit Maulik, Maxim Zvyagin, Alexander Brace, Bin Zhang, Cindy Orozco Bohorquez, Austin Clyde, Bharat Kale, Danilo Perez-Rivera, Heng Ma, Carla M. Mann, Michael Irvin, J. Gregory Pauloski, Logan Ward, Valerie Hayot, Murali Emani, Zhen Xie, Diangen Lin, Maulik Shukla, Ian Foster, James J. Davis, Michael E. Papka, Thomas Brettin, Prasanna Balaprakash, Gina Tourassi, John Gounley, Heidi Hanson, Thomas E Potok, Massimiliano Lupo Pasini, Kate Evans, Dan Lu, Dalton Lunga, Junqi Yin, Sajal Dash, Feiyi Wang, Mallikarjun Shankar, Isaac Lyngaas, Xiao Wang, Guojing Cong, Pei Zhang, Ming Fan, Siyan Liu, Adolfy Hoisie, Shinjae Yoo, Yihui Ren, William Tang, Kyle Felker, Alexey Svyatkovskiy, Hang Liu, Ashwin Aji, Angela Dalton, Michael Schulte, Karl Schulz, Yuntian Deng, Weili Nie, Josh Romero, Christian Dallago, Arash Vahdat, Chaowei Xiao, Thomas Gibbs, Anima Anandkumar, Rick Stevens

In the upcoming decade, deep learning may revolutionize the natural sciences, enhancing our capacity to model and predict natural occurrences.

DeepSpeed-VisualChat: Multi-Round Multi-Image Interleave Chat via Multi-Modal Causal Attention

2 code implementations25 Sep 2023 Zhewei Yao, Xiaoxia Wu, Conglong Li, Minjia Zhang, Heyang Qin, Olatunji Ruwase, Ammar Ahmad Awan, Samyam Rajbhandari, Yuxiong He

Most of the existing multi-modal models, hindered by their incapacity to adeptly manage interleaved image-and-text inputs in multi-image, multi-round dialogues, face substantial constraints in resource allocation for training and data accessibility, impacting their adaptability and scalability across varied interaction realms.

Language Modelling

ZeroQuant-FP: A Leap Forward in LLMs Post-Training W4A8 Quantization Using Floating-Point Formats

1 code implementation19 Jul 2023 Xiaoxia Wu, Zhewei Yao, Yuxiong He

In the complex domain of large language models (LLMs), striking a balance between computational efficiency and maintaining model quality is a formidable challenge.

Computational Efficiency Quantization

ZeroQuant-V2: Exploring Post-training Quantization in LLMs from Comprehensive Study to Low Rank Compensation

2 code implementations15 Mar 2023 Zhewei Yao, Xiaoxia Wu, Cheng Li, Stephen Youn, Yuxiong He

Post-training quantization (PTQ) has emerged as a promising technique for mitigating memory consumption and computational costs in large language models (LLMs).

Quantization

Understanding INT4 Quantization for Transformer Models: Latency Speedup, Composability, and Failure Cases

1 code implementation27 Jan 2023 Xiaoxia Wu, Cheng Li, Reza Yazdani Aminabadi, Zhewei Yao, Yuxiong He

Improving the deployment efficiency of transformer-based language models has been challenging given their high computation and memory cost.

Quantization

DeepSpeed Data Efficiency: Improving Deep Learning Model Quality and Training Efficiency via Efficient Data Sampling and Routing

1 code implementation7 Dec 2022 Conglong Li, Zhewei Yao, Xiaoxia Wu, Minjia Zhang, Connor Holmes, Cheng Li, Yuxiong He

Compared to the rapidly evolving model architecture, how to efficiently use the training data (especially for the expensive foundation model pretraining) is both less explored and difficult to realize due to the lack of a convenient framework that focuses on data efficiency capabilities.

Language Modelling Large Language Model

Random-LTD: Random and Layerwise Token Dropping Brings Efficient Training for Large-scale Transformers

1 code implementation17 Nov 2022 Zhewei Yao, Xiaoxia Wu, Conglong Li, Connor Holmes, Minjia Zhang, Cheng Li, Yuxiong He

Large-scale transformer models have become the de-facto architectures for various machine learning applications, e. g., CV and NLP.

Extreme Compression for Pre-trained Transformers Made Simple and Efficient

1 code implementation4 Jun 2022 Xiaoxia Wu, Zhewei Yao, Minjia Zhang, Conglong Li, Yuxiong He

Extreme compression, particularly ultra-low bit precision (binary/ternary) quantization, has been proposed to fit large NLP models on resource-constraint devices.

Knowledge Distillation Quantization

ZeroQuant: Efficient and Affordable Post-Training Quantization for Large-Scale Transformers

3 code implementations4 Jun 2022 Zhewei Yao, Reza Yazdani Aminabadi, Minjia Zhang, Xiaoxia Wu, Conglong Li, Yuxiong He

How to efficiently serve ever-larger trained natural language models in practice has become exceptionally challenging even for powerful cloud servers due to their prohibitive memory/computation requirements.

Knowledge Distillation Quantization

Adaptive Differentially Private Empirical Risk Minimization

no code implementations14 Oct 2021 Xiaoxia Wu, Lingxiao Wang, Irina Cristali, Quanquan Gu, Rebecca Willett

We propose an adaptive (stochastic) gradient perturbation method for differentially private empirical risk minimization.

AdaLoss: A computationally-efficient and provably convergent adaptive gradient method

no code implementations17 Sep 2021 Xiaoxia Wu, Yuege Xie, Simon Du, Rachel Ward

We propose a computationally-friendly adaptive learning rate schedule, "AdaLoss", which directly uses the information of the loss function to adjust the stepsize in gradient descent methods.

regression

LEAP: Learnable Pruning for Transformer-based Models

1 code implementation30 May 2021 Zhewei Yao, Xiaoxia Wu, Linjian Ma, Sheng Shen, Kurt Keutzer, Michael W. Mahoney, Yuxiong He

Moreover, in order to reduce hyperparameter tuning, a novel adaptive regularization coefficient is deployed to control the regularization penalty adaptively.

QQP

Hierarchical Learning for Generation with Long Source Sequences

no code implementations15 Apr 2021 Tobias Rohde, Xiaoxia Wu, Yinhan Liu

One of the challenges for current sequence to sequence (seq2seq) models is processing long sequences, such as those in summarization and document level machine translation tasks.

Document Level Machine Translation Document Summarization +6

When Do Curricula Work?

1 code implementation ICLR 2021 Xiaoxia Wu, Ethan Dyer, Behnam Neyshabur

Inspired by common use cases of curriculum learning in practice, we investigate the role of limited training time budget and noisy data in the success of curriculum learning.

Choosing the Sample with Lowest Loss makes SGD Robust

1 code implementation10 Jan 2020 Vatsal Shah, Xiaoxia Wu, Sujay Sanghavi

The presence of outliers can potentially significantly skew the parameters of machine learning models trained via stochastic gradient descent (SGD).

regression

Linear Convergence of Adaptive Stochastic Gradient Descent

no code implementations28 Aug 2019 Yuege Xie, Xiaoxia Wu, Rachel Ward

We prove that the norm version of the adaptive stochastic gradient method (AdaGrad-Norm) achieves a linear convergence rate for a subset of either strongly convex functions or non-convex functions that satisfy the Polyak Lojasiewicz (PL) inequality.

Global Convergence of Adaptive Gradient Methods for An Over-parameterized Neural Network

no code implementations19 Feb 2019 Xiaoxia Wu, Simon S. Du, Rachel Ward

Adaptive gradient methods like AdaGrad are widely used in optimizing neural networks.

AdaGrad stepsizes: Sharp convergence over nonconvex landscapes

1 code implementation5 Jun 2018 Rachel Ward, Xiaoxia Wu, Leon Bottou

Adaptive gradient methods such as AdaGrad and its variants update the stepsize in stochastic gradient descent on the fly according to the gradients received along the way; such methods have gained widespread use in large-scale optimization for their ability to converge robustly, without the need to fine-tune the stepsize schedule.

Stochastic Optimization

WNGrad: Learn the Learning Rate in Gradient Descent

no code implementations7 Mar 2018 Xiaoxia Wu, Rachel Ward, Léon Bottou

Adjusting the learning rate schedule in stochastic gradient methods is an important unresolved problem which requires tuning in practice.

Cannot find the paper you are looking for? You can Submit a new open access paper.