Search Results for author: Zhuoming Chen

Found 13 papers, 10 papers with code

GSM-Infinite: How Do Your LLMs Behave over Infinitely Increasing Context Length and Reasoning Complexity?

1 code implementation7 Feb 2025 Yang Zhou, Hongyi Liu, Zhuoming Chen, Yuandong Tian, Beidi Chen

Our GSM-Infinite benchmark provides a scalable and controllable testbed for systematically studying and advancing LLM reasoning in long and complex contexts.

8k Information Retrieval +1

AdaServe: SLO-Customized LLM Serving with Fine-Grained Speculative Decoding

no code implementations21 Jan 2025 Zikun Li, Zhuofu Chen, Remi Delacourt, Gabriele Oliaro, Zeyu Wang, Qinghan Chen, Shuhuai Lin, April Yang, Zhihao Zhang, Zhuoming Chen, Sean Lai, Xupeng Miao, Zhihao Jia

This paper introduces AdaServe, the first LLM serving system to support SLO customization through fine-grained speculative decoding.

Sirius: Contextual Sparsity with Correction for Efficient LLMs

1 code implementation5 Sep 2024 Yang Zhou, Zhuoming Chen, Zhaozhuo Xu, Victoria Lin, Beidi Chen

However, after a comprehensive evaluation of contextual sparsity methods on various complex generation tasks, we find that although CS succeeds in prompt-understanding tasks, CS significantly degrades the model performance for reasoning, deduction, and knowledge-based tasks.

Math

MagicDec: Breaking the Latency-Throughput Tradeoff for Long Context Generation with Speculative Decoding

1 code implementation20 Aug 2024 Ranajoy Sadhukhan, Jian Chen, Zhuoming Chen, Vashisth Tiwari, Ruihang Lai, Jinyuan Shi, Ian En-Hsu Yen, Avner May, Tianqi Chen, Beidi Chen

MagicDec first identifies the bottleneck shifts with increasing batch size and sequence length, and uses these insights to deploy SD more effectively for high throughput inference.

Mini-Sequence Transformer: Optimizing Intermediate Memory for Long Sequences Training

1 code implementation22 Jul 2024 Cheng Luo, Jiawei Zhao, Zhuoming Chen, Beidi Chen, Anima Anandkumar

We introduce Mini-Sequence Transformer (MsT), a simple and effective methodology for highly efficient and accurate LLM training with extremely long sequences.

SpecExec: Massively Parallel Speculative Decoding for Interactive LLM Inference on Consumer Devices

1 code implementation4 Jun 2024 Ruslan Svirschevski, Avner May, Zhuoming Chen, Beidi Chen, Zhihao Jia, Max Ryabinin

We propose SpecExec (Speculative Execution), a simple parallel decoding method that can generate up to 20 tokens per target model iteration for popular LLM families.

Text Generation

TriForce: Lossless Acceleration of Long Sequence Generation with Hierarchical Speculative Decoding

1 code implementation18 Apr 2024 Hanshi Sun, Zhuoming Chen, Xinyu Yang, Yuandong Tian, Beidi Chen

However, key-value (KV) cache, which is stored to avoid re-computation, has emerged as a critical bottleneck by growing linearly in size with the sequence length.

Sequoia: Scalable, Robust, and Hardware-aware Speculative Decoding

1 code implementation19 Feb 2024 Zhuoming Chen, Avner May, Ruslan Svirschevski, Yuhsun Huang, Max Ryabinin, Zhihao Jia, Beidi Chen

This paper introduces Sequoia, a scalable, robust, and hardware-aware algorithm for speculative decoding.

GNNPipe: Scaling Deep GNN Training with Pipelined Model Parallelism

no code implementations19 Aug 2023 Jingji Chen, Zhuoming Chen, Xuehai Qian

Communication is a key bottleneck for distributed graph neural network (GNN) training.

Graph Neural Network

SpecInfer: Accelerating Generative Large Language Model Serving with Tree-based Speculative Inference and Verification

3 code implementations16 May 2023 Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Zeyu Wang, Zhengxin Zhang, Rae Ying Yee Wong, Alan Zhu, Lijie Yang, Xiaoxiang Shi, Chunan Shi, Zhuoming Chen, Daiyaan Arfeen, Reyna Abhyankar, Zhihao Jia

Our evaluation shows that SpecInfer outperforms existing LLM serving systems by 1. 5-2. 8x for distributed LLM inference and by 2. 6-3. 5x for offloading-based LLM inference, while preserving the same generative performance.

Decoder Language Modeling +2

Quark: A Gradient-Free Quantum Learning Framework for Classification Tasks

no code implementations2 Oct 2022 Zhihao Zhang, Zhuoming Chen, Heyang Huang, Zhihao Jia

To address the limitations of existing quantum ML methods, we introduce Quark, a gradient-free quantum learning framework that optimizes quantum ML models using quantum optimization.

Edge Detection

Quantized Training of Gradient Boosting Decision Trees

2 code implementations20 Jul 2022 Yu Shi, Guolin Ke, Zhuoming Chen, Shuxin Zheng, Tie-Yan Liu

Recent years have witnessed significant success in Gradient Boosting Decision Trees (GBDT) for a wide range of machine learning applications.

Quantization

Cannot find the paper you are looking for? You can Submit a new open access paper.