Search Results for author: Lianmin Zheng

Found 23 papers, 18 papers with code

Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference

1 code implementation • 7 Mar 2024 • Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios Nikolas Angelopoulos, Tianle Li, Dacheng Li, Hao Zhang, Banghua Zhu, Michael Jordan, Joseph E. Gonzalez, Ion Stoica

To address this issue, we introduce Chatbot Arena, an open platform for evaluating LLMs based on human preferences.

Chatbot

33,630

Paper
Code

Efficiently Programming Large Language Models using SGLang

1 code implementation • 12 Dec 2023 • Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Jeff Huang, Chuyue Sun, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez, Clark Barrett, Ying Sheng

SGLang is designed for the efficient programming of LLMs and incorporates primitives for common LLM programming patterns.

2,241

Paper
Code

Rethinking Benchmark and Contamination for Language Models with Rephrased Samples

1 code implementation • 8 Nov 2023 • Shuo Yang, Wei-Lin Chiang, Lianmin Zheng, Joseph E. Gonzalez, Ion Stoica

Many have raised concerns about the trustworthiness of public benchmarks due to potential contamination in pre-training or fine-tuning datasets.

171

Paper
Code

S-LoRA: Serving Thousands of Concurrent LoRA Adapters

1 code implementation • 6 Nov 2023 • Ying Sheng, Shiyi Cao, Dacheng Li, Coleman Hooper, Nicholas Lee, Shuo Yang, Christopher Chou, Banghua Zhu, Lianmin Zheng, Kurt Keutzer, Joseph E. Gonzalez, Ion Stoica

To capitalize on these opportunities, we present S-LoRA, a system designed for the scalable serving of many LoRA adapters.

1,444

Paper
Code

LMSYS-Chat-1M: A Large-Scale Real-World LLM Conversation Dataset

1 code implementation • 21 Sep 2023 • Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Tianle Li, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zhuohan Li, Zi Lin, Eric P. Xing, Joseph E. Gonzalez, Ion Stoica, Hao Zhang

Studying how people interact with large language models (LLMs) in real-world scenarios is increasingly important due to their widespread use in various applications.

Chatbot Instruction Following

33,630

Paper
Code

Efficient Memory Management for Large Language Model Serving with PagedAttention

4 code implementations • 12 Sep 2023 • Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, Ion Stoica

On top of it, we build vLLM, an LLM serving system that achieves (1) near-zero waste in KV cache memory and (2) flexible sharing of KV cache within and across requests to further reduce memory usage.

Language Modelling Large Language Model +1

17,911

Paper
Code

H$_2$O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models

1 code implementation • 24 Jun 2023 • Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher Ré, Clark Barrett, Zhangyang Wang, Beidi Chen

Based on these insights, we propose Heavy Hitter Oracle (H$_2$O), a KV cache eviction policy that dynamically retains a balance of recent and H$_2$ tokens.

271

Paper
Code

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

5 code implementations • NeurIPS 2023 • Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, Ion Stoica

Evaluating large language model (LLM) based chat assistants is challenging due to their broad capabilities and the inadequacy of existing benchmarks in measuring human preferences.

Ranked #3 on Long-Context Understanding on Ada-LEval (TSort)

Chatbot Language Modelling +2

33,630

Paper
Code

On Optimal Caching and Model Multiplexing for Large Model Inference

1 code implementation • 3 Jun 2023 • Banghua Zhu, Ying Sheng, Lianmin Zheng, Clark Barrett, Michael I. Jordan, Jiantao Jiao

Theoretically, we provide an optimal algorithm for jointly optimizing both approaches to reduce the inference cost in both offline and online tabular settings.

Paper
Code

FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU

1 code implementation • 13 Mar 2023 • Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Li, Max Ryabinin, Daniel Y. Fu, Zhiqiang Xie, Beidi Chen, Clark Barrett, Joseph E. Gonzalez, Percy Liang, Christopher Ré, Ion Stoica, Ce Zhang

As a result, when running OPT-175B on a single 16GB GPU, FlexGen achieves significantly higher throughput compared to state-of-the-art offloading systems, reaching a generation throughput of 1 token/s for the first time with an effective batch size of 144.

Language Modelling Large Language Model

8,987

Paper
Code

AlpaServe: Statistical Multiplexing with Model Parallelism for Deep Learning Serving

2 code implementations • 22 Feb 2023 • Zhuohan Li, Lianmin Zheng, Yinmin Zhong, Vincent Liu, Ying Sheng, Xin Jin, Yanping Huang, Zhifeng Chen, Hao Zhang, Joseph E. Gonzalez, Ion Stoica

Model parallelism is conventionally viewed as a method to scale a single large deep learning model beyond the memory limits of a single device.

2,980

Paper
Code

On Optimizing the Communication of Model Parallelism

no code implementations • 10 Nov 2022 • Yonghao Zhuang, Hexu Zhao, Lianmin Zheng, Zhuohan Li, Eric P. Xing, Qirong Ho, Joseph E. Gonzalez, Ion Stoica, Hao Zhang

This pattern emerges when the two paradigms of model parallelism - intra-operator and inter-operator parallelism - are combined to support large models on large clusters.

Paper
Add Code

TensorIR: An Abstraction for Automatic Tensorized Program Optimization

2 code implementations • 9 Jul 2022 • Siyuan Feng, Bohan Hou, Hongyi Jin, Wuwei Lin, Junru Shao, Ruihang Lai, Zihao Ye, Lianmin Zheng, Cody Hao Yu, Yong Yu, Tianqi Chen

Finally, we build an end-to-end framework on top of our abstraction to automatically optimize deep learning models for given tensor computation primitives.

BIG-bench Machine Learning

11,152

Paper
Code

NumS: Scalable Array Programming for the Cloud

1 code implementation • 28 Jun 2022 • Melih Elibol, Vinamra Benara, Samyu Yagati, Lianmin Zheng, Alvin Cheung, Michael I. Jordan, Ion Stoica

LSHS is a local search method which optimizes operator placement by minimizing maximum memory and network load on any given node within a distributed system.

regression Scheduling

131

Paper
Code

GACT: Activation Compressed Training for Generic Network Architectures

1 code implementation • 22 Jun 2022 • Xiaoxuan Liu, Lianmin Zheng, Dequan Wang, Yukuo Cen, Weize Chen, Xu Han, Jianfei Chen, Zhiyuan Liu, Jie Tang, Joey Gonzalez, Michael Mahoney, Alvin Cheung

Training large neural network (NN) models requires extensive memory resources, and Activation Compressed Training (ACT) is a promising approach to reduce training memory footprint.

Paper
Code

Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning

1 code implementation • 28 Jan 2022 • Lianmin Zheng, Zhuohan Li, Hao Zhang, Yonghao Zhuang, Zhifeng Chen, Yanping Huang, Yida Wang, Yuanzhong Xu, Danyang Zhuo, Eric P. Xing, Joseph E. Gonzalez, Ion Stoica

Existing model-parallel training systems either require users to manually create a parallelization plan or automatically generate one from a limited space of model parallelism configurations.

2,980

Paper
Code

ActNN: Reducing Training Memory Footprint via 2-Bit Activation Compressed Training

4 code implementations • 29 Apr 2021 • Jianfei Chen, Lianmin Zheng, Zhewei Yao, Dequan Wang, Ion Stoica, Michael W. Mahoney, Joseph E. Gonzalez

On all these tasks, ActNN compresses the activation to 2 bits on average, with negligible accuracy loss.

Quantization

192

Paper
Code

Ansor: Generating High-Performance Tensor Programs for Deep Learning

no code implementations • 11 Jun 2020 • Lianmin Zheng, Chengfan Jia, Minmin Sun, Zhao Wu, Cody Hao Yu, Ameer Haj-Ali, Yida Wang, Jun Yang, Danyang Zhuo, Koushik Sen, Joseph E. Gonzalez, Ion Stoica

Ansor can find high-performance programs that are outside the search space of existing state-of-the-art approaches.

Vocal Bursts Intensity Prediction

Paper
Add Code

A Hardware-Software Blueprint for Flexible Deep Learning Specialization

no code implementations • 11 Jul 2018 • Thierry Moreau, Tianqi Chen, Luis Vega, Jared Roesch, Eddie Yan, Lianmin Zheng, Josh Fromm, Ziheng Jiang, Luis Ceze, Carlos Guestrin, Arvind Krishnamurthy

Specialized Deep Learning (DL) acceleration stacks, designed for a specific set of frameworks, model architectures, operators, and data types, offer the allure of high performance while sacrificing flexibility.

Code Generation Style Transfer

Paper
Add Code

Learning to Optimize Tensor Programs

no code implementations • NeurIPS 2018 • Tianqi Chen, Lianmin Zheng, Eddie Yan, Ziheng Jiang, Thierry Moreau, Luis Ceze, Carlos Guestrin, Arvind Krishnamurthy

Efficient implementations of tensor operators, such as matrix multiplication and high dimensional convolution, are key enablers of effective deep learning systems.

Paper
Add Code

TVM: An Automated End-to-End Optimizing Compiler for Deep Learning

1 code implementation • 12 Feb 2018 • Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Meghan Cowan, Haichen Shen, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, Arvind Krishnamurthy

Experimental results show that TVM delivers performance across hardware back-ends that are competitive with state-of-the-art, hand-tuned libraries for low-power CPU, mobile GPU, and server-class GPUs.

Paper
Code

Size-to-depth: A New Perspective for Single Image Depth Estimation

no code implementations • 13 Jan 2018 • Yiran Wu, Sihao Ying, Lianmin Zheng

To overcome these problems, we propose a new perspective for single monocular image depth estimation problem: size to depth.

Depth Estimation

Paper
Add Code

MAgent: A Many-Agent Reinforcement Learning Platform for Artificial Collective Intelligence

3 code implementations • 2 Dec 2017 • Lianmin Zheng, Jiacheng Yang, Han Cai, Wei-Nan Zhang, Jun Wang, Yong Yu

Unlike previous research platforms on single or multi-agent reinforcement learning, MAgent focuses on supporting the tasks and the applications that require hundreds to millions of agents.

Multi-agent Reinforcement Learning reinforcement-learning +1

1,667

Paper
Code

Cannot find the paper you are looking for? You can Submit a new open access paper.