Search Results for author: Ying Sheng

Found 22 papers, 17 papers with code

Locality-aware Fair Scheduling in LLM Serving

no code implementations24 Jan 2025 Shiyi Cao, Yichuan Wang, Ziming Mao, Pin-Lun Hsu, Liangsheng Yin, Tian Xia, Dacheng Li, Shu Liu, Yineng Zhang, Yang Zhou, Ying Sheng, Joseph Gonzalez, Ion Stoica

We also introduce a novel algorithm, Double Deficit LPM (D$^2$LPM), extending DLPM for the distributed setup that can find a balance point among fairness, locality, and load-balancing.

Fairness Language Modeling +3

MoE-Lightning: High-Throughput MoE Inference on Memory-constrained GPUs

no code implementations18 Nov 2024 Shiyi Cao, Shu Liu, Tyler Griggs, Peter Schafhalter, Xiaoxuan Liu, Ying Sheng, Joseph E. Gonzalez, Matei Zaharia, Ion Stoica

MoE-Lightning can achieve up to 10. 3x higher throughput than state-of-the-art offloading-enabled LLM inference systems for Mixtral 8x7B on a single T4 GPU (16GB).

Computational Efficiency

AI Metropolis: Scaling Large Language Model-based Multi-Agent Simulation with Out-of-order Execution

no code implementations5 Nov 2024 Zhiqiang Xie, Hao Kang, Ying Sheng, Tushar Krishna, Kayvon Fatahalian, Christos Kozyrakis

With more advanced natural language understanding and reasoning capabilities, large language model (LLM)-powered agents are increasingly developed in simulated environments to perform complex tasks, interact with other agents, and exhibit emergent behaviors relevant to social science and gaming.

Language Modeling Language Modelling +3

Inference-Friendly Models With MixAttention

1 code implementation23 Sep 2024 Shashank Rajput, Ying Sheng, Sean Owen, Vitaliy Chiley

The KV cache size grows proportionally with the number of attention heads and the tokens processed, leading to increased memory consumption and slower inference for long inputs.

Post-Training Sparse Attention with Double Sparsity

1 code implementation11 Aug 2024 Shuo Yang, Ying Sheng, Joseph E. Gonzalez, Ion Stoica, Lianmin Zheng

Our key insight is that the pattern of channel sparsity is relatively static, allowing us to use offline calibration to make it efficient at runtime, thereby enabling accurate and efficient identification of important tokens.

DafnyBench: A Benchmark for Formal Software Verification

1 code implementation12 Jun 2024 Chloe Loughridge, Qinyi Sun, Seth Ahrenbach, Federico Cassano, Chuyue Sun, Ying Sheng, Anish Mudide, Md Rakib Hossain Misu, Nada Amin, Max Tegmark

We introduce DafnyBench, the largest benchmark of its kind for training and evaluating machine learning systems for formal software verification.

Fairness in Serving Large Language Models

2 code implementations31 Dec 2023 Ying Sheng, Shiyi Cao, Dacheng Li, Banghua Zhu, Zhuohan Li, Danyang Zhuo, Joseph E. Gonzalez, Ion Stoica

High-demand LLM inference services (e. g., ChatGPT and BARD) support a wide range of requests from short chat conversations to long document reading.

Fairness Scheduling

Clover: Closed-Loop Verifiable Code Generation

2 code implementations26 Oct 2023 Chuyue Sun, Ying Sheng, Oded Padon, Clark Barrett

In this paper, we introduce a new approach for addressing this challenge: the Clover paradigm, short for Closed-Loop Verifiable Code Generation, which uses consistency checking to provide a strong filter for incorrect code.

Code Generation

LMSYS-Chat-1M: A Large-Scale Real-World LLM Conversation Dataset

5 code implementations21 Sep 2023 Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Tianle Li, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zhuohan Li, Zi Lin, Eric P. Xing, Joseph E. Gonzalez, Ion Stoica, Hao Zhang

Studying how people interact with large language models (LLMs) in real-world scenarios is increasingly important due to their widespread use in various applications.

Chatbot Diversity +1

Efficient Memory Management for Large Language Model Serving with PagedAttention

6 code implementations12 Sep 2023 Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, Ion Stoica

On top of it, we build vLLM, an LLM serving system that achieves (1) near-zero waste in KV cache memory and (2) flexible sharing of KV cache within and across requests to further reduce memory usage.

Language Modeling Language Modelling +2

H$_2$O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models

2 code implementations24 Jun 2023 Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher Ré, Clark Barrett, Zhangyang Wang, Beidi Chen

Based on these insights, we propose Heavy Hitter Oracle (H$_2$O), a KV cache eviction policy that dynamically retains a balance of recent and H$_2$ tokens.

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

11 code implementations NeurIPS 2023 Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, Ion Stoica

Evaluating large language model (LLM) based chat assistants is challenging due to their broad capabilities and the inadequacy of existing benchmarks in measuring human preferences.

Chatbot Language Modelling +2

On Optimal Caching and Model Multiplexing for Large Model Inference

1 code implementation3 Jun 2023 Banghua Zhu, Ying Sheng, Lianmin Zheng, Clark Barrett, Michael I. Jordan, Jiantao Jiao

Theoretically, we provide an optimal algorithm for jointly optimizing both approaches to reduce the inference cost in both offline and online tabular settings.

model

FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU

1 code implementation13 Mar 2023 Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Li, Max Ryabinin, Daniel Y. Fu, Zhiqiang Xie, Beidi Chen, Clark Barrett, Joseph E. Gonzalez, Percy Liang, Christopher Ré, Ion Stoica, Ce Zhang

As a result, when running OPT-175B on a single 16GB GPU, FlexGen achieves significantly higher throughput compared to state-of-the-art offloading systems, reaching a generation throughput of 1 token/s for the first time with an effective batch size of 144.

Language Modelling Large Language Model

AlpaServe: Statistical Multiplexing with Model Parallelism for Deep Learning Serving

2 code implementations22 Feb 2023 Zhuohan Li, Lianmin Zheng, Yinmin Zhong, Vincent Liu, Ying Sheng, Xin Jin, Yanping Huang, Zhifeng Chen, Hao Zhang, Joseph E. Gonzalez, Ion Stoica

Model parallelism is conventionally viewed as a method to scale a single large deep learning model beyond the memory limits of a single device.

Deep Learning

Simplified DOM Trees for Transferable Attribute Extraction from the Web

2 code implementations7 Jan 2021 Yichao Zhou, Ying Sheng, Nguyen Vo, Nick Edmonds, Sandeep Tata

There has been a steady need to precisely extract structured knowledge from the web (i. e. HTML documents).

Attribute Attribute Extraction +2

FreeDOM: A Transferable Neural Architecture for Structured Information Extraction on Web Documents

no code implementations21 Oct 2020 Bill Yuchen Lin, Ying Sheng, Nguyen Vo, Sandeep Tata

By combining these stages, FreeDOM is able to generalize to unseen sites after training on a small number of seed sites from that vertical without requiring expensive hand-crafted features over visual renderings of the page.

Subspace Embedding and Linear Regression with Orlicz Norm

no code implementations ICML 2018 Alexandr Andoni, Chengyu Lin, Ying Sheng, Peilin Zhong, Ruiqi Zhong

An Orlicz norm is parameterized by a non-negative convex function $G:\mathbb{R}_+\rightarrow\mathbb{R}_+$ with $G(0)=0$: the Orlicz norm of a vector $x\in\mathbb{R}^n$ is defined as $ \|x\|_G=\inf\left\{\alpha>0\large\mid\sum_{i=1}^n G(|x_i|/\alpha)\leq 1\right\}.

regression

Cannot find the paper you are looking for? You can Submit a new open access paper.