Search Results for author: Xiaoxuan Liu

Found 12 papers, 4 papers with code

MoE-Lightning: High-Throughput MoE Inference on Memory-constrained GPUs

no code implementations18 Nov 2024 Shiyi Cao, Shu Liu, Tyler Griggs, Peter Schafhalter, Xiaoxuan Liu, Ying Sheng, Joseph E. Gonzalez, Matei Zaharia, Ion Stoica

MoE-Lightning can achieve up to 10. 3x higher throughput than state-of-the-art offloading-enabled LLM inference systems for Mixtral 8x7B on a single T4 GPU (16GB).

Computational Efficiency

UNetMamba: An Efficient UNet-Like Mamba for Semantic Segmentation of High-Resolution Remote Sensing Images

1 code implementation21 Aug 2024 Enze Zhu, Zhan Chen, Dingkai Wang, Hanru Shi, Xiaoxuan Liu, Lei Wang

Semantic segmentation of high-resolution remote sensing images is vital in downstream applications such as land-cover mapping, urban planning and disaster assessment. Existing Transformer-based methods suffer from the constraint between accuracy and efficiency, while the recently proposed Mamba is renowned for being efficient.

Mamba Segmentation +2

Optimizing Speculative Decoding for Serving Large Language Models Using Goodput

no code implementations20 Jun 2024 Xiaoxuan Liu, Cade Daniel, Langxiang Hu, Woosuk Kwon, Zhuohan Li, Xiangxi Mo, Alvin Cheung, Zhijie Deng, Ion Stoica, Hao Zhang

SmartSpec dynamically determines the best speculation length for each request (from 0, i. e., no speculation, to many tokens) -- hence the associated speculative execution costs -- based on a new metric called goodput, which characterizes the current observed load of the entire system and the speculation accuracy.

Mélange: Cost Efficient Large Language Model Serving by Exploiting GPU Heterogeneity

1 code implementation22 Apr 2024 Tyler Griggs, Xiaoxuan Liu, Jiaxiang Yu, Doyoung Kim, Wei-Lin Chiang, Alvin Cheung, Ion Stoica

Based on this analysis, we introduce M\'elange, a GPU allocation framework that navigates these diverse LLM service characteristics and heterogeneous GPU option space to automatically and efficiently derive the minimal-cost GPU allocation for a given LLM service.

Language Modeling Language Modelling +1

Learned Best-Effort LLM Serving

no code implementations15 Jan 2024 Siddharth Jha, Coleman Hooper, Xiaoxuan Liu, Sehoon Kim, Kurt Keutzer

Many applications must provide low-latency LLM service to users or risk unacceptable user experience.

Deep Reinforcement Learning

QFT: Quantized Full-parameter Tuning of LLMs with Affordable Resources

no code implementations11 Oct 2023 Zhikai Li, Xiaoxuan Liu, Banghua Zhu, Zhen Dong, Qingyi Gu, Kurt Keutzer

Large Language Models (LLMs) have showcased remarkable impacts across a wide spectrum of natural language processing tasks.

parameter-efficient fine-tuning Quantization

Online Speculative Decoding

1 code implementation11 Oct 2023 Xiaoxuan Liu, Lanxiang Hu, Peter Bailis, Alvin Cheung, Zhijie Deng, Ion Stoica, Hao Zhang

Adapting to query distribution mitigates the shifts between the training distribution of the draft model and the query distribution, enabling the draft model to more accurately predict the target model's outputs.

Knowledge Distillation

An Evaluation of Memory Optimization Methods for Training Neural Networks

no code implementations26 Mar 2023 Xiaoxuan Liu, Siddharth Jha, Alvin Cheung

To address the challenge, this paper summarizes the scenarios in which MOMs prove advantageous for model training.

Quantization

GACT: Activation Compressed Training for Generic Network Architectures

1 code implementation22 Jun 2022 Xiaoxuan Liu, Lianmin Zheng, Dequan Wang, Yukuo Cen, Weize Chen, Xu Han, Jianfei Chen, Zhiyuan Liu, Jie Tang, Joey Gonzalez, Michael Mahoney, Alvin Cheung

Training large neural network (NN) models requires extensive memory resources, and Activation Compressed Training (ACT) is a promising approach to reduce training memory footprint.

Long-run User Value Optimization in Recommender Systems through Content Creation Modeling

no code implementations25 Apr 2022 Akos Lada, Xiaoxuan Liu, Jens Rischbieth, Yi Wang, Yuwen Zhang

Content recommender systems are generally adept at maximizing immediate user satisfaction but to optimize for the \textit{long-run} user value, we need more statistically sophisticated solutions than off-the-shelf simple recommender algorithms.

BIG-bench Machine Learning Recommendation Systems

Cannot find the paper you are looking for? You can Submit a new open access paper.