no code implementations • 4 Apr 2025 • Yongji Wu, Xueshen Liu, Shuowei Jin, Ceyu Xu, Feng Qian, Z. Morley Mao, Matthew Lentz, Danyang Zhuo, Ion Stoica
However, existing solutions are agnostic to the performance characteristics of different MoE model components (i. e., attention and expert) and do not fully utilize each GPU's compute capability.
no code implementations • 5 Jul 2024 • Yongji Wu, Wenjie Qu, Tianyang Tao, Zhuang Wang, Wei Bai, Zhuohao Li, Yuan Tian, Jiaheng Zhang, Matthew Lentz, Danyang Zhuo
The cost of even a single failure is significant, as all GPUs need to wait idle until the failure is resolved, potentially losing considerable training progress as training has to restart from checkpoints.
no code implementations • 29 Jun 2024 • Ceyu Xu, Yongji Wu, Xinyu Yang, Beidi Chen, Matthew Lentz, Danyang Zhuo, Lisa Wu Wills
As the parameter size of large language models (LLMs) continues to expand, the need for a large memory footprint and high communication bandwidth have become significant bottlenecks for the training and inference of LLMs.
1 code implementation • 29 May 2024 • Yechen Xu, Xinhao Kong, Tingjun Chen, Danyang Zhuo
In this paper, we identify a new opportunity for efficient LLM serving for requests that trigger tools: tool partial execution alongside LLM decoding.
no code implementations • 19 Feb 2024 • Shuowei Jin, Yongji Wu, Haizhong Zheng, Qingzhao Zhang, Matthew Lentz, Z. Morley Mao, Atul Prakash, Feng Qian, Danyang Zhuo
Large language models (LLMs) have seen significant adoption for natural language tasks, owing their success to massive numbers of model parameters (e. g., 70B+); however, LLM inference incurs significant computation and memory costs.
no code implementations • 17 Jan 2024 • Yao Lu, Song Bian, Lequn Chen, Yongjun He, Yulong Hui, Matthew Lentz, Beibin Li, Fei Liu, Jialin Li, Qi Liu, Rui Liu, Xiaoxuan Liu, Lin Ma, Kexin Rong, Jianguo Wang, Yingjun Wu, Yongji Wu, Huanchen Zhang, Minjia Zhang, Qizhen Zhang, Tianyi Zhou, Danyang Zhuo
In this paper, we investigate the intersection of large generative AI models and cloud-native computing architectures.
no code implementations • 13 Jan 2024 • Yicheng Jin, Yongji Wu, WenJun Hu, Bruce M. Maggs, Xiao Zhang, Danyang Zhuo
Vector databases have emerged as key enablers for bridging intelligent applications with unstructured data, providing generic search and management support for embedding vectors extracted from the raw unstructured data.
2 code implementations • 31 Dec 2023 • Ying Sheng, Shiyi Cao, Dacheng Li, Banghua Zhu, Zhuohan Li, Danyang Zhuo, Joseph E. Gonzalez, Ion Stoica
High-demand LLM inference services (e. g., ChatGPT and BARD) support a wide range of requests from short chat conversations to long document reading.
1 code implementation • 28 Oct 2023 • Lequn Chen, Zihao Ye, Yongji Wu, Danyang Zhuo, Luis Ceze, Arvind Krishnamurthy
Our scheduler consolidates multi-tenant LoRA serving workloads in a shared GPU cluster.
no code implementations • 14 Aug 2023 • Lequn Chen, Weixin Deng, Anirudh Canumalla, Yu Xin, Danyang Zhuo, Matthai Philipose, Arvind Krishnamurthy
However, existing model serving systems cannot achieve adequate batch sizes while meeting latency objectives as these systems eagerly dispatch requests to accelerators to minimize the accelerator idle time.
no code implementations • 6 Jun 2023 • Xiang Chen, Zhao Song, Baocheng Sun, Junze Yin, Danyang Zhuo
Many machine learning algorithms require large numbers of labeled data to deliver state-of-the-art results.
no code implementations • 21 Dec 2022 • Lianke Qin, Aravind Reddy, Zhao Song, Zhaozhuo Xu, Danyang Zhuo
In this paper, we propose Adam-Hash: an adaptive and dynamic multi-resolution hashing data-structure for fast pairwise summation estimation.
no code implementations • 28 Nov 2022 • Jiehao Liang, Somdeb Sarkhel, Zhao Song, Chenbo Yin, Junze Yin, Danyang Zhuo
We propose a new algorithm \textsc{FastKmeans++} that only takes in $\widetilde{O}(nd + nk^2)$ time, in total.
no code implementations • 9 Aug 2022 • Yichuan Deng, Hang Hu, Zhao Song, Omri Weinstein, Danyang Zhuo
The success of deep learning comes at a tremendous computational and energy cost, and the scalability of training massively overparametrized neural networks is becoming a real barrier to the progress of artificial intelligence (AI).
no code implementations • 8 Aug 2022 • Jiehao Liang, Zhao Song, Zhaozhuo Xu, Junze Yin, Danyang Zhuo
In this work, we focus on the dynamic maintenance of KDE data structures with robustness to adversarial queries.
no code implementations • 5 Aug 2022 • Hang Hu, Zhao Song, Runzhou Tao, Zhaozhuo Xu, Junze Yin, Danyang Zhuo
Online bipartite matching is a fundamental problem in online algorithms.
no code implementations • 10 May 2022 • Yongji Wu, Matthew Lentz, Danyang Zhuo, Yao Lu
With the advent of ubiquitous deployment of smart devices and the Internet of Things, data sources for machine learning inference have increasingly moved to the edge of the network.
1 code implementation • 28 Jan 2022 • Lianmin Zheng, Zhuohan Li, Hao Zhang, Yonghao Zhuang, Zhifeng Chen, Yanping Huang, Yida Wang, Yuanzhong Xu, Danyang Zhuo, Eric P. Xing, Joseph E. Gonzalez, Ion Stoica
Existing model-parallel training systems either require users to manually create a parallelization plan or automatically generate one from a limited space of model parallelism configurations.
no code implementations • 4 Dec 2021 • Shunhua Jiang, Yunze Man, Zhao Song, Zheng Yu, Danyang Zhuo
Given a kernel matrix of $n$ graphs, using sketching in solving kernel regression can reduce the running time to $o(n^3)$.
no code implementations • 29 Sep 2021 • Baihe Huang, Zhao Song, Runzhou Tao, Ruizhe Zhang, Danyang Zhuo
Inspired by InstaHide challenge [Huang, Song, Li and Arora'20], [Chen, Song and Zhuo'20] recently provides one mathematical formulation of InstaHide attack problem under Gaussian images distribution.
no code implementations • 29 Sep 2021 • Zhao Song, Baocheng Sun, Danyang Zhuo
In this paper, we present the first deep active learning algorithm which has a provable sample complexity.
1 code implementation • 16 Feb 2021 • Zhuohan Li, Siyuan Zhuang, Shiyuan Guo, Danyang Zhuo, Hao Zhang, Dawn Song, Ion Stoica
With this key idea, we design TeraPipe, a high-performance token-level pipeline parallel algorithm for synchronous model-parallel training of Transformer-based language models.
no code implementations • ICLR 2021 • Sitan Chen, Xiaoxiao Li, Zhao Song, Danyang Zhuo
In this work, we examine the security of InstaHide, a scheme recently proposed by \cite{hsla20} for preserving the security of private datasets in the context of distributed learning.
no code implementations • 1 Jan 2021 • Shunhua Jiang, Yunze Man, Zhao Song, Danyang Zhuo
Theoretically, we present two techniques to speed up GNTK training while preserving the generalization error: (1) We use a novel matrix decoupling method to reduce matrix dimensions during the kernel solving.
no code implementations • 24 Nov 2020 • Baihe Huang, Zhao Song, Runzhou Tao, Junze Yin, Ruizhe Zhang, Danyang Zhuo
On the current InstaHide challenge setup, where each InstaHide image is a mixture of two private images, we present a new algorithm to recover all the private images with a provable guarantee and optimal sample complexity.
no code implementations • 23 Nov 2020 • Sitan Chen, Xiaoxiao Li, Zhao Song, Danyang Zhuo
In this work, we examine the security of InstaHide, a scheme recently proposed by [Huang, Song, Li and Arora, ICML'20] for preserving the security of private datasets in the context of distributed learning.
no code implementations • 11 Jun 2020 • Lianmin Zheng, Chengfan Jia, Minmin Sun, Zhao Wu, Cody Hao Yu, Ameer Haj-Ali, Yida Wang, Jun Yang, Danyang Zhuo, Koushik Sen, Joseph E. Gonzalez, Ion Stoica
Ansor can find high-performance programs that are outside the search space of existing state-of-the-art approaches.
1 code implementation • 13 Feb 2020 • Siyuan Zhuang, Zhuohan Li, Danyang Zhuo, Stephanie Wang, Eric Liang, Robert Nishihara, Philipp Moritz, Ion Stoica
Task-based distributed frameworks (e. g., Ray, Dask, Hydro) have become increasingly popular for distributed applications that contain asynchronous and dynamic workloads, including asynchronous gradient descent, reinforcement learning, and model serving.