1 code implementation • 21 Oct 2024 • Zhuoming Chen, Ranajoy Sadhukhan, Zihao Ye, Yang Zhou, Jianyu Zhang, Niklas Nolte, Yuandong Tian, Matthijs Douze, Leon Bottou, Zhihao Jia, Beidi Chen
MagicPIG stores the LSH hash tables and runs the attention computation on the CPU, which allows it to serve longer contexts and larger batch sizes with high approximation accuracy.
1 code implementation • 7 Oct 2024 • Lijie Yang, Zhihao Zhang, Zhuofu Chen, Zikun Li, Zhihao Jia
Existing sparse attention mechanisms designed to address this bottleneck have two limitations: (1) they often fail to reliably identify the most relevant tokens for attention, and (2) they overlook the spatial coherence of token selection across consecutive Transformer layers, which can lead to performance degradation and substantial overhead in token selection.
no code implementations • 24 Jun 2024 • Byungsoo Jeon, Mengdi Wu, Shiyi Cao, Sunghyun Kim, Sunghyun Park, Neeraj Aggarwal, Colin Unger, Daiyaan Arfeen, Peiyuan Liao, Xupeng Miao, Mohammad Alizadeh, Gregory R. Ganger, Tianqi Chen, Zhihao Jia
GraphPipe partitions a DNN into a graph of stages, optimizes micro-batch schedules for these stages, and parallelizes DNN training using the discovered GPP strategies.
1 code implementation • 4 Jun 2024 • Ruslan Svirschevski, Avner May, Zhuoming Chen, Beidi Chen, Zhihao Jia, Max Ryabinin
We propose SpecExec (Speculative Execution), a simple parallel decoding method that can generate up to 20 tokens per target model iteration for popular LLM families.
no code implementations • 3 Jun 2024 • Yixuan Mei, Yonghao Zhuang, Xupeng Miao, Juncheng Yang, Zhihao Jia, Rashmi Vinayak
This paper introduces Helix, a distributed system for high-throughput, low-latency large language model (LLM) serving on heterogeneous GPU clusters.
1 code implementation • 9 May 2024 • Mengdi Wu, Xinhao Cheng, Oded Padon, Zhihao Jia
We introduce Mirage, the first multi-level superoptimizer for tensor programs.
1 code implementation • 29 Feb 2024 • Xupeng Miao, Gabriele Oliaro, Xinhao Cheng, Mengdi Wu, Colin Unger, Zhihao Jia
This is because existing systems cannot handle workloads that include a mix of inference and PEFT finetuning requests.
1 code implementation • 19 Feb 2024 • Zhuoming Chen, Avner May, Ruslan Svirschevski, Yuhsun Huang, Max Ryabinin, Zhihao Jia, Beidi Chen
This paper introduces Sequoia, a scalable, robust, and hardware-aware algorithm for speculative decoding.
no code implementations • 25 Jan 2024 • Zhihao Zhang, Alan Zhu, Lijie Yang, Yihua Xu, LanTing LI, Phitchaya Mangpo Phothilimthana, Zhihao Jia
Retrieval-augmented language models (RaLM) have demonstrated the potential to solve knowledge-intensive natural language processing (NLP) tasks by combining a non-parametric knowledge base with a parametric language model.
1 code implementation • 13 Jan 2024 • Zhengxin Zhang, Dan Zhao, Xupeng Miao, Gabriele Oliaro, Qing Li, Yong Jiang, Zhihao Jia
Experiments show that QST can reduce the total memory footprint by up to 2. 3 $\times$ and speed up the finetuning process by up to 3 $\times$ while achieving competent performance compared with the state-of-the-art.
no code implementations • 23 Dec 2023 • Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Hongyi Jin, Tianqi Chen, Zhihao Jia
In the rapidly evolving landscape of artificial intelligence (AI), generative large language models (LLMs) stand at the forefront, revolutionizing how we interact with our data.
1 code implementation • 27 Nov 2023 • Xupeng Miao, Chunan Shi, Jiangfei Duan, Xiaoli Xi, Dahua Lin, Bin Cui, Zhihao Jia
This paper aims to reduce the monetary cost for serving LLMs by leveraging preemptible GPU instances on modern clouds, which offer accesses to spare GPUs at a much cheaper price than regular instances but may be preempted by the cloud at any time.
no code implementations • 30 Aug 2023 • Zhihao Jia, Bing Wang, Changhao Chen
In this work, we propose the Drone-NeRF framework to enhance the efficient reconstruction of unbounded large-scale scenes suited for drone oblique photography using Neural Radiance Fields (NeRF).
no code implementations • 17 Jul 2023 • Zikun Li, Jinjun Peng, Yixuan Mei, Sina Lin, Yi Wu, Oded Padon, Zhihao Jia
Applying reinforcement learning (RL) to quantum circuit optimization raises two main challenges: the large and varying action space and the non-uniform state representation.
3 code implementations • 16 May 2023 • Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Zeyu Wang, Zhengxin Zhang, Rae Ying Yee Wong, Alan Zhu, Lijie Yang, Xiaoxiang Shi, Chunan Shi, Zhuoming Chen, Daiyaan Arfeen, Reyna Abhyankar, Zhihao Jia
Our evaluation shows that SpecInfer outperforms existing LLM serving systems by 1. 5-2. 8x for distributed LLM inference and by 2. 6-3. 5x for offloading-based LLM inference, while preserving the same generative performance.
no code implementations • 2 Oct 2022 • Zhihao Zhang, Zhuoming Chen, Heyang Huang, Zhihao Jia
To address the limitations of existing quantum ML methods, we introduce Quark, a gradient-free quantum learning framework that optimizes quantum ML models using quantum optimization.
no code implementations • 2 Aug 2022 • Liyan Zheng, Haojie Wang, Jidong Zhai, Muyan Hu, Zixuan Ma, Tuowei Wang, Shizhi Tang, Lei Xie, Kezhao Huang, Zhihao Jia
Boosting the runtime performance of deep neural networks (DNNs) is critical due to their wide adoption in real-world tasks.
2 code implementations • 21 Jun 2022 • Kay Liu, Yingtong Dou, Yue Zhao, Xueying Ding, Xiyang Hu, Ruitong Zhang, Kaize Ding, Canyu Chen, Hao Peng, Kai Shu, Lichao Sun, Jundong Li, George H. Chen, Zhihao Jia, Philip S. Yu
To bridge this gap, we present--to the best of our knowledge--the first comprehensive benchmark for unsupervised outlier node detection on static attributed graphs called BOND, with the following highlights.
no code implementations • 4 May 2022 • Ferdinand Kossmann, Zhihao Jia, Alex Aiken
The Mixture of Experts architecture allows for outrageously large neural networks by scaling model parameter size independently from computational demand (FLOPs).
no code implementations • 26 Apr 2022 • John Thorpe, Pengzhan Zhao, Jonathan Eyolfson, Yifan Qiao, Zhihao Jia, Minjia Zhang, Ravi Netravali, Guoqing Harry Xu
DNN models across many domains continue to grow in size, resulting in high resource requirements for effective training, and unpalatable (and often unaffordable) costs for organizations and research labs across scales.
1 code implementation • 1 Nov 2021 • Byungsoo Jeon, Sunghyun Park, Peiyuan Liao, Sheng Xu, Tianqi Chen, Zhihao Jia
Given the fast-evolving nature of the DL ecosystem, this manual approach often slows down continuous innovations across different layers; it prevents hardware vendors from the fast deployment of their cutting-edge libraries, DL framework developers must repeatedly adjust their hand-coded rules to accommodate new versions of libraries, and machine learning practitioners need to wait for the integration of new technologies and often encounter unsatisfactory performance.
2 code implementations • 26 Oct 2021 • Yue Zhao, George H. Chen, Zhihao Jia
Outlier detection (OD) is a key learning task for finding rare and deviant data samples, with many time-critical applications such as fraud detection and intrusion detection.
1 code implementation • ICLR 2022 • Zhihao Zhang, Zhihao Jia
In addition, we design GradSign, an accurate and simple approximation of {\Psi} using the gradients of a network evaluated at a random initialization state.
1 code implementation • 24 May 2021 • John Thorpe, Yifan Qiao, Jonathan Eyolfson, Shen Teng, Guanzhou Hu, Zhihao Jia, Jinliang Wei, Keval Vora, Ravi Netravali, Miryung Kim, Guoqing Harry Xu
Computation separation makes it possible to construct a deep, bounded-asynchronous pipeline where graph and tensor parallel tasks can fully overlap, effectively hiding the network latency incurred by Lambdas.
no code implementations • 12 Apr 2021 • Dheevatsa Mudigere, Yuchen Hao, Jianyu Huang, Zhihao Jia, Andrew Tulloch, Srinivas Sridharan, Xing Liu, Mustafa Ozdal, Jade Nie, Jongsoo Park, Liang Luo, Jie Amy Yang, Leon Gao, Dmytro Ivchenko, Aarti Basant, Yuxi Hu, Jiyan Yang, Ehsan K. Ardestani, Xiaodong Wang, Rakesh Komuravelli, Ching-Hsiang Chu, Serhat Yilmaz, Huayu Li, Jiyuan Qian, Zhuobo Feng, Yinbin Ma, Junjie Yang, Ellie Wen, Hong Li, Lin Yang, Chonglin Sun, Whitney Zhao, Dimitry Melts, Krishna Dhulipala, KR Kishore, Tyler Graf, Assaf Eisenman, Kiran Kumar Matam, Adi Gangidi, Guoqiang Jerry Chen, Manoj Krishnan, Avinash Nayak, Krishnakumar Nair, Bharath Muthiah, Mahmoud khorashadi, Pallab Bhattacharya, Petr Lapukhov, Maxim Naumov, Ajit Mathews, Lin Qiao, Mikhail Smelyanskiy, Bill Jia, Vijay Rao
Deep learning recommendation models (DLRMs) are used across many business-critical services at Facebook and are the single largest AI application in terms of infrastructure demand in its data-centers.
1 code implementation • 2 Nov 2020 • Yaoyao Ding, Ligeng Zhu, Zhihao Jia, Gennady Pekhimenko, Song Han
To accelerate CNN inference, existing deep learning frameworks focus on optimizing intra-operator parallelization.
no code implementations • 9 Jun 2019 • Zhihao Jia, Sina Lin, Rex Ying, Jiaxuan You, Jure Leskovec, Alex Aiken
Graph Neural Networks (GNNs) are based on repeated aggregations of information across nodes' neighbors in a graph.
no code implementations • 14 Jul 2018 • Zhihao Jia, Matei Zaharia, Alex Aiken
We also propose FlexFlow, a deep learning framework that uses guided randomized search of the SOAP space to find a fast parallelization strategy for a specific parallel machine.
Distributed, Parallel, and Cluster Computing
no code implementations • ICML 2018 • Zhihao Jia, Sina Lin, Charles R. Qi, Alex Aiken
The past few years have witnessed growth in the computational requirements for training deep convolutional neural networks.
no code implementations • 14 Feb 2018 • Zhihao Jia, Sina Lin, Charles R. Qi, Alex Aiken
The past few years have witnessed growth in the computational requirements for training deep convolutional neural networks.
no code implementations • ICLR 2018 • Zhihao Jia, Sina Lin, Charles R. Qi, Alex Aiken
DeePa is a deep learning framework that explores parallelism in all parallelizable dimensions to accelerate the training process of convolutional neural networks.