Search Results for author: Zhihao Jia

Found 31 papers, 15 papers with code

MagicPIG: LSH Sampling for Efficient LLM Generation

1 code implementation21 Oct 2024 Zhuoming Chen, Ranajoy Sadhukhan, Zihao Ye, Yang Zhou, Jianyu Zhang, Niklas Nolte, Yuandong Tian, Matthijs Douze, Leon Bottou, Zhihao Jia, Beidi Chen

MagicPIG stores the LSH hash tables and runs the attention computation on the CPU, which allows it to serve longer contexts and larger batch sizes with high approximation accuracy.

TidalDecode: Fast and Accurate LLM Decoding with Position Persistent Sparse Attention

1 code implementation7 Oct 2024 Lijie Yang, Zhihao Zhang, Zhuofu Chen, Zikun Li, Zhihao Jia

Existing sparse attention mechanisms designed to address this bottleneck have two limitations: (1) they often fail to reliably identify the most relevant tokens for attention, and (2) they overlook the spatial coherence of token selection across consecutive Transformer layers, which can lead to performance degradation and substantial overhead in token selection.

Position

GraphPipe: Improving Performance and Scalability of DNN Training with Graph Pipeline Parallelism

no code implementations24 Jun 2024 Byungsoo Jeon, Mengdi Wu, Shiyi Cao, Sunghyun Kim, Sunghyun Park, Neeraj Aggarwal, Colin Unger, Daiyaan Arfeen, Peiyuan Liao, Xupeng Miao, Mohammad Alizadeh, Gregory R. Ganger, Tianqi Chen, Zhihao Jia

GraphPipe partitions a DNN into a graph of stages, optimizes micro-batch schedules for these stages, and parallelizes DNN training using the discovered GPP strategies.

SpecExec: Massively Parallel Speculative Decoding for Interactive LLM Inference on Consumer Devices

1 code implementation4 Jun 2024 Ruslan Svirschevski, Avner May, Zhuoming Chen, Beidi Chen, Zhihao Jia, Max Ryabinin

We propose SpecExec (Speculative Execution), a simple parallel decoding method that can generate up to 20 tokens per target model iteration for popular LLM families.

Text Generation

Helix: Distributed Serving of Large Language Models via Max-Flow on Heterogeneous GPUs

no code implementations3 Jun 2024 Yixuan Mei, Yonghao Zhuang, Xupeng Miao, Juncheng Yang, Zhihao Jia, Rashmi Vinayak

This paper introduces Helix, a distributed system for high-throughput, low-latency large language model (LLM) serving on heterogeneous GPU clusters.

Language Modelling Large Language Model +1

A Multi-Level Superoptimizer for Tensor Programs

1 code implementation9 May 2024 Mengdi Wu, Xinhao Cheng, Oded Padon, Zhihao Jia

We introduce Mirage, the first multi-level superoptimizer for tensor programs.

Navigate

Sequoia: Scalable, Robust, and Hardware-aware Speculative Decoding

1 code implementation19 Feb 2024 Zhuoming Chen, Avner May, Ruslan Svirschevski, Yuhsun Huang, Max Ryabinin, Zhihao Jia, Beidi Chen

This paper introduces Sequoia, a scalable, robust, and hardware-aware algorithm for speculative decoding.

Accelerating Retrieval-Augmented Language Model Serving with Speculation

no code implementations25 Jan 2024 Zhihao Zhang, Alan Zhu, Lijie Yang, Yihua Xu, LanTing LI, Phitchaya Mangpo Phothilimthana, Zhihao Jia

Retrieval-augmented language models (RaLM) have demonstrated the potential to solve knowledge-intensive natural language processing (NLP) tasks by combining a non-parametric knowledge base with a parametric language model.

Language Modelling Retrieval

Quantized Side Tuning: Fast and Memory-Efficient Tuning of Quantized Large Language Models

1 code implementation13 Jan 2024 Zhengxin Zhang, Dan Zhao, Xupeng Miao, Gabriele Oliaro, Qing Li, Yong Jiang, Zhihao Jia

Experiments show that QST can reduce the total memory footprint by up to 2. 3 $\times$ and speed up the finetuning process by up to 3 $\times$ while achieving competent performance compared with the state-of-the-art.

Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems

no code implementations23 Dec 2023 Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Hongyi Jin, Tianqi Chen, Zhihao Jia

In the rapidly evolving landscape of artificial intelligence (AI), generative large language models (LLMs) stand at the forefront, revolutionizing how we interact with our data.

Language Modelling Large Language Model +1

SpotServe: Serving Generative Large Language Models on Preemptible Instances

1 code implementation27 Nov 2023 Xupeng Miao, Chunan Shi, Jiangfei Duan, Xiaoli Xi, Dahua Lin, Bin Cui, Zhihao Jia

This paper aims to reduce the monetary cost for serving LLMs by leveraging preemptible GPU instances on modern clouds, which offer accesses to spare GPUs at a much cheaper price than regular instances but may be preempted by the cloud at any time.

Graph Matching

Drone-NeRF: Efficient NeRF Based 3D Scene Reconstruction for Large-Scale Drone Survey

no code implementations30 Aug 2023 Zhihao Jia, Bing Wang, Changhao Chen

In this work, we propose the Drone-NeRF framework to enhance the efficient reconstruction of unbounded large-scale scenes suited for drone oblique photography using Neural Radiance Fields (NeRF).

3D Scene Reconstruction Neural Rendering

Quarl: A Learning-Based Quantum Circuit Optimizer

no code implementations17 Jul 2023 Zikun Li, Jinjun Peng, Yixuan Mei, Sina Lin, Yi Wu, Oded Padon, Zhihao Jia

Applying reinforcement learning (RL) to quantum circuit optimization raises two main challenges: the large and varying action space and the non-uniform state representation.

Reinforcement Learning (RL)

SpecInfer: Accelerating Generative Large Language Model Serving with Tree-based Speculative Inference and Verification

3 code implementations16 May 2023 Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Zeyu Wang, Zhengxin Zhang, Rae Ying Yee Wong, Alan Zhu, Lijie Yang, Xiaoxiang Shi, Chunan Shi, Zhuoming Chen, Daiyaan Arfeen, Reyna Abhyankar, Zhihao Jia

Our evaluation shows that SpecInfer outperforms existing LLM serving systems by 1. 5-2. 8x for distributed LLM inference and by 2. 6-3. 5x for offloading-based LLM inference, while preserving the same generative performance.

Decoder Language Modelling +1

Quark: A Gradient-Free Quantum Learning Framework for Classification Tasks

no code implementations2 Oct 2022 Zhihao Zhang, Zhuoming Chen, Heyang Huang, Zhihao Jia

To address the limitations of existing quantum ML methods, we introduce Quark, a gradient-free quantum learning framework that optimizes quantum ML models using quantum optimization.

Edge Detection

OLLIE: Derivation-based Tensor Program Optimizer

no code implementations2 Aug 2022 Liyan Zheng, Haojie Wang, Jidong Zhai, Muyan Hu, Zixuan Ma, Tuowei Wang, Shizhi Tang, Lei Xie, Kezhao Huang, Zhihao Jia

Boosting the runtime performance of deep neural networks (DNNs) is critical due to their wide adoption in real-world tasks.

BOND: Benchmarking Unsupervised Outlier Node Detection on Static Attributed Graphs

2 code implementations21 Jun 2022 Kay Liu, Yingtong Dou, Yue Zhao, Xueying Ding, Xiyang Hu, Ruitong Zhang, Kaize Ding, Canyu Chen, Hao Peng, Kai Shu, Lichao Sun, Jundong Li, George H. Chen, Zhihao Jia, Philip S. Yu

To bridge this gap, we present--to the best of our knowledge--the first comprehensive benchmark for unsupervised outlier node detection on static attributed graphs called BOND, with the following highlights.

Anomaly Detection Benchmarking +2

Optimizing Mixture of Experts using Dynamic Recompilations

no code implementations4 May 2022 Ferdinand Kossmann, Zhihao Jia, Alex Aiken

The Mixture of Experts architecture allows for outrageously large neural networks by scaling model parameter size independently from computational demand (FLOPs).

Bamboo: Making Preemptible Instances Resilient for Affordable Training of Large DNNs

no code implementations26 Apr 2022 John Thorpe, Pengzhan Zhao, Jonathan Eyolfson, Yifan Qiao, Zhihao Jia, Minjia Zhang, Ravi Netravali, Guoqing Harry Xu

DNN models across many domains continue to grow in size, resulting in high resource requirements for effective training, and unpalatable (and often unaffordable) costs for organizations and research labs across scales.

Collage: Seamless Integration of Deep Learning Backends with Automatic Placement

1 code implementation1 Nov 2021 Byungsoo Jeon, Sunghyun Park, Peiyuan Liao, Sheng Xu, Tianqi Chen, Zhihao Jia

Given the fast-evolving nature of the DL ecosystem, this manual approach often slows down continuous innovations across different layers; it prevents hardware vendors from the fast deployment of their cutting-edge libraries, DL framework developers must repeatedly adjust their hand-coded rules to accommodate new versions of libraries, and machine learning practitioners need to wait for the integration of new technologies and often encounter unsatisfactory performance.

Deep Learning

TOD: GPU-accelerated Outlier Detection via Tensor Operations

2 code implementations26 Oct 2021 Yue Zhao, George H. Chen, Zhihao Jia

Outlier detection (OD) is a key learning task for finding rare and deviant data samples, with many time-critical applications such as fraud detection and intrusion detection.

Fraud Detection Intrusion Detection +2

GradSign: Model Performance Inference with Theoretical Insights

1 code implementation ICLR 2022 Zhihao Zhang, Zhihao Jia

In addition, we design GradSign, an accurate and simple approximation of {\Psi} using the gradients of a network evaluated at a random initialization state.

Neural Architecture Search

Dorylus: Affordable, Scalable, and Accurate GNN Training with Distributed CPU Servers and Serverless Threads

1 code implementation24 May 2021 John Thorpe, Yifan Qiao, Jonathan Eyolfson, Shen Teng, Guanzhou Hu, Zhihao Jia, Jinliang Wei, Keval Vora, Ravi Netravali, Miryung Kim, Guoqing Harry Xu

Computation separation makes it possible to construct a deep, bounded-asynchronous pipeline where graph and tensor parallel tasks can fully overlap, effectively hiding the network latency incurred by Lambdas.

Graph Neural Network

IOS: Inter-Operator Scheduler for CNN Acceleration

1 code implementation2 Nov 2020 Yaoyao Ding, Ligeng Zhu, Zhihao Jia, Gennady Pekhimenko, Song Han

To accelerate CNN inference, existing deep learning frameworks focus on optimizing intra-operator parallelization.

Redundancy-Free Computation Graphs for Graph Neural Networks

no code implementations9 Jun 2019 Zhihao Jia, Sina Lin, Rex Ying, Jiaxuan You, Jure Leskovec, Alex Aiken

Graph Neural Networks (GNNs) are based on repeated aggregations of information across nodes' neighbors in a graph.

Beyond Data and Model Parallelism for Deep Neural Networks

no code implementations14 Jul 2018 Zhihao Jia, Matei Zaharia, Alex Aiken

We also propose FlexFlow, a deep learning framework that uses guided randomized search of the SOAP space to find a fast parallelization strategy for a specific parallel machine.

Distributed, Parallel, and Cluster Computing

Exploring Hidden Dimensions in Accelerating Convolutional Neural Networks

no code implementations ICML 2018 Zhihao Jia, Sina Lin, Charles R. Qi, Alex Aiken

The past few years have witnessed growth in the computational requirements for training deep convolutional neural networks.

Exploring Hidden Dimensions in Parallelizing Convolutional Neural Networks

no code implementations14 Feb 2018 Zhihao Jia, Sina Lin, Charles R. Qi, Alex Aiken

The past few years have witnessed growth in the computational requirements for training deep convolutional neural networks.

Exploring the Hidden Dimension in Accelerating Convolutional Neural Networks

no code implementations ICLR 2018 Zhihao Jia, Sina Lin, Charles R. Qi, Alex Aiken

DeePa is a deep learning framework that explores parallelism in all parallelizable dimensions to accelerate the training process of convolutional neural networks.

Deep Learning

Cannot find the paper you are looking for? You can Submit a new open access paper.