Search Results for author: Junchen Jiang

Found 19 papers, 4 papers with code

RAGServe: Fast Quality-Aware RAG Systems with Configuration Adaptation

no code implementations13 Dec 2024 Siddhant Ray, Rui Pan, Zhuohan Gu, Kuntai Du, Ganesh Ananthanarayanan, Ravi Netravali, Junchen Jiang

RAG (Retrieval Augmented Generation) allows LLMs (large language models) to generate better responses with external knowledge, but using more external knowledge often improves generation quality at the expense of response delay.

RAG Scheduling

LLMSteer: Improving Long-Context LLM Inference by Steering Attention on Reused Contexts

no code implementations20 Nov 2024 Zhuohan Gu, Jiayi Yao, Kuntai Du, Junchen Jiang

As large language models (LLMs) show impressive performance on complex tasks, they still struggle with longer contextual understanding and high computational costs.

DroidSpeak: KV Cache Sharing for Cross-LLM Communication and Multi-LLM Serving

no code implementations5 Nov 2024 YuHan Liu, YuYang Huang, Jiayi Yao, Zhuohan Gu, Kuntai Du, Hanchen Li, Yihua Cheng, Junchen Jiang, Shan Lu, Madan Musuvathi, Esha Choukse

Large Language Models (LLMs) are increasingly employed in complex workflows, where different LLMs and fine-tuned variants collaboratively address complex tasks.

Computational Efficiency

SwiftQueue: Optimizing Low-Latency Applications with Swift Packet Queuing

no code implementations8 Oct 2024 Siddhant Ray, Xi Jiang, Jack Luo, Nick Feamster, Junchen Jiang

Instead, SwiftQueue uses a custom Transformer, which is well-studied for its expressiveness on sequential patterns, to predict the next packet's latency based on the latencies of recently received ACKs.

Do Large Language Models Need a Content Delivery Network?

1 code implementation16 Sep 2024 Yihua Cheng, Kuntai Du, Jiayi Yao, Junchen Jiang

As the use of large language models (LLMs) expands rapidly, so does the range of knowledge needed to supplement various LLM queries.

In-Context Learning

CacheBlend: Fast Large Language Model Serving for RAG with Cached Knowledge Fusion

2 code implementations26 May 2024 Jiayi Yao, Hanchen Li, YuHan Liu, Siddhant Ray, Yihua Cheng, Qizheng Zhang, Kuntai Du, Shan Lu, Junchen Jiang

To speed up the prefill of the long LLM inputs, one can pre-compute the KV cache of a text and re-use the KV cache when the context is reused as the prefix of another LLM input.

Language Modeling Language Modelling +1

NetLLM: Adapting Large Language Models for Networking

no code implementations4 Feb 2024 Duo Wu, Xianda Wang, Yaqi Qiao, Zhi Wang, Junchen Jiang, Shuguang Cui, Fangxin Wang

Motivated by the recent success of large language models (LLMs), this work studies the LLM adaptation for networking to explore a more sustainable design philosophy.

Answer Generation Language Modelling +3

Eloquent: A More Robust Transmission Scheme for LLM Token Streaming

no code implementations23 Jan 2024 Hanchen Li, YuHan Liu, Yihua Cheng, Siddhant Ray, Kuntai Du, Junchen Jiang

To render each generated token in real-time for users, the Large Language Model (LLM) server generates tokens one by one and streams each token (or group of a few tokens) through the network to the user right after generation, which we refer to as LLM token streaming.

Chatbot Language Modelling +1

CacheGen: KV Cache Compression and Streaming for Fast Large Language Model Serving

2 code implementations11 Oct 2023 YuHan Liu, Hanchen Li, Yihua Cheng, Siddhant Ray, YuYang Huang, Qizheng Zhang, Kuntai Du, Jiayi Yao, Shan Lu, Ganesh Ananthanarayanan, Michael Maire, Henry Hoffmann, Ari Holtzman, Junchen Jiang

Compared to the recent systems that reuse the KV cache, CacheGen reduces the KV cache size by 3. 5-4. 3x and the total delay in fetching and processing contexts by 3. 2-3. 7x with negligible impact on the LLM response quality.

Language Modeling Language Modelling +2

Automatic and Efficient Customization of Neural Networks for ML Applications

no code implementations7 Oct 2023 YuHan Liu, Chengcheng Wan, Kuntai Du, Henry Hoffmann, Junchen Jiang, Shan Lu, Michael Maire

ML APIs have greatly relieved application developers of the burden to design and train their own neural network models -- classifying objects in an image can now be as simple as one line of Python code to call an API.

OneAdapt: Fast Configuration Adaptation for Video Analytics Applications via Backpropagation

no code implementations3 Oct 2023 Kuntai Du, YuHan Liu, Yitian Hao, Qizheng Zhang, Haodong Wang, YuYang Huang, Ganesh Ananthanarayanan, Junchen Jiang

While the high demand for network bandwidth and GPU resources could be substantially reduced by optimally adapting the configuration knobs, such as video resolution and frame rate, current adaptation techniques fail to meet three requirements simultaneously: adapt configurations (i) with minimum extra GPU or bandwidth overhead; (ii) to reach near-optimal decisions based on how the data affects the final DNN's accuracy, and (iii) do so for a range of configuration knobs.

Deep Learning object-detection +1

GRACE: Loss-Resilient Real-Time Video through Neural Codecs

no code implementations21 May 2023 Yihua Cheng, Ziyi Zhang, Hanchen Li, Anton Arapin, Yue Zhang, Qizheng Zhang, YuHan Liu, Xu Zhang, Francis Y. Yan, Amrita Mazumdar, Nick Feamster, Junchen Jiang

In real-time video communication, retransmitting lost packets over high-latency networks is not viable due to strict latency requirements.

Decoder

AccMPEG: Optimizing Video Encoding for Video Analytics

no code implementations26 Apr 2022 Kuntai Du, Qizheng Zhang, Anton Arapin, Haodong Wang, Zhengxu Xia, Junchen Jiang

This paper presents AccMPEG, a new video encoding and streaming system that meets all the three requirements.

object-detection Object Detection +1

Ekya: Continuous Learning of Video Analytics Models on Edge Compute Servers

no code implementations19 Dec 2020 Romil Bhardwaj, Zhengxu Xia, Ganesh Ananthanarayanan, Junchen Jiang, Nikolaos Karianakis, Yuanchao Shu, Kevin Hsieh, Victor Bahl, Ion Stoica

Compressed models that are deployed on the edge servers for inference suffer from data drift, where the live video data diverges from the training data.

Domain-specific Communication Optimization for Distributed DNN Training

no code implementations16 Aug 2020 Hao Wang, Jingrong Chen, Xinchen Wan, Han Tian, Jiacheng Xia, Gaoxiong Zeng, Weiyan Wang, Kai Chen, Wei Bai, Junchen Jiang

Communication overhead poses an important obstacle to distributed DNN training and draws increasing attention in recent years.

Scheduling

Addressing Training Bias via Automated Image Annotation

no code implementations22 Sep 2018 Zhujun Xiao, Yanzi Zhu, Yuxin Chen, Ben Y. Zhao, Junchen Jiang, Hai-Tao Zheng

Build accurate DNN models requires training on large labeled, context specific datasets, especially those matching the target scenario.

Scaling Video Analytics Systems to Large Camera Deployments

no code implementations7 Sep 2018 Samvit Jain, Ganesh Ananthanarayanan, Junchen Jiang, Yuanchao Shu, Joseph E. Gonzalez

Driven by advances in computer vision and the falling costs of camera hardware, organizations are deploying video cameras en masse for the spatial monitoring of their physical premises.

Cannot find the paper you are looking for? You can Submit a new open access paper.