Search Results for author: Shang Yang

Found 12 papers, 9 papers with code

Locality-aware Parallel Decoding for Efficient Autoregressive Image Generation

no code implementations2 Jul 2025 Zhuoyang Zhang, Luke J. Huang, Chengyue Wu, Shang Yang, Kelly Peng, Yao Lu, Song Han

Traditional autoregressive image generation relies on next-patch prediction, a memory-bound process that leads to high latency.

Image Generation Prediction

LServe: Efficient Long-sequence LLM Serving with Unified Sparse Attention

2 code implementations20 Feb 2025 Shang Yang, Junxian Guo, Haotian Tang, Qinghao Hu, Guangxuan Xiao, Jiaming Tang, Yujun Lin, Zhijian Liu, Yao Lu, Song Han

On average, LServe accelerates LLM prefilling by up to 2. 9x and decoding by 1. 3-2. 1x over vLLM, maintaining long-context accuracy.

DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads

2 code implementations14 Oct 2024 Guangxuan Xiao, Jiaming Tang, Jingwei Zuo, Junxian Guo, Shang Yang, Haotian Tang, Yao Fu, Song Han

Based on this insight, we introduce DuoAttention, a framework that only applies a full KV cache to retrieval heads while using a light-weight, constant-length KV cache for streaming heads, which reduces both LLM's decoding and pre-filling memory and latency without compromising its long-context abilities.

GPU Quantization +1

HART: Efficient Visual Generation with Hybrid Autoregressive Transformer

2 code implementations14 Oct 2024 Haotian Tang, Yecheng Wu, Shang Yang, Enze Xie, Junsong Chen, Junyu Chen, Zhuoyang Zhang, Han Cai, Yao Lu, Song Han

To address these challenges, we present the hybrid tokenizer, which decomposes the continuous latents from the autoencoder into two components: discrete tokens representing the big picture and continuous tokens representing the residual components that cannot be represented by the discrete tokens.

Image Generation Image Reconstruction

Deep Compression Autoencoder for Efficient High-Resolution Diffusion Models

1 code implementation14 Oct 2024 Junyu Chen, Han Cai, Junsong Chen, Enze Xie, Shang Yang, Haotian Tang, Muyang Li, Yao Lu, Song Han

With these designs, we improve the autoencoder's spatial compression ratio up to 128 while maintaining the reconstruction quality.

GPU Image Generation

LongVILA: Scaling Long-Context Visual Language Models for Long Videos

1 code implementation19 Aug 2024 Yukang Chen, Fuzhao Xue, Dacheng Li, Qinghao Hu, Ligeng Zhu, Xiuyu Li, Yunhao Fang, Haotian Tang, Shang Yang, Zhijian Liu, Ethan He, Hongxu Yin, Pavlo Molchanov, Jan Kautz, Linxi Fan, Yuke Zhu, Yao Lu, Song Han

We introduce the long-context Multi-Modal Sequence Parallelism (MM-SP) system that efficiently parallelizes long video training and inference, enabling 2M context length training on 256 GPUs without any gradient checkpointing.

Video Captioning Video Question Answering +1

QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving

4 code implementations7 May 2024 Yujun Lin, Haotian Tang, Shang Yang, Zhekai Zhang, Guangxuan Xiao, Chuang Gan, Song Han

The key insight driving QServe is that the efficiency of LLM serving on GPUs is critically influenced by operations on low-throughput CUDA cores.

GPU Language Modelling +2

TorchSparse++: Efficient Training and Inference Framework for Sparse Convolution on GPUs

1 code implementation25 Oct 2023 Haotian Tang, Shang Yang, Zhijian Liu, Ke Hong, Zhongming Yu, Xiuyu Li, Guohao Dai, Yu Wang, Song Han

On top of this, we design the Sparse Autotuner, which extends the design space of existing sparse convolution libraries and searches for the best dataflow configurations for training and inference workloads.

Autonomous Driving GPU +1

Cannot find the paper you are looking for? You can Submit a new open access paper.