Search Results for author: Guangxuan Xiao

Found 14 papers, 11 papers with code

Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference

1 code implementation16 Jun 2024 Jiaming Tang, Yilong Zhao, Kan Zhu, Guangxuan Xiao, Baris Kasikci, Song Han

By only loading the Top-K critical KV cache pages for attention, Quest significantly speeds up self-attention without sacrificing accuracy.

QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving

1 code implementation7 May 2024 Yujun Lin, Haotian Tang, Shang Yang, Zhekai Zhang, Guangxuan Xiao, Chuang Gan, Song Han

The key insight driving QServe is that the efficiency of LLM serving on GPUs is critically influenced by operations on low-throughput CUDA cores.

Language Modelling Large Language Model +1

Retrieval Head Mechanistically Explains Long-Context Factuality

1 code implementation24 Apr 2024 Wenhao Wu, Yizhong Wang, Guangxuan Xiao, Hao Peng, Yao Fu

Despite the recent progress in long-context language models, it remains elusive how transformer-based models exhibit the capability to retrieve relevant information from arbitrary locations within the long context.

Continual Pretraining Hallucination +3

BitDelta: Your Fine-Tune May Only Be Worth One Bit

1 code implementation15 Feb 2024 James Liu, Guangxuan Xiao, Kai Li, Jason D. Lee, Song Han, Tri Dao, Tianle Cai

Large Language Models (LLMs) are typically trained in two phases: pre-training on large internet-scale datasets, and fine-tuning for downstream tasks.

InfLLM: Training-Free Long-Context Extrapolation for LLMs with an Efficient Context Memory

1 code implementation7 Feb 2024 Chaojun Xiao, Pengle Zhang, Xu Han, Guangxuan Xiao, Yankai Lin, Zhengyan Zhang, Zhiyuan Liu, Maosong Sun

In this paper, we unveil the intrinsic capacity of LLMs for understanding extremely long sequences without any fine-tuning.

Efficient Streaming Language Models with Attention Sinks

5 code implementations29 Sep 2023 Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, Mike Lewis

In this paper, we first demonstrate that the emergence of attention sink is due to the strong attention scores towards initial tokens as a "sink" even if they are not semantically important.

Language Modelling

FastComposer: Tuning-Free Multi-Subject Image Generation with Localized Attention

1 code implementation17 May 2023 Guangxuan Xiao, Tianwei Yin, William T. Freeman, Frédo Durand, Song Han

FastComposer proposes delayed subject conditioning in the denoising step to maintain both identity and editability in subject-driven image generation.

Denoising Diffusion Personalization Tuning Free +2

Sparse and Local Networks for Hypergraph Reasoning

no code implementations9 Mar 2023 Guangxuan Xiao, Leslie Pack Kaelbling, Jiajun Wu, Jiayuan Mao

Reasoning about the relationships between entities from input facts (e. g., whether Ari is a grandparent of Charlie) generally requires explicit consideration of other entities that are not mentioned in the query (e. g., the parents of Charlie).

Knowledge Graphs World Knowledge

Offsite-Tuning: Transfer Learning without Full Model

1 code implementation9 Feb 2023 Guangxuan Xiao, Ji Lin, Song Han

In this paper, we propose Offsite-Tuning, a privacy-preserving and efficient transfer learning framework that can adapt billion-parameter foundation models to downstream data without access to the full model.

Privacy Preserving Transfer Learning

SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models

5 code implementations18 Nov 2022 Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, Song Han

We propose SmoothQuant, a training-free, accuracy-preserving, and general-purpose post-training quantization (PTQ) solution to enable 8-bit weight, 8-bit activation (W8A8) quantization for LLMs.

Quantization

Efficient Training and Inference of Hypergraph Reasoning Networks

no code implementations29 Sep 2021 Guangxuan Xiao, Leslie Pack Kaelbling, Jiajun Wu, Jiayuan Mao

To leverage the sparsity in hypergraph neural networks, SpaLoc represents the grounding of relationships such as parent and grandparent as sparse tensors and uses neural networks and finite-domain quantification operations to infer new facts based on the input.

Knowledge Graphs Logical Reasoning +1

Red Alarm for Pre-trained Models: Universal Vulnerability to Neuron-Level Backdoor Attacks

1 code implementation ICML Workshop AML 2021 Zhengyan Zhang, Guangxuan Xiao, Yongwei Li, Tian Lv, Fanchao Qi, Zhiyuan Liu, Yasheng Wang, Xin Jiang, Maosong Sun

In this work, we demonstrate the universal vulnerability of PTMs, where fine-tuned PTMs can be easily controlled by backdoor attacks in arbitrary downstream tasks.

Backdoor Attack

Cannot find the paper you are looking for? You can Submit a new open access paper.