Search Results for author: Qingru Zhang

Found 11 papers, 8 papers with code

GEAR: An Efficient KV Cache Compression Recipe for Near-Lossless Generative Inference of LLM

1 code implementation8 Mar 2024 Hao Kang, Qingru Zhang, Souvik Kundu, Geonhwa Jeong, Zaoxing Liu, Tushar Krishna, Tuo Zhao

Key-value (KV) caching has become the de-facto to accelerate generation speed for large language models (LLMs) inference.

Quantization

Tell Your Model Where to Attend: Post-hoc Attention Steering for LLMs

1 code implementation3 Nov 2023 Qingru Zhang, Chandan Singh, Liyuan Liu, Xiaodong Liu, Bin Yu, Jianfeng Gao, Tuo Zhao

In human-written articles, we often leverage the subtleties of text style, such as bold and italics, to guide the attention of readers.

AdaLoRA: Adaptive Budget Allocation for Parameter-Efficient Fine-Tuning

2 code implementations18 Mar 2023 Qingru Zhang, Minshuo Chen, Alexander Bukharin, Nikos Karampatziakis, Pengcheng He, Yu Cheng, Weizhu Chen, Tuo Zhao

Therefore, many fine-tuning methods are proposed to learn incremental updates of pre-trained weights in a parameter efficient way, e. g., low-rank increments.

Question Answering Text Generation

Less is More: Task-aware Layer-wise Distillation for Language Model Compression

1 code implementation4 Oct 2022 Chen Liang, Simiao Zuo, Qingru Zhang, Pengcheng He, Weizhu Chen, Tuo Zhao

As such, TED reduces the knowledge gap between the two models and helps the student to fit better on the target task.

Language Modelling Model Compression

A Biased Graph Neural Network Sampler with Near-Optimal Regret

1 code implementation NeurIPS 2021 Qingru Zhang, David Wipf, Quan Gan, Le Song

Graph neural networks (GNN) have recently emerged as a vehicle for applying deep network architectures to graph and relational data.

Cannot find the paper you are looking for? You can Submit a new open access paper.