Search Results for author: Zhanpeng Zeng

Found 6 papers, 4 papers with code

LookupFFN: Making Transformers Compute-lite for CPU inference

1 code implementation12 Mar 2024 Zhanpeng Zeng, Michael Davies, Pranav Pulijala, Karthikeyan Sankaralingam, Vikas Singh

While GPU clusters are the de facto choice for training large deep neural network (DNN) models today, several reasons including ease of workflow, security and cost have led to efforts investigating whether CPUs may be viable for inference in routine use in many sectors of the industry.

Language Modelling

IM-Unpack: Training and Inference with Arbitrarily Low Precision Integers

no code implementations12 Mar 2024 Zhanpeng Zeng, Karthikeyan Sankaralingam, Vikas Singh

A popular strategy is the use of low bit-width integers to approximate the original entries in a matrix.

FrameQuant: Flexible Low-Bit Quantization for Transformers

no code implementations10 Mar 2024 Harshavardhan Adepu, Zhanpeng Zeng, Li Zhang, Vikas Singh

If quantization is interpreted as the addition of noise, our casting of the problem allows invoking an extensive body of known consistent recovery and noise robustness guarantees.

Quantization

Multi Resolution Analysis (MRA) for Approximate Self-Attention

1 code implementation21 Jul 2022 Zhanpeng Zeng, Sourav Pal, Jeffery Kline, Glenn M Fung, Vikas Singh

Transformers have emerged as a preferred model for many tasks in natural langugage processing and vision.

You Only Sample (Almost) Once: Linear Cost Self-Attention Via Bernoulli Sampling

1 code implementation18 Nov 2021 Zhanpeng Zeng, Yunyang Xiong, Sathya N. Ravi, Shailesh Acharya, Glenn Fung, Vikas Singh

In this paper, we show that a Bernoulli sampling attention mechanism based on Locality Sensitive Hashing (LSH), decreases the quadratic complexity of such models to linear.

Cannot find the paper you are looking for? You can Submit a new open access paper.