Search Results for author: Ankit Singh Rawat

Found 47 papers, 3 papers with code

Language Model Cascades: Token-level uncertainty and beyond

no code implementations15 Apr 2024 Neha Gupta, Harikrishna Narasimhan, Wittawat Jitkrittum, Ankit Singh Rawat, Aditya Krishna Menon, Sanjiv Kumar

While the principles underpinning cascading are well-studied for classification tasks - with deferral based on predicted class uncertainty favored theoretically and practically - a similar understanding is lacking for generative LM tasks.

Language Modelling

Mechanics of Next Token Prediction with Self-Attention

no code implementations12 Mar 2024 Yingcong Li, Yixiao Huang, M. Emrullah Ildiz, Ankit Singh Rawat, Samet Oymak

}$ We show that training self-attention with gradient descent learns an automaton which generates the next token in two distinct steps: $\textbf{(1)}$ $\textbf{Hard}$ $\textbf{retrieval:}$ Given input sequence, self-attention precisely selects the $\textit{high-priority}$ $\textit{input}$ $\textit{tokens}$ associated with the last input token.

Retrieval

From Self-Attention to Markov Models: Unveiling the Dynamics of Generative Transformers

no code implementations21 Feb 2024 M. Emrullah Ildiz, Yixiao Huang, Yingcong Li, Ankit Singh Rawat, Samet Oymak

Modern language models rely on the transformer architecture and attention mechanism to perform language understanding and text generation.

Text Generation

DistillSpec: Improving Speculative Decoding via Knowledge Distillation

no code implementations12 Oct 2023 Yongchao Zhou, Kaifeng Lyu, Ankit Singh Rawat, Aditya Krishna Menon, Afshin Rostamizadeh, Sanjiv Kumar, Jean-François Kagy, Rishabh Agarwal

Finally, in practical scenarios with models of varying sizes, first using distillation to boost the performance of the target model and then applying DistillSpec to train a well-aligned draft model can reduce decoding latency by 6-10x with minimal performance drop, compared to standard decoding without distillation.

Knowledge Distillation Language Modelling +1

What do larger image classifiers memorise?

no code implementations9 Oct 2023 Michal Lukasik, Vaishnavh Nagarajan, Ankit Singh Rawat, Aditya Krishna Menon, Sanjiv Kumar

The success of modern neural networks has prompted study of the connection between memorisation and generalisation: overparameterised models generalise well, despite being able to perfectly fit (memorise) completely random labels.

Image Classification Knowledge Distillation +2

Think before you speak: Training Language Models With Pause Tokens

no code implementations3 Oct 2023 Sachin Goyal, Ziwei Ji, Ankit Singh Rawat, Aditya Krishna Menon, Sanjiv Kumar, Vaishnavh Nagarajan

Language models generate responses by producing a series of tokens in immediate succession: the $(K+1)^{th}$ token is an outcome of manipulating $K$ hidden vectors per layer, one vector per preceding token.

GSM8K Question Answering

When Does Confidence-Based Cascade Deferral Suffice?

no code implementations NeurIPS 2023 Wittawat Jitkrittum, Neha Gupta, Aditya Krishna Menon, Harikrishna Narasimhan, Ankit Singh Rawat, Sanjiv Kumar

Cascades are a classical strategy to enable inference cost to vary adaptively across samples, wherein a sequence of classifiers are invoked in turn.

On the Role of Attention in Prompt-tuning

no code implementations6 Jun 2023 Samet Oymak, Ankit Singh Rawat, Mahdi Soltanolkotabi, Christos Thrampoulidis

Despite its success in LLMs, there is limited theoretical understanding of the power of prompt-tuning and the role of the attention mechanism in prompting.

Large Language Models with Controllable Working Memory

no code implementations9 Nov 2022 Daliang Li, Ankit Singh Rawat, Manzil Zaheer, Xin Wang, Michal Lukasik, Andreas Veit, Felix Yu, Sanjiv Kumar

By contrast, when the context is irrelevant to the task, the model should ignore it and fall back on its internal knowledge.

counterfactual World Knowledge

The Lazy Neuron Phenomenon: On Emergence of Activation Sparsity in Transformers

no code implementations12 Oct 2022 Zonglin Li, Chong You, Srinadh Bhojanapalli, Daliang Li, Ankit Singh Rawat, Sashank J. Reddi, Ke Ye, Felix Chern, Felix Yu, Ruiqi Guo, Sanjiv Kumar

This paper studies the curious phenomenon for machine learning models with Transformer architectures that their activation maps are sparse.

Generalization Properties of Retrieval-based Models

no code implementations6 Oct 2022 Soumya Basu, Ankit Singh Rawat, Manzil Zaheer

The second class of retrieval-based approaches we explore learns a global model using kernel methods to directly map an input instance and retrieved examples to a prediction, without explicitly solving a local learning task.

Protein Folding Retrieval

A Fourier Approach to Mixture Learning

no code implementations5 Oct 2022 Mingda Qiao, Guru Guruganesh, Ankit Singh Rawat, Avinava Dubey, Manzil Zaheer

Regev and Vijayaraghavan (2017) showed that with $\Delta = \Omega(\sqrt{\log k})$ separation, the means can be learned using $\mathrm{poly}(k, d)$ samples, whereas super-polynomially many samples are required if $\Delta = o(\sqrt{\log k})$ and $d = \Omega(\log k)$.

Teacher Guided Training: An Efficient Framework for Knowledge Transfer

no code implementations14 Aug 2022 Manzil Zaheer, Ankit Singh Rawat, Seungyeon Kim, Chong You, Himanshu Jain, Andreas Veit, Rob Fergus, Sanjiv Kumar

In this paper, we propose the teacher-guided training (TGT) framework for training a high-quality compact model that leverages the knowledge acquired by pretrained generative models, while obviating the need to go through a large volume of data.

Generalization Bounds Image Classification +4

ELM: Embedding and Logit Margins for Long-Tail Learning

no code implementations27 Apr 2022 Wittawat Jitkrittum, Aditya Krishna Menon, Ankit Singh Rawat, Sanjiv Kumar

Long-tail learning is the problem of learning under skewed label distributions, which pose a challenge for standard learners.

Contrastive Learning Long-tail Learning

FedLite: A Scalable Approach for Federated Learning on Resource-constrained Clients

no code implementations28 Jan 2022 Jianyu Wang, Hang Qi, Ankit Singh Rawat, Sashank Reddi, Sagar Waghmare, Felix X. Yu, Gauri Joshi

In classical federated learning, the clients contribute to the overall training by communicating local updates for the underlying model on their private data to a coordinating server.

Federated Learning

When in Doubt, Summon the Titans: Efficient Inference with Large Models

no code implementations19 Oct 2021 Ankit Singh Rawat, Manzil Zaheer, Aditya Krishna Menon, Amr Ahmed, Sanjiv Kumar

In a nutshell, we use the large teacher models to guide the lightweight student models to only make correct predictions on a subset of "easy" examples; for the "hard" examples, we fall-back to the teacher.

Image Classification

In defense of dual-encoders for neural ranking

no code implementations29 Sep 2021 Aditya Krishna Menon, Sadeep Jayasumana, Seungyeon Kim, Ankit Singh Rawat, Sashank J. Reddi, Sanjiv Kumar

Transformer-based models such as BERT have proven successful in information retrieval problem, which seek to identify relevant documents for a given query.

Information Retrieval Natural Questions +1

When in Doubt, Summon the Titans: A Framework for Efficient Inference with Large Models

no code implementations29 Sep 2021 Ankit Singh Rawat, Manzil Zaheer, Aditya Krishna Menon, Amr Ahmed, Sanjiv Kumar

In a nutshell, we use the large teacher models to guide the lightweight student models to only make correct predictions on a subset of "easy" examples; for the "hard" examples, we fall-back to the teacher.

Image Classification

Disentangling Sampling and Labeling Bias for Learning in Large-Output Spaces

no code implementations12 May 2021 Ankit Singh Rawat, Aditya Krishna Menon, Wittawat Jitkrittum, Sadeep Jayasumana, Felix X. Yu, Sashank Reddi, Sanjiv Kumar

Negative sampling schemes enable efficient training given a large number of classes, by offering a means to approximate a computationally expensive loss function that takes all labels into account.

Retrieval

Distilling Double Descent

no code implementations13 Feb 2021 Andrew Cotter, Aditya Krishna Menon, Harikrishna Narasimhan, Ankit Singh Rawat, Sashank J. Reddi, Yichen Zhou

Distillation is the technique of training a "student" model based on examples that are labeled by a separate "teacher" model, which itself is trained on a labeled dataset.

On the Reproducibility of Neural Network Predictions

no code implementations5 Feb 2021 Srinadh Bhojanapalli, Kimberly Wilber, Andreas Veit, Ankit Singh Rawat, Seungyeon Kim, Aditya Menon, Sanjiv Kumar

By analyzing the relationship between churn and prediction confidences, we pursue an approach with two components for churn reduction.

Data Augmentation Image Classification

Overparameterisation and worst-case generalisation: friend or foe?

no code implementations ICLR 2021 Aditya Krishna Menon, Ankit Singh Rawat, Sanjiv Kumar

Overparameterised neural networks have demonstrated the remarkable ability to perfectly fit training samples, while still generalising to unseen test samples.

Structured Prediction

O(n) Connections are Expressive Enough: Universal Approximability of Sparse Transformers

no code implementations NeurIPS 2020 Chulhee Yun, Yin-Wen Chang, Srinadh Bhojanapalli, Ankit Singh Rawat, Sashank Reddi, Sanjiv Kumar

We propose sufficient conditions under which we prove that a sparse attention model can universally approximate any sequence-to-sequence function.

Modifying Memories in Transformer Models

no code implementations1 Dec 2020 Chen Zhu, Ankit Singh Rawat, Manzil Zaheer, Srinadh Bhojanapalli, Daliang Li, Felix Yu, Sanjiv Kumar

In this paper, we propose a new task of \emph{explicitly modifying specific factual knowledge in Transformer models while ensuring the model performance does not degrade on the unmodified facts}.

Memorization

Long-tail learning via logit adjustment

3 code implementations ICLR 2021 Aditya Krishna Menon, Sadeep Jayasumana, Ankit Singh Rawat, Himanshu Jain, Andreas Veit, Sanjiv Kumar

Real-world classification problems typically exhibit an imbalanced or long-tailed label distribution, wherein many labels are associated with only a few samples.

Long-tail Learning

Adversarial robustness via robust low rank representations

no code implementations NeurIPS 2020 Pranjal Awasthi, Himanshu Jain, Ankit Singh Rawat, Aravindan Vijayaraghavan

Adversarial robustness measures the susceptibility of a classifier to imperceptible perturbations made to the inputs at test time.

Adversarial Robustness

$O(n)$ Connections are Expressive Enough: Universal Approximability of Sparse Transformers

no code implementations NeurIPS 2020 Chulhee Yun, Yin-Wen Chang, Srinadh Bhojanapalli, Ankit Singh Rawat, Sashank J. Reddi, Sanjiv Kumar

We propose sufficient conditions under which we prove that a sparse attention model can universally approximate any sequence-to-sequence function.

Why distillation helps: a statistical perspective

no code implementations21 May 2020 Aditya Krishna Menon, Ankit Singh Rawat, Sashank J. Reddi, Seungyeon Kim, Sanjiv Kumar

In this paper, we present a statistical perspective on distillation which addresses this question, and provides a novel connection to extreme multiclass retrieval techniques.

Knowledge Distillation Retrieval

Can gradient clipping mitigate label noise?

1 code implementation ICLR 2020 Aditya Krishna Menon, Ankit Singh Rawat, Sashank J. Reddi, Sanjiv Kumar

Gradient clipping is a widely-used technique in the training of deep networks, and is generally motivated from an optimisation lens: informally, it controls the dynamics of iterates, thus enhancing the rate of convergence to a local minimum.

Doubly-stochastic mining for heterogeneous retrieval

no code implementations23 Apr 2020 Ankit Singh Rawat, Aditya Krishna Menon, Andreas Veit, Felix Yu, Sashank J. Reddi, Sanjiv Kumar

Modern retrieval problems are characterised by training sets with potentially billions of labels, and heterogeneous data distributions across subpopulations (e. g., users of a retrieval system may be from different countries), each of which poses a challenge.

Retrieval Stochastic Optimization

Federated Learning with Only Positive Labels

1 code implementation ICML 2020 Felix X. Yu, Ankit Singh Rawat, Aditya Krishna Menon, Sanjiv Kumar

We consider learning a multi-class classification model in the federated setting, where each user has access to the positive data associated with only a single class.

Federated Learning Multi-class Classification

Robust Large-Margin Learning in Hyperbolic Space

no code implementations NeurIPS 2020 Melanie Weber, Manzil Zaheer, Ankit Singh Rawat, Aditya Menon, Sanjiv Kumar

In this paper, we present, to our knowledge, the first theoretical guarantees for learning a classifier in hyperbolic rather than Euclidean space.

Representation Learning

Reliable Distributed Clustering with Redundant Data Assignment

no code implementations20 Feb 2020 Venkata Gandikota, Arya Mazumdar, Ankit Singh Rawat

In this paper, we present distributed generalized clustering algorithms that can handle large scale data across multiple machines in spite of straggling or unreliable machines.

Clustering Dimensionality Reduction

Low-Rank Bottleneck in Multi-head Attention Models

no code implementations ICML 2020 Srinadh Bhojanapalli, Chulhee Yun, Ankit Singh Rawat, Sashank J. Reddi, Sanjiv Kumar

Attention based Transformer architecture has enabled significant advances in the field of natural language processing.

Are Transformers universal approximators of sequence-to-sequence functions?

no code implementations ICLR 2020 Chulhee Yun, Srinadh Bhojanapalli, Ankit Singh Rawat, Sashank J. Reddi, Sanjiv Kumar

In this paper, we establish that Transformer models are universal approximators of continuous permutation equivariant sequence-to-sequence functions with compact support, which is quite surprising given the amount of shared parameters in these models.

Multilabel reductions: what is my loss optimising?

no code implementations NeurIPS 2019 Aditya K. Menon, Ankit Singh Rawat, Sashank Reddi, Sanjiv Kumar

Multilabel classification is a challenging problem arising in applications ranging from information retrieval to image tagging.

General Classification Information Retrieval +1

Concise Multi-head Attention Models

no code implementations25 Sep 2019 Srinadh Bhojanapalli, Chulhee Yun, Ankit Singh Rawat, Sashank Reddi, Sanjiv Kumar

Attention based Transformer architecture has enabled significant advances in the field of natural language processing.

Learning Network Parameters in the ReLU Model

no code implementations NeurIPS Workshop Deep_Invers 2019 Arya Mazumdar, Ankit Singh Rawat

Rectified linear units, or ReLUs, have become a preferred activation function for artificial neural networks.

Sampled Softmax with Random Fourier Features

no code implementations NeurIPS 2019 Ankit Singh Rawat, Jiecao Chen, Felix Yu, Ananda Theertha Suresh, Sanjiv Kumar

For the settings where a large number of classes are involved, a common method to speed up training is to sample a subset of classes and utilize an estimate of the loss gradient based on these classes, known as the sampled softmax method.

Robust Gradient Descent via Moment Encoding with LDPC Codes

no code implementations22 May 2018 Raj Kumar Maity, Ankit Singh Rawat, Arya Mazumdar

We, instead, propose to encode the second-moment of the data with a low density parity-check (LDPC) code.

Distributed Computing

Representation Learning and Recovery in the ReLU Model

no code implementations12 Mar 2018 Arya Mazumdar, Ankit Singh Rawat

Given a set of observation vectors $\mathbf{y}^i \in \mathbb{R}^d, i =1, 2, \dots , n$, we aim to recover $d\times k$ matrix $A$ and the latent vectors $\{\mathbf{c}^i\} \subset \mathbb{R}^k$ under the model $\mathbf{y}^i = \mathrm{ReLU}(A\mathbf{c}^i +\mathbf{b})$, where $\mathbf{b}\in \mathbb{R}^d$ is a random bias.

Dictionary Learning Representation Learning

Lifting high-dimensional nonlinear models with Gaussian regressors

no code implementations11 Dec 2017 Christos Thrampoulidis, Ankit Singh Rawat

Unfortunately, both least-squares and the Lasso fail to recover $\mathbf{x}_0$ when $\mu_\ell=0$.

Vocal Bursts Intensity Prediction

Associative Memory using Dictionary Learning and Expander Decoding

no code implementations29 Nov 2016 Arya Mazumdar, Ankit Singh Rawat

Designing an associative memory requires addressing two main tasks: 1) learning phase: given a dataset, learn a concise representation of the dataset in the form of a graphical model (or a neural network), 2) recall phase: given a noisy version of a message vector from the dataset, output the correct message vector via a neurally feasible algorithm over the network learnt during the learning phase.

Dictionary Learning

Associative Memory via a Sparse Recovery Model

no code implementations NeurIPS 2015 Arya Mazumdar, Ankit Singh Rawat

An associative memory is a structure learned from a dataset $\mathcal{M}$ of vectors (signals) in a way such that, given a noisy version of one of the vectors as input, the nearest valid vector from $\mathcal{M}$ (nearest neighbor) is provided as output, preferably via a fast iterative algorithm.

valid

Cannot find the paper you are looking for? You can Submit a new open access paper.