Search Results for author: Ankit Singh Rawat

Found 47 papers, 3 papers with code

Language Model Cascades: Token-level uncertainty and beyond

no code implementations • 15 Apr 2024 • Neha Gupta, Harikrishna Narasimhan, Wittawat Jitkrittum, Ankit Singh Rawat, Aditya Krishna Menon, Sanjiv Kumar

While the principles underpinning cascading are well-studied for classification tasks - with deferral based on predicted class uncertainty favored theoretically and practically - a similar understanding is lacking for generative LM tasks.

Language Modelling

Paper
Add Code

Mechanics of Next Token Prediction with Self-Attention

no code implementations • 12 Mar 2024 • Yingcong Li, Yixiao Huang, M. Emrullah Ildiz, Ankit Singh Rawat, Samet Oymak

}$ We show that training self-attention with gradient descent learns an automaton which generates the next token in two distinct steps: $\textbf{(1)}$ $\textbf{Hard}$ $\textbf{retrieval:}$ Given input sequence, self-attention precisely selects the $\textit{high-priority}$ $\textit{input}$ $\textit{tokens}$ associated with the last input token.

Retrieval

Paper
Add Code

From Self-Attention to Markov Models: Unveiling the Dynamics of Generative Transformers

no code implementations • 21 Feb 2024 • M. Emrullah Ildiz, Yixiao Huang, Yingcong Li, Ankit Singh Rawat, Samet Oymak

Modern language models rely on the transformer architecture and attention mechanism to perform language understanding and text generation.

Text Generation

Paper
Add Code

DistillSpec: Improving Speculative Decoding via Knowledge Distillation

no code implementations • 12 Oct 2023 • Yongchao Zhou, Kaifeng Lyu, Ankit Singh Rawat, Aditya Krishna Menon, Afshin Rostamizadeh, Sanjiv Kumar, Jean-François Kagy, Rishabh Agarwal

Finally, in practical scenarios with models of varying sizes, first using distillation to boost the performance of the target model and then applying DistillSpec to train a well-aligned draft model can reduce decoding latency by 6-10x with minimal performance drop, compared to standard decoding without distillation.

Knowledge Distillation Language Modelling +1

Paper
Add Code

What do larger image classifiers memorise?

no code implementations • 9 Oct 2023 • Michal Lukasik, Vaishnavh Nagarajan, Ankit Singh Rawat, Aditya Krishna Menon, Sanjiv Kumar

The success of modern neural networks has prompted study of the connection between memorisation and generalisation: overparameterised models generalise well, despite being able to perfectly fit (memorise) completely random labels.

Image Classification Knowledge Distillation +2

Paper
Add Code

Think before you speak: Training Language Models With Pause Tokens

no code implementations • 3 Oct 2023 • Sachin Goyal, Ziwei Ji, Ankit Singh Rawat, Aditya Krishna Menon, Sanjiv Kumar, Vaishnavh Nagarajan

Language models generate responses by producing a series of tokens in immediate succession: the $(K+1)^{th}$ token is an outcome of manipulating $K$ hidden vectors per layer, one vector per preceding token.

GSM8K Question Answering

Paper
Add Code

When Does Confidence-Based Cascade Deferral Suffice?

no code implementations • NeurIPS 2023 • Wittawat Jitkrittum, Neha Gupta, Aditya Krishna Menon, Harikrishna Narasimhan, Ankit Singh Rawat, Sanjiv Kumar

Cascades are a classical strategy to enable inference cost to vary adaptively across samples, wherein a sequence of classifiers are invoked in turn.

Paper
Add Code

On the Role of Attention in Prompt-tuning

no code implementations • 6 Jun 2023 • Samet Oymak, Ankit Singh Rawat, Mahdi Soltanolkotabi, Christos Thrampoulidis

Despite its success in LLMs, there is limited theoretical understanding of the power of prompt-tuning and the role of the attention mechanism in prompting.

Paper
Add Code

Supervision Complexity and its Role in Knowledge Distillation

no code implementations • 28 Jan 2023 • Hrayr Harutyunyan, Ankit Singh Rawat, Aditya Krishna Menon, Seungyeon Kim, Sanjiv Kumar

Despite the popularity and efficacy of knowledge distillation, there is limited understanding of why it helps.

Image Classification Knowledge Distillation

Paper
Add Code

EmbedDistill: A Geometric Knowledge Distillation for Information Retrieval

no code implementations • 27 Jan 2023 • Seungyeon Kim, Ankit Singh Rawat, Manzil Zaheer, Sadeep Jayasumana, Veeranjaneyulu Sadhanala, Wittawat Jitkrittum, Aditya Krishna Menon, Rob Fergus, Sanjiv Kumar

Large neural models (such as Transformers) achieve state-of-the-art performance for information retrieval (IR).

Information Retrieval Knowledge Distillation +2

Paper
Add Code

Large Language Models with Controllable Working Memory

no code implementations • 9 Nov 2022 • Daliang Li, Ankit Singh Rawat, Manzil Zaheer, Xin Wang, Michal Lukasik, Andreas Veit, Felix Yu, Sanjiv Kumar

By contrast, when the context is irrelevant to the task, the model should ignore it and fall back on its internal knowledge.

counterfactual World Knowledge

Paper
Add Code

The Lazy Neuron Phenomenon: On Emergence of Activation Sparsity in Transformers

no code implementations • 12 Oct 2022 • Zonglin Li, Chong You, Srinadh Bhojanapalli, Daliang Li, Ankit Singh Rawat, Sashank J. Reddi, Ke Ye, Felix Chern, Felix Yu, Ruiqi Guo, Sanjiv Kumar

This paper studies the curious phenomenon for machine learning models with Transformer architectures that their activation maps are sparse.

Paper
Add Code

Generalization Properties of Retrieval-based Models

no code implementations • 6 Oct 2022 • Soumya Basu, Ankit Singh Rawat, Manzil Zaheer

The second class of retrieval-based approaches we explore learns a global model using kernel methods to directly map an input instance and retrieved examples to a prediction, without explicitly solving a local learning task.

Protein Folding Retrieval

Paper
Add Code

A Fourier Approach to Mixture Learning

no code implementations • 5 Oct 2022 • Mingda Qiao, Guru Guruganesh, Ankit Singh Rawat, Avinava Dubey, Manzil Zaheer

Regev and Vijayaraghavan (2017) showed that with $\Delta = \Omega(\sqrt{\log k})$ separation, the means can be learned using $\mathrm{poly}(k, d)$ samples, whereas super-polynomially many samples are required if $\Delta = o(\sqrt{\log k})$ and $d = \Omega(\log k)$.

Paper
Add Code

Teacher Guided Training: An Efficient Framework for Knowledge Transfer

no code implementations • 14 Aug 2022 • Manzil Zaheer, Ankit Singh Rawat, Seungyeon Kim, Chong You, Himanshu Jain, Andreas Veit, Rob Fergus, Sanjiv Kumar

In this paper, we propose the teacher-guided training (TGT) framework for training a high-quality compact model that leverages the knowledge acquired by pretrained generative models, while obviating the need to go through a large volume of data.

Generalization Bounds Image Classification +4

Paper
Add Code

ELM: Embedding and Logit Margins for Long-Tail Learning

no code implementations • 27 Apr 2022 • Wittawat Jitkrittum, Aditya Krishna Menon, Ankit Singh Rawat, Sanjiv Kumar

Long-tail learning is the problem of learning under skewed label distributions, which pose a challenge for standard learners.

Contrastive Learning Long-tail Learning

Paper
Add Code

FedLite: A Scalable Approach for Federated Learning on Resource-constrained Clients

no code implementations • 28 Jan 2022 • Jianyu Wang, Hang Qi, Ankit Singh Rawat, Sashank Reddi, Sagar Waghmare, Felix X. Yu, Gauri Joshi

In classical federated learning, the clients contribute to the overall training by communicating local updates for the underlying model on their private data to a coordinating server.

Federated Learning

Paper
Add Code

When in Doubt, Summon the Titans: Efficient Inference with Large Models

no code implementations • 19 Oct 2021 • Ankit Singh Rawat, Manzil Zaheer, Aditya Krishna Menon, Amr Ahmed, Sanjiv Kumar

In a nutshell, we use the large teacher models to guide the lightweight student models to only make correct predictions on a subset of "easy" examples; for the "hard" examples, we fall-back to the teacher.

Image Classification

Paper
Add Code

An Empirical Study of Pre-trained Models on Out-of-distribution Generalization

no code implementations • 29 Sep 2021 • Yaodong Yu, Heinrich Jiang, Dara Bahri, Hossein Mobahi, Seungyeon Kim, Ankit Singh Rawat, Andreas Veit, Yi Ma

Concretely, we show that larger models and larger datasets need to be simultaneously leveraged to improve OOD performance.

Out-of-Distribution Generalization

Paper
Add Code

In defense of dual-encoders for neural ranking

no code implementations • 29 Sep 2021 • Aditya Krishna Menon, Sadeep Jayasumana, Seungyeon Kim, Ankit Singh Rawat, Sashank J. Reddi, Sanjiv Kumar

Transformer-based models such as BERT have proven successful in information retrieval problem, which seek to identify relevant documents for a given query.

Information Retrieval Natural Questions +1

Paper
Add Code

When in Doubt, Summon the Titans: A Framework for Efficient Inference with Large Models

no code implementations • 29 Sep 2021 • Ankit Singh Rawat, Manzil Zaheer, Aditya Krishna Menon, Amr Ahmed, Sanjiv Kumar

Image Classification

Paper
Add Code

Disentangling Sampling and Labeling Bias for Learning in Large-Output Spaces

no code implementations • 12 May 2021 • Ankit Singh Rawat, Aditya Krishna Menon, Wittawat Jitkrittum, Sadeep Jayasumana, Felix X. Yu, Sashank Reddi, Sanjiv Kumar

Negative sampling schemes enable efficient training given a large number of classes, by offering a means to approximate a computationally expensive loss function that takes all labels into account.

Retrieval

Paper
Add Code

Distilling Double Descent

no code implementations • 13 Feb 2021 • Andrew Cotter, Aditya Krishna Menon, Harikrishna Narasimhan, Ankit Singh Rawat, Sashank J. Reddi, Yichen Zhou

Distillation is the technique of training a "student" model based on examples that are labeled by a separate "teacher" model, which itself is trained on a labeled dataset.

Paper
Add Code

On the Reproducibility of Neural Network Predictions

no code implementations • 5 Feb 2021 • Srinadh Bhojanapalli, Kimberly Wilber, Andreas Veit, Ankit Singh Rawat, Seungyeon Kim, Aditya Menon, Sanjiv Kumar

By analyzing the relationship between churn and prediction confidences, we pursue an approach with two components for churn reduction.

Data Augmentation Image Classification

Paper
Add Code

Overparameterisation and worst-case generalisation: friend or foe?

no code implementations • ICLR 2021 • Aditya Krishna Menon, Ankit Singh Rawat, Sanjiv Kumar

Overparameterised neural networks have demonstrated the remarkable ability to perfectly fit training samples, while still generalising to unseen test samples.

Structured Prediction

Paper
Add Code

O(n) Connections are Expressive Enough: Universal Approximability of Sparse Transformers

no code implementations • NeurIPS 2020 • Chulhee Yun, Yin-Wen Chang, Srinadh Bhojanapalli, Ankit Singh Rawat, Sashank Reddi, Sanjiv Kumar

We propose sufficient conditions under which we prove that a sparse attention model can universally approximate any sequence-to-sequence function.

Paper
Add Code

Modifying Memories in Transformer Models

no code implementations • 1 Dec 2020 • Chen Zhu, Ankit Singh Rawat, Manzil Zaheer, Srinadh Bhojanapalli, Daliang Li, Felix Yu, Sanjiv Kumar

In this paper, we propose a new task of \emph{explicitly modifying specific factual knowledge in Transformer models while ensuring the model performance does not degrade on the unmodified facts}.

Memorization

Paper
Add Code

Long-tail learning via logit adjustment

3 code implementations • ICLR 2021 • Aditya Krishna Menon, Sadeep Jayasumana, Ankit Singh Rawat, Himanshu Jain, Andreas Veit, Sanjiv Kumar

Real-world classification problems typically exhibit an imbalanced or long-tailed label distribution, wherein many labels are associated with only a few samples.

Ranked #48 on Long-tail Learning on ImageNet-LT

Long-tail Learning

32,753

Paper
Code

Adversarial robustness via robust low rank representations

no code implementations • NeurIPS 2020 • Pranjal Awasthi, Himanshu Jain, Ankit Singh Rawat, Aravindan Vijayaraghavan

Adversarial robustness measures the susceptibility of a classifier to imperceptible perturbations made to the inputs at test time.

Adversarial Robustness

Paper
Add Code

$O(n)$ Connections are Expressive Enough: Universal Approximability of Sparse Transformers

no code implementations • NeurIPS 2020 • Chulhee Yun, Yin-Wen Chang, Srinadh Bhojanapalli, Ankit Singh Rawat, Sashank J. Reddi, Sanjiv Kumar

We propose sufficient conditions under which we prove that a sparse attention model can universally approximate any sequence-to-sequence function.

Paper
Add Code

Why distillation helps: a statistical perspective

no code implementations • 21 May 2020 • Aditya Krishna Menon, Ankit Singh Rawat, Sashank J. Reddi, Seungyeon Kim, Sanjiv Kumar

In this paper, we present a statistical perspective on distillation which addresses this question, and provides a novel connection to extreme multiclass retrieval techniques.

Knowledge Distillation Retrieval

Paper
Add Code

Can gradient clipping mitigate label noise?

1 code implementation • ICLR 2020 • Aditya Krishna Menon, Ankit Singh Rawat, Sashank J. Reddi, Sanjiv Kumar

Gradient clipping is a widely-used technique in the training of deep networks, and is generally motivated from an optimisation lens: informally, it controls the dynamics of iterates, thus enhancing the rate of convergence to a local minimum.

Paper
Code

Doubly-stochastic mining for heterogeneous retrieval

no code implementations • 23 Apr 2020 • Ankit Singh Rawat, Aditya Krishna Menon, Andreas Veit, Felix Yu, Sashank J. Reddi, Sanjiv Kumar

Modern retrieval problems are characterised by training sets with potentially billions of labels, and heterogeneous data distributions across subpopulations (e. g., users of a retrieval system may be from different countries), each of which poses a challenge.

Retrieval Stochastic Optimization

Paper
Add Code

Federated Learning with Only Positive Labels

1 code implementation • ICML 2020 • Felix X. Yu, Ankit Singh Rawat, Aditya Krishna Menon, Sanjiv Kumar

We consider learning a multi-class classification model in the federated setting, where each user has access to the positive data associated with only a single class.

Federated Learning Multi-class Classification

Paper
Code

Robust Large-Margin Learning in Hyperbolic Space

no code implementations • NeurIPS 2020 • Melanie Weber, Manzil Zaheer, Ankit Singh Rawat, Aditya Menon, Sanjiv Kumar

In this paper, we present, to our knowledge, the first theoretical guarantees for learning a classifier in hyperbolic rather than Euclidean space.

Representation Learning

Paper
Add Code

Reliable Distributed Clustering with Redundant Data Assignment

no code implementations • 20 Feb 2020 • Venkata Gandikota, Arya Mazumdar, Ankit Singh Rawat

In this paper, we present distributed generalized clustering algorithms that can handle large scale data across multiple machines in spite of straggling or unreliable machines.

Clustering Dimensionality Reduction

Paper
Add Code

Low-Rank Bottleneck in Multi-head Attention Models

no code implementations • ICML 2020 • Srinadh Bhojanapalli, Chulhee Yun, Ankit Singh Rawat, Sashank J. Reddi, Sanjiv Kumar

Attention based Transformer architecture has enabled significant advances in the field of natural language processing.

Paper
Add Code

Are Transformers universal approximators of sequence-to-sequence functions?

no code implementations • ICLR 2020 • Chulhee Yun, Srinadh Bhojanapalli, Ankit Singh Rawat, Sashank J. Reddi, Sanjiv Kumar

In this paper, we establish that Transformer models are universal approximators of continuous permutation equivariant sequence-to-sequence functions with compact support, which is quite surprising given the amount of shared parameters in these models.

Paper
Add Code

Multilabel reductions: what is my loss optimising?

no code implementations • NeurIPS 2019 • Aditya K. Menon, Ankit Singh Rawat, Sashank Reddi, Sanjiv Kumar

Multilabel classification is a challenging problem arising in applications ranging from information retrieval to image tagging.

General Classification Information Retrieval +1

Paper
Add Code

Concise Multi-head Attention Models

no code implementations • 25 Sep 2019 • Srinadh Bhojanapalli, Chulhee Yun, Ankit Singh Rawat, Sashank Reddi, Sanjiv Kumar

Attention based Transformer architecture has enabled significant advances in the field of natural language processing.

Paper
Add Code

Learning Network Parameters in the ReLU Model

no code implementations • NeurIPS Workshop Deep_Invers 2019 • Arya Mazumdar, Ankit Singh Rawat

Rectified linear units, or ReLUs, have become a preferred activation function for artificial neural networks.

Paper
Add Code

Sampled Softmax with Random Fourier Features

no code implementations • NeurIPS 2019 • Ankit Singh Rawat, Jiecao Chen, Felix Yu, Ananda Theertha Suresh, Sanjiv Kumar

For the settings where a large number of classes are involved, a common method to speed up training is to sample a subset of classes and utilize an estimate of the loss gradient based on these classes, known as the sampled softmax method.

Paper
Add Code

Robust Gradient Descent via Moment Encoding with LDPC Codes

no code implementations • 22 May 2018 • Raj Kumar Maity, Ankit Singh Rawat, Arya Mazumdar

We, instead, propose to encode the second-moment of the data with a low density parity-check (LDPC) code.

Distributed Computing

Paper
Add Code

Representation Learning and Recovery in the ReLU Model

no code implementations • 12 Mar 2018 • Arya Mazumdar, Ankit Singh Rawat

Given a set of observation vectors $\mathbf{y}^i \in \mathbb{R}^d, i =1, 2, \dots , n$, we aim to recover $d\times k$ matrix $A$ and the latent vectors $\{\mathbf{c}^i\} \subset \mathbb{R}^k$ under the model $\mathbf{y}^i = \mathrm{ReLU}(A\mathbf{c}^i +\mathbf{b})$, where $\mathbf{b}\in \mathbb{R}^d$ is a random bias.

Dictionary Learning Representation Learning

Paper
Add Code

Lifting high-dimensional nonlinear models with Gaussian regressors

no code implementations • 11 Dec 2017 • Christos Thrampoulidis, Ankit Singh Rawat

Unfortunately, both least-squares and the Lasso fail to recover $\mathbf{x}_0$ when $\mu_\ell=0$.

Vocal Bursts Intensity Prediction

Paper
Add Code

Associative Memory using Dictionary Learning and Expander Decoding

no code implementations • 29 Nov 2016 • Arya Mazumdar, Ankit Singh Rawat

Designing an associative memory requires addressing two main tasks: 1) learning phase: given a dataset, learn a concise representation of the dataset in the form of a graphical model (or a neural network), 2) recall phase: given a noisy version of a message vector from the dataset, output the correct message vector via a neurally feasible algorithm over the network learnt during the learning phase.

Dictionary Learning

Paper
Add Code

Associative Memory via a Sparse Recovery Model

no code implementations • NeurIPS 2015 • Arya Mazumdar, Ankit Singh Rawat

An associative memory is a structure learned from a dataset $\mathcal{M}$ of vectors (signals) in a way such that, given a noisy version of one of the vectors as input, the nearest valid vector from $\mathcal{M}$ (nearest neighbor) is provided as output, preferably via a fast iterative algorithm.

valid

Paper
Add Code

Cannot find the paper you are looking for? You can Submit a new open access paper.