Search Results for author: Srinadh Bhojanapalli

Found 42 papers, 10 papers with code

Efficient Language Model Architectures for Differentially Private Federated Learning

no code implementations12 Mar 2024 Jae Hun Ro, Srinadh Bhojanapalli, Zheng Xu, Yanxiang Zhang, Ananda Theertha Suresh

Cross-device federated learning (FL) is a technique that trains a model on data distributed across typically millions of edge devices without data leaving the devices.

Computational Efficiency Federated Learning +1

HiRE: High Recall Approximate Top-$k$ Estimation for Efficient LLM Inference

no code implementations14 Feb 2024 Yashas Samaga B L, Varun Yerram, Chong You, Srinadh Bhojanapalli, Sanjiv Kumar, Prateek Jain, Praneeth Netrapalli

Autoregressive decoding with generative Large Language Models (LLMs) on accelerators (GPUs/TPUs) is often memory-bound where most of the time is spent on transferring model parameters from high bandwidth memory (HBM) to cache.

Dual-Encoders for Extreme Multi-Label Classification

1 code implementation16 Oct 2023 Nilesh Gupta, Devvrit Khatri, Ankit S Rawat, Srinadh Bhojanapalli, Prateek Jain, Inderjit Dhillon

We propose decoupled softmax loss - a simple modification to the InfoNCE loss - that overcomes the limitations of existing contrastive losses.

Classification Extreme Multi-Label Classification +2

Functional Interpolation for Relative Positions Improves Long Context Transformers

no code implementations6 Oct 2023 Shanda Li, Chong You, Guru Guruganesh, Joshua Ainslie, Santiago Ontanon, Manzil Zaheer, Sumit Sanghai, Yiming Yang, Sanjiv Kumar, Srinadh Bhojanapalli

Preventing the performance decay of Transformers on inputs longer than those used for training has been an important challenge in extending the context length of these models.

Language Modelling Position

Depth Dependence of $μ$P Learning Rates in ReLU MLPs

no code implementations13 May 2023 Samy Jelassi, Boris Hanin, Ziwei Ji, Sashank J. Reddi, Srinadh Bhojanapalli, Sanjiv Kumar

In this short note we consider random fully connected ReLU networks of width $n$ and depth $L$ equipped with a mean-field weight initialization.

On student-teacher deviations in distillation: does it pay to disobey?

no code implementations NeurIPS 2023 Vaishnavh Nagarajan, Aditya Krishna Menon, Srinadh Bhojanapalli, Hossein Mobahi, Sanjiv Kumar

Knowledge distillation (KD) has been widely used to improve the test accuracy of a "student" network, by training it to mimic the soft probabilities of a trained "teacher" network.

Knowledge Distillation

On the Adversarial Robustness of Mixture of Experts

no code implementations19 Oct 2022 Joan Puigcerver, Rodolphe Jenatton, Carlos Riquelme, Pranjal Awasthi, Srinadh Bhojanapalli

We next empirically evaluate the robustness of MoEs on ImageNet using adversarial attacks and show they are indeed more robust than dense models with the same computational cost.

Adversarial Robustness Open-Ended Question Answering

The Lazy Neuron Phenomenon: On Emergence of Activation Sparsity in Transformers

no code implementations12 Oct 2022 Zonglin Li, Chong You, Srinadh Bhojanapalli, Daliang Li, Ankit Singh Rawat, Sashank J. Reddi, Ke Ye, Felix Chern, Felix Yu, Ruiqi Guo, Sanjiv Kumar

This paper studies the curious phenomenon for machine learning models with Transformer architectures that their activation maps are sparse.

Treeformer: Dense Gradient Trees for Efficient Attention Computation

no code implementations18 Aug 2022 Lovish Madaan, Srinadh Bhojanapalli, Himanshu Jain, Prateek Jain

Based on such hierarchical navigation, we design Treeformer which can use one of two efficient attention layers -- TF-Attention and TC-Attention.

Retrieval

Robust Training of Neural Networks Using Scale Invariant Architectures

no code implementations2 Feb 2022 Zhiyuan Li, Srinadh Bhojanapalli, Manzil Zaheer, Sashank J. Reddi, Sanjiv Kumar

In contrast to SGD, adaptive gradient methods like Adam allow robust training of modern deep networks, especially large language models.

Leveraging redundancy in attention with Reuse Transformers

1 code implementation13 Oct 2021 Srinadh Bhojanapalli, Ayan Chakrabarti, Andreas Veit, Michal Lukasik, Himanshu Jain, Frederick Liu, Yin-Wen Chang, Sanjiv Kumar

Pairwise dot product-based attention allows Transformers to exchange information between tokens in an input-dependent way, and is key to their success across diverse applications in language and vision.

Teacher's pet: understanding and mitigating biases in distillation

no code implementations19 Jun 2021 Michal Lukasik, Srinadh Bhojanapalli, Aditya Krishna Menon, Sanjiv Kumar

Knowledge distillation is widely used as a means of improving the performance of a relatively simple student model using the predictions from a complex teacher model.

Image Classification Knowledge Distillation

Eigen Analysis of Self-Attention and its Reconstruction from Partial Computation

no code implementations16 Jun 2021 Srinadh Bhojanapalli, Ayan Chakrabarti, Himanshu Jain, Sanjiv Kumar, Michal Lukasik, Andreas Veit

State-of-the-art transformer models use pairwise dot-product based self-attention, which comes at a computational cost quadratic in the input sequence length.

A Simple and Effective Positional Encoding for Transformers

no code implementations EMNLP 2021 Pu-Chin Chen, Henry Tsai, Srinadh Bhojanapalli, Hyung Won Chung, Yin-Wen Chang, Chun-Sung Ferng

Our analysis shows that the gain actually comes from moving positional information to attention layer from the input.

Position

Understanding Robustness of Transformers for Image Classification

no code implementations ICCV 2021 Srinadh Bhojanapalli, Ayan Chakrabarti, Daniel Glasner, Daliang Li, Thomas Unterthiner, Andreas Veit

We find that when pre-trained with a sufficient amount of data, ViT models are at least as robust as the ResNet counterparts on a broad range of perturbations.

Classification General Classification +1

On the Reproducibility of Neural Network Predictions

no code implementations5 Feb 2021 Srinadh Bhojanapalli, Kimberly Wilber, Andreas Veit, Ankit Singh Rawat, Seungyeon Kim, Aditya Menon, Sanjiv Kumar

By analyzing the relationship between churn and prediction confidences, we pursue an approach with two components for churn reduction.

Data Augmentation Image Classification

Modifying Memories in Transformer Models

no code implementations1 Dec 2020 Chen Zhu, Ankit Singh Rawat, Manzil Zaheer, Srinadh Bhojanapalli, Daliang Li, Felix Yu, Sanjiv Kumar

In this paper, we propose a new task of \emph{explicitly modifying specific factual knowledge in Transformer models while ensuring the model performance does not degrade on the unmodified facts}.

Memorization

O(n) Connections are Expressive Enough: Universal Approximability of Sparse Transformers

no code implementations NeurIPS 2020 Chulhee Yun, Yin-Wen Chang, Srinadh Bhojanapalli, Ankit Singh Rawat, Sashank Reddi, Sanjiv Kumar

We propose sufficient conditions under which we prove that a sparse attention model can universally approximate any sequence-to-sequence function.

An efficient nonconvex reformulation of stagewise convex optimization problems

no code implementations NeurIPS 2020 Rudy Bunel, Oliver Hinder, Srinadh Bhojanapalli, Krishnamurthy, Dvijotham

We establish theoretical properties of the nonconvex formulation, showing that it is (almost) free of spurious local minima and has the same global optimum as the convex problem.

Coping with Label Shift via Distributionally Robust Optimisation

1 code implementation ICLR 2021 Jingzhao Zhang, Aditya Menon, Andreas Veit, Srinadh Bhojanapalli, Sanjiv Kumar, Suvrit Sra

The label shift problem refers to the supervised learning setting where the train and test label distributions do not match.

Semantic Label Smoothing for Sequence to Sequence Problems

no code implementations EMNLP 2020 Michal Lukasik, Himanshu Jain, Aditya Krishna Menon, Seungyeon Kim, Srinadh Bhojanapalli, Felix Yu, Sanjiv Kumar

Label smoothing has been shown to be an effective regularization strategy in classification, that prevents overfitting and helps in label de-noising.

Machine Translation Translation

$O(n)$ Connections are Expressive Enough: Universal Approximability of Sparse Transformers

no code implementations NeurIPS 2020 Chulhee Yun, Yin-Wen Chang, Srinadh Bhojanapalli, Ankit Singh Rawat, Sashank J. Reddi, Sanjiv Kumar

We propose sufficient conditions under which we prove that a sparse attention model can universally approximate any sequence-to-sequence function.

Does label smoothing mitigate label noise?

no code implementations ICML 2020 Michal Lukasik, Srinadh Bhojanapalli, Aditya Krishna Menon, Sanjiv Kumar

Label smoothing is commonly used in training deep learning models, wherein one-hot training labels are mixed with uniform label vectors.

Learning with noisy labels

Low-Rank Bottleneck in Multi-head Attention Models

no code implementations ICML 2020 Srinadh Bhojanapalli, Chulhee Yun, Ankit Singh Rawat, Sashank J. Reddi, Sanjiv Kumar

Attention based Transformer architecture has enabled significant advances in the field of natural language processing.

Are Transformers universal approximators of sequence-to-sequence functions?

no code implementations ICLR 2020 Chulhee Yun, Srinadh Bhojanapalli, Ankit Singh Rawat, Sashank J. Reddi, Sanjiv Kumar

In this paper, we establish that Transformer models are universal approximators of continuous permutation equivariant sequence-to-sequence functions with compact support, which is quite surprising given the amount of shared parameters in these models.

Concise Multi-head Attention Models

no code implementations25 Sep 2019 Srinadh Bhojanapalli, Chulhee Yun, Ankit Singh Rawat, Sashank Reddi, Sanjiv Kumar

Attention based Transformer architecture has enabled significant advances in the field of natural language processing.

The role of over-parametrization in generalization of neural networks

1 code implementation ICLR 2019 Behnam Neyshabur, Zhiyuan Li, Srinadh Bhojanapalli, Yann Lecun, Nathan Srebro

Despite existing work on ensuring generalization of neural networks in terms of scale sensitive complexity measures, such as norms, margin and sharpness, these complexity measures do not offer an explanation of why neural networks generalize better with over-parametrization.

Towards Understanding the Role of Over-Parametrization in Generalization of Neural Networks

2 code implementations30 May 2018 Behnam Neyshabur, Zhiyuan Li, Srinadh Bhojanapalli, Yann Lecun, Nathan Srebro

Despite existing work on ensuring generalization of neural networks in terms of scale sensitive complexity measures, such as norms, margin and sharpness, these complexity measures do not offer an explanation of why neural networks generalize better with over-parametrization.

A PAC-Bayesian Approach to Spectrally-Normalized Margin Bounds for Neural Networks

no code implementations ICLR 2018 Behnam Neyshabur, Srinadh Bhojanapalli, Nathan Srebro

We present a generalization bound for feedforward neural networks in terms of the product of the spectral norm of the layers and the Frobenius norm of the weights.

Exploring Generalization in Deep Learning

2 code implementations NeurIPS 2017 Behnam Neyshabur, Srinadh Bhojanapalli, David McAllester, Nathan Srebro

With a goal of understanding what drives generalization in deep networks, we consider several recently suggested explanations, including norm-based control, sharpness and robustness.

Implicit Regularization in Matrix Factorization

no code implementations NeurIPS 2017 Suriya Gunasekar, Blake Woodworth, Srinadh Bhojanapalli, Behnam Neyshabur, Nathan Srebro

We study implicit regularization when optimizing an underdetermined quadratic objective over a matrix $X$ with gradient descent on a factorization of $X$.

Stabilizing GAN Training with Multiple Random Projections

2 code implementations ICLR 2018 Behnam Neyshabur, Srinadh Bhojanapalli, Ayan Chakrabarti

Training generative adversarial networks is unstable in high-dimensions as the true data distribution tends to be concentrated in a small fraction of the ambient space.

Single Pass PCA of Matrix Products

1 code implementation NeurIPS 2016 Shanshan Wu, Srinadh Bhojanapalli, Sujay Sanghavi, Alexandros G. Dimakis

In this paper we present a new algorithm for computing a low rank approximation of the product $A^TB$ by taking only a single pass of the two matrices $A$ and $B$.

Global Optimality of Local Search for Low Rank Matrix Recovery

no code implementations NeurIPS 2016 Srinadh Bhojanapalli, Behnam Neyshabur, Nathan Srebro

We show that there are no spurious local minima in the non-convex factorized parametrization of low-rank matrix recovery from incoherent linear measurements.

Dropping Convexity for Faster Semi-definite Optimization

no code implementations14 Sep 2015 Srinadh Bhojanapalli, Anastasios Kyrillidis, Sujay Sanghavi

To the best of our knowledge, this is the first paper to provide precise convergence rate guarantees for general convex functions under standard convex assumptions.

A New Sampling Technique for Tensors

no code implementations17 Feb 2015 Srinadh Bhojanapalli, Sujay Sanghavi

In this paper we propose new techniques to sample arbitrary third-order tensors, with an objective of speeding up tensor algorithms that have recently gained popularity in machine learning.

Tighter Low-rank Approximation via Sampling the Leveraged Element

1 code implementation14 Oct 2014 Srinadh Bhojanapalli, Prateek Jain, Sujay Sanghavi

The first is a new method to directly compute a low-rank approximation (in efficient factored form) to the product of two given matrices; it computes a small random set of entries of the product, and then executes weighted alternating minimization (as before) on these.

Universal Matrix Completion

no code implementations10 Feb 2014 Srinadh Bhojanapalli, Prateek Jain

The problem of low-rank matrix completion has recently generated a lot of interest leading to several results that offer exact solutions to the problem.

Low-Rank Matrix Completion

Completing Any Low-rank Matrix, Provably

no code implementations12 Jun 2013 Yudong Chen, Srinadh Bhojanapalli, Sujay Sanghavi, Rachel Ward

Matrix completion, i. e., the exact and provable recovery of a low-rank matrix from a small subset of its elements, is currently only known to be possible if the matrix satisfies a restrictive structural constraint---known as {\em incoherence}---on its row and column spaces.

Matrix Completion

Cannot find the paper you are looking for? You can Submit a new open access paper.