no code implementations • 15 Apr 2025 • Michal Lukasik, Lin Chen, Harikrishna Narasimhan, Aditya Krishna Menon, Wittawat Jitkrittum, Felix X. Yu, Sashank J. Reddi, Gang Fu, Mohammadhossein Bateni, Sanjiv Kumar
Bipartite ranking is a fundamental supervised learning problem, with the goal of learning a ranking over instances with maximal area under the ROC curve (AUC) against a single binary target label.
no code implementations • 13 Mar 2025 • Shuo Xie, Tianhao Wang, Sashank Reddi, Sanjiv Kumar, Zhiyuan Li
We present a novel unified analysis for a broad class of adaptive optimization algorithms with structured (e. g., layerwise, diagonal, and kronecker-factored) preconditioners for both online regret minimization and offline convex optimization.
no code implementations • 24 Feb 2025 • Nikunj Saunshi, Nishanth Dikkala, Zhiyuan Li, Sanjiv Kumar, Sashank J. Reddi
Large language models have shown remarkable reasoning abilities and scaling laws suggest that large parameter count, especially along the depth axis, is the primary driver.
no code implementations • 12 Feb 2025 • Wittawat Jitkrittum, Harikrishna Narasimhan, Ankit Singh Rawat, Jeevesh Juneja, Zifeng Wang, Chen-Yu Lee, Pradeep Shenoy, Rina Panigrahy, Aditya Krishna Menon, Sanjiv Kumar
Large language models' significant advances in capabilities are accompanied by significant increases in inference costs.
no code implementations • 4 Feb 2025 • Dylan Sam, Ayan Chakrabarti, Afshin Rostamizadeh, Srikumar Ramalingam, Gui Citovsky, Sanjiv Kumar
We analyze a variety of embedding models in our framework, with experiments using the Pile dataset for pretraining a 1. 7B parameter decoder-only language model.
no code implementations • 24 Dec 2024 • Kanchana Ranasinghe, Sadeep Jayasumana, Andreas Veit, Ayan Chakrabarti, Daniel Glasner, Michael S Ryoo, Srikumar Ramalingam, Sanjiv Kumar
Latent Diffusion Models (LDMs) produce high-quality, photo-realistic images, however, the latency incurred by multiple costly inference iterations can restrict their applicability.
no code implementations • 12 Nov 2024 • Gaurav Menghani, Ravi Kumar, Sanjiv Kumar
One of the core pillars of efficient deep learning methods is architectural improvements such as the residual/skip connection, which has led to significantly better model convergence and quality.
no code implementations • 29 Oct 2024 • Khashayar Gatmiry, Nikunj Saunshi, Sashank J. Reddi, Stefanie Jegelka, Sanjiv Kumar
By studying in-context linear regression on unimodal Gaussian data, recent empirical and theoretical works have argued that ICL emerges from Transformers' abilities to simulate learning algorithms like gradient descent.
no code implementations • 27 Oct 2024 • Jui-Nan Yen, Si Si, Zhao Meng, Felix Yu, Sai Surya Duvvuri, Inderjit S. Dhillon, Cho-Jui Hsieh, Sanjiv Kumar
Low-rank adaption (LoRA) is a widely used parameter-efficient finetuning method for LLM that reduces memory requirements.
no code implementations • 24 Oct 2024 • Ankit Singh Rawat, Veeranjaneyulu Sadhanala, Afshin Rostamizadeh, Ayan Chakrabarti, Wittawat Jitkrittum, Vladimir Feinberg, Seungyeon Kim, Hrayr Harutyunyan, Nikunj Saunshi, Zachary Nado, Rakesh Shivanna, Sashank J. Reddi, Aditya Krishna Menon, Rohan Anil, Sanjiv Kumar
In particular, this paradigm relies on an SLM to both (1) provide soft labels as additional training supervision, and (2) select a small subset of valuable ("informative" and "hard") training examples.
no code implementations • 21 Oct 2024 • Giulia Desalvo, Jean-Fracois Kagy, Lazaros Karydas, Afshin Rostamizadeh, Sanjiv Kumar
We present a novel soft prompt based framework, SoftSRV, that leverages a frozen pre-trained large language model (LLM) to generate targeted synthetic text sequences.
no code implementations • 14 Oct 2024 • Asher Trockman, Hrayr Harutyunyan, J. Zico Kolter, Sanjiv Kumar, Srinadh Bhojanapalli
Recent work has shown that state space models such as Mamba are significantly worse than Transformers on recall-based tasks due to the fact that their state size is constant with respect to their input sequence length.
no code implementations • 10 Oct 2024 • Khashayar Gatmiry, Nikunj Saunshi, Sashank J. Reddi, Stefanie Jegelka, Sanjiv Kumar
To our knowledge, this is the first theoretical analysis for multi-layer Transformer in this setting.
no code implementations • 27 Sep 2024 • Nikunj Saunshi, Stefani Karp, Shankar Krishnan, Sobhan Miryoosefi, Sashank J. Reddi, Sanjiv Kumar
These findings of training efficiency and inductive bias towards reasoning are verified at 1B, 2B and 8B parameter language models.
1 code implementation • 22 Jul 2024 • Yuchen Li, Alexandre Kirchmeyer, Aashay Mehta, Yilong Qin, Boris Dadachev, Kishore Papineni, Sanjiv Kumar, Andrej Risteski
While alternate classes of models have been explored, we have limited mathematical understanding of their fundamental power and limitations.
no code implementations • 25 Jun 2024 • Ziwei Ji, Himanshu Jain, Andreas Veit, Sashank J. Reddi, Sadeep Jayasumana, Ankit Singh Rawat, Aditya Krishna Menon, Felix Yu, Sanjiv Kumar
Cross-Encoder (CE) and Dual-Encoder (DE) models are two fundamental approaches for query-document relevance in information retrieval.
no code implementations • 4 Jun 2024 • Stefani Karp, Nikunj Saunshi, Sobhan Miryoosefi, Sashank J. Reddi, Sanjiv Kumar
Instead, we identify that behavior at initialization can be misleading as a predictor of final performance and present an alternative perspective based on early training dynamics, which we call "landscape-aware growing (LAG)".
no code implementations • 29 May 2024 • Harikrishna Narasimhan, Wittawat Jitkrittum, Ankit Singh Rawat, Seungyeon Kim, Neha Gupta, Aditya Krishna Menon, Sanjiv Kumar
Both approaches involve interleaving models of different sizes, but via fundamentally distinct mechanisms: cascades employ a deferral rule that invokes the larger model only for "hard" inputs, while speculative decoding uses speculative execution to primarily invoke the larger model in parallel verification mode.
no code implementations • 15 Apr 2024 • Neha Gupta, Harikrishna Narasimhan, Wittawat Jitkrittum, Ankit Singh Rawat, Aditya Krishna Menon, Sanjiv Kumar
While the principles underpinning cascading are well-studied for classification tasks - with deferral based on predicted class uncertainty favored theoretically and practically - a similar understanding is lacking for generative LM tasks.
no code implementations • 14 Apr 2024 • Taehyeon Kim, Ananda Theertha Suresh, Kishore Papineni, Michael Riley, Sanjiv Kumar, Adrian Benton
Despite the remarkable strides made by autoregressive language models, their potential is often hampered by the slow inference speeds inherent in sequential token generation.
no code implementations • NeurIPS 2023 • Philip Sun, David Simcha, Dave Dopson, Ruiqi Guo, Sanjiv Kumar
This paper introduces SOAR: Spilling with Orthogonality-Amplified Residuals, a novel data indexing technique for approximate nearest neighbor (ANN) search.
no code implementations • 7 Mar 2024 • Michal Lukasik, Harikrishna Narasimhan, Aditya Krishna Menon, Felix Yu, Sanjiv Kumar
Large language models (LLMs) have shown strong results on a range of applications, including regression and scoring tasks.
no code implementations • 14 Feb 2024 • Yashas Samaga B L, Varun Yerram, Chong You, Srinadh Bhojanapalli, Sanjiv Kumar, Prateek Jain, Praneeth Netrapalli
Autoregressive decoding with generative Large Language Models (LLMs) on accelerators (GPUs/TPUs) is often memory-bound where most of the time is spent on transferring model parameters from high bandwidth memory (HBM) to cache.
no code implementations • 13 Feb 2024 • Aishwarya P S, Pranav Ajit Nair, Yashas Samaga, Toby Boyd, Sanjiv Kumar, Prateek Jain, Praneeth Netrapalli
On the PaLM2 pretraining dataset, a tandem of PaLM2-Bison and PaLM2-Gecko demonstrates a 3. 3% improvement in next-token prediction accuracy over a standalone PaLM2-Gecko, offering a 1. 16x speedup compared to a PaLM2-Otter model with comparable downstream performance.
no code implementations • 8 Feb 2024 • Abhishek Panigrahi, Nikunj Saunshi, Kaifeng Lyu, Sobhan Miryoosefi, Sashank Reddi, Satyen Kale, Sanjiv Kumar
We propose an instantiation of this framework - Random Part Training (RAPTR) - that selects and trains only a random subnetwork (e. g. depth-wise, width-wise) of the network at each step, progressively increasing the size in stages.
no code implementations • 24 Jan 2024 • Ke Ye, Heinrich Jiang, Afshin Rostamizadeh, Ayan Chakrabarti, Giulia Desalvo, Jean-François Kagy, Lazaros Karydas, Gui Citovsky, Sanjiv Kumar
In this paper, we present SpacTor, a new training procedure consisting of (1) a hybrid objective combining span corruption (SC) and token replacement detection (RTD), and (2) a two-stage curriculum that optimizes the hybrid objective over the initial $\tau$ iterations, then transitions to standard SC loss.
no code implementations • 17 Dec 2023 • Srikumar Ramalingam, Pranjal Awasthi, Sanjiv Kumar
The success of deep learning hinges on enormous data and large models, which require labor-intensive annotations and heavy computation costs.
no code implementations • 15 Dec 2023 • Renat Aksitov, Sobhan Miryoosefi, Zonglin Li, Daliang Li, Sheila Babayan, Kavya Kopparapu, Zachary Fisher, Ruiqi Guo, Sushant Prakash, Pranesh Srinivasan, Manzil Zaheer, Felix Yu, Sanjiv Kumar
Answering complex natural language questions often necessitates multi-step reasoning and integrating external information.
Ranked #1 on
Question Answering
on Bamboogle
3 code implementations • CVPR 2024 • Sadeep Jayasumana, Srikumar Ramalingam, Andreas Veit, Daniel Glasner, Ayan Chakrabarti, Sanjiv Kumar
It is an unbiased estimator that does not make any assumptions on the probability distribution of the embeddings and is sample efficient.
no code implementations • 13 Oct 2023 • Lin Chen, Michal Lukasik, Wittawat Jitkrittum, Chong You, Sanjiv Kumar
Classical wisdom in machine learning holds that the generalization error can be decomposed into bias and variance, and these two terms exhibit a \emph{trade-off}.
no code implementations • 12 Oct 2023 • Yongchao Zhou, Kaifeng Lyu, Ankit Singh Rawat, Aditya Krishna Menon, Afshin Rostamizadeh, Sanjiv Kumar, Jean-François Kagy, Rishabh Agarwal
Finally, in practical scenarios with models of varying sizes, first using distillation to boost the performance of the target model and then applying DistillSpec to train a well-aligned draft model can reduce decoding latency by 6-10x with minimal performance drop, compared to standard decoding without distillation.
no code implementations • 9 Oct 2023 • Michal Lukasik, Vaishnavh Nagarajan, Ankit Singh Rawat, Aditya Krishna Menon, Sanjiv Kumar
The success of modern neural networks has prompted study of the connection between memorisation and generalisation: overparameterised models generalise well, despite being able to perfectly fit (memorise) completely random labels.
no code implementations • 6 Oct 2023 • Shanda Li, Chong You, Guru Guruganesh, Joshua Ainslie, Santiago Ontanon, Manzil Zaheer, Sumit Sanghai, Yiming Yang, Sanjiv Kumar, Srinadh Bhojanapalli
Preventing the performance decay of Transformers on inputs longer than those used for training has been an important challenge in extending the context length of these models.
no code implementations • 3 Oct 2023 • Sachin Goyal, Ziwei Ji, Ankit Singh Rawat, Aditya Krishna Menon, Sanjiv Kumar, Vaishnavh Nagarajan
Language models generate responses by producing a series of tokens in immediate succession: the $(K+1)^{th}$ token is an outcome of manipulating $K$ hidden vectors per layer, one vector per preceding token.
no code implementations • CVPR 2024 • Sadeep Jayasumana, Daniel Glasner, Srikumar Ramalingam, Andreas Veit, Ayan Chakrabarti, Sanjiv Kumar
Modern text-to-image generation models produce high-quality images that are both photorealistic and faithful to the text prompts.
no code implementations • NeurIPS 2023 • Wittawat Jitkrittum, Neha Gupta, Aditya Krishna Menon, Harikrishna Narasimhan, Ankit Singh Rawat, Sanjiv Kumar
Cascades are a classical strategy to enable inference cost to vary adaptively across samples, wherein a sequence of classifiers are invoked in turn.
no code implementations • 13 May 2023 • Samy Jelassi, Boris Hanin, Ziwei Ji, Sashank J. Reddi, Srinadh Bhojanapalli, Sanjiv Kumar
In this short note we consider random fully connected ReLU networks of width $n$ and depth $L$ equipped with a mean-field weight initialization.
no code implementations • NeurIPS 2023 • Vaishnavh Nagarajan, Aditya Krishna Menon, Srinadh Bhojanapalli, Hossein Mobahi, Sanjiv Kumar
Knowledge distillation (KD) has been widely used to improve the test accuracy of a "student" network, by training it to mimic the soft probabilities of a trained "teacher" network.
no code implementations • 29 Jan 2023 • Harikrishna Narasimhan, Aditya Krishna Menon, Wittawat Jitkrittum, Sanjiv Kumar
Recent work on selective classification with OOD detection (SCOD) has argued for the unified study of these problems; however, the formal underpinnings of this problem are still nascent, and existing techniques are heuristic in nature.
Out-of-Distribution Detection
Out of Distribution (OOD) Detection
no code implementations • 28 Jan 2023 • Hrayr Harutyunyan, Ankit Singh Rawat, Aditya Krishna Menon, Seungyeon Kim, Sanjiv Kumar
Despite the popularity and efficacy of knowledge distillation, there is limited understanding of why it helps.
no code implementations • 28 Jan 2023 • Gui Citovsky, Giulia Desalvo, Sanjiv Kumar, Srikumar Ramalingam, Afshin Rostamizadeh, Yunjuan Wang
In such a setting, an algorithm can sample examples one at a time but, in order to limit overhead costs, is only able to update its state (i. e. further train model weights) once a large enough batch of examples is selected.
no code implementations • 27 Jan 2023 • Seungyeon Kim, Ankit Singh Rawat, Manzil Zaheer, Sadeep Jayasumana, Veeranjaneyulu Sadhanala, Wittawat Jitkrittum, Aditya Krishna Menon, Rob Fergus, Sanjiv Kumar
Large neural models (such as Transformers) achieve state-of-the-art performance for information retrieval (IR).
no code implementations • 4 Jan 2023 • Philip Sun, Ruiqi Guo, Sanjiv Kumar
The approximate nearest neighbor (ANN) search problem is fundamental to efficiently serving many real-world machine learning applications.
no code implementations • 9 Nov 2022 • Daliang Li, Ankit Singh Rawat, Manzil Zaheer, Xin Wang, Michal Lukasik, Andreas Veit, Felix Yu, Sanjiv Kumar
By contrast, when the context is irrelevant to the task, the model should ignore it and fall back on its internal knowledge.
no code implementations • 1 Nov 2022 • Yihan Wang, Si Si, Daliang Li, Michal Lukasik, Felix Yu, Cho-Jui Hsieh, Inderjit S Dhillon, Sanjiv Kumar
Pretrained large language models (LLMs) are general purpose problem solvers applicable to a diverse set of tasks with prompts.
no code implementations • 28 Oct 2022 • Arslan Chaudhry, Aditya Krishna Menon, Andreas Veit, Sadeep Jayasumana, Srikumar Ramalingam, Sanjiv Kumar
Towards this, we study two questions: (1) how does the Mixup loss that enforces linearity in the \emph{last} network layer propagate the linearity to the \emph{earlier} layers?
no code implementations • 12 Oct 2022 • Zonglin Li, Chong You, Srinadh Bhojanapalli, Daliang Li, Ankit Singh Rawat, Sashank J. Reddi, Ke Ye, Felix Chern, Felix Yu, Ruiqi Guo, Sanjiv Kumar
This paper studies the curious phenomenon for machine learning models with Transformer architectures that their activation maps are sparse.
no code implementations • 11 Oct 2022 • Zonglin Li, Ruiqi Guo, Sanjiv Kumar
Language models can be augmented with a context retriever to incorporate knowledge from large external databases.
no code implementations • 14 Aug 2022 • Manzil Zaheer, Ankit Singh Rawat, Seungyeon Kim, Chong You, Himanshu Jain, Andreas Veit, Rob Fergus, Sanjiv Kumar
In this paper, we propose the teacher-guided training (TGT) framework for training a high-quality compact model that leverages the knowledge acquired by pretrained generative models, while obviating the need to go through a large volume of data.
no code implementations • 28 Jun 2022 • Felix Chern, Blake Hechtman, Andy Davis, Ruiqi Guo, David Majnemer, Sanjiv Kumar
This paper presents a novel nearest neighbor search algorithm achieving TPU (Google Tensor Processing Unit) peak performance, outperforming state-of-the-art GPU algorithms with similar level of recall.
no code implementations • 27 Apr 2022 • Wittawat Jitkrittum, Aditya Krishna Menon, Ankit Singh Rawat, Sanjiv Kumar
Long-tail learning is the problem of learning under skewed label distributions, which pose a challenge for standard learners.
no code implementations • 15 Feb 2022 • Taman Narayan, Heinrich Jiang, Sen Zhao, Sanjiv Kumar
Much effort has been devoted to making large and more accurate models, but relatively little has been put into understanding which examples are benefiting from the added complexity.
no code implementations • 2 Feb 2022 • Zhiyuan Li, Srinadh Bhojanapalli, Manzil Zaheer, Sashank J. Reddi, Sanjiv Kumar
In contrast to SGD, adaptive gradient methods like Adam allow robust training of modern deep networks, especially large language models.
2 code implementations • NeurIPS 2021 • Erik Lindgren, Sashank Reddi, Ruiqi Guo, Sanjiv Kumar
These models are typically trained by optimizing the model parameters to score relevant positive" pairs higher than the irrelevantnegative" ones.
no code implementations • 19 Oct 2021 • Ankit Singh Rawat, Manzil Zaheer, Aditya Krishna Menon, Amr Ahmed, Sanjiv Kumar
In a nutshell, we use the large teacher models to guide the lightweight student models to only make correct predictions on a subset of "easy" examples; for the "hard" examples, we fall-back to the teacher.
1 code implementation • 13 Oct 2021 • Srinadh Bhojanapalli, Ayan Chakrabarti, Andreas Veit, Michal Lukasik, Himanshu Jain, Frederick Liu, Yin-Wen Chang, Sanjiv Kumar
Pairwise dot product-based attention allows Transformers to exchange information between tokens in an input-dependent way, and is key to their success across diverse applications in language and vision.
no code implementations • 29 Sep 2021 • Ankit Singh Rawat, Manzil Zaheer, Aditya Krishna Menon, Amr Ahmed, Sanjiv Kumar
In a nutshell, we use the large teacher models to guide the lightweight student models to only make correct predictions on a subset of "easy" examples; for the "hard" examples, we fall-back to the teacher.
no code implementations • 29 Sep 2021 • Sadeep Jayasumana, Srikumar Ramalingam, Sanjiv Kumar
We investigate the possibility of using the embeddings produced by a lightweight network more effectively with a nonlinear classification layer.
no code implementations • 29 Sep 2021 • Aditya Krishna Menon, Sadeep Jayasumana, Seungyeon Kim, Ankit Singh Rawat, Sashank J. Reddi, Sanjiv Kumar
Transformer-based models such as BERT have proven successful in information retrieval problem, which seek to identify relevant documents for a given query.
no code implementations • 29 Sep 2021 • Srikumar Ramalingam, Daniel Glasner, Kaushal Patel, Raviteja Vemulapalli, Sadeep Jayasumana, Sanjiv Kumar
Deep learning has yielded extraordinary results in vision and natural language processing, but this achievement comes at a cost.
1 code implementation • NeurIPS 2021 • Gui Citovsky, Giulia Desalvo, Claudio Gentile, Lazaros Karydas, Anand Rajagopalan, Afshin Rostamizadeh, Sanjiv Kumar
The ability to train complex and highly effective models often requires an abundance of training data, which can easily become a bottleneck in cost, time, and computational resources.
no code implementations • 19 Jun 2021 • Michal Lukasik, Srinadh Bhojanapalli, Aditya Krishna Menon, Sanjiv Kumar
Knowledge distillation is widely used as a means of improving the performance of a relatively simple student model using the predictions from a complex teacher model.
no code implementations • 16 Jun 2021 • Srinadh Bhojanapalli, Ayan Chakrabarti, Himanshu Jain, Sanjiv Kumar, Michal Lukasik, Andreas Veit
State-of-the-art transformer models use pairwise dot-product based self-attention, which comes at a computational cost quadratic in the input sequence length.
1 code implementation • 25 May 2021 • Baris Sumengen, Anand Rajagopalan, Gui Citovsky, David Simcha, Olivier Bachem, Pradipta Mitra, Sam Blasiak, Mason Liang, Sanjiv Kumar
Hierarchical Agglomerative Clustering (HAC) is one of the oldest but still most widely used clustering methods.
no code implementations • 19 May 2021 • Seungyeon Kim, Daniel Glasner, Srikumar Ramalingam, Cho-Jui Hsieh, Kishore Papineni, Sanjiv Kumar
It is generally believed that robust training of extremely large networks is critical to their success in real-world applications.
no code implementations • 12 May 2021 • Ankit Singh Rawat, Aditya Krishna Menon, Wittawat Jitkrittum, Sadeep Jayasumana, Felix X. Yu, Sashank Reddi, Sanjiv Kumar
Negative sampling schemes enable efficient training given a large number of classes, by offering a means to approximate a computationally expensive loss function that takes all labels into account.
no code implementations • 26 Apr 2021 • Srikumar Ramalingam, Daniel Glasner, Kaushal Patel, Raviteja Vemulapalli, Sadeep Jayasumana, Sanjiv Kumar
Deep learning has yielded extraordinary results in vision and natural language processing, but this achievement comes at a cost.
no code implementations • AISTATS 2021 • Sashank J. Reddi, Rama Kumar Pasumarthi, Aditya Krishna Menon, Ankit Singh Rawat Felix Yu, Seungyeon Kim, Andreas Veit, Sanjiv Kumar
Knowledge distillation is an approach to improve the performance of a student model by using the knowledge of a complex teacher. Despite its success in several deep learning applications, the study of distillation is mostly confined to classification settings.
no code implementations • 5 Feb 2021 • Srinadh Bhojanapalli, Kimberly Wilber, Andreas Veit, Ankit Singh Rawat, Seungyeon Kim, Aditya Menon, Sanjiv Kumar
By analyzing the relationship between churn and prediction confidences, we pursue an approach with two components for churn reduction.
no code implementations • ICLR 2021 • Aditya Krishna Menon, Ankit Singh Rawat, Sanjiv Kumar
Overparameterised neural networks have demonstrated the remarkable ability to perfectly fit training samples, while still generalising to unseen test samples.
no code implementations • 8 Dec 2020 • Sadeep Jayasumana, Srikumar Ramalingam, Sanjiv Kumar
We propose a kernelized classification layer for deep networks.
no code implementations • 1 Dec 2020 • Chen Zhu, Ankit Singh Rawat, Manzil Zaheer, Srinadh Bhojanapalli, Daliang Li, Felix Yu, Sanjiv Kumar
In this paper, we propose a new task of \emph{explicitly modifying specific factual knowledge in Transformer models while ensuring the model performance does not degrade on the unmodified facts}.
no code implementations • NeurIPS 2020 • Chulhee Yun, Yin-Wen Chang, Srinadh Bhojanapalli, Ankit Singh Rawat, Sashank Reddi, Sanjiv Kumar
We propose sufficient conditions under which we prove that a sparse attention model can universally approximate any sequence-to-sequence function.
1 code implementation • ICLR 2021 • Jingzhao Zhang, Aditya Menon, Andreas Veit, Srinadh Bhojanapalli, Sanjiv Kumar, Suvrit Sra
The label shift problem refers to the supervised learning setting where the train and test label distributions do not match.
no code implementations • EMNLP 2020 • Michal Lukasik, Himanshu Jain, Aditya Krishna Menon, Seungyeon Kim, Srinadh Bhojanapalli, Felix Yu, Sanjiv Kumar
Label smoothing has been shown to be an effective regularization strategy in classification, that prevents overfitting and helps in label de-noising.
no code implementations • NeurIPS 2020 • Yuhan Liu, Ananda Theertha Suresh, Felix Yu, Sanjiv Kumar, Michael Riley
If each user has $m$ samples, we show that straightforward applications of Laplace or Gaussian mechanisms require the number of users to be $\mathcal{O}(k/(m\alpha^2) + k/\epsilon\alpha)$ to achieve an $\ell_1$ distance of $\alpha$ between the true and estimated distributions, with the privacy-induced penalty $k/\epsilon\alpha$ independent of the number of samples per user $m$.
no code implementations • NeurIPS 2020 • Hongge Chen, Si Si, Yang Li, Ciprian Chelba, Sanjiv Kumar, Duane Boning, Cho-Jui Hsieh
With this score, we can identify the pretraining examples in the pretraining task that contribute most to a prediction in the finetuning task.
3 code implementations • ICLR 2021 • Aditya Krishna Menon, Sadeep Jayasumana, Ankit Singh Rawat, Himanshu Jain, Andreas Veit, Sanjiv Kumar
Real-world classification problems typically exhibit an imbalanced or long-tailed label distribution, wherein many labels are associated with only a few samples.
Ranked #53 on
Long-tail Learning
on ImageNet-LT
no code implementations • NeurIPS 2020 • Chulhee Yun, Yin-Wen Chang, Srinadh Bhojanapalli, Ankit Singh Rawat, Sashank J. Reddi, Sanjiv Kumar
We propose sufficient conditions under which we prove that a sparse attention model can universally approximate any sequence-to-sequence function.
no code implementations • ICLR 2021 • Cheng-Yu Hsieh, Chih-Kuan Yeh, Xuanqing Liu, Pradeep Ravikumar, Seungyeon Kim, Sanjiv Kumar, Cho-Jui Hsieh
In this paper, we establish a novel set of evaluation criteria for such feature based explanations by robustness analysis.
no code implementations • 21 May 2020 • Aditya Krishna Menon, Ankit Singh Rawat, Sashank J. Reddi, Seungyeon Kim, Sanjiv Kumar
In this paper, we present a statistical perspective on distillation which addresses this question, and provides a novel connection to extreme multiclass retrieval techniques.
1 code implementation • ICLR 2020 • Aditya Krishna Menon, Ankit Singh Rawat, Sashank J. Reddi, Sanjiv Kumar
Gradient clipping is a widely-used technique in the training of deep networks, and is generally motivated from an optimisation lens: informally, it controls the dynamics of iterates, thus enhancing the rate of convergence to a local minimum.
no code implementations • 23 Apr 2020 • Ankit Singh Rawat, Aditya Krishna Menon, Andreas Veit, Felix Yu, Sashank J. Reddi, Sanjiv Kumar
Modern retrieval problems are characterised by training sets with potentially billions of labels, and heterogeneous data distributions across subpopulations (e. g., users of a retrieval system may be from different countries), each of which poses a challenge.
1 code implementation • ICML 2020 • Felix X. Yu, Ankit Singh Rawat, Aditya Krishna Menon, Sanjiv Kumar
We consider learning a multi-class classification model in the federated setting, where each user has access to the positive data associated with only a single class.
no code implementations • NeurIPS 2020 • Melanie Weber, Manzil Zaheer, Ankit Singh Rawat, Aditya Menon, Sanjiv Kumar
In this paper, we present, to our knowledge, the first theoretical guarantees for learning a classifier in hyperbolic rather than Euclidean space.
no code implementations • ICML 2020 • Michal Lukasik, Srinadh Bhojanapalli, Aditya Krishna Menon, Sanjiv Kumar
Label smoothing is commonly used in training deep learning models, wherein one-hot training labels are mixed with uniform label vectors.
Ranked #13 on
Learning with noisy labels
on CIFAR-10N-Random3
8 code implementations • ICLR 2021 • Sashank Reddi, Zachary Charles, Manzil Zaheer, Zachary Garrett, Keith Rush, Jakub Konečný, Sanjiv Kumar, H. Brendan McMahan
Federated learning is a distributed machine learning paradigm in which a large number of clients coordinate with a central server to learn a model without sharing their own training data.
no code implementations • ICML 2020 • Srinadh Bhojanapalli, Chulhee Yun, Ankit Singh Rawat, Sashank J. Reddi, Sanjiv Kumar
Attention based Transformer architecture has enabled significant advances in the field of natural language processing.
no code implementations • ICLR 2020 • Wei-Cheng Chang, Felix X. Yu, Yin-Wen Chang, Yiming Yang, Sanjiv Kumar
We consider the large-scale query-document retrieval problem: given a query (e. g., a question), return the set of relevant documents (e. g., paragraphs containing the answer) from a large document corpus.
no code implementations • ICLR 2020 • Ruiqi Guo, Quan Geng, David Simcha, Felix Chern, Phil Sun, Sanjiv Kumar
In this work, we focus directly on minimizing error in inner product approximation and derive a new class of quantization loss functions.
no code implementations • ICLR 2020 • Chulhee Yun, Srinadh Bhojanapalli, Ankit Singh Rawat, Sashank J. Reddi, Sanjiv Kumar
In this paper, we establish that Transformer models are universal approximators of continuous permutation equivariant sequence-to-sequence functions with compact support, which is quite surprising given the amount of shared parameters in these models.
no code implementations • NeurIPS 2020 • Jingzhao Zhang, Sai Praneeth Karimireddy, Andreas Veit, Seungyeon Kim, Sashank J. Reddi, Sanjiv Kumar, Suvrit Sra
While stochastic gradient descent (SGD) is still the \emph{de facto} algorithm in deep learning, adaptive methods like Clipped SGD/Adam have been observed to outperform SGD across important tasks, such as attention models.
no code implementations • NeurIPS 2019 • Chuan Guo, Ali Mousavi, Xiang Wu, Daniel N. Holtmann-Rice, Satyen Kale, Sashank Reddi, Sanjiv Kumar
In extreme classification settings, embedding-based neural network models are currently not competitive with sparse linear and tree-based methods in terms of accuracy.
no code implementations • NeurIPS 2019 • Aditya K. Menon, Ankit Singh Rawat, Sashank Reddi, Sanjiv Kumar
Multilabel classification is a challenging problem arising in applications ranging from information retrieval to image tagging.
1 code implementation • ICLR 2020 • Yangjun Ruan, Yuanhao Xiong, Sashank Reddi, Sanjiv Kumar, Cho-Jui Hsieh
In the learning to learn (L2L) framework, we cast the design of optimization algorithms as a machine learning problem and use deep neural networks to learn the update rules.
no code implementations • 25 Sep 2019 • Srinadh Bhojanapalli, Chulhee Yun, Ankit Singh Rawat, Sashank Reddi, Sanjiv Kumar
Attention based Transformer architecture has enabled significant advances in the field of natural language processing.
no code implementations • 25 Sep 2019 • Jingzhao Zhang, Sai Praneeth Karimireddy, Andreas Veit, Seungyeon Kim, Sashank J Reddi, Sanjiv Kumar, Suvrit Sra
While stochastic gradient descent (SGD) is still the de facto algorithm in deep learning, adaptive methods like Adam have been observed to outperform SGD across important tasks, such as attention models.
no code implementations • 25 Sep 2019 • Patrick H. Chen, Sashank Reddi, Sanjiv Kumar, Cho-Jui Hsieh
We consider the learning to learn problem, where the goal is to leverage deeplearning models to automatically learn (iterative) optimization algorithms for training machine learning models.
no code implementations • 20 Sep 2019 • Aditya Krishna Menon, Anand Rajagopalan, Baris Sumengen, Gui Citovsky, Qin Cao, Sanjiv Kumar
The second algorithm, OHAC, is an online counterpart to offline HAC, which is known to yield a 1/3-approximation to the MW revenue, and produce good quality clusters in practice.
4 code implementations • ICML 2020 • Ruiqi Guo, Philip Sun, Erik Lindgren, Quan Geng, David Simcha, Felix Chern, Sanjiv Kumar
Based on the observation that for a given query, the database points that have the largest inner products are more relevant, we develop a family of anisotropic quantization loss functions.
1 code implementation • 20 Aug 2019 • Venkatadheeraj Pichapati, Ananda Theertha Suresh, Felix X. Yu, Sashank J. Reddi, Sanjiv Kumar
Motivated by this, differentially private stochastic gradient descent (SGD) algorithms for training machine learning models have been proposed.
no code implementations • NeurIPS 2019 • Ankit Singh Rawat, Jiecao Chen, Felix Yu, Ananda Theertha Suresh, Sanjiv Kumar
For the settings where a large number of classes are involved, a common method to speed up training is to sample a subset of classes and utilize an estimate of the loss gradient based on these classes, known as the sampled softmax method.
1 code implementation • 5 Jun 2019 • Xuanqing Liu, Tesi Xiao, Si Si, Qin Cao, Sanjiv Kumar, Cho-Jui Hsieh
In this paper, we propose a new continuous neural network framework called Neural Stochastic Differential Equation (Neural SDE) network, which naturally incorporates various commonly used regularization mechanisms based on random noise injection.
3 code implementations • ICLR 2018 • Sashank J. Reddi, Satyen Kale, Sanjiv Kumar
Several recently proposed stochastic optimization methods that have been successfully used in training deep networks such as RMSProp, Adam, Adadelta, Nadam are based on using gradient updates scaled by square roots of exponential moving averages of squared past gradients.
29 code implementations • ICLR 2020 • Yang You, Jing Li, Sashank Reddi, Jonathan Hseu, Sanjiv Kumar, Srinadh Bhojanapalli, Xiaodan Song, James Demmel, Kurt Keutzer, Cho-Jui Hsieh
In this paper, we first study a principled layerwise adaptation strategy to accelerate training of deep neural networks using large mini-batches.
Ranked #11 on
Question Answering
on SQuAD1.1 dev
(F1 metric)
no code implementations • 25 Mar 2019 • Xiang Wu, Ruiqi Guo, Sanjiv Kumar, David Simcha
More specifically, we decompose a residual vector locally into two orthogonal components and perform uniform quantization and multiscale quantization to each component respectively.
no code implementations • 20 Mar 2019 • Xiang Wu, Ruiqi Guo, David Simcha, Dave Dopson, Sanjiv Kumar
In this paper, we propose a technique that approximates the inner product computation in hybrid vectors, leading to substantial speedup in search while maintaining high accuracy.
no code implementations • 26 Jan 2019 • Matthew Staib, Sashank J. Reddi, Satyen Kale, Sanjiv Kumar, Suvrit Sra
Adaptive methods such as Adam and RMSProp are widely used in deep learning but are not well understood.
1 code implementation • NeurIPS 2018 • Manzil Zaheer, Sashank Reddi, Devendra Sachan, Satyen Kale, Sanjiv Kumar
In this work, we provide a new analysis of such methods applied to nonconvex stochastic optimization problems, characterizing the effect of increasing minibatch size.
no code implementations • ICLR 2019 • Patrick H. Chen, Si Si, Sanjiv Kumar, Yang Li, Cho-Jui Hsieh
The algorithm achieves an order of magnitude faster inference than the original softmax layer for predicting top-$k$ words in various tasks such as beam search in machine translation or next words prediction.
no code implementations • 16 Oct 2018 • Sashank J. Reddi, Satyen Kale, Felix Yu, Dan Holtmann-Rice, Jiecao Chen, Sanjiv Kumar
Furthermore, we identify a particularly intuitive class of loss functions in the aforementioned family and show that they are amenable to practical implementation in the large output space setting (i. e. computation is possible without evaluating scores of all labels) by developing a technique called Stochastic Negative Mining.
no code implementations • 1 Oct 2018 • Quan Geng, Wei Ding, Ruiqi Guo, Sanjiv Kumar
We show that the multiplicative gap of the lower bounds and upper bounds goes to zero in various high privacy regimes, proving the tightness of the lower and upper bounds and thus establishing the optimality of the truncated Laplacian mechanism.
no code implementations • 26 Sep 2018 • Quan Geng, Wei Ding, Ruiqi Guo, Sanjiv Kumar
We derive the optimal $(0, \delta)$-differentially private query-output independent noise-adding mechanism for single real-valued query function under a general cost-minimization framework.
no code implementations • ICML 2018 • Ian En-Hsu Yen, Satyen Kale, Felix Yu, Daniel Holtmann-Rice, Sanjiv Kumar, Pradeep Ravikumar
For problems with large output spaces, evaluation of the loss function and its gradient are expensive, typically taking linear time in the size of the output space.
1 code implementation • 26 Jun 2018 • Shanshan Wu, Alexandros G. Dimakis, Sujay Sanghavi, Felix X. Yu, Daniel Holtmann-Rice, Dmitry Storcheus, Afshin Rostamizadeh, Sanjiv Kumar
Our experiments show that there is indeed additional structure beyond sparsity in the real datasets; our method is able to discover it and exploit it to create excellent reconstructions with fewer measurements (by a factor of 1. 1-3x) compared to the previous state-of-the-art methods.
no code implementations • NeurIPS 2018 • Naman Agarwal, Ananda Theertha Suresh, Felix Yu, Sanjiv Kumar, H. Brendan McMahan
Distributed stochastic gradient descent is an important subroutine in distributed learning.
no code implementations • 21 Feb 2018 • Si Si, Sanjiv Kumar, Yang Li
Use of nonlinear feature maps via kernel approximation has led to success in many online learning tasks.
no code implementations • NeurIPS 2017 • Xiang Wu, Ruiqi Guo, Ananda Theertha Suresh, Sanjiv Kumar, Daniel N. Holtmann-Rice, David Simcha, Felix Yu
We propose a multiscale quantization approach for fast similarity search on large, high-dimensional datasets.
no code implementations • 29 Nov 2017 • Blaise Agüera y Arcas, Beat Gfeller, Ruiqi Guo, Kevin Kilgour, Sanjiv Kumar, James Lyon, Julian Odell, Marvin Ritter, Dominik Roblek, Matthew Sharifi, Mihajlo Velimirović
To reduce battery consumption, a small music detector runs continuously on the mobile device's DSP chip and wakes up the main application processor only when it is confident that music is present.
2 code implementations • ICCV 2017 • Xu Zhang, Felix X. Yu, Sanjiv Kumar, Shih-Fu Chang
We propose a simple, yet powerful regularization technique that can be used to significantly improve both the pairwise and triplet losses in learning local feature descriptors.
no code implementations • 1 May 2017 • Matthew Henderson, Rami Al-Rfou, Brian Strope, Yun-Hsuan Sung, Laszlo Lukacs, Ruiqi Guo, Sanjiv Kumar, Balint Miklos, Ray Kurzweil
This paper presents a computationally efficient machine-learned method for natural language response suggestion.
2 code implementations • ICML 2017 • Bo Dai, Ruiqi Guo, Sanjiv Kumar, Niao He, Le Song
Learning-based binary hashing has become a powerful paradigm for fast search and retrieval in massive databases.
no code implementations • ICML 2017 • Ananda Theertha Suresh, Felix X. Yu, Sanjiv Kumar, H. Brendan McMahan
Motivated by the need for distributed learning and optimization algorithms with low communication cost, we study communication efficient algorithms for distributed mean estimation.
no code implementations • NeurIPS 2016 • Felix X. Yu, Ananda Theertha Suresh, Krzysztof Choromanski, Daniel Holtmann-Rice, Sanjiv Kumar
We present an intriguing discovery related to Random Fourier Features: in Gaussian kernel approximation, replacing the random Gaussian matrix by a properly scaled random orthogonal matrix significantly decreases kernel approximation error.
no code implementations • ICCV 2015 • Xu Zhang, Felix X. Yu, Ruiqi Guo, Sanjiv Kumar, Shengjin Wang, Shi-Fu Chang
We propose a family of structured matrices to speed up orthogonal projections for high-dimensional data commonly seen in computer vision applications.
no code implementations • NeurIPS 2015 • Jeffrey Pennington, Felix Xinnan X. Yu, Sanjiv Kumar
Among the commonly used kernels for nonlinear classification are polynomial kernels, for which low approximation error has thus far necessitated explicit feature maps of large dimensionality, especially for higher-order polynomials.
no code implementations • 20 Nov 2015 • Felix X. Yu, Aditya Bhaskara, Sanjiv Kumar, Yunchao Gong, Shih-Fu Chang
To address this problem, we propose Circulant Binary Embedding (CBE) which generates binary codes by projecting the data with a circulant matrix.
no code implementations • 16 Nov 2015 • Anna Choromanska, Krzysztof Choromanski, Mariusz Bojarski, Tony Jebara, Sanjiv Kumar, Yann Lecun
We prove several theoretical results showing that projections via various structured matrices followed by nonlinear mappings accurately preserve the angular distance between input high-dimensional vectors.
no code implementations • NeurIPS 2015 • Vikas Sindhwani, Tara N. Sainath, Sanjiv Kumar
We consider the task of building compact deep learning pipelines suitable for deployment on storage and power constrained mobile devices.
no code implementations • 17 Sep 2015 • Jun Wang, Wei Liu, Sanjiv Kumar, Shih-Fu Chang
Such learning to hash methods exploit information such as data distributions or class labels when optimizing the hash codes or functions.
no code implementations • 4 Sep 2015 • Ruiqi Guo, Sanjiv Kumar, Krzysztof Choromanski, David Simcha
We propose a quantization based approach for fast approximate Maximum Inner Product Search (MIPS).
no code implementations • 10 Jun 2015 • Krzysztof Choromanski, Sanjiv Kumar, Xiaofeng Liu
To achieve fast clustering, we propose to represent each cluster by a skeleton set which is updated continuously as new data is seen.
no code implementations • 12 Mar 2015 • Felix X. Yu, Sanjiv Kumar, Henry Rowley, Shih-Fu Chang
This leads to much more compact maps without hurting the performance.
no code implementations • ICCV 2015 • Yu Cheng, Felix X. Yu, Rogerio S. Feris, Sanjiv Kumar, Alok Choudhary, Shih-Fu Chang
We explore the redundancy of parameters in deep neural networks by replacing the conventional linear projection in fully-connected layers with the circulant projection.
no code implementations • NeurIPS 2014 • Wei Liu, Cun Mu, Sanjiv Kumar, Shih-Fu Chang
Hashing has emerged as a popular technique for fast nearest neighbor search in gigantic databases.
no code implementations • 13 May 2014 • Felix X. Yu, Sanjiv Kumar, Yunchao Gong, Shih-Fu Chang
To address this problem, we propose Circulant Binary Embedding (CBE) which generates binary codes by projecting the data with a circulant matrix.
1 code implementation • 24 Feb 2014 • Felix X. Yu, Krzysztof Choromanski, Sanjiv Kumar, Tony Jebara, Shih-Fu Chang
Learning from Label Proportions (LLP) is a learning setting, where the training data is provided in groups, or "bags", and only the proportion of each class in each bag is known.
no code implementations • 4 Jun 2013 • Felix X. Yu, Dong Liu, Sanjiv Kumar, Tony Jebara, Shih-Fu Chang
We study the problem of learning with label proportions in which the training data is provided in groups and only the proportion of each class in each group is known.
no code implementations • CVPR 2013 • Yunchao Gong, Sanjiv Kumar, Henry A. Rowley, Svetlana Lazebnik
Recent advances in visual recognition indicate that to achieve good retrieval and classification accuracy on largescale datasets like ImageNet, extremely high-dimensional visual descriptors, e. g., Fisher Vectors, are needed.
no code implementations • NeurIPS 2012 • Yunchao Gong, Sanjiv Kumar, Vishal Verma, Svetlana Lazebnik
Such data typically arises in a large number of vision and text applications where counts or frequencies are used as features.
no code implementations • NeurIPS 2009 • Sanjiv Kumar, Mehryar Mohri, Ameet Talwalkar
A crucial technique for scaling kernel methods to very large data sets reaching or exceeding millions of instances is based on low-rank approximation of kernel matrices.