no code implementations • 4 Feb 2025 • Dylan Sam, Ayan Chakrabarti, Afshin Rostamizadeh, Srikumar Ramalingam, Gui Citovsky, Sanjiv Kumar
We analyze a variety of embedding models in our framework, with experiments using the Pile dataset for pretraining a 1. 7B parameter decoder-only language model.
no code implementations • 24 Oct 2024 • Ankit Singh Rawat, Veeranjaneyulu Sadhanala, Afshin Rostamizadeh, Ayan Chakrabarti, Wittawat Jitkrittum, Vladimir Feinberg, Seungyeon Kim, Hrayr Harutyunyan, Nikunj Saunshi, Zachary Nado, Rakesh Shivanna, Sashank J. Reddi, Aditya Krishna Menon, Rohan Anil, Sanjiv Kumar
In particular, this paradigm relies on an SLM to both (1) provide soft labels as additional training supervision, and (2) select a small subset of valuable ("informative" and "hard") training examples.
no code implementations • 21 Oct 2024 • Giulia Desalvo, Jean-Fracois Kagy, Lazaros Karydas, Afshin Rostamizadeh, Sanjiv Kumar
We present a novel soft prompt based framework, SoftSRV, that leverages a frozen pre-trained large language model (LLM) to generate targeted synthetic text sequences.
no code implementations • 24 Jan 2024 • Ke Ye, Heinrich Jiang, Afshin Rostamizadeh, Ayan Chakrabarti, Giulia Desalvo, Jean-François Kagy, Lazaros Karydas, Gui Citovsky, Sanjiv Kumar
In this paper, we present SpacTor, a new training procedure consisting of (1) a hybrid objective combining span corruption (SC) and token replacement detection (RTD), and (2) a two-stage curriculum that optimizes the hybrid objective over the initial $\tau$ iterations, then transitions to standard SC loss.
no code implementations • 12 Oct 2023 • Yongchao Zhou, Kaifeng Lyu, Ankit Singh Rawat, Aditya Krishna Menon, Afshin Rostamizadeh, Sanjiv Kumar, Jean-François Kagy, Rishabh Agarwal
Finally, in practical scenarios with models of varying sizes, first using distillation to boost the performance of the target model and then applying DistillSpec to train a well-aligned draft model can reduce decoding latency by 6-10x with minimal performance drop, compared to standard decoding without distillation.
no code implementations • 28 Jan 2023 • Gui Citovsky, Giulia Desalvo, Sanjiv Kumar, Srikumar Ramalingam, Afshin Rostamizadeh, Yunjuan Wang
In such a setting, an algorithm can sample examples one at a time but, in order to limit overhead costs, is only able to update its state (i. e. further train model weights) once a large enough batch of examples is selected.
no code implementations • 7 Oct 2022 • Dara Bahri, Heinrich Jiang, Tal Schuster, Afshin Rostamizadeh
Given a labeled training set and a collection of unlabeled data, the goal of active learning (AL) is to identify the best unlabeled points to label.
no code implementations • NeurIPS 2021 • Kareem Amin, Giulia Desalvo, Afshin Rostamizadeh
Consider a setting where we wish to automate an expensive task with a machine learning algorithm using a limited labeling resource.
1 code implementation • NeurIPS 2021 • Gui Citovsky, Giulia Desalvo, Claudio Gentile, Lazaros Karydas, Anand Rajagopalan, Afshin Rostamizadeh, Sanjiv Kumar
The ability to train complex and highly effective models often requires an abundance of training data, which can easily become a bottleneck in cost, time, and computational resources.
no code implementations • ICLR 2022 • Heinrich Jiang, Harikrishna Narasimhan, Dara Bahri, Andrew Cotter, Afshin Rostamizadeh
In real-world systems, models are frequently updated as more data becomes available, and in addition to achieving high accuracy, the goal is to also maintain a low difference in predictions compared to the base model (i. e. predictive "churn").
no code implementations • 4 Jun 2021 • Heinrich Jiang, Afshin Rostamizadeh
We show under standard non-parametric assumptions that a classical support estimator can be repurposed as an offline algorithm attaining an excess query cost of $\widetilde{\Theta}(n^{D/(D+1)})$ compared to the optimal learner, where $n$ is the number of datapoints and $D$ is the dimension.
1 code implementation • ICLR 2021 • Maruan Al-Shedivat, Jennifer Gillenwater, Eric Xing, Afshin Rostamizadeh
Federated learning is typically approached as an optimization problem, where the goal is to minimize a global loss function by distributing computation across client devices that possess local data and specify different parts of the global objective.
2 code implementations • NeurIPS 2020 • Jake Levinson, Carlos Esteves, Kefan Chen, Noah Snavely, Angjoo Kanazawa, Afshin Rostamizadeh, Ameesh Makadia
Symmetric orthogonalization via SVD, and closely related procedures, are well-known techniques for projecting matrices onto $O(n)$ or $SO(n)$.
1 code implementation • 2 Dec 2019 • Shuang Song, David Berthelot, Afshin Rostamizadeh
This analysis can be used to measure the relative value of labeled/unlabeled data at different points of the learning curve, where we find that although the incremental value of labeled data can be as much as 20x that of unlabeled, it quickly diminishes to less than 3x once more than 2, 000 labeled example are observed.
no code implementations • 28 Jun 2019 • Jean-François Kagy, Tolga Kayadelen, Ji Ma, Afshin Rostamizadeh, Jana Strnadova
We tested in a live setting the use of active learning for selecting text sentences for human annotations used in training a Thai segmentation machine learning model.
no code implementations • 30 Apr 2019 • Mohammadhossein Bateni, Lin Chen, Hossein Esfandiari, Thomas Fu, Vahab S. Mirrokni, Afshin Rostamizadeh
To achieve this, we introduce a novel re-parametrization of the mutual information objective, which we prove is submodular, and design a data structure to query the submodular function in amortized $O(\log n )$ time (where $n$ is the input vocabulary size).
no code implementations • 29 Mar 2019 • Alexander Ratner, Dan Alistarh, Gustavo Alonso, David G. Andersen, Peter Bailis, Sarah Bird, Nicholas Carlini, Bryan Catanzaro, Jennifer Chayes, Eric Chung, Bill Dally, Jeff Dean, Inderjit S. Dhillon, Alexandros Dimakis, Pradeep Dubey, Charles Elkan, Grigori Fursin, Gregory R. Ganger, Lise Getoor, Phillip B. Gibbons, Garth A. Gibson, Joseph E. Gonzalez, Justin Gottschlich, Song Han, Kim Hazelwood, Furong Huang, Martin Jaggi, Kevin Jamieson, Michael. I. Jordan, Gauri Joshi, Rania Khalaf, Jason Knight, Jakub Konečný, Tim Kraska, Arun Kumar, Anastasios Kyrillidis, Aparna Lakshmiratan, Jing Li, Samuel Madden, H. Brendan McMahan, Erik Meijer, Ioannis Mitliagkas, Rajat Monga, Derek Murray, Kunle Olukotun, Dimitris Papailiopoulos, Gennady Pekhimenko, Theodoros Rekatsinas, Afshin Rostamizadeh, Christopher Ré, Christopher De Sa, Hanie Sedghi, Siddhartha Sen, Virginia Smith, Alex Smola, Dawn Song, Evan Sparks, Ion Stoica, Vivienne Sze, Madeleine Udell, Joaquin Vanschoren, Shivaram Venkataraman, Rashmi Vinayak, Markus Weimer, Andrew Gordon Wilson, Eric Xing, Matei Zaharia, Ce Zhang, Ameet Talwalkar
Machine learning (ML) techniques are enjoying rapidly increasing adoption.
5 code implementations • ICLR 2018 • Liam Li, Kevin Jamieson, Afshin Rostamizadeh, Ekaterina Gonina, Moritz Hardt, Benjamin Recht, Ameet Talwalkar
Modern learning models are characterized by large hyperparameter spaces and long training times.
1 code implementation • 26 Jun 2018 • Shanshan Wu, Alexandros G. Dimakis, Sujay Sanghavi, Felix X. Yu, Daniel Holtmann-Rice, Dmitry Storcheus, Afshin Rostamizadeh, Sanjiv Kumar
Our experiments show that there is indeed additional structure beyond sparsity in the real datasets; our method is able to discover it and exploit it to create excellent reconstructions with fewer measurements (by a factor of 1. 1-3x) compared to the previous state-of-the-art methods.
no code implementations • ICLR 2018 • Lisha Li, Kevin Jamieson, Afshin Rostamizadeh, Katya Gonina, Moritz Hardt, Benjamin Recht, Ameet Talwalkar
Modern machine learning models are characterized by large hyperparameter search spaces and prohibitively expensive training costs.
18 code implementations • 21 Mar 2016 • Lisha Li, Kevin Jamieson, Giulia Desalvo, Afshin Rostamizadeh, Ameet Talwalkar
Performance of machine learning algorithms depends critically on identifying a good set of hyperparameters.
no code implementations • 29 Sep 2015 • Mehryar Mohri, Afshin Rostamizadeh, Dmitry Storcheus
The generalization error bound is based on a careful analysis of the empirical Rademacher complexity of the relevant hypothesis set.
no code implementations • NeurIPS 2014 • Kareem Amin, Afshin Rostamizadeh, Umar Syed
Motivated by real-time advertising exchanges, we analyze the problem of pricing inventory in a repeated posted-price auction.
no code implementations • 9 Aug 2014 • Ameet Talwalkar, Afshin Rostamizadeh
Crucial to the performance of this technique is the assumption that a matrix can be well approximated by working exclusively with a subset of its columns.
no code implementations • NeurIPS 2013 • Kareem Amin, Afshin Rostamizadeh, Umar Syed
Inspired by real-time ad exchanges for online display advertising, we consider the problem of inferring a buyer's value distribution for a good when the buyer is repeatedly interacting with a seller through a posted-price mechanism.
no code implementations • 1 May 2013 • Mehryar Mohri, Afshin Rostamizadeh
We present a brief survey of existing mistake bounds and introduce novel bounds for the Perceptron or the kernel Perceptron algorithm.
no code implementations • 2 Mar 2012 • Corinna Cortes, Mehryar Mohri, Afshin Rostamizadeh
Our theoretical results include a novel concentration bound for centered alignment between kernel matrices, the proof of the existence of effective predictors for kernels with high alignment, both for classification and for regression, and the proof of stability-based generalization bounds for a broad family of algorithms for learning kernels based on centered alignment.
no code implementations • NeurIPS 2009 • Corinna Cortes, Mehryar Mohri, Afshin Rostamizadeh
This paper studies the general problem of learning kernels based on a polynomial combination of base kernels.
no code implementations • 19 Feb 2009 • Yishay Mansour, Mehryar Mohri, Afshin Rostamizadeh
This motivates our analysis of the problem of minimizing the empirical discrepancy for various loss functions for which we also give novel algorithms.
no code implementations • NeurIPS 2008 • Mehryar Mohri, Afshin Rostamizadeh
In particular, they are data-dependent and measure the complexity of a class of hypotheses based on the training sample.
no code implementations • NeurIPS 2008 • Yishay Mansour, Mehryar Mohri, Afshin Rostamizadeh
The problem consists of combining these hypotheses to derive a hypothesis with small error with respect to the target domain.
no code implementations • NeurIPS 2007 • Mehryar Mohri, Afshin Rostamizadeh
We also illustrate their application in the case of several general classes of learning algorithms, including Support Vector Regression and Kernel Ridge Regression.