no code implementations • 22 Apr 2024 • Marah Abdin, Sam Ade Jacobs, Ammar Ahmad Awan, Jyoti Aneja, Ahmed Awadallah, Hany Awadalla, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Jianmin Bao, Harkirat Behl, Alon Benhaim, Misha Bilenko, Johan Bjorck, Sébastien Bubeck, Qin Cai, Martin Cai, Caio César Teodoro Mendes, Weizhu Chen, Vishrav Chaudhary, Dong Chen, Dongdong Chen, Yen-Chun Chen, Yi-Ling Chen, Parul Chopra, Xiyang Dai, Allie Del Giorno, Gustavo de Rosa, Matthew Dixon, Ronen Eldan, Victor Fragoso, Dan Iter, Mei Gao, Min Gao, Jianfeng Gao, Amit Garg, Abhishek Goswami, Suriya Gunasekar, Emman Haider, Junheng Hao, Russell J. Hewett, Jamie Huynh, Mojan Javaheripi, Xin Jin, Piero Kauffmann, Nikos Karampatziakis, Dongwoo Kim, Mahoud Khademi, Lev Kurilenko, James R. Lee, Yin Tat Lee, Yuanzhi Li, Yunsheng Li, Chen Liang, Lars Liden, Ce Liu, Mengchen Liu, Weishung Liu, Eric Lin, Zeqi Lin, Chong Luo, Piyush Madan, Matt Mazzola, Arindam Mitra, Hardik Modi, Anh Nguyen, Brandon Norick, Barun Patra, Daniel Perez-Becker, Thomas Portet, Reid Pryzant, Heyang Qin, Marko Radmilac, Corby Rosset, Sambudha Roy, Olatunji Ruwase, Olli Saarikivi, Amin Saied, Adil Salim, Michael Santacroce, Shital Shah, Ning Shang, Hiteshi Sharma, Swadheen Shukla, Xia Song, Masahiro Tanaka, Andrea Tupini, Xin Wang, Lijuan Wang, Chunyu Wang, Yu Wang, Rachel Ward, Guanhua Wang, Philipp Witte, Haiping Wu, Michael Wyatt, Bin Xiao, Can Xu, Jiahang Xu, Weijian Xu, Sonali Yadav, Fan Yang, Jianwei Yang, ZiYi Yang, Yifan Yang, Donghan Yu, Lu Yuan, Chengruidong Zhang, Cyril Zhang, Jianwen Zhang, Li Lyna Zhang, Yi Zhang, Yue Zhang, Yunan Zhang, Xiren Zhou
We introduce phi-3-mini, a 3. 8 billion parameter language model trained on 3. 3 trillion tokens, whose overall performance, as measured by both academic benchmarks and internal testing, rivals that of models such as Mixtral 8x7B and GPT-3. 5 (e. g., phi-3-mini achieves 69% on MMLU and 8. 38 on MT-bench), despite being small enough to be deployed on a phone.
no code implementations • 14 Dec 2023 • Bingbin Liu, Sebastien Bubeck, Ronen Eldan, Janardhan Kulkarni, Yuanzhi Li, Anh Nguyen, Rachel Ward, Yi Zhang
Specifically for solving grade school math, the smallest model size so far required to break the 80\% barrier on the GSM8K benchmark remains to be 34B.
Ranked #59 on Arithmetic Reasoning on GSM8K
no code implementations • NeurIPS 2023 • Rachel Ward, Tamara G. Kolda
We show that, for a rank-$r$ matrix $\mathbf{A} \in \mathbb{R}^{m \times n}$, $T = C (\frac{\sigma_1(\mathbf{A})}{\sigma_r(\mathbf{A})})^2 \log(1/\epsilon)$ iterations of alternating gradient descent suffice to reach an $\epsilon$-optimal factorization $\| \mathbf{A} - \mathbf{X} \mathbf{Y}^{T} \|^2 \leq \epsilon \| \mathbf{A}\|^2$ with high probability starting from an atypical random initialization.
no code implementations • 9 May 2023 • Hung-Hsu Chou, Holger Rauhut, Rachel Ward
By analyzing key invariants of the gradient flow and using Lojasiewicz Theorem, we show that weight normalization also has an implicit bias towards sparse solutions in the diagonal linear model, but that in contrast to plain gradient flow, weight normalization enables a robust bias that persists even if the weights are initialized at practically large scale.
no code implementations • 4 Oct 2022 • Yijun Dong, Yuege Xie, Rachel Ward
At the saddle point of the underlying objective, the weights assign label-dense samples to the supervised loss and label-sparse samples to the unsupervised consistency regularization.
no code implementations • 15 Jun 2022 • Raghu Bollapragada, Tyler Chen, Rachel Ward
Simple stochastic momentum methods are widely used in machine learning optimization, but their good practical performance is at odds with an absence of theoretical guarantees of acceleration in the literature.
no code implementations • 19 May 2022 • Itay Evron, Edward Moroshko, Rachel Ward, Nati Srebro, Daniel Soudry
In specific settings, we highlight differences between forgetting and convergence to the offline solution as studied in those areas.
no code implementations • 16 May 2022 • Nhat Ho, Tongzheng Ren, Sujay Sanghavi, Purnamrita Sarkar, Rachel Ward
Therefore, the total computational complexity of the EGD algorithm is \emph{optimal} and exponentially cheaper than that of the GD for solving parameter estimation in non-regular statistical models while being comparable to that of the GD in regular statistical settings.
no code implementations • 14 Apr 2022 • Zhijun Chen, Hayden Schaeffer, Rachel Ward
The spectra of random feature matrices provide essential information on the conditioning of the linear system used in random feature regression problems and are thus connected to the consistency and generalization of random feature models.
no code implementations • 1 Mar 2022 • Juncai He, Richard Tsai, Rachel Ward
In this setting, a typical neural network defines a function that takes a finite number of vectors in the embedding space as input.
no code implementations • 24 Feb 2022 • Shuo Yang, Yijun Dong, Rachel Ward, Inderjit S. Dhillon, Sujay Sanghavi, Qi Lei
Data augmentation is popular in the training of large neural networks; currently, however, there is no clear theoretical comparison between different algorithmic choices on how to use augmented data.
no code implementations • 11 Feb 2022 • Matthew Faw, Isidoros Tziotis, Constantine Caramanis, Aryan Mokhtari, Sanjay Shakkottai, Rachel Ward
We study convergence rates of AdaGrad-Norm as an exemplar of adaptive stochastic gradient methods (SGD), where the step sizes change based on observed stochastic gradients, for minimizing non-convex, smooth objectives.
1 code implementation • 7 Dec 2021 • Yuege Xie, Bobby Shi, Hayden Schaeffer, Rachel Ward
Inspired by the success of the iterative magnitude pruning technique in finding lottery tickets of neural networks, we propose a new method -- Sparser Random Feature Models via IMP (ShRIMP) -- to efficiently fit high-dimensional data with inherent low-dimensional structure in the form of sparse variable dependencies.
no code implementations • 29 Sep 2021 • Shuo Yang, Yijun Dong, Rachel Ward, Inderjit S Dhillon, Sujay Sanghavi, Qi Lei
Data augmentation is popular in the training of large neural networks; currently, however, there is no clear theoretical comparison between different algorithmic choices on how to use augmented data.
1 code implementation • 20 Sep 2021 • Dimitris Giannakis, Amelia Henriksen, Joel A. Tropp, Rachel Ward
This algorithm dramatically reduces the costs of training and prediction without sacrificing forecasting skill.
no code implementations • 17 Sep 2021 • Xiaoxia Wu, Yuege Xie, Simon Du, Rachel Ward
We propose a computationally-friendly adaptive learning rate schedule, "AdaLoss", which directly uses the information of the loss function to adjust the stepsize in gradient descent methods.
no code implementations • NeurIPS 2021 • Robert Lunde, Purnamrita Sarkar, Rachel Ward
We consider the problem of quantifying uncertainty for the estimation error of the leading eigenvector from Oja's algorithm for streaming principal component analysis, where the data are generated IID from some unknown distribution.
2 code implementations • 4 Mar 2021 • Abolfazl Hashemi, Hayden Schaeffer, Robert Shi, Ufuk Topcu, Giang Tran, Rachel Ward
In particular, we provide generalization bounds for functions in a certain class (that is dense in a reproducing kernel Hilbert space) depending on the number of samples and the distribution of features.
no code implementations • 6 Feb 2021 • De Huang, Jonathan Niles-Weed, Rachel Ward
We analyze Oja's algorithm for streaming $k$-PCA and prove that it achieves performance nearly matching that of an optimal offline algorithm.
no code implementations • 15 Jun 2020 • Yuege Xie, Hung-Hsu Chou, Holger Rauhut, Rachel Ward
Motivated by surprisingly good generalization properties of learned deep neural networks in overparameterized scenarios and by the related double descent phenomenon, this paper analyzes the relation between smoothness and low generalization error in an overparameterized linear learning problem.
no code implementations • NeurIPS 2020 • Xiaoxia Wu, Edgar Dobriban, Tongzheng Ren, Shanshan Wu, Zhiyuan Li, Suriya Gunasekar, Rachel Ward, Qiang Liu
For certain stepsizes of g and w , we show that they can converge close to the minimum norm solution.
no code implementations • 28 Aug 2019 • Yuege Xie, Xiaoxia Wu, Rachel Ward
We prove that the norm version of the adaptive stochastic gradient method (AdaGrad-Norm) achieves a linear convergence rate for a subset of either strongly convex functions or non-convex functions that satisfy the Polyak Lojasiewicz (PL) inequality.
no code implementations • 26 Jul 2019 • Denali Molitor, Deanna Needell, Rachel Ward
Gradient descent is a simple and widely used optimization method for machine learning.
1 code implementation • 28 May 2019 • Amelia Henriksen, Rachel Ward
We also show that AdaOja performs comparably to state-of-the-art algorithms (History PCA and Streaming Power Method) in the same streaming PCA setting.
no code implementations • 19 Feb 2019 • Xiaoxia Wu, Simon S. Du, Rachel Ward
Adaptive gradient methods like AdaGrad are widely used in optimizing neural networks.
no code implementations • 25 Nov 2018 • Lam Si Tung Ho, Hayden Schaeffer, Giang Tran, Rachel Ward
In this work, we study the problem of learning nonlinear functions from corrupted and dependent data.
1 code implementation • 5 Jun 2018 • Rachel Ward, Xiaoxia Wu, Leon Bottou
Adaptive gradient methods such as AdaGrad and its variants update the stepsize in stochastic gradient descent on the fly according to the gradients received along the way; such methods have gained widespread use in large-scale optimization for their ability to converge robustly, without the need to fine-tune the stepsize schedule.
no code implementations • 7 Mar 2018 • Xiaoxia Wu, Rachel Ward, Léon Bottou
Adjusting the learning rate schedule in stochastic gradient methods is an important unresolved problem which requires tuning in practice.
no code implementations • 17 Oct 2016 • Soledad Villar, Afonso S. Bandeira, Andrew J. Blumberg, Rachel Ward
The Gromov-Hausdorff distance provides a metric on the set of isometry classes of compact metric spaces.
no code implementations • 22 Feb 2016 • Dustin G. Mixon, Soledad Villar, Rachel Ward
We introduce a model-free relax-and-round algorithm for k-means clustering based on a semidefinite relaxation due to Peng and Wei.
no code implementations • 25 Jun 2015 • Chris D. White, Sujay Sanghavi, Rachel Ward
This paper considers the recovery of a rank $r$ positive semidefinite matrix $X X^T\in\mathbb{R}^{n\times n}$ from $m$ scalar measurements of the form $y_i := a_i^T X X^T a_i$ (i. e., quadratic measurements of $X$).
no code implementations • 18 Aug 2014 • Pranjal Awasthi, Afonso S. Bandeira, Moses Charikar, Ravishankar Krishnaswamy, Soledad Villar, Rachel Ward
Under the same distributional model, the $k$-means LP relaxation fails to recover such clusters at separation as large as $\Delta = 4$.
no code implementations • 28 Apr 2014 • Karin Knudson, Rayan Saab, Rachel Ward
Consider the recovery of an unknown signal ${x}$ from quantized linear measurements.
no code implementations • NeurIPS 2014 • Deanna Needell, Nathan Srebro, Rachel Ward
Furthermore, we show how reweighting the sampling distribution (i. e. importance sampling) is necessary in order to further improve convergence, and obtain a linear dependence in the average smoothness, dominating previous results.
no code implementations • 12 Sep 2013 • Abhinav Nellore, Rachel Ward
For a certain class of distributions, we prove that the linear programming relaxation of $k$-medoids clustering---a variant of $k$-means clustering where means are replaced by exemplars from within the dataset---distinguishes points drawn from nonoverlapping balls with high probability once the number of points drawn and the separation distance between any two balls are sufficiently large.
no code implementations • 12 Jun 2013 • Yudong Chen, Srinadh Bhojanapalli, Sujay Sanghavi, Rachel Ward
Matrix completion, i. e., the exact and provable recovery of a low-rank matrix from a small subset of its elements, is currently only known to be possible if the matrix satisfies a restrictive structural constraint---known as {\em incoherence}---on its row and column spaces.
no code implementations • 11 Oct 2012 • Deanna Needell, Rachel Ward
Consider the problem of reconstructing a multidimensional signal from an underdetermined set of measurements, as in the setting of compressed sensing.
no code implementations • 8 Oct 2012 • Felix Krahmer, Rachel Ward
For Fourier measurements and Haar wavelet sparsity, the local coherence can be controlled and bounded explicitly, so for matrices comprised of frequencies sampled from a suitable inverse square power-law density, we can prove the restricted isometry property with near-optimal embedding dimensions.