Search Results for author: Behrooz Ghorbani

Found 18 papers, 6 papers with code

Order Matters in the Presence of Dataset Imbalance for Multilingual Learning

no code implementations • NeurIPS 2023 • Dami Choi, Derrick Xin, Hamid Dadkhahi, Justin Gilmer, Ankush Garg, Orhan Firat, Chih-Kuan Yeh, Andrew M. Dai, Behrooz Ghorbani

In this paper, we empirically study the optimization dynamics of multi-task learning, particularly focusing on those that govern a collection of tasks with significant data imbalance.

Language Modelling Machine Translation +3

Paper
Add Code

Epsilon Sampling Rocks: Investigating Sampling Strategies for Minimum Bayes Risk Decoding for Machine Translation

1 code implementation • 17 May 2023 • Markus Freitag, Behrooz Ghorbani, Patrick Fernandes

Recent advances in machine translation (MT) have shown that Minimum Bayes Risk (MBR) decoding can be a powerful alternative to beam search decoding, especially when combined with neural-based utility functions.

Machine Translation

Paper
Code

Scaling Laws for Multilingual Neural Machine Translation

no code implementations • 19 Feb 2023 • Patrick Fernandes, Behrooz Ghorbani, Xavier Garcia, Markus Freitag, Orhan Firat

Through a novel joint scaling law formulation, we compute the effective number of parameters allocated to each language pair and examine the role of language similarity in the scaling behavior of our models.

Machine Translation Translation

Paper
Add Code

Binarized Neural Machine Translation

1 code implementation • NeurIPS 2023 • Yichi Zhang, Ankush Garg, Yuan Cao, Łukasz Lew, Behrooz Ghorbani, Zhiru Zhang, Orhan Firat

In this work, we propose a novel binarization technique for Transformers applied to machine translation (BMT), the first of its kind.

Binarization Machine Translation +2

169

Paper
Code

Do Current Multi-Task Optimization Methods in Deep Learning Even Help?

no code implementations • 23 Sep 2022 • Derrick Xin, Behrooz Ghorbani, Ankush Garg, Orhan Firat, Justin Gilmer

Recent research has proposed a series of specialized optimization algorithms for deep multi-task models.

Paper
Add Code

Adaptive Gradient Methods at the Edge of Stability

no code implementations • 29 Jul 2022 • Jeremy M. Cohen, Behrooz Ghorbani, Shankar Krishnan, Naman Agarwal, Sourabh Medapati, Michal Badura, Daniel Suo, David Cardoze, Zachary Nado, George E. Dahl, Justin Gilmer

Very little is known about the training dynamics of adaptive gradient methods like Adam in deep learning.

Paper
Add Code

Data Scaling Laws in NMT: The Effect of Noise and Architecture

no code implementations • 4 Feb 2022 • Yamini Bansal, Behrooz Ghorbani, Ankush Garg, Biao Zhang, Maxim Krikun, Colin Cherry, Behnam Neyshabur, Orhan Firat

In this work, we study the effect of varying the architecture and training data quality on the data scaling properties of Neural Machine Translation (NMT).

Language Modelling Machine Translation +1

Paper
Add Code

Examining Scaling and Transfer of Language Model Architectures for Machine Translation

no code implementations • 1 Feb 2022 • Biao Zhang, Behrooz Ghorbani, Ankur Bapna, Yong Cheng, Xavier Garcia, Jonathan Shen, Orhan Firat

Natural language understanding and generation models follow one of the two dominant architectural paradigms: language models (LMs) that process concatenated sequences in a single stack of layers, and encoder-decoder models (EncDec) that utilize separate layer stacks for input and output processing.

Language Modelling Machine Translation +2

Paper
Add Code

A Loss Curvature Perspective on Training Instability in Deep Learning

no code implementations • 8 Oct 2021 • Justin Gilmer, Behrooz Ghorbani, Ankush Garg, Sneha Kudugunta, Behnam Neyshabur, David Cardoze, George Dahl, Zachary Nado, Orhan Firat

In this work, we study the evolution of the loss Hessian across many classification tasks in order to understand the effect the curvature of the loss has on the training dynamics.

Navigate

Paper
Add Code

A Loss Curvature Perspective on Training Instabilities of Deep Learning Models

no code implementations • ICLR 2022 • Justin Gilmer, Behrooz Ghorbani, Ankush Garg, Sneha Kudugunta, Behnam Neyshabur, David Cardoze, George Edward Dahl, Zachary Nado, Orhan Firat

In this work, we study the evolution of the loss Hessian across many classification tasks in order to understand the effect the curvature of the loss has on the training dynamics.

Navigate

Paper
Add Code

Scaling Laws for Neural Machine Translation

no code implementations • ICLR 2022 • Behrooz Ghorbani, Orhan Firat, Markus Freitag, Ankur Bapna, Maxim Krikun, Xavier Garcia, Ciprian Chelba, Colin Cherry

We show that cross-entropy loss as a function of model size follows a certain scaling law.

Machine Translation NMT +1

Paper
Add Code

When Do Neural Networks Outperform Kernel Methods?

1 code implementation • NeurIPS 2020 • Behrooz Ghorbani, Song Mei, Theodor Misiakiewicz, Andrea Montanari

Recent empirical work showed that, for some classification tasks, RKHS methods can replace NNs without a large loss in performance.

Image Classification

Paper
Code

Limitations of Lazy Training of Two-layers Neural Network

1 code implementation • NeurIPS 2019 • Song Mei, Theodor Misiakiewicz, Behrooz Ghorbani, Andrea Montanari

We study the supervised learning problem under either of the following two models: (1) Feature vectors x_i are d-dimensional Gaussian and responses are y_i = f_*(x_i) for f_* an unknown quadratic function; (2) Feature vectors x_i are distributed as a mixture of two d-dimensional centered Gaussians, and y_i's are the corresponding class labels.

Vocal Bursts Valence Prediction

Paper
Code

Limitations of Lazy Training of Two-layers Neural Networks

1 code implementation • 21 Jun 2019 • Behrooz Ghorbani, Song Mei, Theodor Misiakiewicz, Andrea Montanari

We study the supervised learning problem under either of the following two models: (1) Feature vectors ${\boldsymbol x}_i$ are $d$-dimensional Gaussians and responses are $y_i = f_*({\boldsymbol x}_i)$ for $f_*$ an unknown quadratic function; (2) Feature vectors ${\boldsymbol x}_i$ are distributed as a mixture of two $d$-dimensional centered Gaussians, and $y_i$'s are the corresponding class labels.

Vocal Bursts Valence Prediction

Paper
Code

The Effect of Network Depth on the Optimization Landscape

no code implementations • 28 May 2019 • Behrooz Ghorbani, Ying Xiao, Shankar Krishnan

It is well-known that deeper neural networks are harder to train than shallower ones.

Paper
Add Code

Linearized two-layers neural networks in high dimension

no code implementations • 27 Apr 2019 • Behrooz Ghorbani, Song Mei, Theodor Misiakiewicz, Andrea Montanari

Both these approaches can also be regarded as randomized approximations of kernel ridge regression (with respect to different kernels), and enjoy universal approximation properties when the number of neurons $N$ diverges, for a fixed dimension $d$.

regression Vocal Bursts Intensity Prediction +1

Paper
Add Code

An Investigation into Neural Net Optimization via Hessian Eigenvalue Density

1 code implementation • 29 Jan 2019 • Behrooz Ghorbani, Shankar Krishnan, Ying Xiao

To understand the dynamics of optimization in deep neural networks, we develop a tool to study the evolution of the entire Hessian spectrum throughout the optimization process.

107

Paper
Code

An Instability in Variational Inference for Topic Models

no code implementations • 2 Feb 2018 • Behrooz Ghorbani, Hamid Javadi, Andrea Montanari

Namely, for certain regimes of the model parameters, variational inference outputs a non-trivial decomposition into topics.

Topic Models Variational Inference

Paper
Add Code

Cannot find the paper you are looking for? You can Submit a new open access paper.