In many manufacturing processes, the design parameters are subject to random input noise, resulting in a product that is often less performant than expected.
We consider Bayesian optimization of objective functions of the form $\rho[ F(x, W) ]$, where $F$ is a black-box expensive-to-evaluate function and $\rho$ denotes either the VaR or CVaR risk measure, computed with respect to the randomness induced by the environmental random variable $W$.
We show, however, that gradient descent combined with proper normalization, avoids being trapped by the spurious local optimum, and converges to a global optimum in polynomial time, when the weight of the first layer is initialized at 0, and that of the second layer is initialized arbitrarily in a ball.
Numerous empirical evidence has corroborated that the noise plays a crucial rule in effective and efficient training of neural networks.
Asynchronous momentum stochastic gradient descent algorithms (Async-MSGD) is one of the most popular algorithms in distributed machine learning.
Our theoretical discovery partially corroborates the empirical success of MSGD in training deep neural networks.