Understanding Over-parameterization in Generative Adversarial Networks
A broad class of unsupervised deep learning methods such as Generative Adversarial Networks (GANs) involve training of over-parameterized models where the parameters of the model exceed the size of the training data set. Indeed, most successful GANs used in practice are trained using over-parameterized generator and discriminator networks, both in terms of depth and width. A large body of work in supervised learning have shown the importance of such model over-parameterization in the convergence of the gradient descent (GD) to globally optimal solutions. In contrast, the unsupervised setting and GANs in particular involve non-convex concave mini-max optimization problems that are often trained using alternating Gradient Descent/Ascent (GDA). The role and benefits of model over-parameterization in the convergence of GDA to a global saddle point in non-convex concave problems is far less understood. In this work, we present a comprehensive analysis of the importance of model over-parameterization in GANs both theoretically and empirically. We theoretically show that in an over-parameterized GAN model with a $1$-layer neural network generator and a linear discriminator, GDA converges to a global saddle point of the underlying non-convex concave min-max problem. To the best of our knowledge, this is the first result for global convergence of GDA in such settings. Our theory is based on a more general result that holds for a broader class of nonlinear generators and discriminators that obey certain assumptions (including deeper generators and random feature discriminators). Our theory utilizes and builds upon a novel connection with the convergence analysis of linear time-varying dynamical systems which may have broader implications for understanding the convergence behavior of GDA for non-convex concave problems involving over-parameterized models. We also empirically study the role of model over-parameterization in GANs using several large-scale experiments on CIFAR-10 and Celeb-A datasets. Our experiments show that over-parameterization improves the quality of generated samples across various model architectures and datasets. Remarkably, we observe that over-parameterization leads to faster and more stable convergence behavior of GDA across the board.
PDF Abstract