Illuminating Dark Knowledge via Random Matrix Ensembles
It is all but certain that machine learning models based on deep neural networks will soon feature ubiquitously in a wide variety of critical products and services that people rely on. This should be a major cause of concern given that we still lack a rigorous understanding of the failure modes of these systems, and can hardly make guarantees about the conditions under which the models are expected to work. In particular, we would like to understand how these models manage to generalize so well, even when seemingly overparametrized, effectively evading many of the intuitions expected from statistical learning theory. We argue that Distillation (Caruana et al., 2006, Hinton et al., 2014) provides us with a rich playground for understanding what enables generalization in a concrete setting. We carry out a precise high-dimensional analysis of generalization under distillation in a real world setting, eschewing ad hoc assumptions, and instead consider models actually encountered in the wild.
PDF Abstract