Feature selection is one of the most fundamental problems in machine
learning. An extensive body of work on information-theoretic feature selection
exists which is based on maximizing mutual information between subsets of
features and class labels...
Practical methods are forced to rely on
approximations due to the difficulty of estimating mutual information. We
demonstrate that approximations made by existing methods are based on
unrealistic assumptions. We formulate a more flexible and general class of
assumptions based on variational distributions and use them to tractably
generate lower bounds for mutual information. These bounds define a novel
information-theoretic framework for feature selection, which we prove to be
optimal under tree graphical models with proper choice of variational
distributions. Our experiments demonstrate that the proposed method strongly
outperforms existing information-theoretic feature selection approaches.