Decomposing Mutual Information for Representation Learning
Many self-supervised representation learning methods maximize mutual information (MI) across views. In this paper, we transform each view into a set of subviews and then decompose the original MI bound into a sum of bounds involving conditional MI between the subviews. E.g.,~given two views $x$ and $y$ of the same input example, we can split $x$ into two subviews, $x^{\prime}$ and $x^{\prime\prime}$, which depend only on $x$ but are otherwise unconstrained. The following holds: $I(x; y) \geq I(x^{\prime\prime}; y) + I(x^{\prime}; y | x^{\prime\prime})$, due to the chain rule and information processing inequality. By maximizing both terms in the decomposition, our approach explicitly rewards the encoder for any information about $y$ which it extracts from $x^{\prime\prime}$, and for information about $y$ extracted from $x^{\prime}$ in excess of the information from $x^{\prime\prime}$. We provide a novel contrastive lower-bound on conditional MI, that relies on sampling contrast sets from $p(y|x^{\prime\prime})$. By decomposing the original MI into a sum of increasingly challenging MI bounds between sets of increasingly informed views, our representations can capture more of the total information shared between the original views. We empirically test the method in a vision domain and for dialogue generation.
PDF Abstract