HighMMT: Quantifying Modality & Interaction Heterogeneity for High-Modality Representation Learning

Many real-world problems are inherently multimodal, from the communicative modalities humans use to express social and emotional states to the force, proprioception, and visual sensors ubiquitous on robots. While there has been an explosion of interest in multimodal representation learning, these methods are still largely focused on a small set of modalities, primarily in the language, vision, and audio space. In order to accelerate generalization towards diverse and understudied modalities, this paper studies efficient representation learning for high-modality scenarios. Since adding new models for every new modality or task becomes prohibitively expensive, a critical technical challenge is heterogeneity quantification: how can we measure which modalities encode similar information and interactions in order to permit parameter sharing with previous modalities? We propose two new information-theoretic metrics for heterogeneity quantification: (1) modality heterogeneity studies how similar 2 modalities $\{X_1,X_2\}$ are by measuring how much information can be transferred from $X_1$ to $X_2$, while (2) interaction heterogeneity studies how similarly pairs of modalities $\{X_1,X_2\}, \{X_3,X_4\}$ interact by measuring how much interaction information can be transferred from $\{X_1,X_2\}$ to $\{X_3,X_4\}$. We show the importance of these proposed metrics in high-modality scenarios as a way to automatically prioritize the fusion of modalities that contain unique information or interactions. The result is a single model, HighMMT, that scales up to $10$ modalities and $15$ tasks from $5$ different research areas. Not only does HighMMT outperform prior methods on the tradeoff between performance and efficiency, it also demonstrates a crucial scaling behavior: performance continues to improve with each modality added, and transfers to entirely new modalities and tasks during fine-tuning.

PDF Abstract

Results from the Paper

  Submit results from this paper to get state-of-the-art GitHub badges and help the community compare results to other papers.


No methods listed for this paper. Add relevant methods here