Embodied Multimodal Agents to Bridge the Understanding Gap

In this paper we argue that embodied multimodal agents, i.e., avatars, can play an important role in moving natural language processing toward “deep understanding.” Fully-featured interactive agents, model encounters between two “people,” but a language-only agent has little environmental and situational awareness. Multimodal agents bring new opportunities for interpreting visuals, locational information, gestures, etc., which are more axes along which to communicate. We propose that multimodal agents, by facilitating an embodied form of human-computer interaction, provide additional structure that can be used to train models that move NLP systems closer to genuine “understanding” of grounded language, and we discuss ongoing studies using existing systems.

PDF Abstract
No code implementations yet. Submit your code now


  Add Datasets introduced or used in this paper

Results from the Paper

  Submit results from this paper to get state-of-the-art GitHub badges and help the community compare results to other papers.


No methods listed for this paper. Add relevant methods here