no code implementations • 3 Feb 2022 • Shuang Li, Xavier Puig, Chris Paxton, Yilun Du, Clinton Wang, Linxi Fan, Tao Chen, De-An Huang, Ekin Akyürek, Anima Anandkumar, Jacob Andreas, Igor Mordatch, Antonio Torralba, Yuke Zhu
Together, these results suggest that language modeling induces representations that are useful for modeling not just language, but also goals and plans; these representations can aid learning and generalization even outside of language processing.
Additional experiments explore the role of language-based encodings in these results; we find that it is possible to train a simple adapter layer that maps from observations and action histories to LM embeddings, and thus that language modeling provides an effective initializer even for tasks with no language as input or output.
We investigate this question in the setting of learning general-purpose visual representations from a black-box generative model rather than directly from data.
In this paper, we introduce Watch-And-Help (WAH), a challenge for testing social intelligence in agents.
We then implement the most common atomic (inter)actions in the Unity3D game engine, and use our programs to "drive" an artificial agent to execute tasks in a simulated household environment.
A novel network design called Cascade Segmentation Module is proposed to parse a scene into stuff, objects, and object parts in a cascade and improve over the baselines.
Recognizing arbitrary objects in the wild has been a challenging problem due to the limitations of existing classification models and datasets.
Scene parsing, or recognizing and segmenting objects and stuff in an image, is one of the key problems in computer vision.