In this paper we argue that embodied multimodal agents, i. e., avatars, can play an important role in moving natural language processing toward “deep understanding.” Fully-featured interactive agents, model encounters between two “people,” but a language-only agent has little environmental and situational awareness.
In this paper, we present an approach toward grounding linguistic positional and directional labels directly to human motions in the course of a disoriented balancing task in a multi-axis rotational device.
We present a five-year retrospective on the development of the VoxWorld platform, first introduced as a multimodal platform for modeling motion language, that has evolved into a platform for rapidly building and deploying embodied agents with contextual and situational awareness, capable of interacting with humans in multiple modalities, and exploring their environments.
Loanwords are words incorporated from one language into another without translation.
Annotating cross-document event coreference links is a time-consuming and cognitively demanding task that can compromise annotation quality and efficiency.
We (1) manually transcribe utterances in a dataset of triads collaboratively solving a problem involving dialogue and physical object manipulation, (2) annotate collaborative moves according to these gold-standard transcripts, and then (3) apply these annotations to utterances that have been automatically segmented using toolkits from Google and OpenAI's Whisper.
We present a novel method for using agent experiences gathered through an embodied simulation to ground contextualized word vectors to object representations.
Moreover, we show that AxomiyaBERTa can leverage phonological signals for even more challenging tasks, such as a novel cross-document coreference task on a translated version of the ECB+ corpus, where we present a new SOTA result for an LRL.
VoxML is a modeling language used to map natural language expressions into real-time visualizations using commonsense semantic knowledge of objects and events.
Event Coreference Resolution (ECR) is the task of linking mentions of the same event either within or across documents.
In this paper, we present methods for two types of metacognitive tasks in an AI system: rapidly expanding a neural classification model to accommodate a new category of object, and recognizing when a novel object type is observed instead of misclassifying the observation as a known class.
In recent years, data-intensive AI, particularly the domain of natural language processing and understanding, has seen significant progress driven by the advent of large datasets and deep neural networks that have sidelined more classic AI approaches to the field.
We present a new interface for controlling a navigation robot in novel environments using coordinated gesture and language.
In this paper, we present an analysis of computationally generated mixed-modality definite referring expressions using combinations of gesture and linguistic descriptions.
We present an architecture for integrating real-time, multimodal input into a computational agent's contextual model.
In this paper, we argue that simulation platforms enable a novel type of embodied spatial reasoning, one facilitated by a formal model of object and event semantics that renders the continuous quantitative search space of an open-world, real-time environment tractable.
We present a novel approach to introducing new spatial structures to an AI agent, combining deep learning over qualitative spatial relations with various heuristic search algorithms.
Most work within the computational event modeling community has tended to focus on the interpretation and ordering of events that are associated with verbs and event nominals in linguistic expressions.
Simulation and automatic visualization of events from natural language descriptions and supplementary modalities, such as gestures, allows humans to use their native capabilities as linguistic and visual interpreters to collaborate on tasks with an artificial agent or to put semantic intuitions to the test in an environment where user and agent share a common context. In previous work (Pustejovsky and Krishnaswamy, 2014; Pustejovsky, 2013a), we introduced a method for modeling natural language expressions within a 3D simulation environment built on top of the game development platform Unity (Goldstone, 2009).
Human communication is a multimodal activity, involving not only speech and written expressions, but intonation, images, gestures, visual clues, and the interpretation of actions through perception.
The generated simulations act as a conceptual "debugger" for the semantics of different motion verbs: that is, by testing for consistency and informativeness in the model, simulations expose the presuppositions associated with linguistic expressions and their compositions.
This paper introduces the Event Capture Annotation Tool (ECAT), a user-friendly, open-source interface tool for annotating events and their participants in video, capable of extracting the 3D positions and orientations of objects in video captured by Microsoft's Kinect(R) hardware.
We present the specification for a modeling language, VoxML, which encodes semantic knowledge of real-world objects represented as three-dimensional models, and of events and attributes related to and enacted over these objects.