We investigate a new AI task --- Multi-Agent Interactive Question Answering --- where several agents explore the scene jointly in interactive environments to answer a question.
By contrast, EIP models the tactile sensor as a group of coordinated particles, and the elastic property is applied to regulate the deformation of particles during contact.
We propose a compact and effective framework to fuse multimodal features at multiple layers in a single network.
In particular, we propose an Expectation-Maximization(EM)-style algorithm: an E-step that samples the options of expert conditioned on the current learned policy, and an M-step that updates the low- and high-level policies of agent simultaneously to minimize the newly proposed option-occupancy measurement between the expert and the agent.
At its core, EIP models the tactile sensor as a group of coordinated particles, and the elastic theory is applied to regulate the deformation of particles during the contact process.
Deep multimodal fusion by using multiple sources of data for classification or regression has exhibited a clear advantage over the unimodal counterpart on various applications.
In this paper, we propose Invariance Propagation to focus on learning representations invariant to category-level variations, which are provided by different instances from the same category.
Increasing the depth of GCN, which is expected to permit more expressivity, is shown to incur performance detriment especially on node classification.
Using a gating mechanism that discriminates the unseen samples from the seen samples can decompose the GZSL problem to a conventional Zero-Shot Learning (ZSL) problem and a supervised classification problem.
We propose a general method to train a single convolutional neural network which is capable of switching image resolutions at inference.
Embodiment is an important characteristic for all intelligent agents (creatures and robots), while existing scene description tasks mainly focus on analyzing images passively and the semantic understanding of the scenario is separated from the interaction between the agent and the environment.
In this paper, we present a multimodal mobile teleoperation system that consists of a novel vision-based hand pose regression network (Transteleop) and an IMU-based arm tracking method.
In this paper, we propose a novel task, Manipulation Question Answering (MQA), where the robot performs manipulation actions to change the environment in order to answer a given question.
Both network training results and robot experiments demonstrate that MP-Net is robust against noise and changes to the task and environment.
The proposed architecture, termed as NICE-GAN, exhibits two advantageous patterns over previous approaches: First, it is more compact since no independent encoding component is required; Second, this plug-in encoder is directly trained by the adversary loss, making it more informative and trained more effectively if a multi-scale discriminator is applied.
While almost all state-of-the-art object detectors utilize predefined anchors to enumerate possible locations, scales and aspect ratios for the search of the objects, their performance and generalization ability are also limited to the design of anchors.
In this paper, we study Reinforcement Learning from Demonstrations (RLfD) that improves the exploration efficiency of Reinforcement Learning (RL) by providing expert demonstrations.
In deep CTR models, exploiting users' historical data is essential for learning users' behaviors and interests.
This paper studies Learning from Observations (LfO) for imitation learning with access to state-only demonstrations.
Different functional areas of the human brain play different roles in brain activity, which has not been paid sufficient research attention in the brain-computer interface (BCI) field.
In FoveaBox, an instance is assigned to adjacent feature levels to make the model more accurate. We demonstrate its effectiveness on standard benchmarks and report extensive experimental analysis.
Ranked #86 on Object Detection on COCO test-dev (APM metric)
In this paper, we present TeachNet, a novel neural network architecture for intuitive and markerless vision-based teleoperation of dexterous robotic hands.
In this paper, we propose an end-to-end grasp evaluation model to address the challenging problem of localizing robot grasp configurations directly from the point cloud.
In this paper, we begin by investigating current feature pyramids solutions, and then reformulate the feature pyramid construction as the feature reconfiguration process.
First, we model cognitive events based on EEG data by characterizing the data using EEG optical flow, which is designed to preserve multimodal EEG information in a uniform representation.
As a new classification platform, deep learning has recently received increasing attention from researchers and has been successfully applied to many domains.
Herein, we propose a novel approach to modeling cognitive events from EEG data by reducing it to a video classification problem, which is designed to preserve the multimodal information of EEG.
Learning and inference movement is a very challenging problem due to its high dimensionality and dependency to varied environments or tasks.
The goal of task transfer in reinforcement learning is migrating the action policy of an agent to the target task from the source task.
Linear Dynamical Systems (LDSs) are fundamental tools for modeling spatio-temporal data in various disciplines.
To address (a), we design the reverse connection, which enables the network to detect objects on multi-levels of CNNs.
In this way, the sequential representation of an image can be naturally translated to a sequence of words, as the target sequence of the RNN model.
We then devise efficient algorithms to perform sparse coding and dictionary learning on the space of infinite-dimensional subspaces.
To enhance the performance of LDSs, in this paper, we address the challenging issue of performing sparse coding on the space of LDSs, where both data and dictionary atoms are LDSs.
Almost all of the current top-performing object detection networks employ region proposals to guide the search for object instances.