This work studies feature representations for dense label propagation in video, with a focus on recently proposed methods that learn video correspondence using self-supervised signals such as colorization or temporal cycle consistency.
We show that the emergent communication can be grounded to the agent observations and the spatial structure of the 3D environment.
Next, to tackle harder tracking cases, we mine hard examples across an unlabeled pool of real videos with a tracker trained on our hallucinated video data.
While deep reinforcement learning (RL) promises freedom from hand-labeled data, great successes, especially for Embodied AI, require significant work to create supervision via carefully shaped rewards.
We proposes a flexible person generation framework called Dressing in Order (DiOr), which supports 2D pose transfer, virtual try-on, and several fashion editing tasks.
Ranked #1 on Pose Transfer on Deep-Fashion
However, we show that when the teaching agent makes decisions with access to privileged information that is unavailable to the student, this information is marginalized during imitation learning, resulting in an "imitation gap" and, potentially, poor results.
We assume that the model is updated incrementally for new classes as new data becomes available sequentially. This requires adapting the previously stored feature vectors to the updated feature space without having access to the corresponding original training images.
The previous VTransE model maps entities and predicates into a low-dimensional embedding vector space where the predicate is interpreted as a translation vector between the embedded features of the bounding box regions of the subject and the object.
Collaboration is a necessary skill to perform tasks that are beyond one agent's capabilities.
Most existing work that grounds natural language phrases in images starts with the assumption that the phrase in question is relevant to the image.
Given a question-image pair, deep network techniques have been employed to successively reduce the large set of facts until one of the two entities of the final remaining fact is predicted as the answer.
In addition, for the first time on the visual dialog dataset, we assess the performance of a system asking questions, and demonstrate how visual dialog can be generated from discriminative question generation and question answering.
Ranked #7 on Visual Dialog on VisDial v0.9 val
This work presents a method for adapting a single, fixed deep neural network to multiple tasks without affecting performance on already learned tasks.
This paper presents an approach for grounding phrases in images which jointly learns multiple text-conditioned embeddings in a single end-to-end model.
This paper explores image caption generation using conditional variational auto-encoders (CVAEs).
This paper presents a method for adding multiple tasks to a single deep neural network while avoiding catastrophic forgetting.
Ranked #4 on Continual Learning on CUBS (Fine-grained 6 Tasks)
This paper addresses video summarization, or the problem of distilling a raw video into a shorter form while still capturing the original story.
Image-language matching tasks have recently attracted a lot of attention in the computer vision field.
This work proposes Recurrent Neural Network (RNN) models to predict structured 'image situations' -- actions and noun entities fulfilling semantic roles related to the action.
This paper presents a framework for localization or grounding of phrases in images using a large collection of linguistic and visual cues.
This paper presents an approach for answering fill-in-the-blank multiple choice questions from the Visual Madlibs dataset.
This paper focuses on answering fill-in-the-blank style multiple choice questions from the Visual Madlibs dataset.
This paper proposes deep convolutional network models that utilize local and global context to make human activity label predictions in still images, achieving state-of-the-art performance on two recent datasets with hundreds of labels each.
Ranked #5 on Human-Object Interaction Detection on HICO
In this paper, we define a new task, Exact Street to Shop, where our goal is to match a real-world example of a garment item to the same item in an online shop.
We learn to predict 'informative edge' probability maps using two recent methods that exploit local and global context, respectively: structured edge detection forests, and a fully convolutional network for pixelwise labeling.
This paper proposes a method for learning joint embeddings of images and text using a two-branch neural network with multiple layers of linear projections followed by nonlinearities.
Ranked #14 on Image Retrieval on Flickr30K 1K test
The Flickr30k dataset has become a standard benchmark for sentence-based image description.
Ranked #16 on Image Retrieval on Flickr30K 1K test
One of the most promising ways of improving the performance of deep convolutional neural networks is by increasing the number of convolutional layers.
This work proposes a method to interpret a scene by assigning a semantic label at every pixel and inferring the spatial extent of individual object instances together with their occlusion relationships.
Deep convolutional neural networks (CNN) have shown their promise as a universal representation for recognition.
Recent advances in visual recognition indicate that to achieve good retrieval and classification accuracy on largescale datasets like ImageNet, extremely high-dimensional visual descriptors, e. g., Fisher Vectors, are needed.
This paper presents a system for image parsing, or labeling each pixel in an image with its semantic category, aimed at achieving broad coverage across hundreds of object categories, many of them sparsely sampled.
This paper investigates the problem of modeling Internet images and associated text or tags for tasks such as image-to-image search, tag-to-image search, and image-to-tag search (image annotation).
Such data typically arises in a large number of vision and text applications where counts or frequencies are used as features.
This paper addresses the problem of designing binary codes for high-dimensional data such that vectors that are similar in the original space map to similar binary strings.
This paper describes a recursive estimation procedure for multivariate binary densities using orthogonal expansions.