To 'track anything' without training on video data for every individual task, we develop a decoupled video segmentation approach (DEVA), composed of task-specific image-level segmentation and class/task-agnostic bi-directional temporal propagation.
Ranked #1 on Semi-Supervised Video Object Segmentation on MOSE
To enable interactive generation, our method supports a variety of input modalities that can be easily provided by a human, including images, text, partially observed shapes and combinations of these, further allowing to adjust the strength of each input.
We present a system for collision-free control of a robot manipulator that uses only RGB views of the world.
Generalization bounds which assess the difference between the true risk and the empirical risk, have been studied extensively.
Specifically, the task involves multi-hop questions that require reasoning over image-caption pairs to identify the grounded visual object being referred to and then predicting a span from the news body text to answer the question.
To study and quantify this concern, we introduce the perceptual score, a metric that assesses the degree to which a model relies on the different subsets of the input features, i. e., modalities.
To address this challenge we propose to reformulate WIE as a context-aware Webpage Object Detection task.
Ranked #1 on Webpage Object Detection on CoVA (using extra training data)
We show that the emergent communication can be grounded to the agent observations and the spatial structure of the 3D environment.
It is fundamental for personal robots to reliably navigate to a specified goal.
OIA models interactions between the sentence-corresponding image and important regions in other images of the sequence.
While deep reinforcement learning (RL) promises freedom from hand-labeled data, great successes, especially for Embodied AI, require significant work to create supervision via carefully shaped rewards.
However, regularization with the functional entropy is challenging.
Ranked #3 on Visual Question Answering (VQA) on VQA-CP
To tackle this issue, we propose an energy-based prior defined by the product of a base prior distribution and a reweighting factor, designed to bring the base closer to the aggregate posterior.
Ranked #6 on Image Generation on CelebA 256x256 (FID metric)
However, we show that when the teaching agent makes decisions with access to privileged information that is unavailable to the student, this information is marginalized during imitation learning, resulting in an "imitation gap" and, potentially, poor results.
1 code implementation • 21 Apr 2020 • Mang Tik Chiu, Xingqian Xu, Kai Wang, Jennifer Hobbs, Naira Hovakimyan, Thomas S. Huang, Honghui Shi, Yunchao Wei, Zilong Huang, Alexander Schwing, Robert Brunner, Ivan Dozier, Wyatt Dozier, Karen Ghandilyan, David Wilson, Hyunseong Park, Junhee Kim, Sungho Kim, Qinghui Liu, Michael C. Kampffmeyer, Robert Jenssen, Arnt B. Salberg, Alexandre Barbosa, Rodrigo Trevisan, Bingchen Zhao, Shaozuo Yu, Siwei Yang, Yin Wang, Hao Sheng, Xiao Chen, Jingyi Su, Ram Rajagopal, Andrew Ng, Van Thong Huynh, Soo-Hyung Kim, In-Seop Na, Ujjwal Baid, Shubham Innani, Prasad Dutande, Bhakti Baheti, Sanjay Talbar, Jianyu Tang
The first Agriculture-Vision Challenge aims to encourage research in developing novel and effective algorithms for agricultural pattern recognition from aerial images, especially for the semantic segmentation task associated with our challenge dataset.
In many vision-based reinforcement learning (RL) problems, the agent controls a movable object in its visual field, e. g., the player's avatar in video games and the robotic arm in visual grasping and manipulation.
2 code implementations • • Mang Tik Chiu, Xingqian Xu, Yunchao Wei, Zilong Huang, Alexander Schwing, Robert Brunner, Hrant Khachatrian, Hovnatan Karapetyan, Ivan Dozier, Greg Rose, David Wilson, Adrian Tudor, Naira Hovakimyan, Thomas S. Huang, Honghui Shi
To encourage research in computer vision for agriculture, we present Agriculture-Vision: a large-scale aerial farmland image dataset for semantic segmentation of agricultural patterns.
no code implementations • 14 Dec 2019 • Dawit Belayneh, Federico Carminati, Amir Farbin, Benjamin Hooberman, Gulrukh Khattak, Miaoyuan Liu, Junze Liu, Dominick Olivito, Vitória Barin Pacela, Maurizio Pierini, Alexander Schwing, Maria Spiropulu, Sofia Vallecorsa, Jean-Roch Vlimant, Wei Wei, Matt Zhang
These networks can serve as fast and computationally light methods for particle shower simulation and reconstruction for current and future experiments at particle colliders.
Despite impressive recent progress that has been reported on tasks that necessitate reasoning, such as visual question answering and visual dialog, models often exploit biases in datasets.
For joint inference over multiple variables, a variety of structured prediction techniques have been developed to model correlations among variables and thereby improve predictions.
%In fact, contrastive disentanglement and unsupervised recovery are often combined in that we seek additional variations that exhibit salient factors/properties.
Through unsupervised clustering, supervised partitioning, and a zero-shot-like generalization analysis we show that our word embeddings complement text-only embeddings like GloVe by better representing similarities and differences between visual concepts that are difficult to obtain from text corpora alone.
We encourage this temporal latent space to capture the 'intention' about how to complete the sentence by mimicking a representation which summarizes the future.
Generative adversarial nets (GANs) and variational auto-encoders have significantly improved our distribution modeling capabilities, showing promise for dataset augmentation, image-to-image translation and feature learning.
Collaboration is a necessary skill to perform tasks that are beyond one agent's capabilities.
We address this issue and develop a general attention mechanism for visual dialog which operates on any number of data utilities.
Ranked #1 on Visual Dialog on VisDial v0.9 val
We show that for human-object interaction detection a relatively simple factorized model with appearance and layout encodings constructed from pre-trained object detectors outperforms more sophisticated approaches.
Data parallelism can boost the training speed of convolutional neural networks (CNN), but could suffer from significant communication costs caused by gradient aggregation.
Distributed training of deep nets is an important technique to address some of the present day computing challenges like memory consumption and computational demands.
Deep structured models are widely used for tasks like semantic segmentation, where explicit correlations between variables provide important prior information which generally helps to reduce the data needs of deep nets.
We achieve the trifecta: (1) High accuracy for the diverse captions as evaluated by standard captioning metrics and user studies; (2) Faster computation of diverse captions compared to beam search and diverse beam search; and (3) High diversity as evaluated by counting novel sentences, distinct n-grams and mutual overlap (i. e., mBleu-4) scores.
While this is particularly true for early GAN formulations, there has been significant empirically motivated and theoretically founded progress to improve stability, for instance, by using the Wasserstein distance rather than the Jenson-Shannon divergence.
In addition, for the first time on the visual dialog dataset, we assess the performance of a system asking questions, and demonstrate how visual dialog can be generated from discriminative question generation and question answering.
Ranked #7 on Visual Dialog on VisDial v0.9 val
In recent years significant progress has been made in image captioning, using Recurrent Neural Networks powered by long-short-term-memory (LSTM) units.
We start from linear discriminators in which case conjugate duality provides a mechanism to reformulate the saddle point objective into a maximization problem, such that both the generator and the discriminator of this 'dualing GAN' act in concert.
Generating diverse questions for given images is an important task for computational education, entertainment and AI assistants.
no code implementations • 9 Sep 2015 • Beate Franke, Jean-François Plante, Ribana Roscher, Annie Lee, Cathal Smyth, Armin Hatefi, Fuqi Chen, Einat Gil, Alexander Schwing, Alessandro Selvitella, Michael M. Hoffman, Roger Grosse, Dieter Hendricks, Nancy Reid
The need for new methods to deal with big data is a common theme in most scientific fields, although its definition tends to vary with the context.
In this paper we derive an efficient algorithm to learn the parameters of structured predictors in general graphical models.