Existing works in hierarchical reinforcement learning provide agents with structural representations of subtasks but are not affordance-aware, and by grounding our definition of hierarchical affordances in the present state, our approach is more flexible than the multitude of approaches that ground their subtask dependencies in a symbolic history.
Reinforcement learning constantly deals with hard integrals, for example when computing expectations in policy evaluation and policy iteration.
We term this approach as Co-training Videos and Images for Action Recognition (CoVeR).
Ranked #4 on Action Classification on Moments in Time (using extra training data)
Generic unstructured neural networks have been shown to struggle on out-of-distribution compositional generalization.
Many types of physics-informed neural network models have been proposed in recent years as approaches for learning solutions to differential equations.
We propose to address this problem by integrating a semi-parametric representation of a large text corpus into a Transformer model as a source of factual knowledge.
We analyze the grounded SCAN (gSCAN) benchmark, which was recently proposed to study systematic generalization for grounded language understanding.
Knowledge-intensive tasks such as question answering often require assimilating information from different sections of large inputs such as books or article collections.
This enables a new class of powerful, high-capacity representations that can ultimately distill much of the useful information about an entity from multiple text sources, without any human supervision.
Identifying a short segment in a long video that semantically matches a text query is a challenging task that has important application potentials in language-based video search, browsing, and navigation.
Summarization is the task of compressing source document(s) into coherent and succinct passages.
Recent progresses have leveraged the ideas of pre-training (from language modeling) and attention layers in Transformers to learn representation from datasets containing images aligned with linguistic expressions that describe the images.
Collectively, the POLL problem setting, the Firehose datasets, and the ConGraD algorithm enable a complete benchmark for reproducible research on web-scale continual learning.
Many methods have been proposed to quantify the predictive uncertainty associated with the outputs of deep neural networks.
Multi-agent settings in the real world often involve tasks with varying types and quantities of agents and non-agent entities; however, common patterns of behavior often emerge among these agents/entities.
To this end, we propose BabyWalk, a new VLN agent that is learned to navigate by decomposing long instructions into shorter ones (BabySteps) and completing them sequentially.
To narrate a sequence of images, we use the predicted anchor word embeddings and the image features as the joint input to a seq2seq model.
On the other hand, we have just started to understand and analyze how they are able to adapt fast to new tasks.
Meta-learning methods, most notably Model-Agnostic Meta-Learning (Finn et al, 2017) or MAML, have achieved great success in adapting to new tasks quickly, after having been trained on similar tasks.
In this paper, we propose a new decoder where the output summary is generated by conditioning on both the input text and the latent topics of the document.
Solving tasks with sparse rewards is one of the most important challenges in reinforcement learning.
Many few-shot learning methods address this challenge by learning an instance embedding function from seen classes and apply the function to instances from unseen classes with limited labels.
Reinforcement learning in multi-agent scenarios is important for real-world applications but presents challenges beyond those seen in single-agent settings.
Our approach learns textual and visual representations jointly: latent visual factors couple together a skip-gram model for co-occurrence in linguistic data and a generative latent variable model for visual data.
The key idea is to complement the discriminative losses with another loss which measures if the predicted summary preserves the same information as in the original video.
These properties make the approach particularly appealing for transfer learning for open-ended Visual QA, where the source dataset on which the model is learned has limited overlapping with the target dataset in the space of answers.
Analogous to domain adaptation for visual recognition, this setting is appealing when the target dataset does not have a sufficient amount of labeled data to learn an "in-domain" model.
We apply the procedures to re-construct decoy answers for two popular Visual QA datasets as well as to create a new Visual QA dataset from the Visual Genome project, resulting in the largest dataset for this task.
We advocate that holistic inference of image concepts provides valuable information for detailed pixel labeling.
no code implementations • 13 Jan 2017 • Avner May, Alireza Bagheri Garakani, Zhiyun Lu, Dong Guo, Kuan Liu, Aurélien Bellet, Linxi Fan, Michael Collins, Daniel Hsu, Brian Kingsbury, Michael Picheny, Fei Sha
First, in order to reduce the number of random features required by kernel models, we propose a simple but effective method for feature selection.
We introduce a new multi-modal task for computer systems, posed as a combined vision-language comprehension challenge: identifying the most suitable text describing a scene, given several similar options.
Accurately measuring the similarity between text documents lies at the core of many real world applications of machine learning.
We advocate that high-recall holistic inference of image concepts provides valuable information for detailed pixel labeling.
Attention mechanisms have recently been introduced in deep learning for various tasks in natural language processing and computer vision.
We propose a novel supervised learning technique for summarizing videos by automatically selecting keyframes or key subshots.
Leveraging class semantic descriptions and examples of known objects, zero-shot learning makes it possible to train a recognition model for an object class whose examples are not available.
Zero-shot learning (ZSL) methods have been studied in the unrealistic setting where test data are assumed to come from unseen classes only.
We study large-scale kernel methods for acoustic modeling and compare to DNNs on performance metrics related to both acoustic modeling and recognition.
Video summarization has unprecedented importance to help us digest, browse, and search today's ever-growing video collections.
Given semantic descriptions of object classes, zero-shot learning aims to accurately recognize objects of the unseen classes, from which no examples are available at the training stage, by associating them to the seen classes, from which labeled examples are provided.
Ranked #1 on Few-Shot Image Classification on AWA - 0-Shot
In this paper, we propose an image caption system that exploits the parallel structures between images and sentences.
The computational complexity of kernel methods has often been a major barrier for applying them to large-scale learning problems.
Extensive empirical studies validate our contributions, including applications on challenging document and video summarization, where flexibility in modeling the kernel matrix and balancing different errors is indispensable.
Existing methods to learn visual attributes are prone to learning the wrong thing---namely, properties that are correlated with the attribute of interest among training samples.
We propose a new approach for metric learning by framing it as learning a sparse combination of locally discriminative metrics that are inexpensive to generate from the training data.
We further show that the communication cost of dFW is optimal by deriving a lower-bound on the communication cost required to construct an $\epsilon$-approximate solution.
By maximum distinctiveness, we require the underlying distributions of the identified domains to be different from each other; by maximum learnability, we ensure that a strong discriminative model can be learned from the domain.
Moreover, we show how SCA can be instrumental in exploratory analysis of data, where we gain insights about the data by examining patterns hidden in its latent components' local similarity values.
On various benchmark data sets, we demonstrate these methods not only match the current state-of-the-art in terms of kNN classification error, but in the case of χ2-LMNN, obtain best results in 19 out of 20 learning settings.
Given a hierarchical taxonomy that captures semantic similarity between the objects, we learn a corresponding tree of metrics (ToM).
In this framework, kernel-based measures of independence are used to derive low-dimensional representations that maximally capture information in covariates in order to predict responses.
By using the transformed topic mixture proportions as a new representation of documents, we obtain a supervised dimensionality reduction algorithm that uncovers the latent structure in a document collection while preserving predictive power for the task of classification.