In this work, we propose the application of abstract meaning representation (AMR) based semantic parsing models to parse textual descriptions of a visual scene into scene graphs, which is the first work to the best of our knowledge.
We present a new form of ensemble method–Devil’s Advocate, which uses a deliberately dissenting model to force other submodels within the ensemble to better collaborate.
Recent GAN-based text-to-image generation models have advanced that they can generate photo-realistic images matching semantically with descriptions.
Furthermore, the qualitative analysis shows that the unadapted VG model often fails to find correct objects due to a strong bias learned from the pre-training data.
We evaluate our method on the first-person video benchmark dataset, TREK-150, and on the custom dataset, RMOT-223, that we collect from the UR5e robot.
Specifically, our model learns to predict the target moment from the joint probability of the given query and the complement of negative queries for each candidate frame.
Tasks that involve interaction with various targets are called multi-target tasks.
Our method, coined Learning by Sketching (LBS), learns to convert an image into a set of colored strokes that explicitly incorporate the geometric information of the scene in a single inference step without requiring a sketch dataset.
Experiments on standard benchmarks demonstrate the effectiveness of the method, in particular when label noise complicates the identification of bias-conflicting examples.
In Self-Supervised Learning (SSL), it is known that frequent occurrences of the collision in which target data and its negative samples share the same class can decrease performance.
Video corpus moment retrieval (VCMR) is the task to retrieve the most relevant video moment from a large video corpus using a natural language query.
Recently, adversarial imitation learning has shown a scalable reward acquisition method for inverse reinforcement learning (IRL) problems.
To this end, we design a simple yet effective two-stage scene graph parsing framework utilizing abstract meaning representation, SGRAM (Scene GRaph parsing via Abstract Meaning representation): 1) transforming a textual description of an image into an AMR graph (Text-to-AMR) and 2) encoding the AMR graph into a Transformer-based language model to generate a scene graph (AMR-to-SG).
We compare our approach with Unlikelihood (UL) training in a text continuation task on commonsense natural language inference (NLI) corpora to show which method better models the coherence by avoiding unlikely continuations.
We present an automated learning framework for a robotic sketching agent that is capable of learning stroke-based rendering and motor control simultaneously.
The initial years of an infant's life are known as the critical period, during which the overall development of learning performance is significantly impacted due to neural plasticity.
Additionally, we also propose an aligned cross-modal representation learning method that learns semantic representations of visual objects and words in a self-supervised manner based on the cross-modal relational graph networks.
As a result, GST scales the amount of training data up to an order of magnitude that of VisDial (1. 2M to 12. 9M QA data).
Knowledge-based visual question answering (QA) aims to answer a question which requires visually-grounded external knowledge beyond image content itself.
To validate this hypothesis, we adapt this notion of critical periods to learning in AI agents and investigate the critical period in the virtual environment for AI agents.
Face-swapping models have been drawing attention for their compelling generation quality, but their complex architectures and loss functions often require careful tuning for successful training.
Learning in a multi-target environment without prior knowledge about the targets requires a large amount of samples and makes generalization difficult.
In this paper, we propose the Video Turing Test to provide effective and practical assessments of video understanding intelligence as well as human-likeness evaluation of AI agents.
In this paper, we challenge the existing multiple-choice video question answering by changing it to open-ended video question answering.
Then we propose a top-down evaluation system for VideoQA, based on the cognitive process of humans and story elements: Cognitive Modules for Evaluation (CogME).
MASN consists of a motion module, an appearance module, and a motion-appearance fusion module.
Assessing advertisements, specifically on the basis of user preferences and ad quality, is crucial to the marketing industry.
One of the inherent limitations of current AI systems, stemming from the passive learning mechanisms (e. g., supervised learning), is that they perform well on labeled datasets but cannot deduce knowledge on their own.
To further investigate the effectiveness of our proposed method, we evaluate our approach on a real-world problem, image retrieval with visual scene graphs.
The formulation draws a strong connection between adversarial learning and energy-based reinforcement learning; thus, the architecture is capable of recovering a reward function that induces a multi-modal policy.
Through the re-training process, some of noises can be compensated and other noises can be utilized to learn better representations.
Active learning is widely used to reduce labeling effort and training time by repeatedly querying only the most beneficial samples from unlabeled data.
In the end, we showed not only that we can make build better machine training framework through the human experiment result, but also empirically confirm the result of human experiment through imitated machine experiments; human-like active learning have crucial effect on learning performance.
Inspired by recent trends in vision and language learning, we explore applications of attention mechanisms for visio-lingual fusion within an application to story-based video understanding.
In experiment, the integrated scene graph is applied to the image-caption retrieval task as a down-stream task.
We propose an in silico molecular associative memory model for pattern learning, storage and denoising using Pairwise Markov Random Field (PMRF) model.
Despite recent progress on computer vision and natural language processing, developing a machine that can understand video story is still hard to achieve due to the intrinsic difficulty of video story.
Visual dialog is a task of answering a sequence of questions grounded in an image using the previous dialog history as context.
Here, we propose Cut-Based Graph Learning Networks (CB-GLNs) for learning video data by discovering these complex structures of the video.
Next, we gradually add random noises to the word representations and repeat the training process from scratch, but initialize with the noised word representations.
Despite crucial influences of image quality, auxiliary information of ad images such as tags and target subjects can also determine image preference.
We present a generative adversarial network (GAN) that conducts manifold learning and alignment (MLA): A task to learn the multi-manifold structure underlying data and to align those manifolds without any correspondence information.
Generative replay (GR) is a method to alleviate catastrophic forgetting in continual learning (CL) by generating previous task data and learning them together with the data from new tasks.
However, most of sequential data, as seen with videos, have complex temporal dependencies that imply variable-length semantic flows and their compositions, and those are hard to be captured by conventional methods.
We present an encoder-powered generative adversarial network (EncGAN) that is able to learn both the multi-manifold structure and the abstract features of data.
Problem difficulty was operationalized by the number of carries involved in solving a given problem.
Video understanding is emerging as a new paradigm for studying human-like AI.
Specifically, REFER module learns latent relationships between a given question and a dialog history by employing a self-attention mechanism.
Ranked #2 on Visual Dialog on VisDial v0.9 val
Exploiting the deep generative model's remarkable ability of learning the data-manifold structure, some recent researches proposed a geometric data interpolation method based on the geodesic curves on the learned data-manifold.
While conventional methods for sequential learning focus on interaction between consecutive inputs, we suggest a new method which captures composite semantic flows with variable-length dependencies.
We confirm the best performance of the dual attention mechanism combined with late fusion by ablation studies.
The task of multi-image cued story generation, such as visual storytelling dataset (VIST) challenge, is to compose multiple coherent sentences from a given sequence of images.
Ranked #30 on Visual Storytelling on VIST (METEOR metric)
In this paper, we propose bilinear attention networks (BAN) that find bilinear attention distributions to utilize given vision-language information seamlessly.
Ranked #11 on Phrase Grounding on Flickr30k Entities Test
Goal-oriented dialogue tasks occur when a questioner asks an action-oriented question and an answerer responds with the intent of letting the questioner know a correct action to take.
The parameter domain of the loss surface can be decomposed into regions in which activation values (zero or one for rectified linear units) are consistent.
Kim et al. (2016) show that the Hadamard product in multimodal deep networks, which is well-known for the joint function of visual question answering tasks, implicitly performs an attentional mechanism for visual inputs.
The game involves two players: a Teller and a Drawer.
In multi-agent cooperative task experiments, our model shows 20% faster learning than existing state-of-the-art model.
This is mainly due to 1) the reconstruction of video stories in a scene-dialogue combined form that utilize the latent embedding and 2) attention.
Catastrophic forgetting is a problem of neural networks that loses the information of the first task after training the second task.
To solve this issue, the subgoal and option framework have been proposed.
However, in most of the service robot applications, the user needs to move himself/herself to allow the robot to see him/her face to face.
We propose a model called composite generative adversarial network, that reveals the complex structure of images with multiple generators in which each generator generates some part of the image.
We present Multimodal Residual Networks (MRN) for the multimodal residual learning of visual question-answering, which extends the idea of the deep residual learning.
The proposed architecture consists of deep representation learners and fast learnable shallow kernel networks, both of which synergize to track the information of new data.
We consider the problem of learning a local metric to enhance the performance of nearest neighbor classification.