Topic models have been prevailing for many years on discovering latent semantics while modeling long documents.
Scene Text Recognition (STR) models have achieved high performance in recent years on benchmark datasets where text images are presented with minimal noise.
However, even in paired video-text segments, only a subset of the frames are semantically relevant to the corresponding text, with the remainder representing noise; where the ratio of noisy frames is higher for longer videos.
We propose a novel, generative adversarial framework for probing and improving these models' reasoning capabilities.
1 code implementation • • Klaus Greff, Francois Belletti, Lucas Beyer, Carl Doersch, Yilun Du, Daniel Duckworth, David J. Fleet, Dan Gnanapragasam, Florian Golemo, Charles Herrmann, Thomas Kipf, Abhijit Kundu, Dmitry Lagun, Issam Laradji, Hsueh-Ti, Liu, Henning Meyer, Yishu Miao, Derek Nowrouzezahrai, Cengiz Oztireli, Etienne Pot, Noha Radwan, Daniel Rebain, Sara Sabour, Mehdi S. M. Sajjadi, Matan Sela, Vincent Sitzmann, Austin Stone, Deqing Sun, Suhani Vora, Ziyu Wang, Tianhao Wu, Kwang Moo Yi, Fangcheng Zhong, Andrea Tagliasacchi
Data is the driving force of machine learning, with the amount and quality of training data often being more important for the performance of a system than architecture and training details.
In traditional Visual Question Generation (VQG), most images have multiple concepts (e. g. objects and categories) for which a question could be generated, but models are trained to mimic an arbitrary choice of concept as given in their training data.
Experiments on Visual Question Answering as downstream task demonstrate the effectiveness of the proposed generative model, which is able to improve strong UpDn-based models to achieve state-of-the-art performance.
This paper addresses the problem of simultaneous machine translation (SiMT) by exploring two main concepts: (a) adaptive policies to learn a good trade-off between high translation quality and low latency; and (b) visual information to support this process by providing additional (visual) contextual information which may be available before the textual input is produced.
Current work on Visual Question Answering (VQA) explore deterministic approaches conditioned on various types of image and question features.
In this paper, we teach machines to understand visuals and natural language by learning the mapping between sentences and noisy video snippets without explicit annotations.
Deep learning approaches for Visual-Inertial Odometry (VIO) have proven successful, but they rarely focus on incorporating robust fusion strategies for dealing with imperfect input sensory data.
Due to the sparse rewards and high degree of environment variation, reinforcement learning approaches such as Deep Deterministic Policy Gradient (DDPG) are plagued by issues of high variance when applied in complex real world environments.
Inertial information processing plays a pivotal role in ego-motion awareness for mobile agents, as inertial measurements are entirely egocentric and not environment dependent.
In this framework, real images are first converted to a synthetic domain representation that reduces complexity arising from lighting and texture.
We compare and analyze sequential, random access, and stack memory architectures for recurrent neural network language models.
Topic models have been widely explored as probabilistic generative models of documents.
Ranked #2 on Topic Models on 20NewsGroups
Developing a dialogue agent that is capable of making autonomous decisions and communicating by natural language is one of the long-term goals of machine learning research.
In this work we explore deep generative models of text in which the latent representation of a document is itself drawn from a discrete language model distribution.
We validate this framework on two very different text modelling applications, generative document modelling and supervised question answering.
Ranked #1 on Question Answering on QASent
This paper presents novel Bayesian optimisation algorithms for minimum error rate training of statistical machine translation systems.