We present baseline results for generating natural language explanations in the context of VQA using two state-of-the-art frameworks on the CLEVR-X dataset.
Ranked #1 on Explanation Generation on CLEVR-X
Decomposing a scene into its shape, reflectance and illumination is a fundamental problem in computer vision and graphics.
The general approach is to embed both textual and visual information into a common space -the grounded space-confined by an explicit relationship between both modalities.
This problem is inherently more challenging when the illumination is not a single light source under laboratory conditions but is instead an unconstrained environmental illumination.
Knowledge about the hidden factors that determine particular system dynamics is crucial for both explaining them and pursuing goal-directed interventions.
The novel DISTributed Artificial neural Network Architecture (DISTANA) is a generative, recurrent graph convolution neural network.
Extensive experiments on both synthetic and real-world datasets show that our network trained on a synthetic dataset can generalize well to real-world images.
We introduce a distributed spatio-temporal artificial neural network architecture (DISTANA).
Approximate nearest neighbor (ANN) search in high dimensions is an integral part of several computer vision systems and gains importance in deep learning with explicit memory representations.
Traditional convolution layers are specifically designed to exploit the natural data representation of images -- a fixed and regular grid.
As handheld video cameras are now commonplace and available in every smartphone, images and videos can be recorded almost everywhere at anytime.
We present a new approach for efficient approximate nearest neighbor (ANN) search in high dimensional spaces, extending the idea of Product Quantization.
Rating how aesthetically pleasing an image appears is a highly complex matter and depends on a large number of different visual factors.
Aligning video sequences is a fundamental yet still unsolved component for a broad range of applications in computer graphics and vision.
Specifically, transfer learning from the task of object recognition is exploited to more effectively train good features for material classification.
The proposed approach reliably detects roads with and without lane markings and thus increases the robustness and availability of road course estimations and augmented reality navigation.
This paper proposes a method for transferring the RGB color spectrum to near-infrared (NIR) images using deep multi-scale convolutional neural networks.
Videos consisting of thousands of high resolution frames are challenging for existing structure from motion (SfM) and simultaneous-localization and mapping (SLAM) techniques.