We propose the Video Language Co-Attention Network (VLCN) – a novel memory-enhanced model for Video Question Answering (VideoQA).
Emotional expressions are inherently multimodal -- integrating facial behavior, speech, and gaze -- but their automatic recognition is often limited to a single modality, e. g. speech during a phone call.
We propose Unified Model of Saliency and Scanpaths (UMSS) -- a model that learns to predict visual saliency and scanpaths (i. e. sequences of eye fixations) on information visualisations.
We present VQA-MHUG - a novel 49-participant dataset of multimodal human gaze on both images and questions during visual question answering (VQA) collected using a high-speed eye tracker.
We present the Multimodal Human-like Attention Network (MULAN) - the first method for multimodal integration of human-like attention on image and text during training of VQA models.
The encoder extracts image features and predicts a neural activation map for each face looked at by a human observer.
A lack of corpora has so far limited advances in integrating human gaze data as a supervisory signal in neural attention mechanisms for natural language processing(NLP).
We compare state of the art networks based on long short-term memory (LSTM), convolutional neural models (CNN) and XLNet Transformer architectures.
With an ever-increasing number of mobile devices competing for our attention, quantifying when, how often, or for how long users visually attend to their devices has emerged as a core challenge in mobile human-computer interaction.
Moreover, we discuss how our method enables the calculation of additional attention metrics that, for the first time, enable researchers from different domains to study and quantify attention allocation during mobile interactions in the wild.
Conventional feature-based and model-based gaze estimation methods have proven to perform well in settings with controlled illumination and specialized cameras.
Second, we present an extensive evaluation of state-of-the-art gaze estimation methods on three current datasets, including MPIIGaze.
Such visual decoding is challenging for two reasons: 1) the search target only resides in the user's mind as a subjective visual pattern, and can most often not even be described verbally by the person, and 2) it is, as of yet, unclear if gaze fixations contain sufficient information for this task at all.
We present GazeDirector, a new approach for eye gaze redirection that uses model-fitting.
Zero-shot image classification using auxiliary information, such as attributes describing discriminative object properties, requires time-consuming annotation by domain experts.
Predicting the target of visual search from eye fixation (gaze) data is a challenging problem with many applications in human-computer interaction.
Common computational methods for automated eye movement detection - i. e. the task of detecting different types of eye movement in a continuous stream of gaze data - are limited in that they either involve thresholding on hand-crafted signal features, require individual detectors each only detecting a single movement, or require pre-segmented data.
We show that our retrieval system can cope with this variability using personalisation through an online learning-based retrieval formulation.
3D gaze information is important for scene-centric attention analysis but accurate estimation and analysis of 3D gaze in real-world environments remains challenging.
We further study the influence of image resolution, vision aids, as well as recording location (indoor, outdoor) on pupil detection performance.
Images of the eye are key in several computer vision problems, such as shape registration and gaze estimation.
An increasing number of works explore collaborative human-computer systems in which human gaze is used to enhance computer vision systems.
Appearance-based gaze estimation is believed to work well in real-world settings, but existing datasets have been collected under controlled laboratory conditions and methods have been not evaluated across multiple datasets.
Previous work on predicting the target of visual search from human fixations only considered closed-world settings in which training labels are available and predictions are performed for a known set of potential targets.
Commercial head-mounted eye trackers provide useful features to customers in industry and research but are expensive and rely on closed source hardware and software.
We describe Ubic, a framework that allows users to bridge the gap between digital cryptography and the physical world.