In this paper, we tackle these limitations for the specific problem of few-shot full 3D head reconstruction, by endowing coordinate-based representations with a probabilistic shape prior that enables faster convergence and better generalization when using few input images (down to three).
Recent advances in deep learning have brought significant progress in visual grounding tasks such as language-guided video object segmentation.
Transfer learning approaches can reduce the data requirements of deep learning algorithms.
Recent work have addressed the generation of human poses represented by 2D/3D coordinates of human joints for sign language.
The task of video object segmentation with referring expressions (language-guided VOS) is to, given a linguistic phrase and a video, generate binary masks for the object to which the phrase refers.
Ranked #1 on Referring Expression Segmentation on A2Dre test
Our method consists in first predicting pseudo-masks for the unlabeled pool of samples, together with a score predicting the quality of the mask.
Towards this end, we introduce How2Sign, a multimodal and multiview continuous American Sign Language (ASL) dataset, consisting of a parallel corpus of more than 80 hours of sign language videos and a set of corresponding modalities including speech, English transcripts, and depth.
In this work, we propose an effective approach for training unique embedding representations by combining three simultaneous modalities: image and spoken and textual narratives.
We perform an extensive evaluation of skill discovery methods on controlled environments and show that EDL offers significant advantages, such as overcoming the coverage problem, reducing the dependence of learned skills on the initial state, and allowing the user to define a prior over which behaviors should be learned.
The goal of this work is to segment the objects in an image that are referred to by a sequence of linguistic descriptions (referring expressions).
This work addresses the challenge of hate speech detection in Internet memes, and attempts using visual information to automatically detect hate speech, unlike any previous work of our knowledge.
This paper investigates modifying an existing neural network architecture for static saliency prediction using two types of recurrences that integrate information from the temporal domain.
Methods that move towards less supervised scenarios are key for image segmentation, as dense labels demand significant human intervention.
Speech is a rich biometric signal that contains information about the identity, gender and emotional state of the speaker.
Multiple object video object segmentation is a challenging task, specially for the zero-shot case, when no object mask is given at the initial frame and the model has to find the objects to be segmented along the sequence.
Ranked #1 on One-shot visual object segmentation on YouTube-VOS
Our system predicts ingredients as sets by means of a novel architecture, modeling their dependencies without imposing any order, and then generates cooking instructions by attending to both image and its inferred ingredients simultaneously.
Ranked #1 on Recipe Generation on Recipe1M
Evolution Strategies (ES) emerged as a scalable alternative to popular Reinforcement Learning (RL) techniques, providing an almost perfect speedup when distributed across hundreds of CPU cores thanks to a reduced communication overhead.
We introduce PathGAN, a deep neural network for visual scanpath prediction trained on adversarial examples.
This work adapts a deep neural model for image saliency prediction to the temporal domain of egocentric video.
Adaptive Computation Time for Recurrent Neural Networks (ACT) is one of the most promising architectures for variable computation.
We aim to tackle a novel task in action detection - Online Detection of Action Start (ODAS) in untrimmed, streaming videos.
We present a recurrent model for semantic instance segmentation that sequentially generates binary masks and their associated class probabilities for every object in an image.
A fully automatic technique for segmenting the liver and localizing its unhealthy tissues is a convenient tool in order to diagnose hepatic diseases and assess the response to the according treatments.
This work explores attention models to weight the contribution of local convolutional representations for the instance search task.
We propose a novel Active Learning framework capable to train effectively a convolutional neural network for semantic segmentation of medical imaging, with a limited amount of training labeled data.
We introduce the Skip RNN model which extends existing RNN models by learning to skip state updates and shortens the effective size of the computational graph.
This work aims at disentangling the contributions of the `adjectives' and `nouns' in the visual prediction of ANPs.
This paper introduces an unsupervised framework to extract semantically rich features for video representation.
The first part of the network consists of a model trained to generate saliency volumes, whose parameters are fit by back-propagation computed from a binary cross entropy (BCE) loss over downsampled versions of the saliency volumes.
In this paper, we go beyond this spatial information and propose a local-aware encoding of convolutional features based on semantic information predicted in the target image.
We introduce SalGAN, a deep convolutional neural network for visual saliency prediction trained with adversarial examples.
We argue that, while this loss seems unavoidable when working with large amounts of object candidates, the much more reduced amount of region proposals generated by our reinforcement learning agent allows considering to extract features for each location without sharing convolutional computation among regions.
This thesis report studies methods to solve Visual Question-Answering (VQA) tasks with a Deep Learning framework.
This thesis explore different approaches using Convolutional and Recurrent Neural Networks to classify and temporally localize activities on videos, furthermore an implementation to achieve it has been proposed.
This work presents a retrieval pipeline and evaluation scheme for the problem of finding the last appearance of personal objects in a large dataset of images captured from a wearable camera.
This work explores the suitability for instance retrieval of image- and region-wise representations pooled from an object detection CNN such as Faster R-CNN.
This work proposes a simple instance retrieval pipeline based on encoding the convolutional features of CNN using the bag of words aggregation scheme (BoW).
Visual multimedia have become an inseparable part of our digital social lives, and they often capture moments tied with deep affections.
The prediction of salient areas in images has been traditionally addressed with hand-crafted features based on neuroscience principles.
Our solution is based on the combination of visual features extracted from convolutional neural networks with temporal information using a hierarchical classifier scheme.
This paper explores the potential of brain-computer interfaces in segmenting objects from images.