The task of video object segmentation with referring expressions (language-guided VOS) is to, given a linguistic phrase and a video, generate binary masks for the object to which the phrase refers.
Ranked #1 on Referring Expression Segmentation on A2Dre test
Our method consists in first predicting pseudo-masks for the unlabeled pool of samples, together with a score predicting the quality of the mask.
Towards this end, we introduce How2Sign, a multimodal and multiview continuous American Sign Language (ASL) dataset, consisting of a parallel corpus of more than 80 hours of sign language videos and a set of corresponding modalities including speech, English transcripts, and depth.
This paper presents a study showing the benefits of the EfficientNet models compared with heavier Convolutional Neural Networks (CNNs) in the Document Classification task, essential problem in the digitalization process of institutions.
Ranked #1 on Multi-Modal Document Classification on Tobacco-3482
We perform an extensive evaluation of skill discovery methods on controlled environments and show that EDL offers significant advantages, such as overcoming the coverage problem, reducing the dependence of learned skills on the initial state, and allowing the user to define a prior over which behaviors should be learned.
Methods that move towards less supervised scenarios are key for image segmentation, as dense labels demand significant human intervention.
Speech is a rich biometric signal that contains information about the identity, gender and emotional state of the speaker.
Evolution Strategies (ES) emerged as a scalable alternative to popular Reinforcement Learning (RL) techniques, providing an almost perfect speedup when distributed across hundreds of CPU cores thanks to a reduced communication overhead.
We present a recurrent model for semantic instance segmentation that sequentially generates binary masks and their associated class probabilities for every object in an image.
A fully automatic technique for segmenting the liver and localizing its unhealthy tissues is a convenient tool in order to diagnose hepatic diseases and assess the response to the according treatments.
We introduce the Skip RNN model which extends existing RNN models by learning to skip state updates and shortens the effective size of the computational graph.
This paper introduces an unsupervised framework to extract semantically rich features for video representation.
We introduce SalGAN, a deep convolutional neural network for visual saliency prediction trained with adversarial examples.
We argue that, while this loss seems unavoidable when working with large amounts of object candidates, the much more reduced amount of region proposals generated by our reinforcement learning agent allows considering to extract features for each location without sharing convolutional computation among regions.