Cross-modal recipe retrieval has recently gained substantial attention due to the importance of food in people's lives, as well as the availability of vast amounts of digital cooking recipes and food images to train machine learning models.
Our method consists in first predicting pseudo-masks for the unlabeled pool of samples, together with a score predicting the quality of the mask.
In this paper we present the Women in Computer Vision Workshop - WiCV 2019, organized in conjunction with CVPR 2019.
Methods that move towards less supervised scenarios are key for image segmentation, as dense labels demand significant human intervention.
In this paper, we identify an important reproducibility challenge in the image-to-set prediction literature that impedes proper comparisons among published methods, namely, researchers use different evaluation protocols to assess their contributions.
Speech is a rich biometric signal that contains information about the identity, gender and emotional state of the speaker.
Multiple object video object segmentation is a challenging task, specially for the zero-shot case, when no object mask is given at the initial frame and the model has to find the objects to be segmented along the sequence.
Ranked #1 on One-shot visual object segmentation on YouTube-VOS
Our system predicts ingredients as sets by means of a novel architecture, modeling their dependencies without imposing any order, and then generates cooking instructions by attending to both image and its inferred ingredients simultaneously.
Ranked #1 on Recipe Generation on Recipe1M
In this paper, we introduce Recipe1M+, a new large-scale, structured corpus of over one million cooking recipes and 13 million food images.
We present a recurrent model for semantic instance segmentation that sequentially generates binary masks and their associated class probabilities for every object in an image.
In this paper, we introduce Recipe1M, a new large-scale, structured corpus of over 1m cooking recipes and 800k food images.
This thesis explore different approaches using Convolutional and Recurrent Neural Networks to classify and temporally localize activities on videos, furthermore an implementation to achieve it has been proposed.
This work explores the suitability for instance retrieval of image- and region-wise representations pooled from an object detection CNN such as Faster R-CNN.
This work proposes a simple instance retrieval pipeline based on encoding the convolutional features of CNN using the bag of words aggregation scheme (BoW).
Visual media are powerful means of expressing emotions and sentiments.
This paper explores processing techniques to deal with noisy data in crowdsourced object segmentation tasks.
Our solution is based on the combination of visual features extracted from convolutional neural networks with temporal information using a hierarchical classifier scheme.
We show that it is indeed possible to detect such objects in complex images and, also, that users with previous knowledge on the dataset or experience with the RSVP outperform others.