1 code implementation • 29 Jul 2022 • Nicola Messina, Matteo Stefanini, Marcella Cornia, Lorenzo Baraldi, Fabrizio Falchi, Giuseppe Amato, Rita Cucchiara
In literature, this task is often used as a pre-training objective to forge architectures able to jointly deal with images and texts.
Ranked #22 on Cross-Modal Retrieval on COCO 2014
1 code implementation • 21 Feb 2022 • Manuele Barraco, Matteo Stefanini, Marcella Cornia, Silvia Cascianelli, Lorenzo Baraldi, Rita Cucchiara
Describing images in natural language is a fundamental step towards the automatic modeling of connections between the visual and textual modalities.
no code implementations • 14 Jul 2021 • Matteo Stefanini, Marcella Cornia, Lorenzo Baraldi, Silvia Cascianelli, Giuseppe Fiameni, Rita Cucchiara
Starting from 2015 the task has generally been addressed with pipelines composed of a visual encoder and a language model for text generation.
no code implementations • 2 Jun 2021 • Marco Cagrandi, Marcella Cornia, Matteo Stefanini, Lorenzo Baraldi, Rita Cucchiara
In this paper, we present a novel approach for NOC that learns to select the most relevant objects of an image, regardless of their adherence to the training set, and to constrain the generative process of a language model accordingly.
no code implementations • 27 Apr 2020 • Matteo Stefanini, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara
The joint understanding of vision and language has been recently gaining a lot of attention in both the Computer Vision and Natural Language Processing communities, with the emergence of tasks such as image captioning, image-text matching, and visual question answering.
2 code implementations • CVPR 2020 • Marcella Cornia, Matteo Stefanini, Lorenzo Baraldi, Rita Cucchiara
Transformer-based architectures represent the state of the art in sequence modeling tasks like machine translation and language understanding.
Ranked #2 on Image Captioning on MS COCO
no code implementations • International Conference on Image Analysis and Processing 2019 • Matteo Stefanini, Marcella Cornia, Lorenzo Baraldi, Massimiliano Corsini, and Rita Cucchiara
As vision and language techniques are widely applied to realistic images , there is a growing interest in designing visual-semantic models suitable for more complex and challenging scenarios.
1 code implementation • 5 Mar 2019 • Matteo Stefanini, Riccardo Lancellotti, Lorenzo Baraldi, Simone Calderara
The experiments compare our proposal with state-of-the-art solutions available in literature, demonstrating that our proposal achieve better performance.