1 code implementation • 19 Mar 2025 • Federico Cocchi, Nicholas Moratelli, Davide Caffagni, Sara Sarto, Lorenzo Baraldi, Marcella Cornia, Rita Cucchiara
Recent progress in Multimodal Large Language Models (MLLMs) has highlighted the critical roles of both the visual backbone and the underlying language model.
1 code implementation • 15 Mar 2025 • Tobia Poppi, Tejaswi Kasarla, Pascal Mettes, Lorenzo Baraldi, Rita Cucchiara
We propose to encode safe and unsafe content as an entailment hierarchy, where both are placed in different regions of hyperbolic space.
1 code implementation • 3 Mar 2025 • Davide Caffagni, Sara Sarto, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara
Cross-modal retrieval is gaining increasing efficacy and interest from the research community, thanks to large-scale training, novel architectural and learning designs, and its application in LLMs and multimodal LLMs.
no code implementations • 26 Dec 2024 • Roberto Amoroso, Gengyuan Zhang, Rajat Koner, Lorenzo Baraldi, Rita Cucchiara, Volker Tresp
This progress is largely driven by the effective alignment between visual data and the language space of MLLMs.
1 code implementation • 12 Dec 2024 • Fiorenzo Parascandolo, Nicholas Moratelli, Enver Sangineto, Lorenzo Baraldi, Rita Cucchiara
Recent work has empirically shown that Vision-Language Models (VLMs) struggle to fully understand the compositional properties of the human language, usually modeling an image caption as a "bag of words".
no code implementations • 4 Dec 2024 • Davide Bucciarelli, Nicholas Moratelli, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara
The task of image captioning demands an algorithm to generate natural language descriptions of visual inputs.
1 code implementation • 28 Nov 2024 • Luca Barsellotti, Lorenzo Bianchi, Nicola Messina, Fabio Carrara, Marcella Cornia, Lorenzo Baraldi, Fabrizio Falchi, Rita Cucchiara
At training time, we exploit the attention maps of DINOv2 to selectively align local visual patches with textual embeddings.
1 code implementation • 25 Nov 2024 • Federico Cocchi, Nicholas Moratelli, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara
Multimodal LLMs (MLLMs) are the natural extension of large language models to handle multimodal inputs, combining text and image data.
1 code implementation • 23 Oct 2024 • Luca Barsellotti, Roberto Bigazzi, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara
In the last years, the research interest in visual navigation towards objects in indoor environments has grown significantly.
1 code implementation • 9 Oct 2024 • Sara Sarto, Nicholas Moratelli, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara
Metrics can indeed play a key role in the fine-tuning stage of captioning models, ultimately enhancing the quality of the generated captions.
no code implementations • 16 Sep 2024 • Federico Betti, Lorenzo Baraldi, Rita Cucchiara, Nicu Sebe
Diffusion models have significantly advanced generative AI, but they encounter difficulties when generating complex combinations of multiple objects.
no code implementations • 29 Aug 2024 • Nicholas Moratelli, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara
Fine-tuning image captioning models with hand-crafted rewards like the CIDEr metric has been a classical strategy for promoting caption quality at the sequence level.
1 code implementation • 26 Aug 2024 • Nicholas Moratelli, Davide Caffagni, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara
However, when attempting to optimize modern and higher-quality metrics like CLIP-Score and PAC-Score, this training method often encounters instability and fails to acquire the genuine descriptive capabilities needed to produce fluent and informative captions.
1 code implementation • 29 Jul 2024 • Sara Sarto, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara
Effectively aligning with human judgment when evaluating machine-generated image captions represents a complex yet intriguing challenge.
1 code implementation • 29 Jul 2024 • Federico Cocchi, Marcella Cornia, Lorenzo Baraldi, Alessandro Nicolosi, Rita Cucchiara
To sustain the training of our model, we generate a comprehensive dataset that focuses on images generated by diffusion models and encompasses a collection of 9. 2 million images produced by using four different generators.
no code implementations • 21 May 2024 • Sara Sarto, Marcella Cornia, Lorenzo Baraldi, Alessandro Nicolosi, Rita Cucchiara
The objective of image captioning models is to bridge the gap between the visual and linguistic modalities by generating natural language descriptions that accurately reflect the content of input images.
no code implementations • 23 Apr 2024 • Davide Caffagni, Federico Cocchi, Nicholas Moratelli, Sara Sarto, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara
Multimodal LLMs are the natural evolution of LLMs, and enlarge their capabilities so as to work beyond the pure textual modality.
no code implementations • 15 Apr 2024 • Niyati Rawal, Roberto Bigazzi, Lorenzo Baraldi, Rita Cucchiara
VLN is a challenging task that involves an agent following human instructions and navigating in a previously unknown environment to reach a specified goal.
no code implementations • CVPR 2024 • Luca Barsellotti, Roberto Amoroso, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara
Open-vocabulary semantic segmentation aims at segmenting arbitrary categories expressed in textual form.
Open Vocabulary Semantic Segmentation
Open-Vocabulary Semantic Segmentation
+1
no code implementations • 11 Mar 2024 • Roberto Bigazzi, Lorenzo Baraldi, Shreyas Kousik, Rita Cucchiara, Marco Pavone
Robots require a semantic understanding of their surroundings to operate in an efficient and explainable way in human environments.
1 code implementation • 19 Feb 2024 • Davide Caffagni, Federico Cocchi, Luca Barsellotti, Nicholas Moratelli, Sara Sarto, Lorenzo Baraldi, Marcella Cornia, Rita Cucchiara
Connecting text and visual modalities plays an essential role in generative intelligence.
1 code implementation • 27 Nov 2023 • Samuele Poppi, Tobia Poppi, Federico Cocchi, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara
We show how this can be done by fine-tuning a CLIP model on synthetic data obtained from a large language model trained to convert between safe and unsafe sentences, and a text-to-image generator.
1 code implementation • ICCV 2023 • Manuele Barraco, Sara Sarto, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara
Image captioning, like many tasks involving vision and language, currently relies on Transformer-based architectures for extracting the semantics in an image and translating it into linguistically coherent descriptions.
no code implementations • 18 Jul 2023 • Federico Betti, Jacopo Staiano, Lorenzo Baraldi, Rita Cucchiara, Nicu Sebe
Research in Image Generation has recently made significant progress, particularly boosted by the introduction of Vision-Language models which are able to produce high-quality visual content based on textual inputs.
1 code implementation • 12 Jun 2023 • Roberto Amoroso, Marcella Cornia, Lorenzo Baraldi, Andrea Pilzer, Rita Cucchiara
The use of self-supervised pre-training has emerged as a promising approach to enhance the performance of many different visual tasks.
no code implementations • 4 Apr 2023 • Samuele Poppi, Sara Sarto, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara
Machine Unlearning is an emerging paradigm for selectively removing the impact of training datapoints from a network.
1 code implementation • 4 Apr 2023 • Vittorio Pippi, Silvia Cascianelli, Lorenzo Baraldi, Rita Cucchiara
In this work, we explore massive pre-training on synthetic word images for enhancing the performance on four benchmark downstream handwriting analysis tasks.
2 code implementations • 2 Apr 2023 • Roberto Amoroso, Davide Morelli, Marcella Cornia, Lorenzo Baraldi, Alberto del Bimbo, Rita Cucchiara
Recent advancements in diffusion models have enabled the generation of realistic deepfakes from textual prompts in natural language.
1 code implementation • CVPR 2023 • Sara Sarto, Manuele Barraco, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara
The CLIP model has been recently proven to be very effective for a variety of cross-modal tasks, including the evaluation of captions generated from vision-and-language architectures.
no code implementations • 17 Jan 2023 • Roberto Bigazzi, Marcella Cornia, Silvia Cascianelli, Lorenzo Baraldi, Rita Cucchiara
The development of embodied agents that can communicate with humans in natural language has gained increasing interest over the last years, as it facilitates the diffusion of robotic platforms in human-populated environments.
no code implementations • 17 Aug 2022 • Silvia Cascianelli, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara
Handwritten Text Recognition (HTR) in free-layout pages is a challenging image understanding task that can provide a relevant boost to the digitization of handwritten documents and reuse of their content.
no code implementations • 16 Aug 2022 • Silvia Cascianelli, Vittorio Pippi, Martin Maarand, Marcella Cornia, Lorenzo Baraldi, Christopher Kermorvant, Rita Cucchiara
With the aim of fostering the research on this topic, in this paper we present the Ludovico Antonio Muratori (LAM) dataset, a large line-level HTR dataset of Italian ancient manuscripts edited by a single author over 60 years.
1 code implementation • 29 Jul 2022 • Nicola Messina, Matteo Stefanini, Marcella Cornia, Lorenzo Baraldi, Fabrizio Falchi, Giuseppe Amato, Rita Cucchiara
In literature, this task is often used as a pre-training objective to forge architectures able to jointly deal with images and texts.
Ranked #22 on
Cross-Modal Retrieval
on COCO 2014
(using extra training data)
no code implementations • 26 Jul 2022 • Sara Sarto, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara
In this paper, we investigate the development of an image captioning approach with a kNN memory, with which knowledge can be retrieved from an external corpus to aid the generation process.
1 code implementation • 19 Apr 2022 • Roberto Bigazzi, Federico Landi, Silvia Cascianelli, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara
This feature is challenging for occupancy-based agents which are usually trained in crowded domestic environments with plenty of occupancy information.
1 code implementation • 18 Apr 2022 • Federico Landi, Roberto Bigazzi, Marcella Cornia, Silvia Cascianelli, Lorenzo Baraldi, Rita Cucchiara
To make a step towards this setting, we propose Spot the Difference: a novel task for Embodied AI where the agent has access to an outdated map of the environment and needs to recover the correct layout in a fixed time budget.
1 code implementation • 21 Feb 2022 • Manuele Barraco, Matteo Stefanini, Marcella Cornia, Silvia Cascianelli, Lorenzo Baraldi, Rita Cucchiara
Describing images in natural language is a fundamental step towards the automatic modeling of connections between the visual and textual modalities.
no code implementations • 24 Nov 2021 • Marcella Cornia, Lorenzo Baraldi, Giuseppe Fiameni, Rita Cucchiara
This paper addresses the task of generating fluent descriptions by training on a non-uniform combination of data sources, containing both human-annotated and web-collected captions.
1 code implementation • 14 Sep 2021 • Roberto Bigazzi, Federico Landi, Silvia Cascianelli, Lorenzo Baraldi, Marcella Cornia, Rita Cucchiara
The proposed exploration approach outperforms DRL-based competitors relying on intrinsic rewards and surpasses the agents trained with a dense extrinsic reward computed with the environment layouts.
no code implementations • 31 Aug 2021 • Federico Landi, Lorenzo Baraldi, Marcella Cornia, Rita Cucchiara
Numerical results suggest that the cell state contains useful information that is worth including in the gate structure.
no code implementations • 14 Jul 2021 • Matteo Stefanini, Marcella Cornia, Lorenzo Baraldi, Silvia Cascianelli, Giuseppe Fiameni, Rita Cucchiara
Starting from 2015 the task has generally been addressed with pipelines composed of a visual encoder and a language model for text generation.
no code implementations • 2 Jun 2021 • Marco Cagrandi, Marcella Cornia, Matteo Stefanini, Lorenzo Baraldi, Rita Cucchiara
In this paper, we present a novel approach for NOC that learns to select the most relevant objects of an image, regardless of their adherence to the training set, and to constrain the generative process of a language model accordingly.
1 code implementation • 12 May 2021 • Roberto Bigazzi, Federico Landi, Marcella Cornia, Silvia Cascianelli, Lorenzo Baraldi, Rita Cucchiara
In this work, we detail how to transfer the knowledge acquired in simulation into the real world.
1 code implementation • 20 Apr 2021 • Samuele Poppi, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara
As the request for deep learning solutions increases, the need for explainability is even more fundamental.
1 code implementation • 15 Feb 2021 • Matteo Tomei, Lorenzo Baraldi, Simone Calderara, Simone Bronzin, Rita Cucchiara
The recently proposed action spotting task consists in finding the exact timestamp in which an event occurs.
Ranked #1 on
Action Spotting
on SoccerNet
no code implementations • 20 Jul 2020 • Matteo Fabbri, Fabio Lanzi, Riccardo Gasparini, Simone Calderara, Lorenzo Baraldi, Rita Cucchiara
In this document, we report our proposal for modeling the risk of possible contagiousity in a given area monitored by RGB cameras where people freely move and interact.
no code implementations • 14 Jul 2020 • Roberto Bigazzi, Federico Landi, Marcella Cornia, Silvia Cascianelli, Lorenzo Baraldi, Rita Cucchiara
In this paper, we devise a novel embodied setting in which an agent needs to explore a previously unknown environment while recounting what it sees during the path.
no code implementations • 27 Apr 2020 • Matteo Stefanini, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara
The joint understanding of vision and language has been recently gaining a lot of attention in both the Computer Vision and Natural Language Processing communities, with the emergence of tasks such as image captioning, image-text matching, and visual question answering.
2 code implementations • CVPR 2020 • Marcella Cornia, Matteo Stefanini, Lorenzo Baraldi, Rita Cucchiara
Transformer-based architectures represent the state of the art in sequence modeling tasks like machine translation and language understanding.
Ranked #2 on
Image Captioning
on MS COCO
1 code implementation • 9 Dec 2019 • Matteo Tomei, Lorenzo Baraldi, Simone Calderara, Simone Bronzin, Rita Cucchiara
Action Detection is a complex task that aims to detect and classify human actions in video clips.
1 code implementation • 27 Nov 2019 • Federico Landi, Lorenzo Baraldi, Marcella Cornia, Massimiliano Corsini, Rita Cucchiara
Vision-and-Language Navigation (VLN) is a challenging task in which an agent needs to follow a language-specified path to reach a target destination.
no code implementations • 7 Oct 2019 • Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara
The ability to generate natural language explanations conditioned on the visual perception is a crucial step towards autonomous agents which can explain themselves and communicate with humans.
no code implementations • International Conference on Image Analysis and Processing 2019 • Matteo Stefanini, Marcella Cornia, Lorenzo Baraldi, Massimiliano Corsini, and Rita Cucchiara
As vision and language techniques are widely applied to realistic images , there is a growing interest in designing visual-semantic models suitable for more complex and challenging scenarios.
1 code implementation • 5 Jul 2019 • Federico Landi, Lorenzo Baraldi, Massimiliano Corsini, Rita Cucchiara
In Vision-and-Language Navigation (VLN), an embodied agent needs to reach a target destination with the only guidance of a natural language instruction.
1 code implementation • 5 Mar 2019 • Matteo Stefanini, Riccardo Lancellotti, Lorenzo Baraldi, Simone Calderara
The experiments compare our proposal with state-of-the-art solutions available in literature, demonstrating that our proposal achieve better performance.
1 code implementation • 4 Mar 2019 • Stefano Pini, Marcella Cornia, Federico Bolelli, Lorenzo Baraldi, Rita Cucchiara
Current movie captioning architectures are not capable of mentioning characters with their proper name, replacing them with a generic "someone" tag.
1 code implementation • CVPR 2019 • Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara
Current captioning approaches can describe images using black-box architectures whose behavior is hardly controllable and explainable from the exterior.
1 code implementation • CVPR 2019 • Matteo Tomei, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara
The applicability of computer vision to real paintings and artworks has been rarely investigated, even though a vast heritage would greatly benefit from techniques which can understand and process data from the artistic domain.
1 code implementation • CVPR 2018 • Lorenzo Baraldi, Matthijs Douze, Rita Cucchiara, Hervé Jégou
This paper considers a learnable approach for comparing and aligning videos.
no code implementations • 26 Jun 2017 • Marcella Cornia, Lorenzo Baraldi, Giuseppe Serra, Rita Cucchiara
Image captioning has been recently gaining a lot of attention thanks to the impressive achievements shown by deep captioning architectures, which combine Convolutional Neural Networks to extract image representations, and Recurrent Neural Networks to generate the corresponding captions.
Ranked #2 on
Image Captioning
on Flickr30k Captions test
(using extra training data)
2 code implementations • 29 Nov 2016 • Marcella Cornia, Lorenzo Baraldi, Giuseppe Serra, Rita Cucchiara
Data-driven saliency has recently gained a lot of attention thanks to the use of Convolutional Neural Networks for predicting gaze fixations.
no code implementations • CVPR 2017 • Lorenzo Baraldi, Costantino Grana, Rita Cucchiara
The use of Recurrent Neural Networks for video captioning has recently gained a lot of attention, since they can be used both to encode the input video and to generate the corresponding description.
no code implementations • 5 Oct 2016 • Lorenzo Baraldi, Costantino Grana, Rita Cucchiara
This paper presents a novel approach for temporal and semantic segmentation of edited videos into meaningful segments, from the point of view of the storytelling structure.
2 code implementations • 5 Sep 2016 • Marcella Cornia, Lorenzo Baraldi, Giuseppe Serra, Rita Cucchiara
Current state of the art models for saliency prediction employ Fully Convolutional networks that perform a non-linear combination of features extracted from the last convolutional layer to predict saliency maps.
no code implementations • 9 Apr 2016 • Lorenzo Baraldi, Costantino Grana, Rita Cucchiara
This paper presents a novel retrieval pipeline for video collections, which aims to retrieve the most significant parts of an edited video for a given query, and represent them with thumbnails which are at the same time semantically meaningful and aesthetically remarkable.
1 code implementation • 29 Oct 2015 • Lorenzo Baraldi, Costantino Grana, Rita Cucchiara
We present a model that automatically divides broadcast videos into coherent scenes by learning a distance measure between shots.