The inexorable growth of online shopping and e-commerce demands scalable and robust machine learning-based solutions to accommodate customer requirements.
Image captioning, like many tasks involving vision and language, currently relies on Transformer-based architectures for extracting the semantics in an image and translating it into linguistically coherent descriptions.
The field of multi-object tracking has recently seen a renewed interest in the good old schema of tracking-by-detection, as its simplicity and strong priors spare it from the complex design and painful babysitting of tracking-by-attention approaches.
Recent advancements in Digital Document Restoration (DDR) have led to significant breakthroughs in analyzing highly damaged written artifacts.
Neural Radiance Fields (NeRFs) have gained widespread recognition as a highly effective technique for representing 3D reconstructions of objects and scenes derived from sets of images.
Research in Image Generation has recently made significant progress, particularly boosted by the introduction of Vision-Language models which are able to produce high-quality visual content based on textual inputs.
The use of self-supervised pre-training has emerged as a promising approach to enhance the performance of visual tasks such as image classification.
In this context, image-based virtual try-on, which consists in generating a novel image of a target model wearing a given in-shop garment, has yet to capitalize on the potential of these powerful generative solutions.
Recent advancements in Deep Learning-based Handwritten Text Recognition (HTR) have led to models with remarkable performance on both modern and historical manuscripts in large benchmark datasets.
Machine Unlearning has recently been emerging as a paradigm for selectively removing the impact of training datapoints from a network.
Given the lack of existing datasets suitable for the task, we also extend two existing fashion datasets, namely Dress Code and VITON-HD, with multimodal annotations collected in a semi-automatic manner.
In this work, we explore massive pre-training on synthetic word images for enhancing the performance on four benchmark downstream handwriting analysis tasks.
Recent advancements in diffusion models have enabled the generation of realistic deepfakes by writing textual prompts in natural language.
Generating synthetic images of handwritten text in a writer-specific style is a challenging task, especially in the case of unseen styles and new words, and even more when these latter contain characters that are rarely encountered during training.
The CLIP model has been recently proven to be very effective for a variety of cross-modal tasks, including the evaluation of captions generated from vision-and-language architectures.
There is a recent growing interest in applying Deep Learning techniques to tabular data, in order to replicate the success of other Artificial Intelligence areas in this structured domain.
Denoising Diffusion Probabilistic Models have shown an impressive generation quality, although their long sampling chain leads to high computational costs.
Ranked #1 on Image Generation on FFHQ 128 x 128
The development of embodied agents that can communicate with humans in natural language has gained increasing interest over the last years, as it facilitates the diffusion of robotic platforms in human-populated environments.
Handwritten Text Recognition (HTR) in free-layout pages is a challenging image understanding task that can provide a relevant boost to the digitization of handwritten documents and reuse of their content.
With the aim of fostering the research on this topic, in this paper we present the Ludovico Antonio Muratori (LAM) dataset, a large line-level HTR dataset of Italian ancient manuscripts edited by a single author over 60 years.
This work tackles Weakly Supervised Anomaly detection, in which a predictor is allowed to learn not only from normal examples but also from a few labeled anomalies made available during training.
Ranked #10 on Anomaly Detection In Surveillance Videos on XD-Violence
In literature, this task is often used as a pre-training objective to forge architectures able to jointly deal with images and texts.
Ranked #20 on Cross-Modal Retrieval on COCO 2014
In this paper, we investigate the development of an image captioning approach with a kNN memory, with which knowledge can be retrieved from an external corpus to aid the generation process.
This paper proposes a simple alternative: encoding maximum separation as an inductive bias in the network by adding one fixed matrix multiplication before computing the softmax activations.
In particular, MJP first shuffles the selected patches via our block-wise random jigsaw puzzle shuffle algorithm, and their corresponding PEs are occluded.
In this paper, we introduce SeeFar to achieve vehicle speed estimation and traffic flow analysis based on YOLOv5 and DeepSORT from a moving drone.
Human trajectory forecasting is a key component of autonomous vehicles, social-aware robots and advanced video-surveillance applications.
This feature is challenging for occupancy-based agents which are usually trained in crowded domestic environments with plenty of occupancy information.
To make a step towards this setting, we propose Spot the Difference: a novel task for Embodied AI where the agent has access to an outdated map of the environment and needs to recover the correct layout in a fixed time budget.
Dress Code is more than 3x larger than publicly available datasets for image-based virtual try-on and features high-resolution paired images (1024x768) with front-view, full-body reference models.
Ranked #5 on Virtual Try-on on VITON
To this end, we conceive a novel distillation strategy that allows a knowledge transfer from a teacher network to a student one, the latter fed with fewer observations (just two ones).
Describing images in natural language is a fundamental step towards the automatic modeling of connections between the visual and textual modalities.
While captioning models have obtained compelling results in describing natural images, there is a growing effort to increase their capability of dealing with real-world concepts.
Recently, learning frameworks have shown the capability of inferring the accurate shape, pose, and texture of an object from a single RGB image.
The proposed exploration approach outperforms DRL-based competitors relying on intrinsic rewards and surpasses the agents trained with a dense extrinsic reward computed with the environment layouts.
Deep learning-based methods for video pedestrian detection and tracking require large volumes of training data to achieve good performance.
Starting from 2015 the task has generally been addressed with pipelines composed of a visual encoder and a language model for text generation.
no code implementations • 21 Jun 2021 • Ariel Caputo, Andrea Giachetti, Simone Soso, Deborah Pintani, Andrea D'Eusanio, Stefano Pini, Guido Borghi, Alessandro Simoni, Roberto Vezzani, Rita Cucchiara, Andrea Ranieri, Franca Giannini, Katia Lupinetti, Marina Monti, Mehran Maghoumi, Joseph J. LaViola Jr, Minh-Quan Le, Hai-Dang Nguyen, Minh-Triet Tran
Gesture recognition is a fundamental tool to enable novel interaction paradigms in a variety of application scenarios like Mixed Reality environments, touchless public kiosks, entertainment systems, and more.
In this paper, we present a novel approach for NOC that learns to select the most relevant objects of an image, regardless of their adherence to the training set, and to constrain the generative process of a language model accordingly.
In this work, we detail how to transfer the knowledge acquired in simulation into the real world.
As the request for deep learning solutions increases, the need for explainability is even more fundamental.
The recently proposed action spotting task consists in finding the exact timestamp in which an event occurs.
Ranked #1 on Action Spotting on SoccerNet
In this document, we report our proposal for modeling the risk of possible contagiousity in a given area monitored by RGB cameras where people freely move and interact.
In this paper, we devise a novel embodied setting in which an agent needs to explore a previously unknown environment while recounting what it sees during the path.
In this work we propose a deep learning pipeline to predict the visual future appearance of an urban scene.
Understanding human motion behaviour is a critical task for several possible applications like self-driving cars or social robots, and in general for all those settings where an autonomous agent has to navigate inside a human-centric environment.
Ranked #1 on Trajectory Prediction on STATS SportVu NBA [ATK]
Anticipating human motion in crowded scenarios is essential for developing intelligent transportation systems, social-aware robots and advanced video surveillance applications.
The joint understanding of vision and language has been recently gaining a lot of attention in both the Computer Vision and Natural Language Processing communities, with the emergence of tasks such as image captioning, image-text matching, and visual question answering.
At the core of the proposed method lies our Volumetric Heatmap Autoencoder, a fully-convolutional network tasked with the compression of ground-truth heatmaps into a dense intermediate representation.
Ranked #6 on 3D Human Pose Estimation on Panoptic (using extra training data)
Therefore, we additionally introduce a task classifier that predicts the task label of each example, to deal with settings in which a task oracle is not available.
Ranked #3 on Continual Learning on ImageNet-50 (5 tasks)
Transformer-based architectures represent the state of the art in sequence modeling tasks like machine translation and language understanding.
Ranked #2 on Image Captioning on COCO
Action Detection is a complex task that aims to detect and classify human actions in video clips.
Vision-and-Language Navigation (VLN) is a challenging task in which an agent needs to follow a language-specified path to reach a target destination.
The ability to generate natural language explanations conditioned on the visual perception is a crucial step towards autonomous agents which can explain themselves and communicate with humans.
An Image Completion Network (ICN) is then trained to generate a realistic image starting from this geometric guidance.
In Vision-and-Language Navigation (VLN), an embodied agent needs to reach a target destination with the only guidance of a natural language instruction.
Current movie captioning architectures are not capable of mentioning characters with their proper name, replacing them with a generic "someone" tag.
We present a novel and hierarchical approach for supervised classification of signals spanning over a fixed graph, reflecting shared properties of the dataset.
Can faces acquired by low-cost depth sensors be useful to catch some characteristic details of the face?
When you see a person in a crowd, occluded by other persons, you miss visual information that can be used to recognize, re-identify or simply classify him or her.
Current captioning approaches can describe images using black-box architectures whose behavior is hardly controllable and explainable from the exterior.
The applicability of computer vision to real paintings and artworks has been rarely investigated, even though a vast heritage would greatly benefit from techniques which can understand and process data from the artistic domain.
Novelty detection is commonly referred to as the discrimination of observations that do not conform to a learned model of regularity.
Semi-supervised learning is a popular class of techniques to learn from labeled and unlabeled data.
This paper considers a learnable approach for comparing and aligning videos.
In this paper, an adversarial architecture for facial depth map estimation from monocular intensity images is presented.
Multi-People Tracking in an open-world setting requires a special effort in precise detection.
Therefore, we propose a complete framework for the estimation of the head and shoulder pose based on depth images only.
Two public datasets have been exploited: the first one, called Pandora, is used to train a deep binary classifier with face and non-face images.
In this paper we propose a deep architecture for detecting people attributes (e. g. gender, race, clothing ...) in surveillance contexts.
Awareness of the road scene is an essential component for both autonomous vehicles and Advances Driver Assistance Systems and is gaining importance both for the academia and car companies.
Image captioning has been recently gaining a lot of attention thanks to the impressive achievements shown by deep captioning architectures, which combine Convolutional Neural Networks to extract image representations, and Recurrent Neural Networks to generate the corresponding captions.
Ranked #2 on Image Captioning on Flickr30k Captions test (using extra training data)
We address unsupervised optical flow estimation for ego-centric motion.
In this paper, we tackle the pose estimation problem through a deep learning network working in regression manner.
HMMs are widely used in action and gesture recognition due to their implementation simplicity, low computational requirement, scalability and high parallelism.
Recently, deep learning approaches have achieved promising results in various fields of computer vision.
In this work, we present a new deep learning framework for head localization and pose estimation on depth images.
Data-driven saliency has recently gained a lot of attention thanks to the use of Convolutional Neural Networks for predicting gaze fixations.
The use of Recurrent Neural Networks for video captioning has recently gained a lot of attention, since they can be used both to encode the input video and to generate the corresponding description.
Despite the advent of autonomous cars, it's likely - at least in the near future - that human attention will still maintain a central role as a guarantee in terms of legal responsibility during the driving task.
This paper presents a novel approach for temporal and semantic segmentation of edited videos into meaningful segments, from the point of view of the storytelling structure.
To help accelerate progress in multi-target, multi-camera tracking systems, we present (i) a new pair of precision-recall measures of performance that treats errors of all types uniformly and emphasizes correct identification over sources of error; (ii) the largest fully-annotated and calibrated data set to date with more than 2 million frames of 1080p, 60fps video taken by 8 cameras observing more than 2, 700 identities over 85 minutes; and (iii) a reference software system as a comparison baseline.
Current state of the art models for saliency prediction employ Fully Convolutional networks that perform a non-linear combination of features extracted from the last convolutional layer to predict saliency maps.
To effectively register an egocentric video sequence under these conditions, we propose to tackle the source of the problem: the matching process.
This paper presents a novel retrieval pipeline for video collections, which aims to retrieve the most significant parts of an edited video for a given query, and represent them with thumbnails which are at the same time semantically meaningful and aesthetically remarkable.
We present a model that automatically divides broadcast videos into coherent scenes by learning a distance measure between shots.
Modern crowd theories agree that collective behavior is the result of the underlying interactions among small groups of individuals.