The first untrained module aims to return a rough alignment between textual phrases and bounding boxes.
Long-term trajectory forecasting is an important and challenging problem in the fields of computer vision, machine learning, and robotics.
In this paper, we propose an end-to-end architecture that exploits Proximity-Aware Tasks (referred as to Risk and Proximity Compass) to inject into a reinforcement learning navigation policy the ability to infer common-sense social behaviors.
Human intention prediction is a growing area of research where an activity in a video has to be anticipated by a vision-based system.
Vision Transformers (ViTs) enabled the use of the transformer architecture on vision tasks showing impressive performances when trained on big datasets.
Human trajectory forecasting is a key component of autonomous vehicles, social-aware robots and advanced video-surveillance applications.
To this end, we conceive a novel distillation strategy that allows a knowledge transfer from a teacher network to a student one, the latter fed with fewer observations (just two ones).
In this paper, we present a novel approach to incrementally learn an Abstract Model of an unknown environment, and show how an agent can reuse the learned model for tackling the Object Goal Navigation task.
Action anticipation in egocentric videos is a difficult task due to the inherently multi-modal nature of human actions.
In this way, we are able to control the compactness of the features of the same class around the center of the gaussians, thus controlling the ability of the classifier in detecting samples from unknown classes.
One of the most serious public health problems in Peru and worldwide is Tuberculosis (TB), which is produced by a bacterium known as Mycobacterium tuberculosis.
We study this question in the context of Object Navigation, a problem in which an agent has to reach an object of a specific class while moving in a complex domestic environment.
Tuberculosis (TB), caused by a germ called Mycobacterium tuberculosis, is one of the most serious public health problems in Peru and the world.
Tuberculosis, caused by a bacteria called Mycobacterium tuberculosis, is one of the most serious public health problems worldwide.
Anticipating human motion in crowded scenarios is essential for developing intelligent transportation systems, social-aware robots and advanced video surveillance applications.
Since multiple actions may equally occur in the future, we treat action anticipation as a multi-label problem with missing labels extending the concept of label smoothing.
Mimicking human ability to forecast future positions or interpret complex interactions in urban scenarios, such as streets, shopping malls or squares, is essential to develop socially compliant robots or self-driving cars.
One of the main problems in webly-supervised learning is cleaning the noisy labeled data from the web.
To this end, we propose a "context-aware" recurrent neural network LSTM model, which can learn and predict human motion in crowded spaces such as a sidewalk, a museum or a shopping mall.
In this paper we deal with the problem of predicting action progress in videos.
Automatic image annotation is among the fundamental problems in computer vision and pattern recognition, and it is becoming increasingly important in order to develop algorithms that are able to search and browse large-scale image collections.
When given a single frame of the video, humans can not only interpret the content of the scene, but also they are able to forecast the near future.
Some images that are difficult to recognize on their own may become more clear in the context of a neighborhood of related images with similar social-network metadata.
Where previous reviews on content-based image retrieval emphasize on what can be seen in an image to bridge the semantic gap, this survey considers what people tag about an image.
Our approach exploits collective knowledge embedded in user-generated tags and web sources, and visual similarity of keyframes and images uploaded to social sites like YouTube and Flickr, as well as web sources like Google and Bing.