This paper introduces a Multi-modal Evaluation Benchmark named MERLIM, a scalable test-bed to assess the performance of IT-LVLMs on fundamental computer vision tasks.
One of the objectives of Continual Learning is to learn new concepts continually over a stream of experiences and at the same time avoid catastrophic forgetting.
Existing question answering methods often assume that the input content (e. g., documents or videos) is always accessible to solve the task.
In this paper, we address the problem of continual learning for video data.
Lifelong language learning seeks to have models continuously learn multiple tasks in a sequential order without suffering from catastrophic forgetting.
This paper introduces Memory Outlier Elimination (MOE), a method for identifying and eliminating outliers in the memory buffer by choosing samples from label-homogeneous subpopulations.
Due to the success of pre-trained language models, versions of languages other than English have been released in recent years.
Current language models are usually trained using a self-supervised scheme, where the main focus is learning representations at the word or sentence level.
Current datasets to train social behaviors are usually borrowed from surveillance applications that capture visual data from a bird's-eye perspective.
Recently, few-shot learning has received increasing interest.
On the other hand, a set of trainable masks provides the key mechanism to selectively choose from the KB relevant weights to solve each task.
The field of natural language understanding has experienced exponential progress in the last few years, with impressive results in several tasks.
The state of the art, previously dominated by pre-trained word embeddings, is now being pushed forward by large pre-trained contextual representation models.
DACT-BERT adds an adaptive computation mechanism to the regular processing pipeline of BERT.
As a working hypothesis, we speculate that during learning some weights focus on mining patterns from frequent examples while others are in charge of memorizing rare long-tail samples.
We introduce two temporal attention modules which can be plugged into traditional memory augmented recurrent neural networks to improve their performance in natural language processing tasks.
Every year physicians face an increasing demand of image-based diagnosis from patients, a problem that can be addressed with recent artificial intelligence methods.
We propose a multi-head attention mechanism as a blending layer in a neural network model that translates natural language to a high level behavioral language for indoor robot navigation.
This paper presents a novel attention-based algorithm for achieving adaptive computation called DACT, which, unlike existing ones, is end-to-end differentiable.
Inspired by research in psychology, we introduce a behavioral approach for visual navigation using topological maps.
We propose an end-to-end deep learning model for translating free-form natural language instructions to a high-level plan for behavioral robot navigation.
Traditional video understanding tasks include human action recognition and actor/object semantic segmentation.
A key aspect of VQA models that are interpretable is their ability to ground their answers to relevant regions in the image.
Advances in image processing and computer vision in the latest years have brought about the use of visual features in artwork recommendation.
Compared to other areas, artwork recommendation has received little attention, despite the continuous growth of the artwork market.
Consequently, a main conclusion of this work is that general-purpose commonsense ontologies improve performance on visual reasoning tasks when properly filtered to select meaningful visual relations.
In this paper, we introduce a new hierarchical model for human action recognition using body joint locations.
In terms of the method to obtain key-sequences, we introduce a loss function that, for each video, leads to the identification of a sparse set of representative key-frames capturing both, relevant particularities arising in the input video, as well as relevant generalities arising in the complete class collection.