This paper introduces a Multi-modal Evaluation Benchmark named MERLIM, a scalable test-bed to assess the performance of IT-LVLMs on fundamental computer vision tasks.
In this paper, we address the problem of continual learning for video data.
We perform in-depth evaluations of existing CL methods in vCLIMB, and observe two unique challenges in video data.
Current language models are usually trained using a self-supervised scheme, where the main focus is learning representations at the word or sentence level.
Recently, few-shot learning has received increasing interest.
Our evaluation indicates that both the Transformer architecture and the contextual information are essential to get the best results for this item recommendation task.