Multi-Instance Retrieval
13 papers with code • 1 benchmarks • 1 datasets
Libraries
Use these libraries to find Multi-Instance Retrieval models and implementationsMost implemented papers
Learning Video Representations from Large Language Models
We introduce LaViLa, a new approach to learning video-language representations by leveraging Large Language Models (LLMs).
Learning video retrieval models with relevance-aware online mining
Due to the amount of videos and related captions uploaded every hour, deep learning-based solutions for cross-modal video retrieval are attracting more and more attention.
Egocentric Video-Language Pretraining
Video-Language Pretraining (VLP), which aims to learn transferable representation to advance a wide range of video-text downstream tasks, has recently received increasing attention.
Relevance-based Margin for Contrastively-trained Video Retrieval Models
We show that even if we carefully tuned the fixed margin, our technique (which does not have the margin as a hyper-parameter) would still achieve better performance.
Exploiting Semantic Role Contextualized Video Features for Multi-Instance Text-Video Retrieval EPIC-KITCHENS-100 Multi-Instance Retrieval Challenge 2022
In this report, we present our approach for EPIC-KITCHENS-100 Multi-Instance Retrieval Challenge 2022.
Egocentric Video-Language Pretraining @ EPIC-KITCHENS-100 Multi-Instance Retrieval Challenge 2022
In this report, we propose a video-language pretraining (VLP) based solution \cite{kevin2022egovlp} for the EPIC-KITCHENS-100 Multi-Instance Retrieval (MIR) challenge.
HierVL: Learning Hierarchical Video-Language Embeddings
Video-language embeddings are a promising avenue for injecting semantics into visual representations, but existing methods capture only short-term associations between seconds-long video clips and their accompanying text.
EgoVLPv2: Egocentric Video-Language Pre-training with Fusion in the Backbone
Video-language pre-training (VLP) has become increasingly important due to its ability to generalize to various vision and language tasks.
Training a Large Video Model on a Single Machine in a Day
Videos are big, complex to pre-process, and slow to train on.
EgoNCE++: Do Egocentric Video-Language Models Really Understand Hand-Object Interactions?
Due to the occurrence of diverse EgoHOIs in the real world, we propose an open-vocabulary benchmark named EgoHOIBench to reveal the diminished performance of current egocentric video-language models (EgoVLM) on fined-grained concepts, indicating that these models still lack a full spectrum of egocentric understanding.