We propose LocFormer, a Transformer-based model for video grounding which operates at a constant memory footprint regardless of the video length, i. e. number of frames.
We demonstrate that with a relatively simple architecture, CIRPLANT outperforms existing methods on open-domain images, while matching state-of-the-art accuracy on the existing narrow datasets, such as fashion.
Ranked #3 on Text-Image Retrieval on CIRR
From both the textual and visual perspectives, we find that the relationships among the scene, its objects, and directional clues are essential for the agent to interpret complex instructions and correctly perceive the environment.
This paper studies the task of temporal moment localization in a long untrimmed video using natural language query.
The availability of a large labeled dataset is a key requirement for applying deep learning methods to solve various computer vision tasks.
Despite the recent advances in opinion mining for written reviews, few works have tackled the problem on other sources of reviews.
Vision-and-language navigation requires an agent to navigate through a real 3D environment following natural language instructions.
Given an untrimmed video and a sentence as the query, the goal is to determine the starting, and the ending, of the relevant visual moment in the video, that corresponds to the query sentence.