no code implementations • CVPR 2022 • Didac Suris, Carl Vondrick, Bryan Russell, Justin Salamon
In order to capture the high-level concepts that are required to solve the task, we propose modeling the long-term temporal context of both the video and the music signals, using Transformer networks for each modality.
no code implementations • CVPR 2022 • Basile Van Hoorick, Purva Tendulka, Didac Suris, Dennis Park, Simon Stent, Carl Vondrick
For computer vision systems to operate in dynamic situations, they need to be able to represent and reason about object permanence.
no code implementations • CVPR 2019 • Didac Suris, Adria Recasens, David Bau, David Harwath, James Glass, Antonio Torralba
Our goal is to learn the correspondence between spoken words and abstract visual attributes, from a dataset of spoken descriptions of images.