2 code implementations • 8 Jul 2024 • Skanda Koppula, Ignacio Rocco, Yi Yang, Joe Heyward, João Carreira, Andrew Zisserman, Gabriel Brostow, Carl Doersch
We introduce a new benchmark, TAPVid-3D, for evaluating the task of long-range Tracking Any Point in 3D (TAP-3D).
2 code implementations • 1 Feb 2024 • Carl Doersch, Pauline Luc, Yi Yang, Dilara Gokay, Skanda Koppula, Ankush Gupta, Joseph Heyward, Ignacio Rocco, Ross Goroshin, João Carreira, Andrew Zisserman
To endow models with greater understanding of physics and motion, it is useful to enable them to perceive how solid surfaces move and deform in real scenes.
Ranked #1 on Point Tracking on TAP-Vid-RGB-Stacking
no code implementations • 20 Dec 2023 • Joseph Heyward, João Carreira, Dima Damen, Andrew Zisserman, Viorica Pătrăucean
The First Perception Test challenge was held as a half-day workshop alongside the IEEE/CVF International Conference on Computer Vision (ICCV) 2023, with the goal of benchmarking state-of-the-art video models on the recently proposed Perception Test benchmark.
no code implementations • CVPR 2024 • João Carreira, Michael King, Viorica Pătrăucean, Dilara Gokay, Cătălin Ionescu, Yi Yang, Daniel Zoran, Joseph Heyward, Carl Doersch, Yusuf Aytar, Dima Damen, Andrew Zisserman
We introduce a framework for online learning from a single continuous video stream -- the way people and animals learn, without mini-batches, data augmentation or shuffling.
no code implementations • 12 Oct 2023 • Shashanka Venkataramanan, Mamshad Nayeem Rizve, João Carreira, Yuki M. Asano, Yannis Avrithis
But are we making the best use of data?
1 code implementation • NeurIPS 2023 • Viorica Pătrăucean, Lucas Smaira, Ankush Gupta, Adrià Recasens Continente, Larisa Markeeva, Dylan Banarse, Skanda Koppula, Joseph Heyward, Mateusz Malinowski, Yi Yang, Carl Doersch, Tatiana Matejovicova, Yury Sulsky, Antoine Miech, Alex Frechette, Hanna Klimczak, Raphael Koster, Junlin Zhang, Stephanie Winkler, Yusuf Aytar, Simon Osindero, Dima Damen, Andrew Zisserman, João Carreira
We propose a novel multimodal video benchmark - the Perception Test - to evaluate the perception and reasoning skills of pre-trained multimodal models (e. g. Flamingo, SeViLA, or GPT-4).
Ranked #1 on Point Tracking on Perception Test
3 code implementations • 7 Nov 2022 • Carl Doersch, Ankush Gupta, Larisa Markeeva, Adrià Recasens, Lucas Smaira, Yusuf Aytar, João Carreira, Andrew Zisserman, Yi Yang
Generic motion understanding from video involves not only tracking objects, but also perceiving how their surfaces deform and move.
no code implementations • 12 Oct 2022 • Nikhil Parthasarathy, S. M. Ali Eslami, João Carreira, Olivier J. Hénaff
Humans learn powerful representations of objects and scenes by observing how they evolve over time.
no code implementations • 30 Sep 2022 • Skanda Koppula, Yazhe Li, Evan Shelhamer, Andrew Jaegle, Nikhil Parthasarathy, Relja Arandjelovic, João Carreira, Olivier Hénaff
Self-supervised methods have achieved remarkable success in transfer learning, often achieving the same or better accuracy than supervised pre-training.
no code implementations • 17 Mar 2022 • Charlie Nash, João Carreira, Jacob Walker, Iain Barr, Andrew Jaegle, Mateusz Malinowski, Peter Battaglia
We present a general-purpose framework for image modelling and vision tasks based on probabilistic frame prediction.
1 code implementation • 16 Mar 2022 • Olivier J. Hénaff, Skanda Koppula, Evan Shelhamer, Daniel Zoran, Andrew Jaegle, Andrew Zisserman, João Carreira, Relja Arandjelović
The promise of self-supervised learning (SSL) is to leverage large amounts of unlabeled data to solve complex tasks.
3 code implementations • 15 Feb 2022 • Curtis Hawthorne, Andrew Jaegle, Cătălina Cangea, Sebastian Borgeaud, Charlie Nash, Mateusz Malinowski, Sander Dieleman, Oriol Vinyals, Matthew Botvinick, Ian Simon, Hannah Sheahan, Neil Zeghidour, Jean-Baptiste Alayrac, João Carreira, Jesse Engel
Real-world data is high-dimensional: a book, image, or musical performance can easily contain hundreds of thousands of elements even after compression.
Ranked #35 on Language Modelling on WikiText-103
no code implementations • CVPR 2022 • Wang Yifan, Carl Doersch, Relja Arandjelović, João Carreira, Andrew Zisserman
Much of the recent progress in 3D vision has been driven by the development of specialized architectures that incorporate geometrical inductive biases.
2 code implementations • ICCV 2021 • Olivier J. Hénaff, Skanda Koppula, Jean-Baptiste Alayrac, Aaron van den Oord, Oriol Vinyals, João Carreira
Self-supervised pretraining has been shown to yield powerful representations for transfer learning.
Ranked #60 on Semantic Segmentation on Cityscapes val (using extra training data)
no code implementations • 21 Oct 2020 • Lucas Smaira, João Carreira, Eric Noland, Ellen Clancy, Amy Wu, Andrew Zisserman
We describe the 2020 edition of the DeepMind Kinetics human action dataset, which replenishes and extends the Kinetics-700 dataset.
no code implementations • 1 May 2020 • Ang Li, Meghana Thotakuri, David A. Ross, João Carreira, Alexander Vostrikov, Andrew Zisserman
The dataset is collected by annotating videos from the Kinetics-700 dataset using the AVA annotation protocol, and extending the original AVA dataset with these new AVA annotated Kinetics clips.
1 code implementation • CVPR 2020 • Gunnar A. Sigurdsson, Jean-Baptiste Alayrac, Aida Nematzadeh, Lucas Smaira, Mateusz Malinowski, João Carreira, Phil Blunsom, Andrew Zisserman
Given this shared embedding we demonstrate that (i) we can map words between the languages, particularly the 'visual' words; (ii) that the shared embedding provides a good initialization for existing unsupervised text-based word translation techniques, forming the basis for our proposed hybrid visual-text mapping algorithm, MUVE; and (iii) our approach achieves superior performance by addressing the shortcomings of text-based methods -- it is more robust, handles datasets with less commonality, and is applicable to low-resource languages.
no code implementations • ICCV 2019 • Jean-Baptiste Alayrac, João Carreira, Relja Arandjelović, Andrew Zisserman
The objective of this paper is to be able to separate a video into its natural layers, and to control which of the separated layers to attend to.
no code implementations • CVPR 2019 • Rohit Girdhar, João Carreira, Carl Doersch, Andrew Zisserman
We introduce the Action Transformer model for recognizing and localizing human actions in video clips.
Ranked #6 on Action Recognition on AVA v2.1
1 code implementation • CVPR 2019 • Jean-Baptiste Alayrac, João Carreira, Andrew Zisserman
True video understanding requires making sense of non-lambertian scenes where the color of light arriving at the camera sensor encodes information about not just the last object it collided with, but about multiple mediums -- colored windows, dirty mirrors, smoke or rain.
no code implementations • 26 Jul 2018 • Rohit Girdhar, João Carreira, Carl Doersch, Andrew Zisserman
We introduce a simple baseline for action localization on the AVA dataset.
Ranked #12 on Action Recognition on AVA v2.1
no code implementations • 24 Nov 2015 • Shubham Tulsiani, Abhishek Kar, Qi-Xing Huang, João Carreira, Jitendra Malik
Actions as simple as grasping an object or navigating around it require a rich understanding of that object's 3D shape from a given viewpoint.
no code implementations • ICCV 2015 • Abhishek Kar, Shubham Tulsiani, João Carreira, Jitendra Malik
We consider the problem of enriching current object detection systems with veridical object sizes and relative depth estimates from a single image.
1 code implementation • ICCV 2015 • Shubham Tulsiani, João Carreira, Jitendra Malik
We address the task of predicting pose for objects of unannotated object categories from a small seed set of annotated object classes.
no code implementations • CVPR 2015 • Abhishek Kar, Shubham Tulsiani, João Carreira, Jitendra Malik
Object reconstruction from a single image -- in the wild -- is a problem where we can make progress and get meaningful results today.
no code implementations • CVPR 2015 • João Carreira, Abhishek Kar, Shubham Tulsiani, Jitendra Malik
All that structure from motion algorithms "see" are sets of 2D points.