no code implementations • 30 Aug 2023 • Mel Vecerik, Carl Doersch, Yi Yang, Todor Davchev, Yusuf Aytar, Guangyao Zhou, Raia Hadsell, Lourdes Agapito, Jon Scholz
For robots to be useful outside labs and specialized factories we need a way to teach them new useful behaviors quickly.
1 code implementation • ICCV 2023 • Carl Doersch, Yi Yang, Mel Vecerik, Dilara Gokay, Ankush Gupta, Yusuf Aytar, Joao Carreira, Andrew Zisserman
We present a novel model for Tracking Any Point (TAP) that effectively tracks any queried point on any physical surface throughout a video sequence.
Ranked #1 on
Visual Tracking
on RGB-Stacking
1 code implementation • 23 May 2023 • Viorica Pătrăucean, Lucas Smaira, Ankush Gupta, Adrià Recasens Continente, Larisa Markeeva, Dylan Banarse, Skanda Koppula, Joseph Heyward, Mateusz Malinowski, Yi Yang, Carl Doersch, Tatiana Matejovicova, Yury Sulsky, Antoine Miech, Alex Frechette, Hanna Klimczak, Raphael Koster, Junlin Zhang, Stephanie Winkler, Yusuf Aytar, Simon Osindero, Dima Damen, Andrew Zisserman, João Carreira
We propose a novel multimodal video benchmark - the Perception Test - to evaluate the perception and reasoning skills of pre-trained multimodal models (e. g. Flamingo, SeViLA, or GPT-4).
2 code implementations • 7 Nov 2022 • Carl Doersch, Ankush Gupta, Larisa Markeeva, Adrià Recasens, Lucas Smaira, Yusuf Aytar, João Carreira, Andrew Zisserman, Yi Yang
Generic motion understanding from video involves not only tracking objects, but also perceiving how their surfaces deform and move.
1 code implementation • Deep Mind 2022 • Viorica Pătrăucean, Lucas Smaira, Ankush Gupta, Adrià Recasens Continente, Larisa Markeeva, Dylan Banarse, Mateusz Malinowski, Yi Yang, Carl Doersch, Tatiana Matejovicova, Yury Sulsky, Antoine Miech, Skanda Koppula, Alex Frechette, Hanna Klimczak, Raphael Koster, Junlin Zhang, Stephanie Winkler, Yusuf Aytar, Simon Osindero, Dima Damen, Andrew Zisserman and João Carreira
We propose a novel multimodal benchmark – the Perception Test – that aims to extensively evaluate perception and reasoning skills of multimodal models.
1 code implementation • CVPR 2022 • Klaus Greff, Francois Belletti, Lucas Beyer, Carl Doersch, Yilun Du, Daniel Duckworth, David J. Fleet, Dan Gnanapragasam, Florian Golemo, Charles Herrmann, Thomas Kipf, Abhijit Kundu, Dmitry Lagun, Issam Laradji, Hsueh-Ti, Liu, Henning Meyer, Yishu Miao, Derek Nowrouzezahrai, Cengiz Oztireli, Etienne Pot, Noha Radwan, Daniel Rebain, Sara Sabour, Mehdi S. M. Sajjadi, Matan Sela, Vincent Sitzmann, Austin Stone, Deqing Sun, Suhani Vora, Ziyu Wang, Tianhao Wu, Kwang Moo Yi, Fangcheng Zhong, Andrea Tagliasacchi
Data is the driving force of machine learning, with the amount and quality of training data often being more important for the performance of a system than architecture and training details.
no code implementations • CVPR 2022 • Wang Yifan, Carl Doersch, Relja Arandjelović, João Carreira, Andrew Zisserman
Much of the recent progress in 3D vision has been driven by the development of specialized architectures that incorporate geometrical inductive biases.
7 code implementations • ICLR 2022 • Andrew Jaegle, Sebastian Borgeaud, Jean-Baptiste Alayrac, Carl Doersch, Catalin Ionescu, David Ding, Skanda Koppula, Daniel Zoran, Andrew Brock, Evan Shelhamer, Olivier Hénaff, Matthew M. Botvinick, Andrew Zisserman, Oriol Vinyals, Joāo Carreira
A central goal of machine learning is the development of systems that can solve many problems in as many data domains as possible.
Ranked #1 on
Optical Flow Estimation
on KITTI 2015
(Average End-Point Error metric)
8 code implementations • NeurIPS 2020 • Jean-bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Guo, Mohammad Gheshlaghi Azar, Bilal Piot, Koray Kavukcuoglu, Remi Munos, Michal Valko
From an augmented view of an image, we train the online network to predict the target network representation of the same image under a different augmented view.
4 code implementations • NeurIPS 2020 • Carl Doersch, Ankush Gupta, Andrew Zisserman
In this work, we illustrate how the neural network representations which underpin modern vision systems are subject to supervision collapse, whereby they lose any information that is not necessary for performing the training task, including information that may be necessary for transfer to new tasks or domains.
31 code implementations • 13 Jun 2020 • Jean-bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre H. Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Daniel Guo, Mohammad Gheshlaghi Azar, Bilal Piot, Koray Kavukcuoglu, Rémi Munos, Michal Valko
From an augmented view of an image, we train the online network to predict the target network representation of the same image under a different augmented view.
Ranked #2 on
Self-Supervised Person Re-Identification
on SYSU-30k
Representation Learning
Self-Supervised Image Classification
+3
no code implementations • NeurIPS 2019 • Carl Doersch, Andrew Zisserman
In this paper, we show that standard neural-network approaches, which perform poorly when trained on synthetic RGB images, can perform well when the data is pre-processed to extract cues about the person's motion, notably as optical flow and the motion of 2D keypoints.
4 code implementations • ICML 2020 • Olivier J. Hénaff, Aravind Srinivas, Jeffrey De Fauw, Ali Razavi, Carl Doersch, S. M. Ali Eslami, Aaron van den Oord
Human observers can learn to recognize new categories of images from a handful of examples, yet doing so with artificial ones remains an open challenge.
Ranked #6 on
Contrastive Learning
on imagenet-1k
1 code implementation • CVPR 2019 • Anurag Arnab, Carl Doersch, Andrew Zisserman
We present a bundle-adjustment-based algorithm for recovering accurate 3D human pose and meshes from monocular videos.
Ranked #1 on
Monocular 3D Human Pose Estimation
on Human3.6M
(Use Video Sequence metric)
no code implementations • 5 Apr 2019 • Victor Bapst, Alvaro Sanchez-Gonzalez, Carl Doersch, Kimberly L. Stachenfeld, Pushmeet Kohli, Peter W. Battaglia, Jessica B. Hamrick
Our results show that agents which use structured representations (e. g., objects and scene graphs) and structured policies (e. g., object-centric actions) outperform those which use less structured representations, and generalize better beyond their training when asked to reason about larger scenes.
no code implementations • CVPR 2019 • Rohit Girdhar, João Carreira, Carl Doersch, Andrew Zisserman
We introduce the Action Transformer model for recognizing and localizing human actions in video clips.
Ranked #6 on
Action Recognition
on AVA v2.1
Action Recognition
Recognizing And Localizing Human Actions
+1
no code implementations • 11 Sep 2018 • Mateusz Malinowski, Carl Doersch
Visual QA is a pivotal challenge for higher-level reasoning, requiring understanding language, vision, and relationships between many objects in a scene.
no code implementations • ECCV 2018 • Mateusz Malinowski, Carl Doersch, Adam Santoro, Peter Battaglia
Attention mechanisms in biological perception are thought to select subsets of perceptual information for more sophisticated processing which would be prohibitive to perform on all sensory inputs.
Ranked #7 on
Visual Question Answering (VQA)
on CLEVR
no code implementations • 26 Jul 2018 • Rohit Girdhar, João Carreira, Carl Doersch, Andrew Zisserman
We introduce a simple baseline for action localization on the AVA dataset.
Ranked #12 on
Action Recognition
on AVA v2.1
no code implementations • 10 Mar 2018 • Simon Schmitt, Jonathan J. Hudson, Augustin Zidek, Simon Osindero, Carl Doersch, Wojciech M. Czarnecki, Joel Z. Leibo, Heinrich Kuttler, Andrew Zisserman, Karen Simonyan, S. M. Ali Eslami
Our method places no constraints on the architecture of the teacher or student agents, and it regulates itself to allow the students to surpass their teachers in performance.
no code implementations • ICCV 2017 • Carl Doersch, Andrew Zisserman
We investigate methods for combining multiple self-supervised tasks--i. e., supervised tasks where data can be collected without manual labeling--in order to train a single visual representation.
no code implementations • 25 Jun 2016 • Jacob Walker, Carl Doersch, Abhinav Gupta, Martial Hebert
We show that our method is able to successfully predict events in a wide variety of scenes and can produce multiple different predictions when the future is ambiguous.
27 code implementations • 19 Jun 2016 • Carl Doersch
In just three years, Variational Autoencoders (VAEs) have emerged as one of the most popular approaches to unsupervised learning of complicated distributions.
2 code implementations • 21 Nov 2015 • Philipp Krähenbühl, Carl Doersch, Jeff Donahue, Trevor Darrell
Convolutional Neural Networks spread through computer vision like a wildfire, impacting almost all visual tasks imaginable.
3 code implementations • ICCV 2015 • Carl Doersch, Abhinav Gupta, Alexei A. Efros
This work explores the use of spatial context as a source of free and plentiful supervisory signal for training a rich visual representation.
no code implementations • 27 Apr 2015 • Aayush Bansal, Abhinav Shrivastava, Carl Doersch, Abhinav Gupta
Building on the success of recent discriminative mid-level elements, we propose a surprisingly simple approach for object detection which performs comparable to the current state-of-the-art approaches on PASCAL VOC comp-3 detection challenge (no external data).
no code implementations • NeurIPS 2013 • Carl Doersch, Abhinav Gupta, Alexei A. Efros
We also propose the Purity-Coverage plot as a principled way of experimentally analyzing and evaluating different visual discovery approaches, and compare our method against prior work on the Paris Street View dataset.