no code implementations • 19 Dec 2024 • João Carreira, Dilara Gokay, Michael King, Chuhan Zhang, Ignacio Rocco, Aravindh Mahendran, Thomas Albert Keck, Joseph Heyward, Skanda Koppula, Etienne Pot, Goker Erdogan, Yana Hasson, Yi Yang, Klaus Greff, Guillaume Le Moing, Sjoerd van Steenkiste, Daniel Zoran, Drew A. Hudson, Pedro Vélez, Luisa Polanía, Luke Friedman, Chris Duvarney, Ross Goroshin, Kelsey Allen, Jacob Walker, Rishabh Kabra, Eric Aboussouan, Jennifer Sun, Thomas Kipf, Carl Doersch, Viorica Pătrăucean, Dima Damen, Pauline Luc, Mehdi S. M. Sajjadi, Andrew Zisserman
Scaling has not yet been convincingly demonstrated for pure self-supervised learning from video.
no code implementations • 3 Dec 2024 • Daniel Geng, Charles Herrmann, Junhwa Hur, Forrester Cole, Serena Zhang, Tobias Pfaff, Tatiana Lopez-Guevara, Carl Doersch, Yusuf Aytar, Michael Rubinstein, Chen Sun, Oliver Wang, Andrew Owens, Deqing Sun
Motion control is crucial for generating expressive and compelling video content; however, most existing video generation models rely mainly on text prompts for control, which struggle to capture the nuances of dynamic actions and temporal compositions.
no code implementations • 8 Nov 2024 • Sjoerd van Steenkiste, Daniel Zoran, Yi Yang, Yulia Rubanova, Rishabh Kabra, Carl Doersch, Dilara Gokay, Joseph Heyward, Etienne Pot, Klaus Greff, Drew A. Hudson, Thomas Albert Keck, Joao Carreira, Alexey Dosovitskiy, Mehdi S. M. Sajjadi, Thomas Kipf
By using a combination of cross-attention and positional embeddings we disentangle the representation structure and image structure.
no code implementations • 24 Sep 2024 • Homanga Bharadhwaj, Debidatta Dwibedi, Abhinav Gupta, Shubham Tulsiani, Carl Doersch, Ted Xiao, Dhruv Shah, Fei Xia, Dorsa Sadigh, Sean Kirmani
To train the policy, we use an order of magnitude less robot interaction data compared to what the video prediction model was trained on.
2 code implementations • 8 Jul 2024 • Skanda Koppula, Ignacio Rocco, Yi Yang, Joe Heyward, João Carreira, Andrew Zisserman, Gabriel Brostow, Carl Doersch
We introduce a new benchmark, TAPVid-3D, for evaluating the task of long-range Tracking Any Point in 3D (TAP-3D).
2 code implementations • 1 Feb 2024 • Carl Doersch, Pauline Luc, Yi Yang, Dilara Gokay, Skanda Koppula, Ankush Gupta, Joseph Heyward, Ignacio Rocco, Ross Goroshin, João Carreira, Andrew Zisserman
To endow models with greater understanding of physics and motion, it is useful to enable them to perceive how solid surfaces move and deform in real scenes.
Ranked #1 on
Point Tracking
on TAP-Vid-RGB-Stacking
no code implementations • CVPR 2024 • João Carreira, Michael King, Viorica Pătrăucean, Dilara Gokay, Cătălin Ionescu, Yi Yang, Daniel Zoran, Joseph Heyward, Carl Doersch, Yusuf Aytar, Dima Damen, Andrew Zisserman
We introduce a framework for online learning from a single continuous video stream -- the way people and animals learn, without mini-batches, data augmentation or shuffling.
no code implementations • 30 Aug 2023 • Mel Vecerik, Carl Doersch, Yi Yang, Todor Davchev, Yusuf Aytar, Guangyao Zhou, Raia Hadsell, Lourdes Agapito, Jon Scholz
For robots to be useful outside labs and specialized factories we need a way to teach them new useful behaviors quickly.
3 code implementations • ICCV 2023 • Carl Doersch, Yi Yang, Mel Vecerik, Dilara Gokay, Ankush Gupta, Yusuf Aytar, Joao Carreira, Andrew Zisserman
We present a novel model for Tracking Any Point (TAP) that effectively tracks any queried point on any physical surface throughout a video sequence.
Ranked #1 on
Visual Tracking
on Kinetics
1 code implementation • NeurIPS 2023 • Viorica Pătrăucean, Lucas Smaira, Ankush Gupta, Adrià Recasens Continente, Larisa Markeeva, Dylan Banarse, Skanda Koppula, Joseph Heyward, Mateusz Malinowski, Yi Yang, Carl Doersch, Tatiana Matejovicova, Yury Sulsky, Antoine Miech, Alex Frechette, Hanna Klimczak, Raphael Koster, Junlin Zhang, Stephanie Winkler, Yusuf Aytar, Simon Osindero, Dima Damen, Andrew Zisserman, João Carreira
We propose a novel multimodal video benchmark - the Perception Test - to evaluate the perception and reasoning skills of pre-trained multimodal models (e. g. Flamingo, SeViLA, or GPT-4).
Ranked #1 on
Point Tracking
on Perception Test
3 code implementations • 7 Nov 2022 • Carl Doersch, Ankush Gupta, Larisa Markeeva, Adrià Recasens, Lucas Smaira, Yusuf Aytar, João Carreira, Andrew Zisserman, Yi Yang
Generic motion understanding from video involves not only tracking objects, but also perceiving how their surfaces deform and move.
1 code implementation • Deep Mind 2022 • Viorica Pătrăucean, Lucas Smaira, Ankush Gupta, Adrià Recasens Continente, Larisa Markeeva, Dylan Banarse, Mateusz Malinowski, Yi Yang, Carl Doersch, Tatiana Matejovicova, Yury Sulsky, Antoine Miech, Skanda Koppula, Alex Frechette, Hanna Klimczak, Raphael Koster, Junlin Zhang, Stephanie Winkler, Yusuf Aytar, Simon Osindero, Dima Damen, Andrew Zisserman and João Carreira
We propose a novel multimodal benchmark – the Perception Test – that aims to extensively evaluate perception and reasoning skills of multimodal models.
1 code implementation • CVPR 2022 • Klaus Greff, Francois Belletti, Lucas Beyer, Carl Doersch, Yilun Du, Daniel Duckworth, David J. Fleet, Dan Gnanapragasam, Florian Golemo, Charles Herrmann, Thomas Kipf, Abhijit Kundu, Dmitry Lagun, Issam Laradji, Hsueh-Ti, Liu, Henning Meyer, Yishu Miao, Derek Nowrouzezahrai, Cengiz Oztireli, Etienne Pot, Noha Radwan, Daniel Rebain, Sara Sabour, Mehdi S. M. Sajjadi, Matan Sela, Vincent Sitzmann, Austin Stone, Deqing Sun, Suhani Vora, Ziyu Wang, Tianhao Wu, Kwang Moo Yi, Fangcheng Zhong, Andrea Tagliasacchi
Data is the driving force of machine learning, with the amount and quality of training data often being more important for the performance of a system than architecture and training details.
no code implementations • CVPR 2022 • Wang Yifan, Carl Doersch, Relja Arandjelović, João Carreira, Andrew Zisserman
Much of the recent progress in 3D vision has been driven by the development of specialized architectures that incorporate geometrical inductive biases.
8 code implementations • ICLR 2022 • Andrew Jaegle, Sebastian Borgeaud, Jean-Baptiste Alayrac, Carl Doersch, Catalin Ionescu, David Ding, Skanda Koppula, Daniel Zoran, Andrew Brock, Evan Shelhamer, Olivier Hénaff, Matthew M. Botvinick, Andrew Zisserman, Oriol Vinyals, Joāo Carreira
A central goal of machine learning is the development of systems that can solve many problems in as many data domains as possible.
Ranked #1 on
Optical Flow Estimation
on KITTI 2015
(Average End-Point Error metric)
8 code implementations • NeurIPS 2020 • Jean-bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Guo, Mohammad Gheshlaghi Azar, Bilal Piot, Koray Kavukcuoglu, Remi Munos, Michal Valko
From an augmented view of an image, we train the online network to predict the target network representation of the same image under a different augmented view.
6 code implementations • NeurIPS 2020 • Carl Doersch, Ankush Gupta, Andrew Zisserman
In this work, we illustrate how the neural network representations which underpin modern vision systems are subject to supervision collapse, whereby they lose any information that is not necessary for performing the training task, including information that may be necessary for transfer to new tasks or domains.
30 code implementations • 13 Jun 2020 • Jean-bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre H. Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Daniel Guo, Mohammad Gheshlaghi Azar, Bilal Piot, Koray Kavukcuoglu, Rémi Munos, Michal Valko
From an augmented view of an image, we train the online network to predict the target network representation of the same image under a different augmented view.
Ranked #2 on
Self-Supervised Person Re-Identification
on SYSU-30k
no code implementations • NeurIPS 2019 • Carl Doersch, Andrew Zisserman
In this paper, we show that standard neural-network approaches, which perform poorly when trained on synthetic RGB images, can perform well when the data is pre-processed to extract cues about the person's motion, notably as optical flow and the motion of 2D keypoints.
4 code implementations • ICML 2020 • Olivier J. Hénaff, Aravind Srinivas, Jeffrey De Fauw, Ali Razavi, Carl Doersch, S. M. Ali Eslami, Aaron van den Oord
Human observers can learn to recognize new categories of images from a handful of examples, yet doing so with artificial ones remains an open challenge.
Ranked #6 on
Contrastive Learning
on imagenet-1k
1 code implementation • CVPR 2019 • Anurag Arnab, Carl Doersch, Andrew Zisserman
We present a bundle-adjustment-based algorithm for recovering accurate 3D human pose and meshes from monocular videos.
Ranked #1 on
Monocular 3D Human Pose Estimation
on Human3.6M
(Use Video Sequence metric)
no code implementations • 5 Apr 2019 • Victor Bapst, Alvaro Sanchez-Gonzalez, Carl Doersch, Kimberly L. Stachenfeld, Pushmeet Kohli, Peter W. Battaglia, Jessica B. Hamrick
Our results show that agents which use structured representations (e. g., objects and scene graphs) and structured policies (e. g., object-centric actions) outperform those which use less structured representations, and generalize better beyond their training when asked to reason about larger scenes.
no code implementations • CVPR 2019 • Rohit Girdhar, João Carreira, Carl Doersch, Andrew Zisserman
We introduce the Action Transformer model for recognizing and localizing human actions in video clips.
Ranked #6 on
Action Recognition
on AVA v2.1
no code implementations • 11 Sep 2018 • Mateusz Malinowski, Carl Doersch
Visual QA is a pivotal challenge for higher-level reasoning, requiring understanding language, vision, and relationships between many objects in a scene.
no code implementations • ECCV 2018 • Mateusz Malinowski, Carl Doersch, Adam Santoro, Peter Battaglia
Attention mechanisms in biological perception are thought to select subsets of perceptual information for more sophisticated processing which would be prohibitive to perform on all sensory inputs.
Ranked #8 on
Visual Question Answering (VQA)
on CLEVR
no code implementations • 26 Jul 2018 • Rohit Girdhar, João Carreira, Carl Doersch, Andrew Zisserman
We introduce a simple baseline for action localization on the AVA dataset.
Ranked #12 on
Action Recognition
on AVA v2.1
no code implementations • 10 Mar 2018 • Simon Schmitt, Jonathan J. Hudson, Augustin Zidek, Simon Osindero, Carl Doersch, Wojciech M. Czarnecki, Joel Z. Leibo, Heinrich Kuttler, Andrew Zisserman, Karen Simonyan, S. M. Ali Eslami
Our method places no constraints on the architecture of the teacher or student agents, and it regulates itself to allow the students to surpass their teachers in performance.
no code implementations • ICCV 2017 • Carl Doersch, Andrew Zisserman
We investigate methods for combining multiple self-supervised tasks--i. e., supervised tasks where data can be collected without manual labeling--in order to train a single visual representation.
Ranked #45 on
Self-Supervised Image Classification
on ImageNet
(Top 5 Accuracy metric)
no code implementations • 25 Jun 2016 • Jacob Walker, Carl Doersch, Abhinav Gupta, Martial Hebert
We show that our method is able to successfully predict events in a wide variety of scenes and can produce multiple different predictions when the future is ambiguous.
27 code implementations • 19 Jun 2016 • Carl Doersch
In just three years, Variational Autoencoders (VAEs) have emerged as one of the most popular approaches to unsupervised learning of complicated distributions.
2 code implementations • 21 Nov 2015 • Philipp Krähenbühl, Carl Doersch, Jeff Donahue, Trevor Darrell
Convolutional Neural Networks spread through computer vision like a wildfire, impacting almost all visual tasks imaginable.
3 code implementations • ICCV 2015 • Carl Doersch, Abhinav Gupta, Alexei A. Efros
This work explores the use of spatial context as a source of free and plentiful supervisory signal for training a rich visual representation.
no code implementations • 27 Apr 2015 • Aayush Bansal, Abhinav Shrivastava, Carl Doersch, Abhinav Gupta
Building on the success of recent discriminative mid-level elements, we propose a surprisingly simple approach for object detection which performs comparable to the current state-of-the-art approaches on PASCAL VOC comp-3 detection challenge (no external data).
no code implementations • NeurIPS 2013 • Carl Doersch, Abhinav Gupta, Alexei A. Efros
We also propose the Purity-Coverage plot as a principled way of experimentally analyzing and evaluating different visual discovery approaches, and compare our method against prior work on the Paris Street View dataset.