no code implementations • ECCV 2020 • Dimitri Zhukov, Jean-Baptiste Alayrac, Ivan Laptev, Josef Sivic
The annotation is particularly difficult for temporal action localization where large parts of the video present no action, or background.
no code implementations • 3 May 2023 • Chuhan Zhang, Antoine Miech, Jiajun Shen, Jean-Baptiste Alayrac, Pauline Luc
Large-scale visual language models are widely used as pre-trained models and then adapted for various downstream tasks.
no code implementations • 23 Mar 2023 • Relja Arandjelović, Alex Andonian, Arthur Mensch, Olivier J. Hénaff, Jean-Baptiste Alayrac, Andrew Zisserman
The core problem in zero-shot open vocabulary detection is how to align visual and text features, so that the detector performs well on unseen classes.
1 code implementation • 23 Jan 2023 • Adrià Recasens, Jason Lin, Joāo Carreira, Drew Jaegle, Luyu Wang, Jean-Baptiste Alayrac, Pauline Luc, Antoine Miech, Lucas Smaira, Ross Hemsley, Andrew Zisserman
Attention-based models are appealing for multimodal processing because inputs from multiple modalities can be concatenated and fed to a single backbone network - thus requiring very little fusion engineering.
1 code implementation • 24 Nov 2022 • Tomáš Souček, Jean-Baptiste Alayrac, Antoine Miech, Ivan Laptev, Josef Sivic
We aim to learn to temporally localize object state changes and the corresponding state-modifying actions by observing people interacting with objects in long uncurated web videos.
5 code implementations • DeepMind 2022 • Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katie Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob Menick, Sebastian Borgeaud, Andrew Brock, Aida Nematzadeh, Sahand Sharifzadeh, Mikolaj Binkowski, Ricardo Barreira, Oriol Vinyals, Andrew Zisserman, Karen Simonyan
Building models that can be rapidly adapted to novel tasks using only a handful of annotated examples is an open challenge for multimodal machine learning research.
Ranked #1 on
Action Recognition
on RareAct
1 code implementation • CVPR 2022 • Tomáš Souček, Jean-Baptiste Alayrac, Antoine Miech, Ivan Laptev, Josef Sivic
In this paper, we seek to temporally localize object states (e. g. "empty" and "full" cup) together with the corresponding state-modifying actions ("pouring coffee") in long uncurated videos with minimal supervision.
3 code implementations • 15 Feb 2022 • Curtis Hawthorne, Andrew Jaegle, Cătălina Cangea, Sebastian Borgeaud, Charlie Nash, Mateusz Malinowski, Sander Dieleman, Oriol Vinyals, Matthew Botvinick, Ian Simon, Hannah Sheahan, Neil Zeghidour, Jean-Baptiste Alayrac, João Carreira, Jesse Engel
Real-world data is high-dimensional: a book, image, or musical performance can easily contain hundreds of thousands of elements even after compression.
7 code implementations • ICLR 2022 • Andrew Jaegle, Sebastian Borgeaud, Jean-Baptiste Alayrac, Carl Doersch, Catalin Ionescu, David Ding, Skanda Koppula, Daniel Zoran, Andrew Brock, Evan Shelhamer, Olivier Hénaff, Matthew M. Botvinick, Andrew Zisserman, Oriol Vinyals, Joāo Carreira
A central goal of machine learning is the development of systems that can solve many problems in as many data domains as possible.
Ranked #1 on
Optical Flow Estimation
on KITTI 2015
(Average End-Point Error metric)
1 code implementation • 1 May 2021 • Chrisantha Fernando, S. M. Ali Eslami, Jean-Baptiste Alayrac, Piotr Mirowski, Dylan Banarse, Simon Osindero
Whilst there are perhaps only a few scientific methods, there seem to be almost as many artistic methods as there are artists.
no code implementations • EMNLP 2021 • Rémi Leblond, Jean-Baptiste Alayrac, Laurent SIfre, Miruna Pislar, Jean-Baptiste Lespiau, Ioannis Antonoglou, Karen Simonyan, Oriol Vinyals
Beam search is the go-to method for decoding auto-regressive machine translation models.
1 code implementation • ICCV 2021 • Adrià Recasens, Pauline Luc, Jean-Baptiste Alayrac, Luyu Wang, Ross Hemsley, Florian Strub, Corentin Tallec, Mateusz Malinowski, Viorica Patraucean, Florent Altché, Michal Valko, Jean-bastien Grill, Aäron van den Oord, Andrew Zisserman
Most successful self-supervised learning methods are trained to align the representations of two independent views from the data.
Ranked #1 on
Self-Supervised Audio Classification
on ESC-50
no code implementations • CVPR 2021 • Antoine Miech, Jean-Baptiste Alayrac, Ivan Laptev, Josef Sivic, Andrew Zisserman
We also extend our method to the video domain, improving the state of the art on the VATEX dataset.
2 code implementations • ICCV 2021 • Olivier J. Hénaff, Skanda Koppula, Jean-Baptiste Alayrac, Aaron van den Oord, Oriol Vinyals, João Carreira
Self-supervised pretraining has been shown to yield powerful representations for transfer learning.
Ranked #52 on
Semantic Segmentation
on Cityscapes val
(using extra training data)
1 code implementation • 31 Jan 2021 • Lisa Anne Hendricks, John Mellor, Rosalia Schneider, Jean-Baptiste Alayrac, Aida Nematzadeh
Recently multimodal transformer models have gained popularity because their performance on language and vision tasks suggest they learn rich visual-linguistic representations.
1 code implementation • 3 Aug 2020 • Antoine Miech, Jean-Baptiste Alayrac, Ivan Laptev, Josef Sivic, Andrew Zisserman
This paper introduces a manually annotated video dataset of unusual actions, namely RareAct, including actions such as "blend phone", "cut keyboard" and "microwave shoes".
1 code implementation • NeurIPS 2020 • Jean-Baptiste Alayrac, Adrià Recasens, Rosalia Schneider, Relja Arandjelović, Jason Ramapuram, Jeffrey De Fauw, Lucas Smaira, Sander Dieleman, Andrew Zisserman
In particular, we explore how best to combine the modalities, such that fine-grained representations of the visual and audio modalities can be maintained, whilst also integrating text into a common embedding.
1 code implementation • ACL 2020 • Daniel Fried, Jean-Baptiste Alayrac, Phil Blunsom, Chris Dyer, Stephen Clark, Aida Nematzadeh
We apply a generative segmental model of task structure, guided by narration, to action segmentation in video.
1 code implementation • CVPR 2020 • Gunnar A. Sigurdsson, Jean-Baptiste Alayrac, Aida Nematzadeh, Lucas Smaira, Mateusz Malinowski, João Carreira, Phil Blunsom, Andrew Zisserman
Given this shared embedding we demonstrate that (i) we can map words between the languages, particularly the 'visual' words; (ii) that the shared embedding provides a good initialization for existing unsupervised text-based word translation techniques, forming the basis for our proposed hybrid visual-text mapping algorithm, MUVE; and (iii) our approach achieves superior performance by addressing the shortcomings of text-based methods -- it is more robust, handles datasets with less commonality, and is applicable to low-resource languages.
4 code implementations • CVPR 2020 • Antoine Miech, Jean-Baptiste Alayrac, Lucas Smaira, Ivan Laptev, Josef Sivic, Andrew Zisserman
Annotating videos is cumbersome, expensive and not scalable.
Ranked #3 on
Action Recognition
on RareAct
no code implementations • ICCV 2019 • Jean-Baptiste Alayrac, João Carreira, Relja Arandjelović, Andrew Zisserman
The objective of this paper is to be able to separate a video into its natural layers, and to control which of the separated layers to attend to.
4 code implementations • ICCV 2019 • Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac, Makarand Tapaswi, Ivan Laptev, Josef Sivic
In this work, we propose instead to learn such embeddings from video data with readily available natural language annotations in the form of automatically transcribed narrations.
Ranked #4 on
Temporal Action Localization
on CrossTask
1 code implementation • NeurIPS 2019 • Jonathan Uesato, Jean-Baptiste Alayrac, Po-Sen Huang, Robert Stanforth, Alhussein Fawzi, Pushmeet Kohli
Recent work has uncovered the interesting (and somewhat surprising) finding that training models to be invariant to adversarial perturbations requires substantially larger datasets than those required for standard classification.
2 code implementations • CVPR 2019 • Dimitri Zhukov, Jean-Baptiste Alayrac, Ramazan Gokberk Cinbis, David Fouhey, Ivan Laptev, Josef Sivic
In this paper we investigate learning visual models for the steps of ordinary tasks using weak supervision via instructional narrations and an ordered list of steps instead of strong supervision via temporal annotations.
Ranked #5 on
Temporal Action Localization
on CrossTask
1 code implementation • CVPR 2019 • Jean-Baptiste Alayrac, João Carreira, Andrew Zisserman
True video understanding requires making sense of non-lambertian scenes where the color of light arriving at the camera sensor encodes information about not just the last object it collided with, but about multiple mediums -- colored windows, dirty mirrors, smoke or rain.
no code implementations • 22 Sep 2018 • Meera Hahn, Nataniel Ruiz, Jean-Baptiste Alayrac, Ivan Laptev, James M. Rehg
Automatic generation of textual video descriptions that are time-aligned with video content is a long-standing goal in computer vision.
1 code implementation • NeurIPS 2018 • Guilhem Chéron, Jean-Baptiste Alayrac, Ivan Laptev, Cordelia Schmid
Our model is based on discriminative clustering and integrates different types of supervision as constraints on the optimization.
2 code implementations • ICCV 2017 • Antoine Miech, Jean-Baptiste Alayrac, Piotr Bojanowski, Ivan Laptev, Josef Sivic
Discriminative clustering has been successfully applied to a number of weakly-supervised learning tasks.
Ranked #32 on
Video Retrieval
on LSMDC
1 code implementation • ICLR 2018 • Rémi Leblond, Jean-Baptiste Alayrac, Anton Osokin, Simon Lacoste-Julien
We propose SEARNN, a novel training algorithm for recurrent neural networks (RNNs) inspired by the "learning to search" (L2S) approach to structured prediction.
1 code implementation • ICCV 2017 • Jean-Baptiste Alayrac, Josev Sivic, Ivan Laptev, Simon Lacoste-Julien
We assume a consistent temporal order for the changes in object states and manipulation actions, and introduce new optimization techniques to learn model parameters without additional supervision.
no code implementations • 30 May 2016 • Anton Osokin, Jean-Baptiste Alayrac, Isabella Lukasewitz, Puneet K. Dokania, Simon Lacoste-Julien
In this paper, we propose several improvements on the block-coordinate Frank-Wolfe (BCFW) algorithm from Lacoste-Julien et al. (2013) recently used to optimize the structured support vector machine (SSVM) objective in the context of structured prediction, though it has wider applications.
no code implementations • CVPR 2016 • Jean-Baptiste Alayrac, Piotr Bojanowski, Nishant Agrawal, Josef Sivic, Ivan Laptev, Simon Lacoste-Julien
Third, we experimentally demonstrate that the proposed method can automatically discover, in an unsupervised manner, the main steps to achieve the task and locate the steps in the input videos.
Ranked #7 on
Temporal Action Localization
on CrossTask