Search Results for author: Jean-Baptiste Alayrac

Found 32 papers, 23 papers with code

Learning Actionness via Long-range Temporal Order Verification

no code implementations ECCV 2020 Dimitri Zhukov, Jean-Baptiste Alayrac, Ivan Laptev, Josef Sivic

The annotation is particularly difficult for temporal action localization where large parts of the video present no action, or background.

Action Recognition Temporal Action Localization

Three ways to improve feature alignment for open vocabulary detection

no code implementations23 Mar 2023 Relja Arandjelović, Alex Andonian, Arthur Mensch, Olivier J. Hénaff, Jean-Baptiste Alayrac, Andrew Zisserman

The core problem in zero-shot open vocabulary detection is how to align visual and text features, so that the detector performs well on unseen classes.

Language Modelling

Zorro: the masked multimodal transformer

1 code implementation23 Jan 2023 Adrià Recasens, Jason Lin, Joāo Carreira, Drew Jaegle, Luyu Wang, Jean-Baptiste Alayrac, Pauline Luc, Antoine Miech, Lucas Smaira, Ross Hemsley, Andrew Zisserman

Attention-based models are appealing for multimodal processing because inputs from multiple modalities can be concatenated and fed to a single backbone network - thus requiring very little fusion engineering.

Audio Tagging Multimodal Deep Learning +2

Multi-Task Learning of Object State Changes from Uncurated Videos

1 code implementation24 Nov 2022 Tomáš Souček, Jean-Baptiste Alayrac, Antoine Miech, Ivan Laptev, Josef Sivic

We aim to learn to temporally localize object state changes and the corresponding state-modifying actions by observing people interacting with objects in long uncurated web videos.

Multi-Task Learning Self-Supervised Learning +1

Look for the Change: Learning Object States and State-Modifying Actions from Untrimmed Web Videos

1 code implementation CVPR 2022 Tomáš Souček, Jean-Baptiste Alayrac, Antoine Miech, Ivan Laptev, Josef Sivic

In this paper, we seek to temporally localize object states (e. g. "empty" and "full" cup) together with the corresponding state-modifying actions ("pouring coffee") in long uncurated videos with minimal supervision.

Generative Art Using Neural Visual Grammars and Dual Encoders

1 code implementation1 May 2021 Chrisantha Fernando, S. M. Ali Eslami, Jean-Baptiste Alayrac, Piotr Mirowski, Dylan Banarse, Simon Osindero

Whilst there are perhaps only a few scientific methods, there seem to be almost as many artistic methods as there are artists.

Decoupling the Role of Data, Attention, and Losses in Multimodal Transformers

1 code implementation31 Jan 2021 Lisa Anne Hendricks, John Mellor, Rosalia Schneider, Jean-Baptiste Alayrac, Aida Nematzadeh

Recently multimodal transformer models have gained popularity because their performance on language and vision tasks suggest they learn rich visual-linguistic representations.

Image Retrieval Retrieval +2

RareAct: A video dataset of unusual interactions

1 code implementation3 Aug 2020 Antoine Miech, Jean-Baptiste Alayrac, Ivan Laptev, Josef Sivic, Andrew Zisserman

This paper introduces a manually annotated video dataset of unusual actions, namely RareAct, including actions such as "blend phone", "cut keyboard" and "microwave shoes".

Action Recognition

Self-Supervised MultiModal Versatile Networks

1 code implementation NeurIPS 2020 Jean-Baptiste Alayrac, Adrià Recasens, Rosalia Schneider, Relja Arandjelović, Jason Ramapuram, Jeffrey De Fauw, Lucas Smaira, Sander Dieleman, Andrew Zisserman

In particular, we explore how best to combine the modalities, such that fine-grained representations of the visual and audio modalities can be maintained, whilst also integrating text into a common embedding.

Action Recognition In Videos Audio Classification +2

Visual Grounding in Video for Unsupervised Word Translation

1 code implementation CVPR 2020 Gunnar A. Sigurdsson, Jean-Baptiste Alayrac, Aida Nematzadeh, Lucas Smaira, Mateusz Malinowski, João Carreira, Phil Blunsom, Andrew Zisserman

Given this shared embedding we demonstrate that (i) we can map words between the languages, particularly the 'visual' words; (ii) that the shared embedding provides a good initialization for existing unsupervised text-based word translation techniques, forming the basis for our proposed hybrid visual-text mapping algorithm, MUVE; and (iii) our approach achieves superior performance by addressing the shortcomings of text-based methods -- it is more robust, handles datasets with less commonality, and is applicable to low-resource languages.

Translation Visual Grounding +1

Controllable Attention for Structured Layered Video Decomposition

no code implementations ICCV 2019 Jean-Baptiste Alayrac, João Carreira, Relja Arandjelović, Andrew Zisserman

The objective of this paper is to be able to separate a video into its natural layers, and to control which of the separated layers to attend to.

Action Recognition Reflection Removal

HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips

4 code implementations ICCV 2019 Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac, Makarand Tapaswi, Ivan Laptev, Josef Sivic

In this work, we propose instead to learn such embeddings from video data with readily available natural language annotations in the form of automatically transcribed narrations.

Action Localization Retrieval +2

Are Labels Required for Improving Adversarial Robustness?

1 code implementation NeurIPS 2019 Jonathan Uesato, Jean-Baptiste Alayrac, Po-Sen Huang, Robert Stanforth, Alhussein Fawzi, Pushmeet Kohli

Recent work has uncovered the interesting (and somewhat surprising) finding that training models to be invariant to adversarial perturbations requires substantially larger datasets than those required for standard classification.

Adversarial Robustness

Cross-task weakly supervised learning from instructional videos

2 code implementations CVPR 2019 Dimitri Zhukov, Jean-Baptiste Alayrac, Ramazan Gokberk Cinbis, David Fouhey, Ivan Laptev, Josef Sivic

In this paper we investigate learning visual models for the steps of ordinary tasks using weak supervision via instructional narrations and an ordered list of steps instead of strong supervision via temporal annotations.

Weakly-supervised Learning

The Visual Centrifuge: Model-Free Layered Video Representations

1 code implementation CVPR 2019 Jean-Baptiste Alayrac, João Carreira, Andrew Zisserman

True video understanding requires making sense of non-lambertian scenes where the color of light arriving at the camera sensor encodes information about not just the last object it collided with, but about multiple mediums -- colored windows, dirty mirrors, smoke or rain.

Color Constancy Video Understanding

Learning to Localize and Align Fine-Grained Actions to Sparse Instructions

no code implementations22 Sep 2018 Meera Hahn, Nataniel Ruiz, Jean-Baptiste Alayrac, Ivan Laptev, James M. Rehg

Automatic generation of textual video descriptions that are time-aligned with video content is a long-standing goal in computer vision.

Object Recognition

SEARNN: Training RNNs with Global-Local Losses

1 code implementation ICLR 2018 Rémi Leblond, Jean-Baptiste Alayrac, Anton Osokin, Simon Lacoste-Julien

We propose SEARNN, a novel training algorithm for recurrent neural networks (RNNs) inspired by the "learning to search" (L2S) approach to structured prediction.

Machine Translation Optical Character Recognition (OCR) +4

Joint Discovery of Object States and Manipulation Actions

1 code implementation ICCV 2017 Jean-Baptiste Alayrac, Josev Sivic, Ivan Laptev, Simon Lacoste-Julien

We assume a consistent temporal order for the changes in object states and manipulation actions, and introduce new optimization techniques to learn model parameters without additional supervision.

Action Recognition Clustering +1

Minding the Gaps for Block Frank-Wolfe Optimization of Structured SVMs

no code implementations30 May 2016 Anton Osokin, Jean-Baptiste Alayrac, Isabella Lukasewitz, Puneet K. Dokania, Simon Lacoste-Julien

In this paper, we propose several improvements on the block-coordinate Frank-Wolfe (BCFW) algorithm from Lacoste-Julien et al. (2013) recently used to optimize the structured support vector machine (SSVM) objective in the context of structured prediction, though it has wider applications.

Structured Prediction

Unsupervised Learning from Narrated Instruction Videos

no code implementations CVPR 2016 Jean-Baptiste Alayrac, Piotr Bojanowski, Nishant Agrawal, Josef Sivic, Ivan Laptev, Simon Lacoste-Julien

Third, we experimentally demonstrate that the proposed method can automatically discover, in an unsupervised manner, the main steps to achieve the task and locate the steps in the input videos.

Clustering

Cannot find the paper you are looking for? You can Submit a new open access paper.