no code implementations • ECCV 2020 • Dimitri Zhukov, Jean-Baptiste Alayrac, Ivan Laptev, Josef Sivic
The annotation is particularly difficult for temporal action localization where large parts of the video present no action, or background.
1 code implementation • 27 Feb 2023 • Antoine Yang, Arsha Nagrani, Paul Hongsuck Seo, Antoine Miech, Jordi Pont-Tuset, Ivan Laptev, Josef Sivic, Cordelia Schmid
In this work, we introduce Vid2Seq, a multi-modal single-stage dense event captioning model pretrained on narrated videos which are readily-available at scale.
Ranked #1 on
Dense Video Captioning
on ActivityNet Captions
(using extra training data)
no code implementations • 20 Dec 2022 • Matthieu Futeral, Cordelia Schmid, Ivan Laptev, Benoît Sagot, Rachel Bawden
One of the major challenges of machine translation (MT) is ambiguity, which can in some cases be resolved by accompanying context such as an image.
no code implementations • 14 Dec 2022 • Alaaeldin El-Nouby, Matthew J. Muckley, Karen Ullrich, Ivan Laptev, Jakob Verbeek, Hervé Jégou
In this work, we attempt to bring these lines of research closer by revisiting vector quantization for image compression.
1 code implementation • 24 Nov 2022 • Tomáš Souček, Jean-Baptiste Alayrac, Antoine Miech, Ivan Laptev, Josef Sivic
We aim to learn to temporally localize object state changes and the corresponding state-modifying actions by observing people interacting with objects in long uncurated web videos.
1 code implementation • 17 Nov 2022 • ShiZhe Chen, Pierre-Louis Guhur, Makarand Tapaswi, Cordelia Schmid, Ivan Laptev
In this work we propose a language-conditioned transformer model for grounding 3D objects and their spatial relations.
no code implementations • 19 Sep 2022 • Quentin Le Lidec, Wilson Jallet, Ivan Laptev, Cordelia Schmid, Justin Carpentier
Reinforcement learning (RL) and trajectory optimization (TO) present strong complementary advantages.
1 code implementation • 11 Sep 2022 • Pierre-Louis Guhur, ShiZhe Chen, Ricardo Garcia, Makarand Tapaswi, Ivan Laptev, Cordelia Schmid
In human environments, robots are expected to accomplish a variety of manipulation tasks given simple natural language instructions.
1 code implementation • 24 Aug 2022 • ShiZhe Chen, Pierre-Louis Guhur, Makarand Tapaswi, Cordelia Schmid, Ivan Laptev
Our resulting HM3D-AutoVLN dataset is an order of magnitude larger than existing VLN datasets in terms of navigation environments and instructions.
1 code implementation • 26 Jul 2022 • Zerui Chen, Yana Hasson, Cordelia Schmid, Ivan Laptev
We show that such aligned SDFs better focus on reconstructing shape details and improve reconstruction accuracy both for hands and objects.
1 code implementation • 16 Jun 2022 • Antoine Yang, Antoine Miech, Josef Sivic, Ivan Laptev, Cordelia Schmid
Manual annotation of question and answers for videos, however, is tedious and prohibits scalability.
Ranked #1 on
Zero-Shot Learning
on iVQA
(using extra training data)
no code implementations • 10 May 2022 • Robin Strudel, Ivan Laptev, Cordelia Schmid
Visual grounding localizes regions (boxes or segments) in the image corresponding to given referring expressions.
1 code implementation • 10 May 2022 • Antoine Yang, Antoine Miech, Josef Sivic, Ivan Laptev, Cordelia Schmid
We use our method to generate the WebVidVQA3M dataset from the WebVid dataset, i. e., videos with alt-text annotations, and show its benefits for training VideoQA models.
1 code implementation • CVPR 2022 • Antoine Yang, Antoine Miech, Josef Sivic, Ivan Laptev, Cordelia Schmid
We consider the problem of localizing a spatio-temporal tube in a video corresponding to a given text query.
Ranked #1 on
Spatio-Temporal Video Grounding
on VidSTG
Language-Based Temporal Localization
Natural Language Visual Grounding
+5
1 code implementation • CVPR 2022 • Tomáš Souček, Jean-Baptiste Alayrac, Antoine Miech, Ivan Laptev, Josef Sivic
In this paper, we seek to temporally localize object states (e. g. "empty" and "full" cup) together with the corresponding state-modifying actions ("pouring coffee") in long uncurated videos with minimal supervision.
1 code implementation • CVPR 2022 • ShiZhe Chen, Pierre-Louis Guhur, Makarand Tapaswi, Cordelia Schmid, Ivan Laptev
To balance the complexity of large action space reasoning and fine-grained language grounding, we dynamically combine a fine-scale encoding over local observations and a coarse-scale encoding on a global map via graph transformers.
no code implementations • 20 Dec 2021 • Alaaeldin El-Nouby, Gautier Izacard, Hugo Touvron, Ivan Laptev, Hervé Jegou, Edouard Grave
Our study shows that denoising autoencoders, such as BEiT or a variant that we introduce in this paper, are more robust to the type and size of the pre-training data than popular self-supervised methods trained by comparing image embeddings. We obtain competitive performance compared to ImageNet pre-training on a variety of classification datasets, from different domains.
no code implementations • 2 Nov 2021 • Zongmian Li, Jiri Sedlar, Justin Carpentier, Ivan Laptev, Nicolas Mansard, Josef Sivic
First, we introduce an approach to jointly estimate the motion and the actuation forces of the person on the manipulated object by modeling contacts and the dynamics of the interactions.
1 code implementation • NeurIPS 2021 • ShiZhe Chen, Pierre-Louis Guhur, Cordelia Schmid, Ivan Laptev
Vision-and-language navigation (VLN) aims to build autonomous visual agents that follow instructions and navigate in real scenes.
Ranked #3 on
Vision and Language Navigation
on RxR
no code implementations • NeurIPS 2021 • Quentin Le Lidec, Ivan Laptev, Cordelia Schmid, Justin Carpentier
Notably, images depend both on the properties of observed scenes and on the process of image formation.
no code implementations • 9 Sep 2021 • Dimitri Zhukov, Ignacio Rocco, Ivan Laptev, Josef Sivic, Johannes L. Schönberger, Bugra Tekin, Marc Pollefeys
Contrary to the standard scenario of instance-level 3D reconstruction, where identical objects or scenes are present in all views, objects in different instructional videos may have large appearance variations given varying conditions and versions of the same product.
1 code implementation • ICCV 2021 • Pierre-Louis Guhur, Makarand Tapaswi, ShiZhe Chen, Ivan Laptev, Cordelia Schmid
Given the scarcity of domain-specific training data and the high diversity of image and language inputs, the generalization of VLN agents to unseen environments remains challenging.
Ranked #3 on
Vision and Language Navigation
on VLN Challenge
1 code implementation • 16 Aug 2021 • Yana Hasson, Gül Varol, Ivan Laptev, Cordelia Schmid
Our work aims to obtain 3D reconstruction of hands and manipulated objects from monocular videos.
no code implementations • 1 Jul 2021 • Elliot Chane-Sane, Cordelia Schmid, Ivan Laptev
Goal-conditioned reinforcement learning endows an agent with a large variety of skills, but it often struggles to solve tasks that require more temporally extended reasoning.
10 code implementations • NeurIPS 2021 • Alaaeldin El-Nouby, Hugo Touvron, Mathilde Caron, Piotr Bojanowski, Matthijs Douze, Armand Joulin, Ivan Laptev, Natalia Neverova, Gabriel Synnaeve, Jakob Verbeek, Hervé Jegou
We propose a "transposed" version of self-attention that operates across feature channels rather than tokens, where the interactions are based on the cross-covariance matrix between keys and queries.
Ranked #41 on
Instance Segmentation
on COCO minival
6 code implementations • ICCV 2021 • Robin Strudel, Ricardo Garcia, Ivan Laptev, Cordelia Schmid
In this paper we introduce Segmenter, a transformer model for semantic segmentation.
Ranked #14 on
Semantic Segmentation
on PASCAL Context
no code implementations • CVPR 2021 • Antoine Miech, Jean-Baptiste Alayrac, Ivan Laptev, Josef Sivic, Andrew Zisserman
We also extend our method to the video domain, improving the state of the art on the VATEX dataset.
1 code implementation • 10 Feb 2021 • Alaaeldin El-Nouby, Natalia Neverova, Ivan Laptev, Hervé Jégou
Transformers have shown outstanding results for natural language understanding and, more recently, for image classification.
1 code implementation • ICCV 2021 • Antoine Yang, Antoine Miech, Josef Sivic, Ivan Laptev, Cordelia Schmid
In this work, we propose to avoid manual annotation and generate a large-scale training dataset for video question answering making use of automatic cross-modal supervision.
Ranked #2 on
Zero-Shot Learning
on How2QA
(using extra training data)
1 code implementation • 13 Nov 2020 • Vladimír Petrík, Makarand Tapaswi, Ivan Laptev, Josef Sivic
We evaluate our method on simple single- and two-object actions from the Something-Something dataset.
1 code implementation • 25 Aug 2020 • Robin Strudel, Ricardo Garcia, Justin Carpentier, Jean-Paul Laumond, Ivan Laptev, Cordelia Schmid
Motion planning and obstacle avoidance is a key challenge in robotics applications.
Robotics
1 code implementation • 3 Aug 2020 • Antoine Miech, Jean-Baptiste Alayrac, Ivan Laptev, Josef Sivic, Andrew Zisserman
This paper introduces a manually annotated video dataset of unusual actions, namely RareAct, including actions such as "blend phone", "cut keyboard" and "microwave shoes".
1 code implementation • 3 Aug 2020 • Samuel Albanie, Yang Liu, Arsha Nagrani, Antoine Miech, Ernesto Coto, Ivan Laptev, Rahul Sukthankar, Bernard Ghanem, Andrew Zisserman, Valentin Gabeur, Chen Sun, Karteek Alahari, Cordelia Schmid, Shi-Zhe Chen, Yida Zhao, Qin Jin, Kaixu Cui, Hui Liu, Chen Wang, Yudong Jiang, Xiaoshuai Hao
This report summarizes the results of the first edition of the challenge together with the findings of the participants.
no code implementations • 30 Apr 2020 • Ronan Riochet, Josef Sivic, Ivan Laptev, Emmanuel Dupoux
In this work we propose a probabilistic formulation of learning intuitive physics in 3D scenes with significant inter-object occlusions.
no code implementations • CVPR 2020 • Yana Hasson, Bugra Tekin, Federica Bogo, Ivan Laptev, Marc Pollefeys, Cordelia Schmid
Modeling hand-object manipulations is essential for understanding how humans interact with their environment.
no code implementations • 15 Apr 2020 • Alexander Pashevich, Igor Kalevatykh, Ivan Laptev, Cordelia Schmid
We then show the success of our visual policies for building arches from different primitives.
1 code implementation • CVPR 2020 • Anna Kukleva, Makarand Tapaswi, Ivan Laptev
Localizing the pair of interacting characters in video is a time-consuming process, instead, we train our model to learn from clip-level weak labels.
1 code implementation • CVPR 2020 • Hazel Doughty, Ivan Laptev, Walterio Mayol-Cuevas, Dima Damen
We present a method to learn a representation for adverbs from instructional videos using weak supervision from the accompanying narrations.
4 code implementations • CVPR 2020 • Antoine Miech, Jean-Baptiste Alayrac, Lucas Smaira, Ivan Laptev, Josef Sivic, Andrew Zisserman
Annotating videos is cumbersome, expensive and not scalable.
Ranked #3 on
Action Recognition
on RareAct
1 code implementation • 9 Dec 2019 • Gül Varol, Ivan Laptev, Cordelia Schmid, Andrew Zisserman
Although synthetic training data has been shown to be beneficial for tasks such as human pose estimation, its use for RGB human action recognition is relatively unexplored.
1 code implementation • 2 Aug 2019 • Robin Strudel, Alexander Pashevich, Igor Kalevatykh, Ivan Laptev, Josef Sivic, Cordelia Schmid
Manipulation tasks such as preparing a meal or assembling furniture remain highly challenging for robotics and vision.
4 code implementations • ICCV 2019 • Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac, Makarand Tapaswi, Ivan Laptev, Josef Sivic
In this work, we propose instead to learn such embeddings from video data with readily available natural language annotations in the form of automatically transcribed narrations.
Ranked #4 on
Temporal Action Localization
on CrossTask
2 code implementations • 23 Apr 2019 • Yann Labbé, Sergey Zagoruyko, Igor Kalevatykh, Ivan Laptev, Justin Carpentier, Mathieu Aubry, Josef Sivic
We address the problem of visually guided rearrangement planning with many movable objects, i. e., finding a sequence of actions to move a set of objects from an initial arrangement to a desired one, while relying on visual inputs coming from an RGB camera.
1 code implementation • CVPR 2019 • Sungyeon Kim, Minkyo Seo, Ivan Laptev, Minsu Cho, Suha Kwak
Metric Learning for visual similarity has mostly adopted binary supervision indicating whether a pair of images are of the same class or not.
3 code implementations • CVPR 2019 • Yana Hasson, Gül Varol, Dimitrios Tzionas, Igor Kalevatykh, Michael J. Black, Ivan Laptev, Cordelia Schmid
Previous work has made significant progress towards reconstruction of hand poses and object shapes in isolation.
Ranked #10 on
3D Hand Pose Estimation
on FreiHAND
(PA-F@5mm metric)
1 code implementation • CVPR 2019 • Zongmian Li, Jiri Sedlar, Justin Carpentier, Ivan Laptev, Nicolas Mansard, Josef Sivic
First, we introduce an approach to jointly estimate the motion and the actuation forces of the person on the manipulated object by modeling contacts and the dynamics of their interactions.
2 code implementations • CVPR 2019 • Dimitri Zhukov, Jean-Baptiste Alayrac, Ramazan Gokberk Cinbis, David Fouhey, Ivan Laptev, Josef Sivic
In this paper we investigate learning visual models for the steps of ordinary tasks using weak supervision via instructional narrations and an ordered list of steps instead of strong supervision via temporal annotations.
Ranked #5 on
Temporal Action Localization
on CrossTask
1 code implementation • 18 Mar 2019 • Alexander Pashevich, Robin Strudel, Igor Kalevatykh, Ivan Laptev, Cordelia Schmid
Policies learned in simulators, however, do not transfer well to real scenes given the domain gap between real and synthetic data.
no code implementations • ICCV 2019 • Julia Peyre, Ivan Laptev, Cordelia Schmid, Josef Sivic
We seek to detect visual relations in images of the form of triplets t = (subject, predicate, object), such as "person riding dog", where training examples of the individual entities are available but their combinations are unseen at training.
no code implementations • 6 Dec 2018 • Tuan-Hung Vu, Anton Osokin, Ivan Laptev
Our goal in this paper is to learn discriminative models for the temporal evolution of object appearance and to use such models for object detection.
1 code implementation • 24 Sep 2018 • Nikolai Chinaev, Alexander Chigorin, Ivan Laptev
Estimation of facial shapes plays a central role for face transfer and animation.
no code implementations • 22 Sep 2018 • Meera Hahn, Nataniel Ruiz, Jean-Baptiste Alayrac, Ivan Laptev, James M. Rehg
Automatic generation of textual video descriptions that are time-aligned with video content is a long-standing goal in computer vision.
1 code implementation • NeurIPS 2018 • Guilhem Chéron, Jean-Baptiste Alayrac, Ivan Laptev, Cordelia Schmid
Our model is based on discriminative clustering and integrates different types of supervision as constraints on the optimization.
no code implementations • 28 Jun 2018 • Guilhem Chéron, Anton Osokin, Ivan Laptev, Cordelia Schmid
In order to localize actions in time, we propose a recurrent localization network (RecLNet) designed to model the temporal structure of actions on the level of person tracks.
2 code implementations • ECCV 2018 • Gül Varol, Duygu Ceylan, Bryan Russell, Jimei Yang, Ersin Yumer, Ivan Laptev, Cordelia Schmid
Human shape estimation is an important task for video editing, animation and fashion industry.
Ranked #2 on
3D Human Pose Estimation
on Surreal
(using extra training data)
5 code implementations • 7 Apr 2018 • Antoine Miech, Ivan Laptev, Josef Sivic
We evaluate our method on the task of video retrieval and report results for the MPII Movie Description and MSR-VTT datasets.
Ranked #28 on
Video Retrieval
on LSMDC
(using extra training data)
no code implementations • ICCV 2017 • Julia Peyre, Ivan Laptev, Cordelia Schmid, Josef Sivic
This paper introduces a novel approach for modeling visual relations between pairs of objects.
2 code implementations • ICCV 2017 • Antoine Miech, Jean-Baptiste Alayrac, Piotr Bojanowski, Ivan Laptev, Josef Sivic
Discriminative clustering has been successfully applied to a number of weakly-supervised learning tasks.
Ranked #30 on
Video Retrieval
on LSMDC
4 code implementations • 21 Jun 2017 • Antoine Miech, Ivan Laptev, Josef Sivic
In particular, we evaluate our method on the large-scale multi-modal Youtube-8M v2 dataset and outperform all other methods in the Youtube 8M Large-Scale Video Understanding challenge.
1 code implementation • ICCV 2017 • Jean-Baptiste Alayrac, Josev Sivic, Ivan Laptev, Simon Lacoste-Julien
We assume a consistent temporal order for the changes in object states and manipulation actions, and introduce new optimization techniques to learn model parameters without additional supervision.
2 code implementations • CVPR 2017 • Gül Varol, Javier Romero, Xavier Martin, Naureen Mahmood, Michael J. Black, Ivan Laptev, Cordelia Schmid
In this work we present SURREAL (Synthetic hUmans foR REAL tasks): a new large-scale dataset with synthetically-generated but realistic images of people rendered from 3D sequences of human motion capture data.
1 code implementation • 14 Sep 2016 • Vadim Kantorov, Maxime Oquab, Minsu Cho, Ivan Laptev
The additive model encourages the predicted object region to be supported by its surrounding context region.
Ranked #4 on
Weakly Supervised Object Detection
on Charades
no code implementations • 25 Jul 2016 • Gunnar A. Sigurdsson, Olga Russakovsky, Ali Farhadi, Ivan Laptev, Abhinav Gupta
We conclude that the optimal strategy is to ask as many questions as possible in a HIT (up to 52 binary questions after watching a 30-second video clip in our experiments).
no code implementations • CVPR 2016 • Guillaume Seguin, Piotr Bojanowski, Remi Lajugie, Ivan Laptev
We address the problem of segmenting multiple object instances in complex videos.
no code implementations • CVPR 2016 • Suha Kwak, Minsu Cho, Ivan Laptev
We address the problem of learning a pose-aware, compact embedding that projects images with similar human poses to be placed close-by in the embedding space.
no code implementations • 21 Apr 2016 • Haroon Idrees, Amir R. Zamir, Yu-Gang Jiang, Alex Gorban, Ivan Laptev, Rahul Sukthankar, Mubarak Shah
Additionally, we include a comprehensive empirical study evaluating the differences in action recognition between trimmed and untrimmed videos, and how well methods trained on trimmed videos generalize to untrimmed videos.
1 code implementation • 15 Apr 2016 • Gül Varol, Ivan Laptev, Cordelia Schmid
Typical human actions last several seconds and exhibit characteristic spatio-temporal structure.
Ranked #59 on
Action Recognition
on UCF101
no code implementations • 6 Apr 2016 • Gunnar A. Sigurdsson, Gül Varol, Xiaolong Wang, Ali Farhadi, Ivan Laptev, Abhinav Gupta
Each video is annotated by multiple free-text descriptions, action labels, action intervals and classes of interacted objects.
1 code implementation • ICCV 2015 • Tuan-Hung Vu, Anton Osokin, Ivan Laptev
First, we leverage person-scene relations and propose a Global CNN model trained to predict positions and scales of heads directly from the full image.
no code implementations • CVPR 2016 • Jean-Baptiste Alayrac, Piotr Bojanowski, Nishant Agrawal, Josef Sivic, Ivan Laptev, Simon Lacoste-Julien
Third, we experimentally demonstrate that the proposed method can automatically discover, in an unsupervised manner, the main steps to achieve the task and locate the steps in the input videos.
Ranked #7 on
Temporal Action Localization
on CrossTask
no code implementations • ICCV 2015 • Guilhem Chéron, Ivan Laptev, Cordelia Schmid
This work targets human action recognition in video.
no code implementations • CVPR 2015 • Maxime Oquab, Leon Bottou, Ivan Laptev, Josef Sivic
Successful visual object recognition methods typically rely on training datasets containing lots of richly annotated images.
no code implementations • ICCV 2015 • Piotr Bojanowski, Rémi Lajugie, Edouard Grave, Francis Bach, Ivan Laptev, Jean Ponce, Cordelia Schmid
Given vectorial features for both video and text, we propose to cast this task as a temporal assignment problem, with an implicit linear mapping between the two feature modalities.
no code implementations • ICCV 2015 • Suha Kwak, Minsu Cho, Ivan Laptev, Jean Ponce, Cordelia Schmid
This paper addresses the problem of automatically localizing dominant objects as spatio-temporal tubes in a noisy collection of videos with minimal or even no supervision.
no code implementations • CVPR 2015 • Visesh Chari, Simon Lacoste-Julien, Ivan Laptev, Josef Sivic
Multi-object tracking has been recently approached with the min-cost network flow optimization techniques.
no code implementations • 4 Jul 2014 • Piotr Bojanowski, Rémi Lajugie, Francis Bach, Ivan Laptev, Jean Ponce, Cordelia Schmid, Josef Sivic
We are given a set of video clips, each one annotated with an {\em ordered} list of actions, such as "walk" then "sit" then "answer phone" extracted from, for example, the associated text script.
1 code implementation • CVPR 2014 • Maxime Oquab, Leon Bottou, Ivan Laptev, Josef Sivic
We show that despite differences in image statistics and tasks in the two datasets, the transferred representation leads to significantly improved results for object and action classification, outperforming the current state of the art on Pascal VOC 2007 and 2012 datasets.
no code implementations • CVPR 2014 • Vadim Kantorov, Ivan Laptev
Local video features provide state-of-the-art performance for action recognition.
no code implementations • NeurIPS 2011 • Vincent Delaitre, Josef Sivic, Ivan Laptev
First, we replace the standard quantized local HOG/SIFT features with stronger discriminatively trained body part and object detectors.