Search Results for author: Ivan Laptev

Found 79 papers, 46 papers with code

Learning Actionness via Long-range Temporal Order Verification

no code implementations ECCV 2020 Dimitri Zhukov, Jean-Baptiste Alayrac, Ivan Laptev, Josef Sivic

The annotation is particularly difficult for temporal action localization where large parts of the video present no action, or background.

Action Recognition Temporal Action Localization

Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning

1 code implementation27 Feb 2023 Antoine Yang, Arsha Nagrani, Paul Hongsuck Seo, Antoine Miech, Jordi Pont-Tuset, Ivan Laptev, Josef Sivic, Cordelia Schmid

In this work, we introduce Vid2Seq, a multi-modal single-stage dense event captioning model pretrained on narrated videos which are readily-available at scale.

 Ranked #1 on Dense Video Captioning on ActivityNet Captions (using extra training data)

Dense Video Captioning Language Modelling

Tackling Ambiguity with Images: Improved Multimodal Machine Translation and Contrastive Evaluation

no code implementations20 Dec 2022 Matthieu Futeral, Cordelia Schmid, Ivan Laptev, Benoît Sagot, Rachel Bawden

One of the major challenges of machine translation (MT) is ambiguity, which can in some cases be resolved by accompanying context such as an image.

Multimodal Machine Translation Translation

Multi-Task Learning of Object State Changes from Uncurated Videos

1 code implementation24 Nov 2022 Tomáš Souček, Jean-Baptiste Alayrac, Antoine Miech, Ivan Laptev, Josef Sivic

We aim to learn to temporally localize object state changes and the corresponding state-modifying actions by observing people interacting with objects in long uncurated web videos.

Multi-Task Learning Self-Supervised Learning +1

Language Conditioned Spatial Relation Reasoning for 3D Object Grounding

1 code implementation17 Nov 2022 ShiZhe Chen, Pierre-Louis Guhur, Makarand Tapaswi, Cordelia Schmid, Ivan Laptev

In this work we propose a language-conditioned transformer model for grounding 3D objects and their spatial relations.

Instruction-driven history-aware policies for robotic manipulations

1 code implementation11 Sep 2022 Pierre-Louis Guhur, ShiZhe Chen, Ricardo Garcia, Makarand Tapaswi, Ivan Laptev, Cordelia Schmid

In human environments, robots are expected to accomplish a variety of manipulation tasks given simple natural language instructions.

Learning from Unlabeled 3D Environments for Vision-and-Language Navigation

1 code implementation24 Aug 2022 ShiZhe Chen, Pierre-Louis Guhur, Makarand Tapaswi, Cordelia Schmid, Ivan Laptev

Our resulting HM3D-AutoVLN dataset is an order of magnitude larger than existing VLN datasets in terms of navigation environments and instructions.

Language Modelling Navigate +1

AlignSDF: Pose-Aligned Signed Distance Fields for Hand-Object Reconstruction

1 code implementation26 Jul 2022 Zerui Chen, Yana Hasson, Cordelia Schmid, Ivan Laptev

We show that such aligned SDFs better focus on reconstructing shape details and improve reconstruction accuracy both for hands and objects.

Object Reconstruction

Zero-Shot Video Question Answering via Frozen Bidirectional Language Models

1 code implementation16 Jun 2022 Antoine Yang, Antoine Miech, Josef Sivic, Ivan Laptev, Cordelia Schmid

Manual annotation of question and answers for videos, however, is tedious and prohibits scalability.

 Ranked #1 on Zero-Shot Learning on iVQA (using extra training data)

Fill Mask Language Modelling +6

Weakly-supervised segmentation of referring expressions

no code implementations10 May 2022 Robin Strudel, Ivan Laptev, Cordelia Schmid

Visual grounding localizes regions (boxes or segments) in the image corresponding to given referring expressions.

Image Segmentation Referring Expression +4

Learning to Answer Visual Questions from Web Videos

1 code implementation10 May 2022 Antoine Yang, Antoine Miech, Josef Sivic, Ivan Laptev, Cordelia Schmid

We use our method to generate the WebVidVQA3M dataset from the WebVid dataset, i. e., videos with alt-text annotations, and show its benefits for training VideoQA models.

Question Answering Question Generation +5

Look for the Change: Learning Object States and State-Modifying Actions from Untrimmed Web Videos

1 code implementation CVPR 2022 Tomáš Souček, Jean-Baptiste Alayrac, Antoine Miech, Ivan Laptev, Josef Sivic

In this paper, we seek to temporally localize object states (e. g. "empty" and "full" cup) together with the corresponding state-modifying actions ("pouring coffee") in long uncurated videos with minimal supervision.

Think Global, Act Local: Dual-scale Graph Transformer for Vision-and-Language Navigation

1 code implementation CVPR 2022 ShiZhe Chen, Pierre-Louis Guhur, Makarand Tapaswi, Cordelia Schmid, Ivan Laptev

To balance the complexity of large action space reasoning and fine-grained language grounding, we dynamically combine a fine-scale encoding over local observations and a coarse-scale encoding on a global map via graph transformers.

Efficient Exploration Navigate +1

Are Large-scale Datasets Necessary for Self-Supervised Pre-training?

no code implementations20 Dec 2021 Alaaeldin El-Nouby, Gautier Izacard, Hugo Touvron, Ivan Laptev, Hervé Jegou, Edouard Grave

Our study shows that denoising autoencoders, such as BEiT or a variant that we introduce in this paper, are more robust to the type and size of the pre-training data than popular self-supervised methods trained by comparing image embeddings. We obtain competitive performance compared to ImageNet pre-training on a variety of classification datasets, from different domains.

Denoising Instance Segmentation +1

Estimating 3D Motion and Forces of Human-Object Interactions from Internet Videos

no code implementations2 Nov 2021 Zongmian Li, Jiri Sedlar, Justin Carpentier, Ivan Laptev, Nicolas Mansard, Josef Sivic

First, we introduce an approach to jointly estimate the motion and the actuation forces of the person on the manipulated object by modeling contacts and the dynamics of the interactions.

Human-Object Interaction Detection

Reconstructing and grounding narrated instructional videos in 3D

no code implementations9 Sep 2021 Dimitri Zhukov, Ignacio Rocco, Ivan Laptev, Josef Sivic, Johannes L. Schönberger, Bugra Tekin, Marc Pollefeys

Contrary to the standard scenario of instance-level 3D reconstruction, where identical objects or scenes are present in all views, objects in different instructional videos may have large appearance variations given varying conditions and versions of the same product.

3D Reconstruction

Airbert: In-domain Pretraining for Vision-and-Language Navigation

1 code implementation ICCV 2021 Pierre-Louis Guhur, Makarand Tapaswi, ShiZhe Chen, Ivan Laptev, Cordelia Schmid

Given the scarcity of domain-specific training data and the high diversity of image and language inputs, the generalization of VLN agents to unseen environments remains challenging.

Navigate Referring Expression +1

Goal-Conditioned Reinforcement Learning with Imagined Subgoals

no code implementations1 Jul 2021 Elliot Chane-Sane, Cordelia Schmid, Ivan Laptev

Goal-conditioned reinforcement learning endows an agent with a large variety of skills, but it often struggles to solve tasks that require more temporally extended reasoning.

reinforcement-learning Reinforcement Learning (RL)

XCiT: Cross-Covariance Image Transformers

10 code implementations NeurIPS 2021 Alaaeldin El-Nouby, Hugo Touvron, Mathilde Caron, Piotr Bojanowski, Matthijs Douze, Armand Joulin, Ivan Laptev, Natalia Neverova, Gabriel Synnaeve, Jakob Verbeek, Hervé Jegou

We propose a "transposed" version of self-attention that operates across feature channels rather than tokens, where the interactions are based on the cross-covariance matrix between keys and queries.

Instance Segmentation object-detection +3

Training Vision Transformers for Image Retrieval

1 code implementation10 Feb 2021 Alaaeldin El-Nouby, Natalia Neverova, Ivan Laptev, Hervé Jégou

Transformers have shown outstanding results for natural language understanding and, more recently, for image classification.

Image Classification Image Retrieval +3

Just Ask: Learning to Answer Questions from Millions of Narrated Videos

1 code implementation ICCV 2021 Antoine Yang, Antoine Miech, Josef Sivic, Ivan Laptev, Cordelia Schmid

In this work, we propose to avoid manual annotation and generate a large-scale training dataset for video question answering making use of automatic cross-modal supervision.

Ranked #2 on Zero-Shot Learning on How2QA (using extra training data)

Question Answering Question Generation +5

Learning Object Manipulation Skills via Approximate State Estimation from Real Videos

1 code implementation13 Nov 2020 Vladimír Petrík, Makarand Tapaswi, Ivan Laptev, Josef Sivic

We evaluate our method on simple single- and two-object actions from the Something-Something dataset.

Learning Obstacle Representations for Neural Motion Planning

1 code implementation25 Aug 2020 Robin Strudel, Ricardo Garcia, Justin Carpentier, Jean-Paul Laumond, Ivan Laptev, Cordelia Schmid

Motion planning and obstacle avoidance is a key challenge in robotics applications.


RareAct: A video dataset of unusual interactions

1 code implementation3 Aug 2020 Antoine Miech, Jean-Baptiste Alayrac, Ivan Laptev, Josef Sivic, Andrew Zisserman

This paper introduces a manually annotated video dataset of unusual actions, namely RareAct, including actions such as "blend phone", "cut keyboard" and "microwave shoes".

Action Recognition

Occlusion resistant learning of intuitive physics from videos

no code implementations30 Apr 2020 Ronan Riochet, Josef Sivic, Ivan Laptev, Emmanuel Dupoux

In this work we propose a probabilistic formulation of learning intuitive physics in 3D scenes with significant inter-object occlusions.

Learning visual policies for building 3D shape categories

no code implementations15 Apr 2020 Alexander Pashevich, Igor Kalevatykh, Ivan Laptev, Cordelia Schmid

We then show the success of our visual policies for building arches from different primitives.

Learning Interactions and Relationships between Movie Characters

1 code implementation CVPR 2020 Anna Kukleva, Makarand Tapaswi, Ivan Laptev

Localizing the pair of interacting characters in video is a time-consuming process, instead, we train our model to learn from clip-level weak labels.

Action Modifiers: Learning from Adverbs in Instructional Videos

1 code implementation CVPR 2020 Hazel Doughty, Ivan Laptev, Walterio Mayol-Cuevas, Dima Damen

We present a method to learn a representation for adverbs from instructional videos using weak supervision from the accompanying narrations.

Retrieval Weakly-supervised Learning

Synthetic Humans for Action Recognition from Unseen Viewpoints

1 code implementation9 Dec 2019 Gül Varol, Ivan Laptev, Cordelia Schmid, Andrew Zisserman

Although synthetic training data has been shown to be beneficial for tasks such as human pose estimation, its use for RGB human action recognition is relatively unexplored.

Action Classification Action Recognition +2

HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips

4 code implementations ICCV 2019 Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac, Makarand Tapaswi, Ivan Laptev, Josef Sivic

In this work, we propose instead to learn such embeddings from video data with readily available natural language annotations in the form of automatically transcribed narrations.

Action Localization Retrieval +2

Monte-Carlo Tree Search for Efficient Visually Guided Rearrangement Planning

2 code implementations23 Apr 2019 Yann Labbé, Sergey Zagoruyko, Igor Kalevatykh, Ivan Laptev, Justin Carpentier, Mathieu Aubry, Josef Sivic

We address the problem of visually guided rearrangement planning with many movable objects, i. e., finding a sequence of actions to move a set of objects from an initial arrangement to a desired one, while relying on visual inputs coming from an RGB camera.

Deep Metric Learning Beyond Binary Supervision

1 code implementation CVPR 2019 Sungyeon Kim, Minkyo Seo, Ivan Laptev, Minsu Cho, Suha Kwak

Metric Learning for visual similarity has mostly adopted binary supervision indicating whether a pair of images are of the same class or not.

Image Captioning Image Retrieval +4

Estimating 3D Motion and Forces of Person-Object Interactions from Monocular Video

1 code implementation CVPR 2019 Zongmian Li, Jiri Sedlar, Justin Carpentier, Ivan Laptev, Nicolas Mansard, Josef Sivic

First, we introduce an approach to jointly estimate the motion and the actuation forces of the person on the manipulated object by modeling contacts and the dynamics of their interactions.

Cross-task weakly supervised learning from instructional videos

2 code implementations CVPR 2019 Dimitri Zhukov, Jean-Baptiste Alayrac, Ramazan Gokberk Cinbis, David Fouhey, Ivan Laptev, Josef Sivic

In this paper we investigate learning visual models for the steps of ordinary tasks using weak supervision via instructional narrations and an ordered list of steps instead of strong supervision via temporal annotations.

Weakly-supervised Learning

Learning to Augment Synthetic Images for Sim2Real Policy Transfer

1 code implementation18 Mar 2019 Alexander Pashevich, Robin Strudel, Igor Kalevatykh, Ivan Laptev, Cordelia Schmid

Policies learned in simulators, however, do not transfer well to real scenes given the domain gap between real and synthetic data.

Object Localization

Detecting unseen visual relations using analogies

no code implementations ICCV 2019 Julia Peyre, Ivan Laptev, Cordelia Schmid, Josef Sivic

We seek to detect visual relations in images of the form of triplets t = (subject, predicate, object), such as "person riding dog", where training examples of the individual entities are available but their combinations are unseen at training.


Tube-CNN: Modeling temporal evolution of appearance for object detection in video

no code implementations6 Dec 2018 Tuan-Hung Vu, Anton Osokin, Ivan Laptev

Our goal in this paper is to learn discriminative models for the temporal evolution of object appearance and to use such models for object detection.

object-detection Object Detection +1

Learning to Localize and Align Fine-Grained Actions to Sparse Instructions

no code implementations22 Sep 2018 Meera Hahn, Nataniel Ruiz, Jean-Baptiste Alayrac, Ivan Laptev, James M. Rehg

Automatic generation of textual video descriptions that are time-aligned with video content is a long-standing goal in computer vision.

Object Recognition

Modeling Spatio-Temporal Human Track Structure for Action Localization

no code implementations28 Jun 2018 Guilhem Chéron, Anton Osokin, Ivan Laptev, Cordelia Schmid

In order to localize actions in time, we propose a recurrent localization network (RecLNet) designed to model the temporal structure of actions on the level of person tracks.

Human Detection Optical Flow Estimation +3

Learning a Text-Video Embedding from Incomplete and Heterogeneous Data

5 code implementations7 Apr 2018 Antoine Miech, Ivan Laptev, Josef Sivic

We evaluate our method on the task of video retrieval and report results for the MPII Movie Description and MSR-VTT datasets.

Ranked #28 on Video Retrieval on LSMDC (using extra training data)

Retrieval Text Retrieval +2

Learnable pooling with Context Gating for video classification

4 code implementations21 Jun 2017 Antoine Miech, Ivan Laptev, Josef Sivic

In particular, we evaluate our method on the large-scale multi-modal Youtube-8M v2 dataset and outperform all other methods in the Youtube 8M Large-Scale Video Understanding challenge.

Classification General Classification +2

Joint Discovery of Object States and Manipulation Actions

1 code implementation ICCV 2017 Jean-Baptiste Alayrac, Josev Sivic, Ivan Laptev, Simon Lacoste-Julien

We assume a consistent temporal order for the changes in object states and manipulation actions, and introduce new optimization techniques to learn model parameters without additional supervision.

Action Recognition Temporal Action Localization

Learning from Synthetic Humans

2 code implementations CVPR 2017 Gül Varol, Javier Romero, Xavier Martin, Naureen Mahmood, Michael J. Black, Ivan Laptev, Cordelia Schmid

In this work we present SURREAL (Synthetic hUmans foR REAL tasks): a new large-scale dataset with synthetically-generated but realistic images of people rendered from 3D sequences of human motion capture data.

2D Human Pose Estimation 3D Human Pose Estimation +1

Much Ado About Time: Exhaustive Annotation of Temporal Data

no code implementations25 Jul 2016 Gunnar A. Sigurdsson, Olga Russakovsky, Ali Farhadi, Ivan Laptev, Abhinav Gupta

We conclude that the optimal strategy is to ask as many questions as possible in a HIT (up to 52 binary questions after watching a 30-second video clip in our experiments).

Thin-Slicing for Pose: Learning to Understand Pose Without Explicit Pose Estimation

no code implementations CVPR 2016 Suha Kwak, Minsu Cho, Ivan Laptev

We address the problem of learning a pose-aware, compact embedding that projects images with similar human poses to be placed close-by in the embedding space.

Action Recognition Image Retrieval +3

The THUMOS Challenge on Action Recognition for Videos "in the Wild"

no code implementations21 Apr 2016 Haroon Idrees, Amir R. Zamir, Yu-Gang Jiang, Alex Gorban, Ivan Laptev, Rahul Sukthankar, Mubarak Shah

Additionally, we include a comprehensive empirical study evaluating the differences in action recognition between trimmed and untrimmed videos, and how well methods trained on trimmed videos generalize to untrimmed videos.

Action Classification Action Recognition +3

Context-aware CNNs for person head detection

1 code implementation ICCV 2015 Tuan-Hung Vu, Anton Osokin, Ivan Laptev

First, we leverage person-scene relations and propose a Global CNN model trained to predict positions and scales of heads directly from the full image.

Face Detection Head Detection +1

Unsupervised Learning from Narrated Instruction Videos

no code implementations CVPR 2016 Jean-Baptiste Alayrac, Piotr Bojanowski, Nishant Agrawal, Josef Sivic, Ivan Laptev, Simon Lacoste-Julien

Third, we experimentally demonstrate that the proposed method can automatically discover, in an unsupervised manner, the main steps to achieve the task and locate the steps in the input videos.

Weakly-Supervised Alignment of Video With Text

no code implementations ICCV 2015 Piotr Bojanowski, Rémi Lajugie, Edouard Grave, Francis Bach, Ivan Laptev, Jean Ponce, Cordelia Schmid

Given vectorial features for both video and text, we propose to cast this task as a temporal assignment problem, with an implicit linear mapping between the two feature modalities.

Unsupervised Object Discovery and Tracking in Video Collections

no code implementations ICCV 2015 Suha Kwak, Minsu Cho, Ivan Laptev, Jean Ponce, Cordelia Schmid

This paper addresses the problem of automatically localizing dominant objects as spatio-temporal tubes in a noisy collection of videos with minimal or even no supervision.

Object Discovery Video Understanding

Weakly Supervised Action Labeling in Videos Under Ordering Constraints

no code implementations4 Jul 2014 Piotr Bojanowski, Rémi Lajugie, Francis Bach, Ivan Laptev, Jean Ponce, Cordelia Schmid, Josef Sivic

We are given a set of video clips, each one annotated with an {\em ordered} list of actions, such as "walk" then "sit" then "answer phone" extracted from, for example, the associated text script.

Learning and Transferring Mid-Level Image Representations using Convolutional Neural Networks

1 code implementation CVPR 2014 Maxime Oquab, Leon Bottou, Ivan Laptev, Josef Sivic

We show that despite differences in image statistics and tasks in the two datasets, the transferred representation leads to significantly improved results for object and action classification, outperforming the current state of the art on Pascal VOC 2007 and 2012 datasets.

Action Classification Action Localization +4

Learning person-object interactions for action recognition in still images

no code implementations NeurIPS 2011 Vincent Delaitre, Josef Sivic, Ivan Laptev

First, we replace the standard quantized local HOG/SIFT features with stronger discriminatively trained body part and object detectors.

Action Recognition In Still Images

Cannot find the paper you are looking for? You can Submit a new open access paper.