Search Results for author: Josef Sivic

Found 54 papers, 31 papers with code

Learning Actionness via Long-range Temporal Order Verification

no code implementations ECCV 2020 Dimitri Zhukov, Jean-Baptiste Alayrac, Ivan Laptev, Josef Sivic

The annotation is particularly difficult for temporal action localization where large parts of the video present no action, or background.

Action Recognition

Learning to Answer Visual Questions from Web Videos

1 code implementation10 May 2022 Antoine Yang, Antoine Miech, Josef Sivic, Ivan Laptev, Cordelia Schmid

We use our method to generate the WebVidVQA3M dataset from the WebVid dataset, i. e., videos with alt-text annotations, and show its benefits for training VideoQA models.

Question Answering Question Generation +3

Focal Length and Object Pose Estimation via Render and Compare

1 code implementation11 Apr 2022 Georgy Ponimatkin, Yann Labbé, Bryan Russell, Mathieu Aubry, Josef Sivic

We introduce FocalPose, a neural render-and-compare method for jointly estimating the camera-object 6D pose and camera focal length given a single RGB input image depicting a known object.

Pose Estimation Translation

Look for the Change: Learning Object States and State-Modifying Actions from Untrimmed Web Videos

1 code implementation22 Mar 2022 Tomáš Souček, Jean-Baptiste Alayrac, Antoine Miech, Ivan Laptev, Josef Sivic

In this paper, we seek to temporally localize object states (e. g. "empty" and "full" cup) together with the corresponding state-modifying actions ("pouring coffee") in long uncurated videos with minimal supervision.

Drive&Segment: Unsupervised Semantic Segmentation of Urban Scenes via Cross-modal Distillation

1 code implementation21 Mar 2022 Antonin Vobecky, David Hurych, Oriane Siméoni, Spyros Gidaris, Andrei Bursuc, Patrick Pérez, Josef Sivic

This work investigates learning pixel-wise semantic image segmentation in urban scenes without any manual annotation, just from the raw non-curated data collected by cars which, equipped with cameras and LiDAR sensors, drive around a city.

Unsupervised Semantic Segmentation

Estimating 3D Motion and Forces of Human-Object Interactions from Internet Videos

no code implementations2 Nov 2021 Zongmian Li, Jiri Sedlar, Justin Carpentier, Ivan Laptev, Nicolas Mansard, Josef Sivic

First, we introduce an approach to jointly estimate the motion and the actuation forces of the person on the manipulated object by modeling contacts and the dynamics of the interactions.

Human-Object Interaction Detection

Weakly Supervised Human-Object Interaction Detection in Video via Contrastive Spatiotemporal Regions

1 code implementation ICCV 2021 Shuang Li, Yilun Du, Antonio Torralba, Josef Sivic, Bryan Russell

Our task poses unique challenges as a system does not know what types of human-object interactions are present in a video or the actual spatiotemporal location of the human and the object.

Human-Object Interaction Detection

Reconstructing and grounding narrated instructional videos in 3D

no code implementations9 Sep 2021 Dimitri Zhukov, Ignacio Rocco, Ivan Laptev, Josef Sivic, Johannes L. Schönberger, Bugra Tekin, Marc Pollefeys

Contrary to the standard scenario of instance-level 3D reconstruction, where identical objects or scenes are present in all views, objects in different instructional videos may have large appearance variations given varying conditions and versions of the same product.

3D Reconstruction

Single-view robot pose and joint angle estimation via render & compare

no code implementations CVPR 2021 Yann Labbé, Justin Carpentier, Mathieu Aubry, Josef Sivic

We introduce RoboPose, a method to estimate the joint angles and the 6D camera-to-robot pose of a known articulated robot from a single RGB image.

Visualizing computation in large-scale cellular automata

no code implementations1 Apr 2021 Hugo Cisneros, Josef Sivic, Tomas Mikolov

Emergent processes in complex systems such as cellular automata can perform computations of increasing complexity, and could possibly lead to artificial evolution.

Just Ask: Learning to Answer Questions from Millions of Narrated Videos

1 code implementation ICCV 2021 Antoine Yang, Antoine Miech, Josef Sivic, Ivan Laptev, Cordelia Schmid

In this work, we propose to avoid manual annotation and generate a large-scale training dataset for video question answering making use of automatic cross-modal supervision.

 Ranked #1 on Video Question Answering on ActivityNet-QA (using extra training data)

Question Answering Question Generation +3

Learning Object Manipulation Skills via Approximate State Estimation from Real Videos

1 code implementation13 Nov 2020 Vladimír Petrík, Makarand Tapaswi, Ivan Laptev, Josef Sivic

We evaluate our method on simple single- and two-object actions from the Something-Something dataset.

CosyPose: Consistent multi-view multi-object 6D pose estimation

2 code implementations ECCV 2020 Yann Labbé, Justin Carpentier, Mathieu Aubry, Josef Sivic

Second, we develop a robust method for matching individual 6D object pose hypotheses across different input images in order to jointly estimate camera viewpoints and 6D poses of all objects in a single consistent scene.

6D Pose Estimation 6D Pose Estimation using RGB

RareAct: A video dataset of unusual interactions

1 code implementation3 Aug 2020 Antoine Miech, Jean-Baptiste Alayrac, Ivan Laptev, Josef Sivic, Andrew Zisserman

This paper introduces a manually annotated video dataset of unusual actions, namely RareAct, including actions such as "blend phone", "cut keyboard" and "microwave shoes".

Action Recognition

Occlusion resistant learning of intuitive physics from videos

no code implementations30 Apr 2020 Ronan Riochet, Josef Sivic, Ivan Laptev, Emmanuel Dupoux

In this work we propose a probabilistic formulation of learning intuitive physics in 3D scenes with significant inter-object occlusions.

Efficient Neighbourhood Consensus Networks via Submanifold Sparse Convolutions

1 code implementation ECCV 2020 Ignacio Rocco, Relja Arandjelović, Josef Sivic

In this work we target the problem of estimating accurately localised correspondences between a pair of images.

Evolving Structures in Complex Systems

1 code implementation4 Nov 2019 Hugo Cisneros, Josef Sivic, Tomas Mikolov

In this paper we propose an approach for measuring growth of complexity of emerging patterns in complex systems such as cellular automata.

Artificial Life

Finding Moments in Video Collections Using Natural Language

2 code implementations30 Jul 2019 Victor Escorcia, Mattia Soldan, Josef Sivic, Bernard Ghanem, Bryan Russell

We evaluate our approach on two recently proposed datasets for temporal localization of moments in video with natural language (DiDeMo and Charades-STA) extended to our video corpus moment retrieval setting.

Moment Retrieval Re-Ranking +2

Monte-Carlo Tree Search for Efficient Visually Guided Rearrangement Planning

2 code implementations23 Apr 2019 Yann Labbé, Sergey Zagoruyko, Igor Kalevatykh, Ivan Laptev, Justin Carpentier, Mathieu Aubry, Josef Sivic

We address the problem of visually guided rearrangement planning with many movable objects, i. e., finding a sequence of actions to move a set of objects from an initial arrangement to a desired one, while relying on visual inputs coming from an RGB camera.

Estimating 3D Motion and Forces of Person-Object Interactions from Monocular Video

1 code implementation CVPR 2019 Zongmian Li, Jiri Sedlar, Justin Carpentier, Ivan Laptev, Nicolas Mansard, Josef Sivic

First, we introduce an approach to jointly estimate the motion and the actuation forces of the person on the manipulated object by modeling contacts and the dynamics of their interactions.

Cross-task weakly supervised learning from instructional videos

2 code implementations CVPR 2019 Dimitri Zhukov, Jean-Baptiste Alayrac, Ramazan Gokberk Cinbis, David Fouhey, Ivan Laptev, Josef Sivic

In this paper we investigate learning visual models for the steps of ordinary tasks using weak supervision via instructional narrations and an ordered list of steps instead of strong supervision via temporal annotations.

Detecting unseen visual relations using analogies

no code implementations ICCV 2019 Julia Peyre, Ivan Laptev, Cordelia Schmid, Josef Sivic

We seek to detect visual relations in images of the form of triplets t = (subject, predicate, object), such as "person riding dog", where training examples of the individual entities are available but their combinations are unseen at training.

Neighbourhood Consensus Networks

3 code implementations NeurIPS 2018 Ignacio Rocco, Mircea Cimpoi, Relja Arandjelović, Akihiko Torii, Tomas Pajdla, Josef Sivic

Second, we demonstrate that the model can be trained effectively from weak supervision in the form of matching and non-matching image pairs without the need for costly manual annotation of point to point correspondences.

Ranked #2 on Semantic correspondence on PF-PASCAL (PCK (weak) metric)

Semantic correspondence Visual Localization

Localizing Moments in Video with Temporal Language

1 code implementation EMNLP 2018 Lisa Anne Hendricks, Oliver Wang, Eli Shechtman, Josef Sivic, Trevor Darrell, Bryan Russell

To benchmark whether our model, and other recent video localization models, can effectively reason about temporal language, we collect the novel TEMPOral reasoning in video and language (TEMPO) dataset.

Video Understanding

Learning a Text-Video Embedding from Incomplete and Heterogeneous Data

5 code implementations7 Apr 2018 Antoine Miech, Ivan Laptev, Josef Sivic

We evaluate our method on the task of video retrieval and report results for the MPII Movie Description and MSR-VTT datasets.

Ranked #15 on Video Retrieval on LSMDC (using extra training data)

Video Retrieval

End-to-end weakly-supervised semantic alignment

2 code implementations CVPR 2018 Ignacio Rocco, Relja Arandjelović, Josef Sivic

We tackle the task of semantic alignment where the goal is to compute dense semantic correspondence aligning two images depicting objects of the same category.

Semantic correspondence

Localizing Moments in Video with Natural Language

2 code implementations ICCV 2017 Lisa Anne Hendricks, Oliver Wang, Eli Shechtman, Josef Sivic, Trevor Darrell, Bryan Russell

A key obstacle to training our MCN model is that current video datasets do not include pairs of localized video segments and referring expressions, or text descriptions which uniquely identify a corresponding moment.

Learnable pooling with Context Gating for video classification

4 code implementations21 Jun 2017 Antoine Miech, Ivan Laptev, Josef Sivic

In particular, we evaluate our method on the large-scale multi-modal Youtube-8M v2 dataset and outperform all other methods in the Youtube 8M Large-Scale Video Understanding challenge.

Classification Frame +3

ActionVLAD: Learning spatio-temporal aggregation for action classification

no code implementations CVPR 2017 Rohit Girdhar, Deva Ramanan, Abhinav Gupta, Josef Sivic, Bryan Russell

In this work, we introduce a new video representation for action classification that aggregates local convolutional features across the entire spatio-temporal extent of the video.

Action Classification Classification +2

Convolutional neural network architecture for geometric matching

5 code implementations CVPR 2017 Ignacio Rocco, Relja Arandjelović, Josef Sivic

We address the problem of determining correspondences between two images in agreement with a geometric model such as an affine or thin-plate spline transformation, and estimating its parameters.

Geometric Matching

NetVLAD: CNN architecture for weakly supervised place recognition

15 code implementations CVPR 2016 Relja Arandjelović, Petr Gronat, Akihiko Torii, Tomas Pajdla, Josef Sivic

We tackle the problem of large scale visual place recognition, where the task is to quickly and accurately recognize the location of a given query photograph.

Image Retrieval Visual Place Recognition

Unsupervised Learning from Narrated Instruction Videos

no code implementations CVPR 2016 Jean-Baptiste Alayrac, Piotr Bojanowski, Nishant Agrawal, Josef Sivic, Ivan Laptev, Simon Lacoste-Julien

Third, we experimentally demonstrate that the proposed method can automatically discover, in an unsupervised manner, the main steps to achieve the task and locate the steps in the input videos.

24/7 Place Recognition by View Synthesis

no code implementations CVPR 2015 Akihiko Torii, Relja Arandjelovic, Josef Sivic, Masatoshi Okutomi, Tomas Pajdla

We address the problem of large-scale visual place recognition for situations where the scene undergoes a major change in appearance, for example, due to illumination (day/night), change of seasons, aging, or structural modifications over time such as buildings built or destroyed.

Visual Place Recognition

Weakly Supervised Action Labeling in Videos Under Ordering Constraints

no code implementations4 Jul 2014 Piotr Bojanowski, Rémi Lajugie, Francis Bach, Ivan Laptev, Jean Ponce, Cordelia Schmid, Josef Sivic

We are given a set of video clips, each one annotated with an {\em ordered} list of actions, such as "walk" then "sit" then "answer phone" extracted from, for example, the associated text script.

Learning and Transferring Mid-Level Image Representations using Convolutional Neural Networks

no code implementations CVPR 2014 Maxime Oquab, Leon Bottou, Ivan Laptev, Josef Sivic

We show that despite differences in image statistics and tasks in the two datasets, the transferred representation leads to significantly improved results for object and action classification, outperforming the current state of the art on Pascal VOC 2007 and 2012 datasets.

Action Classification Action Localization +4

Seeing 3D Chairs: Exemplar Part-based 2D-3D Alignment using a Large Dataset of CAD Models

no code implementations CVPR 2014 Mathieu Aubry, Daniel Maturana, Alexei A. Efros, Bryan C. Russell, Josef Sivic

This paper poses object category detection in images as a type of 2D-to-3D alignment problem, utilizing the large quantities of 3D CAD models that have been made publicly available online.

Visual Place Recognition with Repetitive Structures

no code implementations CVPR 2013 Akihiko Torii, Josef Sivic, Tomas Pajdla, Masatoshi Okutomi

Even more importantly, they violate the feature independence assumed in the bag-of-visual-words representation which often leads to over-counting evidence and significant degradation of retrieval performance.

Visual Place Recognition

Learning and Calibrating Per-Location Classifiers for Visual Place Recognition

no code implementations CVPR 2013 Petr Gronat, Guillaume Obozinski, Josef Sivic, Tomas Pajdla

The aim of this work is to localize a query photograph by finding other images depicting the same place in a large geotagged image database.

Object Recognition Two-sample testing +1

Learning person-object interactions for action recognition in still images

no code implementations NeurIPS 2011 Vincent Delaitre, Josef Sivic, Ivan Laptev

First, we replace the standard quantized local HOG/SIFT features with stronger discriminatively trained body part and object detectors.

Action Recognition In Still Images

Segmenting Scenes by Matching Image Composites

no code implementations NeurIPS 2009 Bryan Russell, Alyosha Efros, Josef Sivic, Bill Freeman, Andrew Zisserman

In contrast to recent work in semantic alignment of scenes, we allow an input image to be explained by partial matches of similar scenes.

Scene Segmentation

Cannot find the paper you are looking for? You can Submit a new open access paper.