Search Results for author: Josef Sivic

Found 69 papers, 42 papers with code

Learning Actionness via Long-range Temporal Order Verification

no code implementations ECCV 2020 Dimitri Zhukov, Jean-Baptiste Alayrac, Ivan Laptev, Josef Sivic

The annotation is particularly difficult for temporal action localization where large parts of the video present no action, or background.

Action Recognition Temporal Action Localization

POP-3D: Open-Vocabulary 3D Occupancy Prediction from Images

no code implementations NeurIPS 2023 Antonin Vobecky, Oriane Siméoni, David Hurych, Spyros Gidaris, Andrei Bursuc, Patrick Pérez, Josef Sivic

We describe an approach to predict open-vocabulary 3D semantic voxel occupancy map from input 2D images with the objective of enabling 3D grounding, segmentation and retrieval of free-form language queries.

3D Semantic Occupancy Prediction 3D Semantic Segmentation +3

GenHowTo: Learning to Generate Actions and State Transformations from Instructional Videos

1 code implementation12 Dec 2023 Tomáš Souček, Dima Damen, Michael Wray, Ivan Laptev, Josef Sivic

We address the task of generating temporally consistent and physically plausible images of actions and object state transformations.

Object

FocalPose++: Focal Length and Object Pose Estimation via Render and Compare

1 code implementation15 Nov 2023 Martin Cífka, Georgy Ponimatkin, Yann Labbé, Bryan Russell, Mathieu Aubry, Vladimir Petrik, Josef Sivic

We introduce FocalPose++, a neural render-and-compare method for jointly estimating the camera-object 6D pose and camera focal length given a single RGB input image depicting a known object.

Object Pose Estimation

Learning to design protein-protein interactions with enhanced generalization

2 code implementations27 Oct 2023 Anton Bushuiev, Roman Bushuiev, Petr Kouba, Anatolii Filkin, Marketa Gabrielova, Michal Gabriel, Jiri Sedlar, Tomas Pluskal, Jiri Damborsky, Stanislav Mazurenko, Josef Sivic

Discovering mutations enhancing protein-protein interactions (PPIs) is critical for advancing biomedical research and developing improved therapeutics.

VidChapters-7M: Video Chapters at Scale

no code implementations NeurIPS 2023 Antoine Yang, Arsha Nagrani, Ivan Laptev, Josef Sivic, Cordelia Schmid

To address this issue, we present VidChapters-7M, a dataset of 817K user-chaptered videos including 7M chapters in total.

Dense Video Captioning Navigate

Language-Guided Music Recommendation for Video via Prompt Analogies

no code implementations CVPR 2023 Daniel McKee, Justin Salamon, Josef Sivic, Bryan Russell

A key challenge of this problem setting is that existing music video datasets provide the needed (video, music) training pairs, but lack text descriptions of the music.

Language Modelling Music Recommendation +1

Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning

3 code implementations CVPR 2023 Antoine Yang, Arsha Nagrani, Paul Hongsuck Seo, Antoine Miech, Jordi Pont-Tuset, Ivan Laptev, Josef Sivic, Cordelia Schmid

In this work, we introduce Vid2Seq, a multi-modal single-stage dense event captioning model pretrained on narrated videos which are readily-available at scale.

 Ranked #1 on Dense Video Captioning on ActivityNet Captions (using extra training data)

Dense Video Captioning Language Modelling +1

MegaPose: 6D Pose Estimation of Novel Objects via Render & Compare

no code implementations13 Dec 2022 Yann Labbé, Lucas Manuelli, Arsalan Mousavian, Stephen Tyree, Stan Birchfield, Jonathan Tremblay, Justin Carpentier, Mathieu Aubry, Dieter Fox, Josef Sivic

Second, we introduce a novel approach for coarse pose estimation which leverages a network trained to classify whether the pose error between a synthetic rendering and an observed image of the same object can be corrected by the refiner.

6D Pose Estimation Object

Multi-Task Learning of Object State Changes from Uncurated Videos

1 code implementation24 Nov 2022 Tomáš Souček, Jean-Baptiste Alayrac, Antoine Miech, Ivan Laptev, Josef Sivic

We aim to learn to temporally localize object state changes and the corresponding state-modifying actions by observing people interacting with objects in long uncurated web videos.

Multi-Task Learning Object +2

Benchmarking Learning Efficiency in Deep Reservoir Computing

2 code implementations29 Sep 2022 Hugo Cisneros, Josef Sivic, Tomas Mikolov

In this paper, we introduce a benchmark of increasingly difficult tasks together with a data efficiency metric to measure how quickly machine learning models learn from training data.

Benchmarking

Imitrob: Imitation Learning Dataset for Training and Evaluating 6D Object Pose Estimators

1 code implementation16 Sep 2022 Jiri Sedlar, Karla Stepanova, Radoslav Skoviera, Jan K. Behrens, Matus Tuna, Gabriela Sejnova, Josef Sivic, Robert Babuska

This paper introduces a dataset for training and evaluating methods for 6D pose estimation of hand-held tools in task demonstrations captured by a standard RGB camera.

6D Pose Estimation 6D Pose Estimation using RGB +2

Learning Object Manipulation Skills from Video via Approximate Differentiable Physics

2 code implementations3 Aug 2022 Vladimir Petrik, Mohammad Nomaan Qureshi, Josef Sivic, Makarand Tapaswi

We evaluate our approach on a 3D reconstruction task that consists of 54 video demonstrations sourced from 9 actions such as pull something from right to left or put something in front of something.

3D Reconstruction Friction +1

Learning to Answer Visual Questions from Web Videos

1 code implementation10 May 2022 Antoine Yang, Antoine Miech, Josef Sivic, Ivan Laptev, Cordelia Schmid

We use our method to generate the WebVidVQA3M dataset from the WebVid dataset, i. e., videos with alt-text annotations, and show its benefits for training VideoQA models.

Question Answering Question Generation +4

Focal Length and Object Pose Estimation via Render and Compare

2 code implementations CVPR 2022 Georgy Ponimatkin, Yann Labbé, Bryan Russell, Mathieu Aubry, Josef Sivic

We introduce FocalPose, a neural render-and-compare method for jointly estimating the camera-object 6D pose and camera focal length given a single RGB input image depicting a known object.

Object Pose Estimation +1

Look for the Change: Learning Object States and State-Modifying Actions from Untrimmed Web Videos

1 code implementation CVPR 2022 Tomáš Souček, Jean-Baptiste Alayrac, Antoine Miech, Ivan Laptev, Josef Sivic

In this paper, we seek to temporally localize object states (e. g. "empty" and "full" cup) together with the corresponding state-modifying actions ("pouring coffee") in long uncurated videos with minimal supervision.

Object

Drive&Segment: Unsupervised Semantic Segmentation of Urban Scenes via Cross-modal Distillation

1 code implementation21 Mar 2022 Antonin Vobecky, David Hurych, Oriane Siméoni, Spyros Gidaris, Andrei Bursuc, Patrick Pérez, Josef Sivic

This work investigates learning pixel-wise semantic image segmentation in urban scenes without any manual annotation, just from the raw non-curated data collected by cars which, equipped with cameras and LiDAR sensors, drive around a city.

Image Segmentation Segmentation +1

Estimating 3D Motion and Forces of Human-Object Interactions from Internet Videos

no code implementations2 Nov 2021 Zongmian Li, Jiri Sedlar, Justin Carpentier, Ivan Laptev, Nicolas Mansard, Josef Sivic

First, we introduce an approach to jointly estimate the motion and the actuation forces of the person on the manipulated object by modeling contacts and the dynamics of the interactions.

Human-Object Interaction Detection Object

Weakly Supervised Human-Object Interaction Detection in Video via Contrastive Spatiotemporal Regions

1 code implementation ICCV 2021 Shuang Li, Yilun Du, Antonio Torralba, Josef Sivic, Bryan Russell

Our task poses unique challenges as a system does not know what types of human-object interactions are present in a video or the actual spatiotemporal location of the human and the object.

Human-Object Interaction Detection Object +2

Reconstructing and grounding narrated instructional videos in 3D

no code implementations9 Sep 2021 Dimitri Zhukov, Ignacio Rocco, Ivan Laptev, Josef Sivic, Johannes L. Schönberger, Bugra Tekin, Marc Pollefeys

Contrary to the standard scenario of instance-level 3D reconstruction, where identical objects or scenes are present in all views, objects in different instructional videos may have large appearance variations given varying conditions and versions of the same product.

3D Reconstruction

Single-view robot pose and joint angle estimation via render & compare

no code implementations CVPR 2021 Yann Labbé, Justin Carpentier, Mathieu Aubry, Josef Sivic

We introduce RoboPose, a method to estimate the joint angles and the 6D camera-to-robot pose of a known articulated robot from a single RGB image.

Pose Estimation Robot Pose Estimation

Visualizing computation in large-scale cellular automata

no code implementations1 Apr 2021 Hugo Cisneros, Josef Sivic, Tomas Mikolov

Emergent processes in complex systems such as cellular automata can perform computations of increasing complexity, and could possibly lead to artificial evolution.

Clustering

Just Ask: Learning to Answer Questions from Millions of Narrated Videos

1 code implementation ICCV 2021 Antoine Yang, Antoine Miech, Josef Sivic, Ivan Laptev, Cordelia Schmid

In this work, we propose to avoid manual annotation and generate a large-scale training dataset for video question answering making use of automatic cross-modal supervision.

Question Answering Question Generation +4

Learning Object Manipulation Skills via Approximate State Estimation from Real Videos

1 code implementation13 Nov 2020 Vladimír Petrík, Makarand Tapaswi, Ivan Laptev, Josef Sivic

We evaluate our method on simple single- and two-object actions from the Something-Something dataset.

Object

CosyPose: Consistent multi-view multi-object 6D pose estimation

3 code implementations ECCV 2020 Yann Labbé, Justin Carpentier, Mathieu Aubry, Josef Sivic

Second, we develop a robust method for matching individual 6D object pose hypotheses across different input images in order to jointly estimate camera viewpoints and 6D poses of all objects in a single consistent scene.

6D Pose Estimation 6D Pose Estimation using RGB +1

RareAct: A video dataset of unusual interactions

1 code implementation3 Aug 2020 Antoine Miech, Jean-Baptiste Alayrac, Ivan Laptev, Josef Sivic, Andrew Zisserman

This paper introduces a manually annotated video dataset of unusual actions, namely RareAct, including actions such as "blend phone", "cut keyboard" and "microwave shoes".

Action Recognition

Occlusion resistant learning of intuitive physics from videos

no code implementations30 Apr 2020 Ronan Riochet, Josef Sivic, Ivan Laptev, Emmanuel Dupoux

In this work we propose a probabilistic formulation of learning intuitive physics in 3D scenes with significant inter-object occlusions.

Object

Efficient Neighbourhood Consensus Networks via Submanifold Sparse Convolutions

1 code implementation ECCV 2020 Ignacio Rocco, Relja Arandjelović, Josef Sivic

In this work we target the problem of estimating accurately localised correspondences between a pair of images.

Evolving Structures in Complex Systems

1 code implementation4 Nov 2019 Hugo Cisneros, Josef Sivic, Tomas Mikolov

In this paper we propose an approach for measuring growth of complexity of emerging patterns in complex systems such as cellular automata.

Artificial Life

Finding Moments in Video Collections Using Natural Language

2 code implementations30 Jul 2019 Victor Escorcia, Mattia Soldan, Josef Sivic, Bernard Ghanem, Bryan Russell

We evaluate our approach on two recently proposed datasets for temporal localization of moments in video with natural language (DiDeMo and Charades-STA) extended to our video corpus moment retrieval setting.

Moment Retrieval Re-Ranking +3

Monte-Carlo Tree Search for Efficient Visually Guided Rearrangement Planning

2 code implementations23 Apr 2019 Yann Labbé, Sergey Zagoruyko, Igor Kalevatykh, Ivan Laptev, Justin Carpentier, Mathieu Aubry, Josef Sivic

We address the problem of visually guided rearrangement planning with many movable objects, i. e., finding a sequence of actions to move a set of objects from an initial arrangement to a desired one, while relying on visual inputs coming from an RGB camera.

Estimating 3D Motion and Forces of Person-Object Interactions from Monocular Video

1 code implementation CVPR 2019 Zongmian Li, Jiri Sedlar, Justin Carpentier, Ivan Laptev, Nicolas Mansard, Josef Sivic

First, we introduce an approach to jointly estimate the motion and the actuation forces of the person on the manipulated object by modeling contacts and the dynamics of their interactions.

Object

Cross-task weakly supervised learning from instructional videos

2 code implementations CVPR 2019 Dimitri Zhukov, Jean-Baptiste Alayrac, Ramazan Gokberk Cinbis, David Fouhey, Ivan Laptev, Josef Sivic

In this paper we investigate learning visual models for the steps of ordinary tasks using weak supervision via instructional narrations and an ordered list of steps instead of strong supervision via temporal annotations.

Weakly-supervised Learning

Detecting unseen visual relations using analogies

no code implementations ICCV 2019 Julia Peyre, Ivan Laptev, Cordelia Schmid, Josef Sivic

We seek to detect visual relations in images of the form of triplets t = (subject, predicate, object), such as "person riding dog", where training examples of the individual entities are available but their combinations are unseen at training.

Retrieval

Neighbourhood Consensus Networks

3 code implementations NeurIPS 2018 Ignacio Rocco, Mircea Cimpoi, Relja Arandjelović, Akihiko Torii, Tomas Pajdla, Josef Sivic

Second, we demonstrate that the model can be trained effectively from weak supervision in the form of matching and non-matching image pairs without the need for costly manual annotation of point to point correspondences.

Ranked #2 on Semantic correspondence on PF-PASCAL (PCK (weak) metric)

Semantic correspondence Visual Localization

Localizing Moments in Video with Temporal Language

1 code implementation EMNLP 2018 Lisa Anne Hendricks, Oliver Wang, Eli Shechtman, Josef Sivic, Trevor Darrell, Bryan Russell

To benchmark whether our model, and other recent video localization models, can effectively reason about temporal language, we collect the novel TEMPOral reasoning in video and language (TEMPO) dataset.

Natural Language Queries Retrieval +1

Learning a Text-Video Embedding from Incomplete and Heterogeneous Data

5 code implementations7 Apr 2018 Antoine Miech, Ivan Laptev, Josef Sivic

We evaluate our method on the task of video retrieval and report results for the MPII Movie Description and MSR-VTT datasets.

Ranked #31 on Video Retrieval on LSMDC (using extra training data)

Retrieval Text Retrieval +2

End-to-end weakly-supervised semantic alignment

2 code implementations CVPR 2018 Ignacio Rocco, Relja Arandjelović, Josef Sivic

We tackle the task of semantic alignment where the goal is to compute dense semantic correspondence aligning two images depicting objects of the same category.

Semantic correspondence

Localizing Moments in Video with Natural Language

2 code implementations ICCV 2017 Lisa Anne Hendricks, Oliver Wang, Eli Shechtman, Josef Sivic, Trevor Darrell, Bryan Russell

A key obstacle to training our MCN model is that current video datasets do not include pairs of localized video segments and referring expressions, or text descriptions which uniquely identify a corresponding moment.

Natural Language Queries

Learnable pooling with Context Gating for video classification

5 code implementations21 Jun 2017 Antoine Miech, Ivan Laptev, Josef Sivic

In particular, we evaluate our method on the large-scale multi-modal Youtube-8M v2 dataset and outperform all other methods in the Youtube 8M Large-Scale Video Understanding challenge.

Classification Clustering +3

ActionVLAD: Learning spatio-temporal aggregation for action classification

no code implementations CVPR 2017 Rohit Girdhar, Deva Ramanan, Abhinav Gupta, Josef Sivic, Bryan Russell

In this work, we introduce a new video representation for action classification that aggregates local convolutional features across the entire spatio-temporal extent of the video.

Action Classification Classification +3

Convolutional neural network architecture for geometric matching

5 code implementations CVPR 2017 Ignacio Rocco, Relja Arandjelović, Josef Sivic

We address the problem of determining correspondences between two images in agreement with a geometric model such as an affine or thin-plate spline transformation, and estimating its parameters.

Geometric Matching

NetVLAD: CNN architecture for weakly supervised place recognition

15 code implementations CVPR 2016 Relja Arandjelović, Petr Gronat, Akihiko Torii, Tomas Pajdla, Josef Sivic

We tackle the problem of large scale visual place recognition, where the task is to quickly and accurately recognize the location of a given query photograph.

Image Retrieval Retrieval +1

Unsupervised Learning from Narrated Instruction Videos

no code implementations CVPR 2016 Jean-Baptiste Alayrac, Piotr Bojanowski, Nishant Agrawal, Josef Sivic, Ivan Laptev, Simon Lacoste-Julien

Third, we experimentally demonstrate that the proposed method can automatically discover, in an unsupervised manner, the main steps to achieve the task and locate the steps in the input videos.

Clustering

24/7 Place Recognition by View Synthesis

no code implementations CVPR 2015 Akihiko Torii, Relja Arandjelovic, Josef Sivic, Masatoshi Okutomi, Tomas Pajdla

We address the problem of large-scale visual place recognition for situations where the scene undergoes a major change in appearance, for example, due to illumination (day/night), change of seasons, aging, or structural modifications over time such as buildings built or destroyed.

Visual Place Recognition

Weakly Supervised Action Labeling in Videos Under Ordering Constraints

no code implementations4 Jul 2014 Piotr Bojanowski, Rémi Lajugie, Francis Bach, Ivan Laptev, Jean Ponce, Cordelia Schmid, Josef Sivic

We are given a set of video clips, each one annotated with an {\em ordered} list of actions, such as "walk" then "sit" then "answer phone" extracted from, for example, the associated text script.

Seeing 3D Chairs: Exemplar Part-based 2D-3D Alignment using a Large Dataset of CAD Models

no code implementations CVPR 2014 Mathieu Aubry, Daniel Maturana, Alexei A. Efros, Bryan C. Russell, Josef Sivic

This paper poses object category detection in images as a type of 2D-to-3D alignment problem, utilizing the large quantities of 3D CAD models that have been made publicly available online.

Learning and Transferring Mid-Level Image Representations using Convolutional Neural Networks

1 code implementation CVPR 2014 Maxime Oquab, Leon Bottou, Ivan Laptev, Josef Sivic

We show that despite differences in image statistics and tasks in the two datasets, the transferred representation leads to significantly improved results for object and action classification, outperforming the current state of the art on Pascal VOC 2007 and 2012 datasets.

Action Classification Action Localization +4

Learning and Calibrating Per-Location Classifiers for Visual Place Recognition

no code implementations CVPR 2013 Petr Gronat, Guillaume Obozinski, Josef Sivic, Tomas Pajdla

The aim of this work is to localize a query photograph by finding other images depicting the same place in a large geotagged image database.

Object Recognition Two-sample testing +1

Visual Place Recognition with Repetitive Structures

no code implementations CVPR 2013 Akihiko Torii, Josef Sivic, Tomas Pajdla, Masatoshi Okutomi

Even more importantly, they violate the feature independence assumed in the bag-of-visual-words representation which often leads to over-counting evidence and significant degradation of retrieval performance.

Retrieval Visual Place Recognition

Learning person-object interactions for action recognition in still images

no code implementations NeurIPS 2011 Vincent Delaitre, Josef Sivic, Ivan Laptev

First, we replace the standard quantized local HOG/SIFT features with stronger discriminatively trained body part and object detectors.

Action Recognition In Still Images Object

Segmenting Scenes by Matching Image Composites

no code implementations NeurIPS 2009 Bryan Russell, Alyosha Efros, Josef Sivic, Bill Freeman, Andrew Zisserman

In contrast to recent work in semantic alignment of scenes, we allow an input image to be explained by partial matches of similar scenes.

Scene Segmentation Segmentation

Cannot find the paper you are looking for? You can Submit a new open access paper.