no code implementations • ECCV 2020 • Dimitri Zhukov, Jean-Baptiste Alayrac, Ivan Laptev, Josef Sivic
The annotation is particularly difficult for temporal action localization where large parts of the video present no action, or background.
1 code implementation • 13 Mar 2025 • Evangelos Kazakos, Cordelia Schmid, Josef Sivic
We perform extensive ablations that demonstrate the importance of pre-training using our automatically annotated HowToGround1M dataset followed by fine-tuning on the manually annotated iGround dataset and validate the key technical contributions of our model.
no code implementations • 13 Mar 2025 • Georgy Ponimatkin, Martin Cífka, Tomáš Souček, Médéric Fourmy, Yann Labbé, Vladimir Petrik, Josef Sivic
Third, we thoroughly evaluate and ablate our 6D pose estimation method on YCB-V and HOPE-Video datasets as well as a new dataset of instructional videos manually annotated with approximate 6D object trajectories.
1 code implementation • 24 Dec 2024 • Petr Kouba, Joan Planas-Iglesias, Jiri Damborsky, Jiri Sedlar, Stanislav Mazurenko, Josef Sivic
Third, we introduce a method for fine-tuning a protein inverse folding model to steer it toward desired flexibility in specified regions.
1 code implementation • 2 Dec 2024 • Tomáš Souček, Prajwal Gatti, Michael Wray, Ivan Laptev, Dima Damen, Josef Sivic
The goal of this work is to generate step-by-step visual instructions in the form of a sequence of images, given an input image that provides the scene context and the sequence of textual instructions.
no code implementations • 19 Nov 2024 • Alejandro Pardo, Jui-Hsien Wang, Bernard Ghanem, Josef Sivic, Bryan Russell, Fabian Caba Heilbron
The objective of this work is to manipulate visual timelines (e. g. a video) through natural language instructions, making complex timeline editing tasks accessible to non-expert or potentially even disabled users.
no code implementations • 12 Nov 2024 • Evangelos Kazakos, Cordelia Schmid, Josef Sivic
We apply this approach to videos from the HowTo100M dataset, which results in a new large-scale training dataset, called HowToGround, with automatically annotated captions and spatio-temporally consistent bounding boxes with coherent natural language labels.
1 code implementation • 4 Nov 2024 • Anton Bushuiev, Roman Bushuiev, Nikola Zadorozhny, Raman Samusevich, Hannes Stärk, Jiri Sedlar, Tomáš Pluskal, Josef Sivic
Data scarcity and distribution shifts often hinder the ability of machine learning models to generalize when applied to proteins and other biological data.
1 code implementation • 30 Oct 2024 • Roman Bushuiev, Anton Bushuiev, Niek F. de Jonge, Adamo Young, Fleming Kretschmer, Raman Samusevich, Janne Heirman, Fei Wang, Luke Zhang, Kai Dührkop, Marcus Ludwig, Nils A. Haupt, Apurva Kalia, Corinna Brungs, Robin Schmid, Russell Greiner, Bo wang, David S. Wishart, Li-Ping Liu, Juho Rousu, Wout Bittremieux, Hannes Rost, Tytus D. Mak, Soha Hassoun, Florian Huber, Justin J. J. van der Hooft, Michael A. Stravs, Sebastian Böcker, Josef Sivic, Tomáš Pluskal
To address this problem, we propose MassSpecGym -- the first comprehensive benchmark for the discovery and identification of molecules from MS/MS data.
De novo molecule generation from MS/MS spectrum
De novo molecule generation from MS/MS spectrum (bonus chemical formulae)
+4
1 code implementation • 16 Apr 2024 • Anton Bushuiev, Roman Bushuiev, Jiri Sedlar, Tomas Pluskal, Jiri Damborsky, Stanislav Mazurenko, Josef Sivic
To overcome the data leakage, we recommend constructing data splits based on 3D structural similarity of protein-protein interfaces and suggest corresponding algorithms.
no code implementations • NeurIPS 2023 • Antonin Vobecky, Oriane Siméoni, David Hurych, Spyros Gidaris, Andrei Bursuc, Patrick Pérez, Josef Sivic
We describe an approach to predict open-vocabulary 3D semantic voxel occupancy map from input 2D images with the objective of enabling 3D grounding, segmentation and retrieval of free-form language queries.
3D Semantic Occupancy Prediction
3D Semantic Segmentation
+3
1 code implementation • CVPR 2024 • Tomáš Souček, Dima Damen, Michael Wray, Ivan Laptev, Josef Sivic
We address the task of generating temporally consistent and physically plausible images of actions and object state transformations.
no code implementations • 7 Dec 2023 • Joanna Materzynska, Josef Sivic, Eli Shechtman, Antonio Torralba, Richard Zhang, Bryan Russell
To avoid overfitting to the new custom motion, we introduce an approach for regularization over videos.
1 code implementation • 15 Nov 2023 • Martin Cífka, Georgy Ponimatkin, Yann Labbé, Bryan Russell, Mathieu Aubry, Vladimir Petrik, Josef Sivic
We introduce FocalPose++, a neural render-and-compare method for jointly estimating the camera-object 6D pose and camera focal length given a single RGB input image depicting a known object.
2 code implementations • 27 Oct 2023 • Anton Bushuiev, Roman Bushuiev, Petr Kouba, Anatolii Filkin, Marketa Gabrielova, Michal Gabriel, Jiri Sedlar, Tomas Pluskal, Jiri Damborsky, Stanislav Mazurenko, Josef Sivic
Discovering mutations enhancing protein-protein interactions (PPIs) is critical for advancing biomedical research and developing improved therapeutics.
no code implementations • NeurIPS 2023 • Antoine Yang, Arsha Nagrani, Ivan Laptev, Josef Sivic, Cordelia Schmid
To address this issue, we present VidChapters-7M, a dataset of 817K user-chaptered videos including 7M chapters in total.
1 code implementation • CVPR 2023 • Chun-Hsiao Yeh, Bryan Russell, Josef Sivic, Fabian Caba Heilbron, Simon Jenni
Large-scale vision-language models (VLM) have shown impressive results for language-guided search applications.
no code implementations • CVPR 2023 • Daniel McKee, Justin Salamon, Josef Sivic, Bryan Russell
A key challenge of this problem setting is that existing music video datasets provide the needed (video, music) training pairs, but lack text descriptions of the music.
3 code implementations • CVPR 2023 • Antoine Yang, Arsha Nagrani, Paul Hongsuck Seo, Antoine Miech, Jordi Pont-Tuset, Ivan Laptev, Josef Sivic, Cordelia Schmid
In this work, we introduce Vid2Seq, a multi-modal single-stage dense event captioning model pretrained on narrated videos which are readily-available at scale.
Ranked #1 on
Dense Video Captioning
on ActivityNet Captions
(using extra training data)
1 code implementation • 13 Dec 2022 • Yann Labbé, Lucas Manuelli, Arsalan Mousavian, Stephen Tyree, Stan Birchfield, Jonathan Tremblay, Justin Carpentier, Mathieu Aubry, Dieter Fox, Josef Sivic
Second, we introduce a novel approach for coarse pose estimation which leverages a network trained to classify whether the pose error between a synthetic rendering and an observed image of the same object can be corrected by the refiner.
1 code implementation • 24 Nov 2022 • Tomáš Souček, Jean-Baptiste Alayrac, Antoine Miech, Ivan Laptev, Josef Sivic
We aim to learn to temporally localize object state changes and the corresponding state-modifying actions by observing people interacting with objects in long uncurated web videos.
2 code implementations • 29 Sep 2022 • Hugo Cisneros, Josef Sivic, Tomas Mikolov
In this paper, we introduce a benchmark of increasingly difficult tasks together with a data efficiency metric to measure how quickly machine learning models learn from training data.
1 code implementation • 16 Sep 2022 • Jiri Sedlar, Karla Stepanova, Radoslav Skoviera, Jan K. Behrens, Matus Tuna, Gabriela Sejnova, Josef Sivic, Robert Babuska
This paper introduces a dataset for training and evaluating methods for 6D pose estimation of hand-held tools in task demonstrations captured by a standard RGB camera.
2 code implementations • 3 Aug 2022 • Vladimir Petrik, Mohammad Nomaan Qureshi, Josef Sivic, Makarand Tapaswi
We evaluate our approach on a 3D reconstruction task that consists of 54 video demonstrations sourced from 9 actions such as pull something from right to left or put something in front of something.
3 code implementations • 16 Jun 2022 • Antoine Yang, Antoine Miech, Josef Sivic, Ivan Laptev, Cordelia Schmid
Manual annotation of question and answers for videos, however, is tedious and prohibits scalability.
Ranked #1 on
Zero-Shot Video Question Answer
on TVQA
1 code implementation • 10 May 2022 • Antoine Yang, Antoine Miech, Josef Sivic, Ivan Laptev, Cordelia Schmid
We use our method to generate the WebVidVQA3M dataset from the WebVid dataset, i. e., videos with alt-text annotations, and show its benefits for training VideoQA models.
2 code implementations • CVPR 2022 • Georgy Ponimatkin, Yann Labbé, Bryan Russell, Mathieu Aubry, Josef Sivic
We introduce FocalPose, a neural render-and-compare method for jointly estimating the camera-object 6D pose and camera focal length given a single RGB input image depicting a known object.
1 code implementation • CVPR 2022 • Antoine Yang, Antoine Miech, Josef Sivic, Ivan Laptev, Cordelia Schmid
We consider the problem of localizing a spatio-temporal tube in a video corresponding to a given text query.
Ranked #3 on
Spatio-Temporal Video Grounding
on VidSTG
1 code implementation • CVPR 2022 • Tomáš Souček, Jean-Baptiste Alayrac, Antoine Miech, Ivan Laptev, Josef Sivic
In this paper, we seek to temporally localize object states (e. g. "empty" and "full" cup) together with the corresponding state-modifying actions ("pouring coffee") in long uncurated videos with minimal supervision.
1 code implementation • 21 Mar 2022 • Antonin Vobecky, David Hurych, Oriane Siméoni, Spyros Gidaris, Andrei Bursuc, Patrick Pérez, Josef Sivic
This work investigates learning pixel-wise semantic image segmentation in urban scenes without any manual annotation, just from the raw non-curated data collected by cars which, equipped with cameras and LiDAR sensors, drive around a city.
no code implementations • 2 Nov 2021 • Zongmian Li, Jiri Sedlar, Justin Carpentier, Ivan Laptev, Nicolas Mansard, Josef Sivic
First, we introduce an approach to jointly estimate the motion and the actuation forces of the person on the manipulated object by modeling contacts and the dynamics of the interactions.
1 code implementation • ICCV 2021 • Shuang Li, Yilun Du, Antonio Torralba, Josef Sivic, Bryan Russell
Our task poses unique challenges as a system does not know what types of human-object interactions are present in a video or the actual spatiotemporal location of the human and the object.
no code implementations • 9 Sep 2021 • Dimitri Zhukov, Ignacio Rocco, Ivan Laptev, Josef Sivic, Johannes L. Schönberger, Bugra Tekin, Marc Pollefeys
Contrary to the standard scenario of instance-level 3D reconstruction, where identical objects or scenes are present in all views, objects in different instructional videos may have large appearance variations given varying conditions and versions of the same product.
no code implementations • CVPR 2021 • Yann Labbé, Justin Carpentier, Mathieu Aubry, Josef Sivic
We introduce RoboPose, a method to estimate the joint angles and the 6D camera-to-robot pose of a known articulated robot from a single RGB image.
Ranked #3 on
Robot Pose Estimation
on DREAM-dataset
no code implementations • 1 Apr 2021 • Hugo Cisneros, Josef Sivic, Tomas Mikolov
Emergent processes in complex systems such as cellular automata can perform computations of increasing complexity, and could possibly lead to artificial evolution.
no code implementations • CVPR 2021 • Antoine Miech, Jean-Baptiste Alayrac, Ivan Laptev, Josef Sivic, Andrew Zisserman
We also extend our method to the video domain, improving the state of the art on the VATEX dataset.
1 code implementation • ICCV 2021 • Antoine Yang, Antoine Miech, Josef Sivic, Ivan Laptev, Cordelia Schmid
In this work, we propose to avoid manual annotation and generate a large-scale training dataset for video question answering making use of automatic cross-modal supervision.
Ranked #1 on
Video Question Answering
on VideoQA
1 code implementation • 13 Nov 2020 • Vladimír Petrík, Makarand Tapaswi, Ivan Laptev, Josef Sivic
We evaluate our method on simple single- and two-object actions from the Something-Something dataset.
4 code implementations • ECCV 2020 • Yann Labbé, Justin Carpentier, Mathieu Aubry, Josef Sivic
Second, we develop a robust method for matching individual 6D object pose hypotheses across different input images in order to jointly estimate camera viewpoints and 6D poses of all objects in a single consistent scene.
1 code implementation • 3 Aug 2020 • Antoine Miech, Jean-Baptiste Alayrac, Ivan Laptev, Josef Sivic, Andrew Zisserman
This paper introduces a manually annotated video dataset of unusual actions, namely RareAct, including actions such as "blend phone", "cut keyboard" and "microwave shoes".
no code implementations • 30 Apr 2020 • Ronan Riochet, Josef Sivic, Ivan Laptev, Emmanuel Dupoux
In this work we propose a probabilistic formulation of learning intuitive physics in 3D scenes with significant inter-object occlusions.
1 code implementation • ECCV 2020 • Ignacio Rocco, Relja Arandjelović, Josef Sivic
In this work we target the problem of estimating accurately localised correspondences between a pair of images.
4 code implementations • CVPR 2020 • Antoine Miech, Jean-Baptiste Alayrac, Lucas Smaira, Ivan Laptev, Josef Sivic, Andrew Zisserman
Annotating videos is cumbersome, expensive and not scalable.
Ranked #3 on
Action Recognition
on RareAct
1 code implementation • 4 Nov 2019 • Hugo Cisneros, Josef Sivic, Tomas Mikolov
In this paper we propose an approach for measuring growth of complexity of emerging patterns in complex systems such as cellular automata.
no code implementations • ICCV 2019 • Hajime Taira, Ignacio Rocco, Jiri Sedlar, Masatoshi Okutomi, Josef Sivic, Tomas Pajdla, Torsten Sattler, Akihiko Torii
The pose with the largest geometric consistency with the query image, e. g., in the form of an inlier count, is then selected in a second stage.
1 code implementation • 2 Aug 2019 • Robin Strudel, Alexander Pashevich, Igor Kalevatykh, Ivan Laptev, Josef Sivic, Cordelia Schmid
Manipulation tasks such as preparing a meal or assembling furniture remain highly challenging for robotics and vision.
2 code implementations • 30 Jul 2019 • Victor Escorcia, Mattia Soldan, Josef Sivic, Bernard Ghanem, Bryan Russell
We evaluate our approach on two recently proposed datasets for temporal localization of moments in video with natural language (DiDeMo and Charades-STA) extended to our video corpus moment retrieval setting.
4 code implementations • ICCV 2019 • Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac, Makarand Tapaswi, Ivan Laptev, Josef Sivic
In this work, we propose instead to learn such embeddings from video data with readily available natural language annotations in the form of automatically transcribed narrations.
Ranked #4 on
Temporal Action Localization
on CrossTask
Action Localization
Long Video Retrieval (Background Removed)
+3
4 code implementations • 9 May 2019 • Mihai Dusmanu, Ignacio Rocco, Tomas Pajdla, Marc Pollefeys, Josef Sivic, Akihiko Torii, Torsten Sattler
In this work we address the problem of finding reliable pixel-level correspondences under difficult imaging conditions.
Ranked #8 on
Image Matching
on IMC PhotoTourism
2 code implementations • 23 Apr 2019 • Yann Labbé, Sergey Zagoruyko, Igor Kalevatykh, Ivan Laptev, Justin Carpentier, Mathieu Aubry, Josef Sivic
We address the problem of visually guided rearrangement planning with many movable objects, i. e., finding a sequence of actions to move a set of objects from an initial arrangement to a desired one, while relying on visual inputs coming from an RGB camera.
1 code implementation • CVPR 2019 • Zongmian Li, Jiri Sedlar, Justin Carpentier, Ivan Laptev, Nicolas Mansard, Josef Sivic
First, we introduce an approach to jointly estimate the motion and the actuation forces of the person on the manipulated object by modeling contacts and the dynamics of their interactions.
2 code implementations • CVPR 2019 • Dimitri Zhukov, Jean-Baptiste Alayrac, Ramazan Gokberk Cinbis, David Fouhey, Ivan Laptev, Josef Sivic
In this paper we investigate learning visual models for the steps of ordinary tasks using weak supervision via instructional narrations and an ordered list of steps instead of strong supervision via temporal annotations.
Ranked #5 on
Temporal Action Localization
on CrossTask
no code implementations • ICCV 2019 • Julia Peyre, Ivan Laptev, Cordelia Schmid, Josef Sivic
We seek to detect visual relations in images of the form of triplets t = (subject, predicate, object), such as "person riding dog", where training examples of the individual entities are available but their combinations are unseen at training.
3 code implementations • NeurIPS 2018 • Ignacio Rocco, Mircea Cimpoi, Relja Arandjelović, Akihiko Torii, Tomas Pajdla, Josef Sivic
Second, we demonstrate that the model can be trained effectively from weak supervision in the form of matching and non-matching image pairs without the need for costly manual annotation of point to point correspondences.
Ranked #2 on
Semantic correspondence
on PF-PASCAL
(PCK (weak) metric)
1 code implementation • EMNLP 2018 • Lisa Anne Hendricks, Oliver Wang, Eli Shechtman, Josef Sivic, Trevor Darrell, Bryan Russell
To benchmark whether our model, and other recent video localization models, can effectively reason about temporal language, we collect the novel TEMPOral reasoning in video and language (TEMPO) dataset.
5 code implementations • 7 Apr 2018 • Antoine Miech, Ivan Laptev, Josef Sivic
We evaluate our method on the task of video retrieval and report results for the MPII Movie Description and MSR-VTT datasets.
Ranked #33 on
Video Retrieval
on LSMDC
(using extra training data)
1 code implementation • CVPR 2018 • Hajime Taira, Masatoshi Okutomi, Torsten Sattler, Mircea Cimpoi, Marc Pollefeys, Josef Sivic, Tomas Pajdla, Akihiko Torii
We seek to predict the 6 degree-of-freedom (6DoF) pose of a query photograph with respect to a large indoor 3D map.
2 code implementations • CVPR 2018 • Ignacio Rocco, Relja Arandjelović, Josef Sivic
We tackle the task of semantic alignment where the goal is to compute dense semantic correspondence aligning two images depicting objects of the same category.
2 code implementations • ICCV 2017 • Lisa Anne Hendricks, Oliver Wang, Eli Shechtman, Josef Sivic, Trevor Darrell, Bryan Russell
A key obstacle to training our MCN model is that current video datasets do not include pairs of localized video segments and referring expressions, or text descriptions which uniquely identify a corresponding moment.
no code implementations • ICCV 2017 • Julia Peyre, Ivan Laptev, Cordelia Schmid, Josef Sivic
This paper introduces a novel approach for modeling visual relations between pairs of objects.
2 code implementations • CVPR 2018 • Torsten Sattler, Will Maddern, Carl Toft, Akihiko Torii, Lars Hammarstrand, Erik Stenborg, Daniel Safari, Masatoshi Okutomi, Marc Pollefeys, Josef Sivic, Fredrik Kahl, Tomas Pajdla
Visual localization enables autonomous vehicles to navigate in their surroundings and augmented reality applications to link virtual to real worlds.
2 code implementations • ICCV 2017 • Antoine Miech, Jean-Baptiste Alayrac, Piotr Bojanowski, Ivan Laptev, Josef Sivic
Discriminative clustering has been successfully applied to a number of weakly-supervised learning tasks.
Ranked #35 on
Video Retrieval
on LSMDC
no code implementations • CVPR 2017 • Torsten Sattler, Akihiko Torii, Josef Sivic, Marc Pollefeys, Hajime Taira, Masatoshi Okutomi, Tomas Pajdla
3D structure-based methods employ 3D models of the scene to estimate the full 6DOF pose of a camera very accurately.
5 code implementations • 21 Jun 2017 • Antoine Miech, Ivan Laptev, Josef Sivic
In particular, we evaluate our method on the large-scale multi-modal Youtube-8M v2 dataset and outperform all other methods in the Youtube 8M Large-Scale Video Understanding challenge.
no code implementations • CVPR 2017 • Rohit Girdhar, Deva Ramanan, Abhinav Gupta, Josef Sivic, Bryan Russell
In this work, we introduce a new video representation for action classification that aggregates local convolutional features across the entire spatio-temporal extent of the video.
Ranked #8 on
Long-video Activity Recognition
on Breakfast
5 code implementations • CVPR 2017 • Ignacio Rocco, Relja Arandjelović, Josef Sivic
We address the problem of determining correspondences between two images in agreement with a geometric model such as an affine or thin-plate spline transformation, and estimating its parameters.
14 code implementations • CVPR 2016 • Relja Arandjelović, Petr Gronat, Akihiko Torii, Tomas Pajdla, Josef Sivic
We tackle the problem of large scale visual place recognition, where the task is to quickly and accurately recognize the location of a given query photograph.
Ranked #3 on
Visual Place Recognition
on Mid-Atlantic Ridge
no code implementations • CVPR 2016 • Jean-Baptiste Alayrac, Piotr Bojanowski, Nishant Agrawal, Josef Sivic, Ivan Laptev, Simon Lacoste-Julien
Third, we experimentally demonstrate that the proposed method can automatically discover, in an unsupervised manner, the main steps to achieve the task and locate the steps in the input videos.
Ranked #7 on
Temporal Action Localization
on CrossTask
no code implementations • CVPR 2015 • Akihiko Torii, Relja Arandjelovic, Josef Sivic, Masatoshi Okutomi, Tomas Pajdla
We address the problem of large-scale visual place recognition for situations where the scene undergoes a major change in appearance, for example, due to illumination (day/night), change of seasons, aging, or structural modifications over time such as buildings built or destroyed.
no code implementations • CVPR 2015 • Maxime Oquab, Leon Bottou, Ivan Laptev, Josef Sivic
Successful visual object recognition methods typically rely on training datasets containing lots of richly annotated images.
no code implementations • CVPR 2015 • Visesh Chari, Simon Lacoste-Julien, Ivan Laptev, Josef Sivic
Multi-object tracking has been recently approached with the min-cost network flow optimization techniques.
no code implementations • 4 Jul 2014 • Piotr Bojanowski, Rémi Lajugie, Francis Bach, Ivan Laptev, Jean Ponce, Cordelia Schmid, Josef Sivic
We are given a set of video clips, each one annotated with an {\em ordered} list of actions, such as "walk" then "sit" then "answer phone" extracted from, for example, the associated text script.
1 code implementation • CVPR 2014 • Maxime Oquab, Leon Bottou, Ivan Laptev, Josef Sivic
We show that despite differences in image statistics and tasks in the two datasets, the transferred representation leads to significantly improved results for object and action classification, outperforming the current state of the art on Pascal VOC 2007 and 2012 datasets.
no code implementations • CVPR 2014 • Mathieu Aubry, Daniel Maturana, Alexei A. Efros, Bryan C. Russell, Josef Sivic
This paper poses object category detection in images as a type of 2D-to-3D alignment problem, utilizing the large quantities of 3D CAD models that have been made publicly available online.
no code implementations • CVPR 2013 • Petr Gronat, Guillaume Obozinski, Josef Sivic, Tomas Pajdla
The aim of this work is to localize a query photograph by finding other images depicting the same place in a large geotagged image database.
no code implementations • CVPR 2013 • Akihiko Torii, Josef Sivic, Tomas Pajdla, Masatoshi Okutomi
Even more importantly, they violate the feature independence assumed in the bag-of-visual-words representation which often leads to over-counting evidence and significant degradation of retrieval performance.
no code implementations • NeurIPS 2011 • Vincent Delaitre, Josef Sivic, Ivan Laptev
First, we replace the standard quantized local HOG/SIFT features with stronger discriminatively trained body part and object detectors.
no code implementations • NeurIPS 2009 • Bryan Russell, Alyosha Efros, Josef Sivic, Bill Freeman, Andrew Zisserman
In contrast to recent work in semantic alignment of scenes, we allow an input image to be explained by partial matches of similar scenes.