1 code implementation • 10 May 2022 • Antoine Yang, Antoine Miech, Josef Sivic, Ivan Laptev, Cordelia Schmid
We use our method to generate the WebVidVQA3M dataset from the WebVid dataset, i. e., videos with alt-text annotations, and show its benefits for training VideoQA models.
no code implementations • 10 May 2022 • Robin Strudel, Ivan Laptev, Cordelia Schmid
Visual grounding localizes regions (boxes or segments) in the image corresponding to given referring expressions.
no code implementations • 20 Apr 2022 • Thomas Chabal, Robin Strudel, Etienne Arlaud, Jean Ponce, Cordelia Schmid
This paper addresses the problem of copying an unknown assembly of primitives with known shape and appearance using information extracted from a single photograph by an off-the-shelf procedure for object detection and pose estimation.
no code implementations • 1 Apr 2022 • Arsha Nagrani, Paul Hongsuck Seo, Bryan Seybold, Anja Hauth, Santiago Manen, Chen Sun, Cordelia Schmid
To close this gap we propose a new video mining pipeline which involves transferring captions from image captioning datasets to video clips with no additional manual effort.
1 code implementation • 30 Mar 2022 • Antoine Yang, Antoine Miech, Josef Sivic, Ivan Laptev, Cordelia Schmid
We consider the problem of localizing a spatio-temporal tube in a video corresponding to a given text query.
Ranked #1 on
Spatio-Temporal Video Grounding
on VidSTG
Language-Based Temporal Localization
Natural Language Visual Grounding
+3
no code implementations • 28 Feb 2022 • Pia Bideau, Erik Learned-Miller, Cordelia Schmid, Karteek Alahari
In this work, we argue that the coupling of camera rotation and camera translation can create complex motion fields that are difficult for a deep network to untangle directly.
no code implementations • 23 Feb 2022 • ShiZhe Chen, Pierre-Louis Guhur, Makarand Tapaswi, Cordelia Schmid, Ivan Laptev
To balance the complexity of large action space reasoning and fine-grained language grounding, we dynamically combine a fine-scale encoding over local observations and a coarse-scale encoding on a global map via graph transformers.
no code implementations • 4 Feb 2022 • Ahmet Iscen, Jack Valmadre, Anurag Arnab, Cordelia Schmid
Recent advances in deep learning have relied on large, labelled datasets to train high-capacity models.
Ranked #4 on
Image Classification
on mini WebVision 1.0
no code implementations • 20 Jan 2022 • Paul Hongsuck Seo, Arsha Nagrani, Anurag Arnab, Cordelia Schmid
Recent video and language pretraining frameworks lack the ability to generate sentences.
1 code implementation • 12 Jan 2022 • Shen Yan, Xuehan Xiong, Anurag Arnab, Zhichao Lu, Mi Zhang, Chen Sun, Cordelia Schmid
Video understanding requires reasoning at multiple spatiotemporal resolutions -- from short fine-grained motions to events taking place over longer durations.
Ranked #1 on
Action Classification
on Kinetics-400
(using extra training data)
no code implementations • 1 Nov 2021 • Valentin Gabeur, Arsha Nagrani, Chen Sun, Karteek Alahari, Cordelia Schmid
Our proposal is to pre-train a video encoder using all the available video modalities as supervision, namely, appearance, sound, and transcribed speech.
1 code implementation • NeurIPS 2021 • ShiZhe Chen, Pierre-Louis Guhur, Cordelia Schmid, Ivan Laptev
Vision-and-language navigation (VLN) aims to build autonomous visual agents that follow instructions and navigate in real scenes.
no code implementations • NeurIPS 2021 • Quentin Le Lidec, Ivan Laptev, Cordelia Schmid, Justin Carpentier
Notably, images depend both on the properties of observed scenes and on the process of image formation.
no code implementations • 29 Sep 2021 • Jae Myung Kim, Eunji Kim, Sungroh Yoon, Jungwoo Lee, Cordelia Schmid, Zeynep Akata
Explaining a complex black-box system in a post-hoc manner is important to understand its predictions.
1 code implementation • ICCV 2021 • Pierre-Louis Guhur, Makarand Tapaswi, ShiZhe Chen, Ivan Laptev, Cordelia Schmid
Given the scarcity of domain-specific training data and the high diversity of image and language inputs, the generalization of VLN agents to unseen environments remains challenging.
Ranked #2 on
Vision and Language Navigation
on VLN Challenge
1 code implementation • 16 Aug 2021 • Yana Hasson, Gül Varol, Ivan Laptev, Cordelia Schmid
Our work aims to obtain 3D reconstruction of hands and manipulated objects from monocular videos.
1 code implementation • NeurIPS 2021 • Guillaume Le Moing, Jean Ponce, Cordelia Schmid
The prediction model is doubly autoregressive, in the latent space of an autoencoder for forecasting, and in image space for updating contextual information, which is also used to enforce spatio-temporal consistency through a learnable optical flow module.
Ranked #2 on
Video Prediction
on Kinetics-600 12 frames, 64x64
no code implementations • 1 Jul 2021 • Elliot Chane-Sane, Cordelia Schmid, Ivan Laptev
Goal-conditioned reinforcement learning endows an agent with a large variety of skills, but it often struggles to solve tasks that require more temporally extended reasoning.
no code implementations • NeurIPS 2021 • Arsha Nagrani, Shan Yang, Anurag Arnab, Aren Jansen, Cordelia Schmid, Chen Sun
Humans perceive the world by concurrently processing and fusing high-dimensional inputs from multiple modalities such as vision and audio.
Ranked #1 on
Audio Classification
on VGGSound
(Top 5 Accuracy metric)
no code implementations • CVPR 2021 • Lu Mi, Hang Zhao, Charlie Nash, Xiaohan Jin, Jiyang Gao, Chen Sun, Cordelia Schmid, Nir Shavit, Yuning Chai, Dragomir Anguelov
To address this issue, we introduce a new challenging task to generate HD maps.
no code implementations • 15 Jun 2021 • Minttu Alakuijala, Gabriel Dulac-Arnold, Julien Mairal, Jean Ponce, Cordelia Schmid
Residual reinforcement learning (RL) has been proposed as a way to solve challenging robotic tasks by adapting control actions from a conventional feedback controller to maximize a reward signal.
1 code implementation • NeurIPS 2021 • Huy V. Vo, Elena Sizikova, Cordelia Schmid, Patrick Pérez, Jean Ponce
Extensive experiments on COCO and OpenImages show that, in the single-object discovery setting where a single prominent object is sought in each image, the proposed LOD (Large-scale Object Discovery) approach is on par with, or better than the state of the art for medium-scale datasets (up to 120K images), and over 37% better than the only other algorithms capable of scaling up to 1. 7M images.
1 code implementation • ICCV 2021 • Alexander Pashevich, Cordelia Schmid, Chen Sun
We demonstrate that encoding the history with a transformer is critical to solve compositional tasks, and that pretraining and joint training with synthetic instructions further improve the performance.
5 code implementations • ICCV 2021 • Robin Strudel, Ricardo Garcia, Ivan Laptev, Cordelia Schmid
In this paper we introduce Segmenter, a transformer model for semantic segmentation.
Ranked #9 on
Semantic Segmentation
on PASCAL Context
3 code implementations • 12 Apr 2021 • Ahmet Iscen, André Araujo, Boqing Gong, Cordelia Schmid
An effective and simple approach to long-tailed visual recognition is to learn feature representations and a classifier separately, with instance and class-balanced sampling, respectively.
Ranked #4 on
Long-tail Learning
on iNaturalist 2018
1 code implementation • 6 Apr 2021 • Jack Valmadre, Alex Bewley, Jonathan Huang, Chen Sun, Cristian Sminchisescu, Cordelia Schmid
This paper introduces temporally local metrics for Multi-Object Tracking.
no code implementations • ICCV 2021 • Chen Sun, Arsha Nagrani, Yonglong Tian, Cordelia Schmid
We focus on contrastive methods for self-supervised video representation learning.
no code implementations • ICCV 2021 • Tonmoy Saikia, Cordelia Schmid, Thomas Brox
CNNs perform remarkably well when the training and test distributions are i. i. d, but unseen image corruptions can cause a surprisingly large drop in performance.
no code implementations • ICCV 2021 • Anurag Arnab, Chen Sun, Cordelia Schmid
Accurate video understanding involves reasoning about the relationships between actors, objects and their environment, often over long temporal intervals.
4 code implementations • ICCV 2021 • Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lučić, Cordelia Schmid
We present pure-transformer based models for video classification, drawing upon the recent success of such models in image classification.
Ranked #6 on
Action Classification
on Moments in Time
(Top 5 Accuracy metric, using extra
training data)
no code implementations • ICCV 2021 • Dave Epstein, Jiajun Wu, Cordelia Schmid, Chen Sun
Learning to model how the world changes as time elapses has proven a challenging problem for the computer vision community.
no code implementations • 10 Dec 2020 • Yves Dufournaud, Cordelia Schmid, Radu Horaud
In this paper we address the problem of matching two images with two different resolutions: a high-resolution image and a low-resolution one.
no code implementations • CVPR 2021 • Paul Hongsuck Seo, Arsha Nagrani, Cordelia Schmid
Leveraging recent advances in multimodal learning, our model consists of a novel co-attentional multimodal video transformer, and when trained on both textual and visual context, outperforms baselines that use textual inputs alone.
1 code implementation • ICCV 2021 • Antoine Yang, Antoine Miech, Josef Sivic, Ivan Laptev, Cordelia Schmid
In this work, we propose to avoid manual annotation and generate a large-scale training dataset for video question answering making use of automatic cross-modal supervision.
Ranked #1 on
Video Question Answering
on ActivityNet-QA
(using extra training data)
1 code implementation • 25 Aug 2020 • Robin Strudel, Ricardo Garcia, Justin Carpentier, Jean-Paul Laumond, Ivan Laptev, Cordelia Schmid
Motion planning and obstacle avoidance is a key challenge in robotics applications.
Robotics
2 code implementations • 19 Aug 2020 • Hang Zhao, Jiyang Gao, Tian Lan, Chen Sun, Benjamin Sapp, Balakrishnan Varadarajan, Yue Shen, Yi Shen, Yuning Chai, Cordelia Schmid, Cong-Cong Li, Dragomir Anguelov
Our key insight is that for prediction within a moderate time horizon, the future modes can be effectively captured by a set of target states.
1 code implementation • 3 Aug 2020 • Samuel Albanie, Yang Liu, Arsha Nagrani, Antoine Miech, Ernesto Coto, Ivan Laptev, Rahul Sukthankar, Bernard Ghanem, Andrew Zisserman, Valentin Gabeur, Chen Sun, Karteek Alahari, Cordelia Schmid, Shi-Zhe Chen, Yida Zhao, Qin Jin, Kaixu Cui, Hui Liu, Chen Wang, Yudong Jiang, Xiaoshuai Hao
This report summarizes the results of the first edition of the challenge together with the findings of the participants.
no code implementations • 29 Jul 2020 • Jonathan C. Stroud, Zhichao Lu, Chen Sun, Jia Deng, Rahul Sukthankar, Cordelia Schmid, David A. Ross
Based on this observation, we propose to use text as a method for learning video representations.
2 code implementations • ECCV 2020 • Valentin Gabeur, Chen Sun, Karteek Alahari, Cordelia Schmid
In this paper, we present a multi-modal transformer to jointly encode the different modalities in video, which allows each of them to attend to the others.
Ranked #6 on
Video Retrieval
on ActivityNet
(using extra training data)
no code implementations • ECCV 2020 • Anurag Arnab, Chen Sun, Arsha Nagrani, Cordelia Schmid
Despite the recent advances in video classification, progress in spatio-temporal action recognition has lagged behind.
1 code implementation • 28 Jun 2020 • Pavel Tokmakov, Martial Hebert, Cordelia Schmid
This paper addresses the task of unsupervised learning of representations for action recognition in videos.
no code implementations • ECCV 2020 • Yuhua Chen, Luc van Gool, Cordelia Schmid, Cristian Sminchisescu
To handle inherent modeling error in the consistency loss (e. g. Lambertian assumptions) and for better generalization, we further introduce a learned, output refinement network, which takes the initial predictions, the loss, and the gradient as input, and efficiently predicts a correlated output update.
1 code implementation • NeurIPS 2020 • Yonglong Tian, Chen Sun, Ben Poole, Dilip Krishnan, Cordelia Schmid, Phillip Isola
Contrastive learning between multiple views of the data has recently achieved state of the art performance in the field of self-supervised representation learning.
Ranked #50 on
Self-Supervised Image Classification
on ImageNet
no code implementations • ECCV 2020 • Achal Dave, Tarasha Khurana, Pavel Tokmakov, Cordelia Schmid, Deva Ramanan
To this end, we ask annotators to label objects that move at any point in the video, and give names to them post factum.
5 code implementations • CVPR 2020 • Jiyang Gao, Chen Sun, Hang Zhao, Yi Shen, Dragomir Anguelov, Cong-Cong Li, Cordelia Schmid
Behavior prediction in dynamic, multi-agent systems is an important problem in the context of self-driving cars, due to the complex representations and interactions of road components, including moving agents (e. g. pedestrians and vehicles) and road context information (e. g. lanes, traffic lights).
no code implementations • CVPR 2020 • Yana Hasson, Bugra Tekin, Federica Bogo, Ivan Laptev, Marc Pollefeys, Cordelia Schmid
Modeling hand-object manipulations is essential for understanding how humans interact with their environment.
no code implementations • 15 Apr 2020 • Alexander Pashevich, Igor Kalevatykh, Ivan Laptev, Cordelia Schmid
We then show the success of our visual policies for building arches from different primitives.
no code implementations • ECCV 2020 • Ahmet Iscen, Jeffrey Zhang, Svetlana Lazebnik, Cordelia Schmid
We assume that the model is updated incrementally for new classes as new data becomes available sequentially. This requires adapting the previously stored feature vectors to the updated feature space without having access to the corresponding original training images.
no code implementations • CVPR 2020 • Arsha Nagrani, Chen Sun, David Ross, Rahul Sukthankar, Cordelia Schmid, Andrew Zisserman
We train a BERT-based Speech2Action classifier on over a thousand movie screenplays, to predict action labels from transcribed speech segments.
1 code implementation • ECCV 2020 • Nikita Dvornik, Cordelia Schmid, Julien Mairal
Popular approaches for few-shot classification consist of first learning a generic data representation based on a large annotated dataset, before adapting the representation to new classes given only a few labeled samples.
Ranked #2 on
Few-Shot Image Classification
on Meta-Dataset
no code implementations • 12 Mar 2020 • Gunnar A. Sigurdsson, Abhinav Gupta, Cordelia Schmid, Karteek Alahari
Eye movement and strategic placement of the visual field onto the retina, gives animals increased resolution of the scene and suppresses distracting information.
2 code implementations • ICML 2020 • Alexandre Sablayrolles, Matthijs Douze, Cordelia Schmid, Hervé Jégou
The mark is robust to strong variations such as different architectures or optimization methods.
no code implementations • 22 Jan 2020 • Tonmoy Saikia, Thomas Brox, Cordelia Schmid
To learn models or features that generalize across tasks and domains is one of the grand goals of machine learning.
1 code implementation • 9 Dec 2019 • Gül Varol, Ivan Laptev, Cordelia Schmid, Andrew Zisserman
Although synthetic training data has been shown to be beneficial for tasks such as human pose estimation, its use for RGB human action recognition is relatively unexplored.
no code implementations • 25 Oct 2019 • Achal Dave, Pavel Tokmakov, Cordelia Schmid, Deva Ramanan
Moreover, at test time the same network can be applied to detection and tracking, resulting in a unified approach for the two tasks.
1 code implementation • ECCV 2020 • Ahmet Iscen, Giorgos Tolias, Yannis Avrithis, Ondrej Chum, Cordelia Schmid
In this work we consider the problem of learning a classifier from noisy labels when a few clean labeled examples are given.
no code implementations • 29 Aug 2019 • Alexandre Sablayrolles, Matthijs Douze, Yann Ollivier, Cordelia Schmid, Hervé Jégou
Membership inference determines, given a sample and trained parameters of a machine learning model, whether the sample was part of the training set.
1 code implementation • 2 Aug 2019 • Robin Strudel, Alexander Pashevich, Igor Kalevatykh, Ivan Laptev, Josef Sivic, Cordelia Schmid
Manipulation tasks such as preparing a meal or assembling furniture remain highly challenging for robotics and vision.
no code implementations • ICCV 2019 • Valentin Gabeur, Jean-Sebastien Franco, Xavier Martin, Cordelia Schmid, Gregory Rogez
In this paper, we tackle the problem of 3D human shape estimation from single RGB images.
no code implementations • ICCV 2019 • Yuhua Chen, Cordelia Schmid, Cristian Sminchisescu
We present GLNet, a self-supervised framework for learning depth, optical flow, camera pose and intrinsic parameters from monocular video - addressing the difficulty of acquiring realistic ground-truth for such tasks.
no code implementations • 13 Jun 2019 • Chen Sun, Fabien Baradel, Kevin Murphy, Cordelia Schmid
This paper proposes a self-supervised learning approach for video features that results in significantly improved performance on downstream tasks (such as video classification, captioning and segmentation) compared to existing methods.
no code implementations • 29 Apr 2019 • Yubo Zhang, Pavel Tokmakov, Martial Hebert, Cordelia Schmid
In this work we study the problem of action detection in a highly-imbalanced dataset.
3 code implementations • CVPR 2019 • Yana Hasson, Gül Varol, Dimitrios Tzionas, Igor Kalevatykh, Michael J. Black, Ivan Laptev, Cordelia Schmid
Previous work has made significant progress towards reconstruction of hand poses and object shapes in isolation.
no code implementations • CVPR 2019 • Chen Sun, Abhinav Shrivastava, Carl Vondrick, Rahul Sukthankar, Kevin Murphy, Cordelia Schmid
This paper focuses on multi-person action forecasting in videos.
3 code implementations • ICCV 2019 • Chen Sun, Austin Myers, Carl Vondrick, Kevin Murphy, Cordelia Schmid
Self-supervised learning has become increasingly important to leverage the abundance of unlabeled data available on platforms like YouTube.
Ranked #1 on
Action Classification
on YouCook2
1 code implementation • ICCV 2019 • Nikita Dvornik, Cordelia Schmid, Julien Mairal
Few-shot classification consists of learning a predictive model that is able to effectively adapt to a new class, given only a few annotated samples.
1 code implementation • 18 Mar 2019 • Alexander Pashevich, Robin Strudel, Igor Kalevatykh, Ivan Laptev, Cordelia Schmid
Policies learned in simulators, however, do not transfer well to real scenes given the domain gap between real and synthetic data.
2 code implementations • 5 Jan 2019 • Joseph Roth, Sourish Chaudhuri, Ondrej Klejch, Radhika Marvin, Andrew Gallagher, Liat Kaver, Sharadh Ramaswamy, Arkadiusz Stopczynski, Cordelia Schmid, Zhonghua Xi, Caroline Pantofaru
The dataset contains temporally labeled face tracks in video, where each face instance is labeled as speaking or not, and whether the speech is audible.
no code implementations • NeurIPS 2019 • Thomas Lucas, Konstantin Shmelkov, Karteek Alahari, Cordelia Schmid, Jakob Verbeek
We show that our model significantly improves over existing hybrid models: offering GAN-like samples, IS and FID scores that are competitive with fully adversarial models, and improved likelihood scores.
no code implementations • ICCV 2019 • Julia Peyre, Ivan Laptev, Cordelia Schmid, Josef Sivic
We seek to detect visual relations in images of the form of triplets t = (subject, predicate, object), such as "person riding dog", where training examples of the individual entities are available but their combinations are unseen at training.
no code implementations • CVPR 2019 • Yubo Zhang, Pavel Tokmakov, Martial Hebert, Cordelia Schmid
A dominant paradigm for learning-based approaches in computer vision is training generic models, such as ResNet for image recognition, or I3D for video understanding, on large datasets and allowing them to discover the optimal representation for the problem at hand.
no code implementations • 30 Nov 2018 • Alexander Pashevich, Danijar Hafner, James Davidson, Rahul Sukthankar, Cordelia Schmid
To achieve this, we study different modulation signals and exploration for hierarchical controllers.
no code implementations • 27 Sep 2018 • Thomas Lucas, Konstantin Shmelkov, Karteek Alahari, Cordelia Schmid, Jakob Verbeek
First, we propose a model that extends variational autoencoders by using deterministic invertible transformation layers to map samples from the decoder to the image space.
no code implementations • ICLR 2019 • Alexandre Sablayrolles, Matthijs Douze, Cordelia Schmid, Hervé Jégou
Convolutional neural networks memorize part of their training data, which is why strategies such as data augmentation and drop-out are employed to mitigate overfitting.
no code implementations • 6 Sep 2018 • Nikita Dvornik, Julien Mairal, Cordelia Schmid
In this work, we consider object detection, semantic and instance segmentation and augment the training images by blending objects in existing scenes, using instance segmentation annotations.
1 code implementation • ECCV 2018 • Chen Sun, Abhinav Shrivastava, Carl Vondrick, Kevin Murphy, Rahul Sukthankar, Cordelia Schmid
A visualization of the learned relation features confirms that our approach is able to attend to the relevant relations for each action.
Ranked #14 on
Action Recognition
on AVA v2.1
no code implementations • ECCV 2018 • Konstantin Shmelkov, Cordelia Schmid, Karteek Alahari
Generative adversarial networks (GANs) are one of the most popular methods for generating images today.
4 code implementations • ECCV 2018 • Francisco M. Castro, Manuel J. Marín-Jiménez, Nicolás Guil, Cordelia Schmid, Karteek Alahari
Although deep learning approaches have stood out in recent years due to their state-of-the-art results, they continue to suffer from catastrophic forgetting, a dramatic decrease in overall performance when training with new classes added incrementally.
Ranked #2 on
Incremental Learning
on ImageNet - 10 steps
(# M Params metric)
2 code implementations • ECCV 2018 • Nikita Dvornik, Julien Mairal, Cordelia Schmid
For this approach to be successful, we show that modeling appropriately the visual context surrounding objects is crucial to place them in the right environment.
1 code implementation • NeurIPS 2018 • Guilhem Chéron, Jean-Baptiste Alayrac, Ivan Laptev, Cordelia Schmid
Our model is based on discriminative clustering and integrates different types of supervision as constraints on the optimization.
no code implementations • 28 Jun 2018 • Guilhem Chéron, Anton Osokin, Ivan Laptev, Cordelia Schmid
In order to localize actions in time, we propose a recurrent localization network (RecLNet) designed to model the temporal structure of actions on the level of person tracks.
1 code implementation • ICLR 2019 • Alexandre Sablayrolles, Matthijs Douze, Cordelia Schmid, Hervé Jégou
Discretizing multi-dimensional data distributions is a fundamental step of modern indexing methods.
no code implementations • CVPR 2018 • Vasileios Choutas, Philippe Weinzaepfel, Jérôme Revaud, Cordelia Schmid
We use the human joints as these keypoints and term our Pose moTion representation PoTion.
Ranked #1 on
Skeleton Based Action Recognition
on J-HMDB
no code implementations • NeurIPS 2018 • Daan Wynen, Cordelia Schmid, Julien Mairal
In this paper, we introduce an unsupervised learning approach to automatically discover, summarize, and manipulate artistic styles from large collections of paintings.
no code implementations • 25 Apr 2018 • Gunnar A. Sigurdsson, Abhinav Gupta, Cordelia Schmid, Ali Farhadi, Karteek Alahari
In this paper we describe the egocentric aspect of the dataset and present annotations for Charades-Ego with 68, 536 activity instances in 68. 8 hours of first and third-person video, making it one of the largest and most diverse egocentric datasets available.
1 code implementation • CVPR 2018 • Gunnar A. Sigurdsson, Abhinav Gupta, Cordelia Schmid, Ali Farhadi, Karteek Alahari
Several theories in cognitive neuroscience suggest that when people interact with the world, or simulate interactions, they do so from a first-person egocentric perspective, and seamlessly transfer knowledge between third-person (observer) and first-person (actor).
2 code implementations • ECCV 2018 • Gül Varol, Duygu Ceylan, Bryan Russell, Jimei Yang, Ersin Yumer, Ivan Laptev, Cordelia Schmid
Human shape estimation is an important task for video editing, animation and fashion industry.
Ranked #2 on
3D Human Pose Estimation
on Surreal
(using extra training data)
no code implementations • 1 Mar 2018 • Gregory Rogez, Philippe Weinzaepfel, Cordelia Schmid
We propose an end-to-end architecture for joint 2D and 3D human pose estimation in natural images.
no code implementations • 12 Feb 2018 • Grégory Rogez, Cordelia Schmid
Here, we propose a solution to generate a large set of photorealistic synthetic images of humans with 3D pose annotations.
no code implementations • 1 Dec 2017 • Pavel Tokmakov, Cordelia Schmid, Karteek Alahari
We formulate this as a learning problem and design our framework with three cues: (i) independent object motion between a pair of frames, which complements object recognition, (ii) object appearance, which helps to correct errors in motion estimation, and (iii) temporal consistency, which imposes additional constraints on the segmentation.
Ranked #15 on
Unsupervised Video Object Segmentation
on DAVIS 2016
no code implementations • ICCV 2017 • Vicky Kalogeiton, Philippe Weinzaepfel, Vittorio Ferrari, Cordelia Schmid
dog and cat jumping, enabling to detect actions of an object without training with these object-actions pairs.
3 code implementations • ICCV 2017 • Konstantin Shmelkov, Cordelia Schmid, Karteek Alahari
Despite their success for object detection, convolutional neural networks are ill-equipped for incremental learning, i. e., adapting the original model trained on a set of classes to additionally detect objects of new classes, in the absence of the initial training data.
2 code implementations • ICCV 2017 • Nikita Dvornik, Konstantin Shmelkov, Julien Mairal, Cordelia Schmid
Real-time scene understanding has become crucial in many applications such as autonomous driving.
Ranked #2 on
Real-Time Object Detection
on PASCAL VOC 2007
no code implementations • ICCV 2017 • Julia Peyre, Ivan Laptev, Cordelia Schmid, Josef Sivic
This paper introduces a novel approach for modeling visual relations between pairs of objects.
no code implementations • 19 Jul 2017 • Nicolas Chesneau, Grégory Rogez, Karteek Alahari, Cordelia Schmid
In this paper, we propose a new framework for action localization that tracks people in videos and extracts full-body human tubes, i. e., spatio-temporal regions localizing actions, even in the case of occlusions or truncations.
no code implementations • 13 Jul 2017 • Weixin Yang, Terry Lyons, Hao Ni, Cordelia Schmid, Lianwen Jin
To this end, we regard the evolving landmark data as a high-dimensional path and apply non-linear path signature techniques to provide an expressive, robust, non-linear, and interpretable representation for the sequential events.
no code implementations • CVPR 2017 • Gregory Rogez, Philippe Weinzaepfel, Cordelia Schmid
We propose an end-to-end architecture for joint 2D and 3D human pose estimation in natural images.
Ranked #4 on
3D Multi-Person Pose Estimation (root-relative)
on MuPoTS-3D
(MPJPE metric)
4 code implementations • CVPR 2018 • Chunhui Gu, Chen Sun, David A. Ross, Carl Vondrick, Caroline Pantofaru, Yeqing Li, Sudheendra Vijayanarasimhan, George Toderici, Susanna Ricco, Rahul Sukthankar, Cordelia Schmid, Jitendra Malik
The AVA dataset densely annotates 80 atomic visual actions in 430 15-minute video clips, where actions are localized in space and time, resulting in 1. 58M action labels with multiple labels per person occurring frequently.
Ranked #2 on
Temporal Action Localization
on UCF101-24
1 code implementation • ICCV 2017 • Kai Han, Rafael S. Rezende, Bumsub Ham, Kwan-Yee K. Wong, Minsu Cho, Cordelia Schmid, Jean Ponce
This paper addresses the problem of establishing semantic correspondences between images depicting different instances of the same object or scene category.
2 code implementations • ICCV 2017 • Vicky Kalogeiton, Philippe Weinzaepfel, Vittorio Ferrari, Cordelia Schmid
We propose the ACtion Tubelet detector (ACT-detector) that takes as input a sequence of frames and outputs tubelets, i. e., sequences of bounding boxes with associated scores.
Ranked #3 on
Temporal Action Localization
on J-HMDB-21
no code implementations • 25 Apr 2017 • Sudheendra Vijayanarasimhan, Susanna Ricco, Cordelia Schmid, Rahul Sukthankar, Katerina Fragkiadaki
We propose SfM-Net, a geometry-aware neural network for motion estimation in videos that decomposes frame-to-frame pixel motion in terms of scene and object depth, camera motion and 3D object rotations and translations.
no code implementations • ICCV 2017 • Pavel Tokmakov, Karteek Alahari, Cordelia Schmid
The module to build a "visual memory" in video, i. e., a joint representation of all the video frames, is realized with a convolutional recurrent unit learned from a small number of training video sequences.
no code implementations • 21 Mar 2017 • Bumsub Ham, Minsu Cho, Cordelia Schmid, Jean Ponce
Finding image correspondences remains a challenging problem in the presence of intra-class variations and large changes in scene layout.
2 code implementations • CVPR 2017 • Gül Varol, Javier Romero, Xavier Martin, Naureen Mahmood, Michael J. Black, Ivan Laptev, Cordelia Schmid
In this work we present SURREAL (Synthetic hUmans foR REAL tasks): a new large-scale dataset with synthetically-generated but realistic images of people rendered from 3D sequences of human motion capture data.
no code implementations • CVPR 2017 • Pavel Tokmakov, Karteek Alahari, Cordelia Schmid
The problem of determining whether an object is in motion, irrespective of camera motion, is far from being solved.
Ranked #22 on
Unsupervised Video Object Segmentation
on DAVIS 2016
(using extra training data)
no code implementations • ICCV 2017 • Marco Pedersoli, Thomas Lucas, Cordelia Schmid, Jakob Verbeek
We propose "Areas of Attention", a novel attention-based model for automatic image captioning.
no code implementations • European Conference on Computer Vision (ECVV 2016) 2016 • Xiaojiang Peng, Cordelia Schmid
We propose a multi-region two-stream R-CNN model for action detection in realistic videos.
Ranked #4 on
Temporal Action Localization
on UCF101-24
no code implementations • NeurIPS 2016 • Grégory Rogez, Cordelia Schmid
Here, we propose a solution to generate a large set of photorealistic synthetic images of humans with 3D pose annotations.
no code implementations • 17 May 2016 • Philippe Weinzaepfel, Xavier Martin, Cordelia Schmid
We introduce an approach for spatio-temporal human action localization using sparse spatial supervision.
1 code implementation • 15 Apr 2016 • Gül Varol, Ivan Laptev, Cordelia Schmid
Typical human actions last several seconds and exhibit characteristic spatio-temporal structure.
Ranked #58 on
Action Recognition
on UCF101
no code implementations • 23 Mar 2016 • Pavel Tokmakov, Karteek Alahari, Cordelia Schmid
We also demonstrate that the performance of M-CNN learned with 150 weak video annotations is on par with state-of-the-art weakly-supervised methods trained with thousands of images.
no code implementations • 1 Mar 2016 • Mattis Paulin, Julien Mairal, Matthijs Douze, Zaid Harchaoui, Florent Perronnin, Cordelia Schmid
Convolutional neural networks (CNNs) have recently received a lot of attention due to their ability to model local stationary structures in natural images in a multi-scale fashion, when learning all model parameters with supervision.
no code implementations • ICCV 2015 • Mattis Paulin, Matthijs Douze, Zaid Harchaoui, Julien Mairal, Florent Perronin, Cordelia Schmid
Patch-level descriptors underlie several important computer vision tasks, such as stereo-matching or content-based image retrieval.
no code implementations • CVPR 2016 • Bumsub Ham, Minsu Cho, Cordelia Schmid, Jean Ponce
Finding image correspondences remains a challenging problem in the presence of intra-class variations and large changes in scene layout.~Semantic flow methods are designed to handle images depicting different instances of the same object or scene category.
no code implementations • 3 Oct 2015 • Ramazan Gokberk Cinbis, Jakob Verbeek, Cordelia Schmid
It has been experimentally observed that the performance of BoW and FV representations can be improved by employing discounting transformations such as power normalization.
no code implementations • ICCV 2015 • Yang Hua, Karteek Alahari, Cordelia Schmid
Tracking-by-detection approaches are some of the most successful object trackers in recent years.
no code implementations • 14 Sep 2015 • Gaurav Sharma, Frederic Jurie, Cordelia Schmid
We validate our method on three recent challenging datasets of human attributes and actions.
no code implementations • 15 Aug 2015 • Danila Potapov, Matthijs Douze, Jerome Revaud, Zaid Harchaoui, Cordelia Schmid
While important advances were recently made towards temporally localizing and recognizing specific human actions or activities in videos, efficient detection and classification of long video chunks belonging to semantically defined categories such as "pursuit" or "romance" remains challenging. We introduce a new dataset, Action Movie Franchises, consisting of a collection of Hollywood action movie franchises.
1 code implementation • 25 Jun 2015 • Jerome Revaud, Philippe Weinzaepfel, Zaid Harchaoui, Cordelia Schmid
We introduce a novel matching algorithm, called DeepMatching, to compute dense correspondences between images.
Ranked #4 on
Dense Pixel Correspondence Estimation
on HPatches
Dense Pixel Correspondence Estimation
Optical Flow Estimation
no code implementations • ICCV 2015 • Guilhem Chéron, Ivan Laptev, Cordelia Schmid
This work targets human action recognition in video.
1 code implementation • 8 Jun 2015 • Matthijs Douze, Jérôme Revaud, Jakob Verbeek, Hervé Jégou, Cordelia Schmid
We address the problem of specific video event retrieval.
no code implementations • ICCV 2015 • Philippe Weinzaepfel, Zaid Harchaoui, Cordelia Schmid
We present experimental results for spatio-temporal localization on the UCF-Sports, J-HMDB and UCF-101 action localization datasets, where our approach outperforms the state of the art with a margin of 15%, 7% and 12% respectively in mAP.
no code implementations • CVPR 2015 • Philippe Weinzaepfel, Jerome Revaud, Zaid Harchaoui, Cordelia Schmid
We compare the results obtained with several state-of-the-art optical flow approaches and study the impact of the different cues used in the random forest. Furthermore, we introduce a new dataset, the YouTube Motion Boundaries dataset (YMB), that comprises 60 sequences taken from real-world videos with manually annotated motion boundaries.
no code implementations • ICCV 2015 • Piotr Bojanowski, Rémi Lajugie, Edouard Grave, Francis Bach, Ivan Laptev, Jean Ponce, Cordelia Schmid
Given vectorial features for both video and text, we propose to cast this task as a temporal assignment problem, with an implicit linear mapping between the two feature modalities.
no code implementations • ICCV 2015 • Suha Kwak, Minsu Cho, Ivan Laptev, Jean Ponce, Cordelia Schmid
This paper addresses the problem of automatically localizing dominant objects as spatio-temporal tubes in a noisy collection of videos with minimal or even no supervision.
no code implementations • 21 Apr 2015 • Heng Wang, Dan Oneata, Jakob Verbeek, Cordelia Schmid
We also use the homography to cancel out camera motion from the optical flow.
2 code implementations • 30 Mar 2015 • Zeynep Akata, Florent Perronnin, Zaid Harchaoui, Cordelia Schmid
Attributes act as intermediate representations that enable parameter sharing between classes, a must when training data is scarce.
Ranked #5 on
Zero-Shot Action Recognition
on Kinetics
no code implementations • 3 Mar 2015 • Ramazan Gokberk Cinbis, Jakob Verbeek, Cordelia Schmid
In this case, the supervised information is restricted to binary labels that indicate the absence/presence of object instances in the image, without their locations.
Multiple Instance Learning
Weakly-Supervised Object Localization
no code implementations • CVPR 2015 • Minsu Cho, Suha Kwak, Cordelia Schmid, Jean Ponce
This paper addresses unsupervised discovery and localization of dominant objects from a noisy image collection with multiple object classes.
no code implementations • CVPR 2015 • Jerome Revaud, Philippe Weinzaepfel, Zaid Harchaoui, Cordelia Schmid
We propose a novel approach for optical flow estimation , targeted at large displacements with significant oc-clusions.
1 code implementation • 6 Jan 2015 • Vicky Kalogeiton, Vittorio Ferrari, Cordelia Schmid
Object detection is one of the most important challenges in computer vision.
no code implementations • 4 Jul 2014 • Piotr Bojanowski, Rémi Lajugie, Francis Bach, Ivan Laptev, Jean Ponce, Cordelia Schmid, Josef Sivic
We are given a set of video clips, each one annotated with an {\em ordered} list of actions, such as "walk" then "sit" then "answer phone" extracted from, for example, the associated text script.
no code implementations • NeurIPS 2014 • Julien Mairal, Piotr Koniusz, Zaid Harchaoui, Cordelia Schmid
An important goal in visual recognition is to devise image representations that are invariant to particular transformations.
Ranked #24 on
Image Classification
on MNIST
no code implementations • CVPR 2014 • Dan Oneata, Jakob Verbeek, Cordelia Schmid
Transformation of the FV by power and L2 normalizations has shown to significantly improve its performance, and led to state-of-the-art results for a range of image and video classification and retrieval tasks.
no code implementations • CVPR 2014 • Ramazan Gokberk Cinbis, Jakob Verbeek, Cordelia Schmid
In this case, the supervised information is restricted to binary labels that indicate the absence/presence of object instances in the image, without their locations.
Multiple Instance Learning
Weakly-Supervised Object Localization
no code implementations • CVPR 2014 • Anoop Cherian, Julien Mairal, Karteek Alahari, Cordelia Schmid
In this paper, we present a method for estimating articulated human poses in videos.
no code implementations • CVPR 2014 • Mattis Paulin, Jerome Revaud, Zaid Harchaoui, Florent Perronnin, Cordelia Schmid
We propose a principled algorithm Image Transformation Pursuit (ITP) for the automatic selection of a compact set of transformations.
no code implementations • CVPR 2013 • Zeynep Akata, Florent Perronnin, Zaid Harchaoui, Cordelia Schmid
The label embedding framework offers other advantages such as the ability to leverage alternative sources of information in addition to attributes (e. g. class hierarchies) or to transition smoothly from zero-shot learning to learning with large quantities of data.
no code implementations • CVPR 2013 • Gaurav Sharma, Frederic Jurie, Cordelia Schmid
We propose a new model for recognizing human attributes (e. g. wearing a suit, sitting, short hair) and actions (e. g. running, riding a horse) in still images.
no code implementations • CVPR 2013 • Jerome Revaud, Matthijs Douze, Cordelia Schmid, Herve Jegou
Furthermore, we extend product quantization to complex vectors in order to compress our descriptors, and to compare them in the compressed domain.