Search Results for author: Cordelia Schmid

Found 155 papers, 54 papers with code

Location-Aware Self-Supervised Transformers

1 code implementation5 Dec 2022 Mathilde Caron, Neil Houlsby, Cordelia Schmid

In this work, we propose to pretrain networks for semantic segmentation by predicting the relative location of image parts.

Contrastive Learning Image Classification +1

WALDO: Future Video Synthesis using Object Layer Decomposition and Parametric Flow Prediction

no code implementations25 Nov 2022 Guillaume Le Moing, Jean Ponce, Cordelia Schmid

This paper presents WALDO (WArping Layer-Decomposed Objects), a novel approach to the prediction of future video frames from past ones.

SSIM

AVATAR submission to the Ego4D AV Transcription Challenge

no code implementations18 Nov 2022 Paul Hongsuck Seo, Arsha Nagrani, Cordelia Schmid

In this report, we describe our submission to the Ego4D AudioVisual (AV) Speech Transcription Challenge 2022.

Language Conditioned Spatial Relation Reasoning for 3D Object Grounding

no code implementations17 Nov 2022 ShiZhe Chen, Pierre-Louis Guhur, Makarand Tapaswi, Cordelia Schmid, Ivan Laptev

In this work we propose a language-conditioned transformer model for grounding 3D objects and their spatial relations.

Learning Reward Functions for Robotic Manipulation by Observing Humans

no code implementations16 Nov 2022 Minttu Alakuijala, Gabriel Dulac-Arnold, Julien Mairal, Jean Ponce, Cordelia Schmid

Observing a human demonstrator manipulate objects provides a rich, scalable and inexpensive source of data for learning robotic policies.

A Memory Transformer Network for Incremental Learning

no code implementations10 Oct 2022 Ahmet Iscen, Thomas Bird, Mathilde Caron, Alireza Fathi, Cordelia Schmid

We study class-incremental learning, a training setup in which new classes of data are observed over time for the model to learn from.

class-incremental learning Incremental Learning

Instruction-driven history-aware policies for robotic manipulations

no code implementations11 Sep 2022 Pierre-Louis Guhur, ShiZhe Chen, Ricardo Garcia, Makarand Tapaswi, Ivan Laptev, Cordelia Schmid

In human environments, robots are expected to accomplish a variety of manipulation tasks given simple natural language instructions.

Learning from Unlabeled 3D Environments for Vision-and-Language Navigation

no code implementations24 Aug 2022 ShiZhe Chen, Pierre-Louis Guhur, Makarand Tapaswi, Cordelia Schmid, Ivan Laptev

Our resulting HM3D-AutoVLN dataset is an order of magnitude larger than existing VLN datasets in terms of navigation environments and instructions.

Language Modelling Navigate +1

AlignSDF: Pose-Aligned Signed Distance Fields for Hand-Object Reconstruction

no code implementations26 Jul 2022 Zerui Chen, Yana Hasson, Cordelia Schmid, Ivan Laptev

We show that such aligned SDFs better focus on reconstructing shape details and improve reconstruction accuracy both for hands and objects.

Object Reconstruction

M&M Mix: A Multimodal Multiview Transformer Ensemble

no code implementations20 Jun 2022 Xuehan Xiong, Anurag Arnab, Arsha Nagrani, Cordelia Schmid

This report describes the approach behind our winning solution to the 2022 Epic-Kitchens Action Recognition Challenge.

 Ranked #1 on Action Recognition on EPIC-KITCHENS-100 (using extra training data)

Action Recognition Video Recognition

Zero-Shot Video Question Answering via Frozen Bidirectional Language Models

1 code implementation16 Jun 2022 Antoine Yang, Antoine Miech, Josef Sivic, Ivan Laptev, Cordelia Schmid

Manual annotation of question and answers for videos, however, is tedious and prohibits scalability.

 Ranked #1 on Zero-Shot Learning on iVQA (using extra training data)

Fill Mask Language Modelling +6

AVATAR: Unconstrained Audiovisual Speech Recognition

no code implementations15 Jun 2022 Valentin Gabeur, Paul Hongsuck Seo, Arsha Nagrani, Chen Sun, Karteek Alahari, Cordelia Schmid

Audio-visual automatic speech recognition (AV-ASR) is an extension of ASR that incorporates visual cues, often from the movements of a speaker's mouth.

Automatic Speech Recognition speech-recognition

Learning to Answer Visual Questions from Web Videos

1 code implementation10 May 2022 Antoine Yang, Antoine Miech, Josef Sivic, Ivan Laptev, Cordelia Schmid

We use our method to generate the WebVidVQA3M dataset from the WebVid dataset, i. e., videos with alt-text annotations, and show its benefits for training VideoQA models.

Question Answering Question Generation +4

Weakly-supervised segmentation of referring expressions

no code implementations10 May 2022 Robin Strudel, Ivan Laptev, Cordelia Schmid

Visual grounding localizes regions (boxes or segments) in the image corresponding to given referring expressions.

Image Segmentation Referring Expression +4

Assembly Planning from Observations under Physical Constraints

no code implementations20 Apr 2022 Thomas Chabal, Robin Strudel, Etienne Arlaud, Jean Ponce, Cordelia Schmid

This paper addresses the problem of copying an unknown assembly of primitives with known shape and appearance using information extracted from a single photograph by an off-the-shelf procedure for object detection and pose estimation.

object-detection Object Detection +1

Learning Audio-Video Modalities from Image Captions

no code implementations1 Apr 2022 Arsha Nagrani, Paul Hongsuck Seo, Bryan Seybold, Anja Hauth, Santiago Manen, Chen Sun, Cordelia Schmid

To close this gap we propose a new video mining pipeline which involves transferring captions from image captioning datasets to video clips with no additional manual effort.

Image Captioning Retrieval +2

The Right Spin: Learning Object Motion from Rotation-Compensated Flow Fields

no code implementations28 Feb 2022 Pia Bideau, Erik Learned-Miller, Cordelia Schmid, Karteek Alahari

In this work, we argue that the coupling of camera rotation and camera translation can create complex motion fields that are difficult for a deep network to untangle directly.

Motion Segmentation

Think Global, Act Local: Dual-scale Graph Transformer for Vision-and-Language Navigation

1 code implementation CVPR 2022 ShiZhe Chen, Pierre-Louis Guhur, Makarand Tapaswi, Cordelia Schmid, Ivan Laptev

To balance the complexity of large action space reasoning and fine-grained language grounding, we dynamically combine a fine-scale encoding over local observations and a coarse-scale encoding on a global map via graph transformers.

Efficient Exploration Navigate +1

Multiview Transformers for Video Recognition

1 code implementation CVPR 2022 Shen Yan, Xuehan Xiong, Anurag Arnab, Zhichao Lu, Mi Zhang, Chen Sun, Cordelia Schmid

Video understanding requires reasoning at multiple spatiotemporal resolutions -- from short fine-grained motions to events taking place over longer durations.

Ranked #2 on Action Recognition on EPIC-KITCHENS-100 (using extra training data)

Action Classification Action Recognition +1

Masking Modalities for Cross-modal Video Retrieval

no code implementations1 Nov 2021 Valentin Gabeur, Arsha Nagrani, Chen Sun, Karteek Alahari, Cordelia Schmid

Our proposal is to pre-train a video encoder using all the available video modalities as supervision, namely, appearance, sound, and transcribed speech.

Retrieval Video Retrieval

Variational Perturbations for Visual Feature Attribution

no code implementations29 Sep 2021 Jae Myung Kim, Eunji Kim, Sungroh Yoon, Jungwoo Lee, Cordelia Schmid, Zeynep Akata

Explaining a complex black-box system in a post-hoc manner is important to understand its predictions.

Airbert: In-domain Pretraining for Vision-and-Language Navigation

1 code implementation ICCV 2021 Pierre-Louis Guhur, Makarand Tapaswi, ShiZhe Chen, Ivan Laptev, Cordelia Schmid

Given the scarcity of domain-specific training data and the high diversity of image and language inputs, the generalization of VLN agents to unseen environments remains challenging.

Navigate Referring Expression +1

CCVS: Context-aware Controllable Video Synthesis

1 code implementation NeurIPS 2021 Guillaume Le Moing, Jean Ponce, Cordelia Schmid

The prediction model is doubly autoregressive, in the latent space of an autoencoder for forecasting, and in image space for updating contextual information, which is also used to enforce spatio-temporal consistency through a learnable optical flow module.

Optical Flow Estimation Self-Supervised Learning +2

Goal-Conditioned Reinforcement Learning with Imagined Subgoals

no code implementations1 Jul 2021 Elliot Chane-Sane, Cordelia Schmid, Ivan Laptev

Goal-conditioned reinforcement learning endows an agent with a large variety of skills, but it often struggles to solve tasks that require more temporally extended reasoning.

reinforcement-learning

Attention Bottlenecks for Multimodal Fusion

1 code implementation NeurIPS 2021 Arsha Nagrani, Shan Yang, Anurag Arnab, Aren Jansen, Cordelia Schmid, Chen Sun

Humans perceive the world by concurrently processing and fusing high-dimensional inputs from multiple modalities such as vision and audio.

Action Classification Action Recognition +1

Residual Reinforcement Learning from Demonstrations

no code implementations15 Jun 2021 Minttu Alakuijala, Gabriel Dulac-Arnold, Julien Mairal, Jean Ponce, Cordelia Schmid

Residual reinforcement learning (RL) has been proposed as a way to solve challenging robotic tasks by adapting control actions from a conventional feedback controller to maximize a reward signal.

reinforcement-learning

Large-Scale Unsupervised Object Discovery

1 code implementation NeurIPS 2021 Huy V. Vo, Elena Sizikova, Cordelia Schmid, Patrick Pérez, Jean Ponce

Extensive experiments on COCO and OpenImages show that, in the single-object discovery setting where a single prominent object is sought in each image, the proposed LOD (Large-scale Object Discovery) approach is on par with, or better than the state of the art for medium-scale datasets (up to 120K images), and over 37% better than the only other algorithms capable of scaling up to 1. 7M images.

Multi-object discovery Object Discovery +1

Episodic Transformer for Vision-and-Language Navigation

1 code implementation ICCV 2021 Alexander Pashevich, Cordelia Schmid, Chen Sun

We demonstrate that encoding the history with a transformer is critical to solve compositional tasks, and that pretraining and joint training with synthetic instructions further improve the performance.

Vision and Language Navigation

Class-Balanced Distillation for Long-Tailed Visual Recognition

3 code implementations12 Apr 2021 Ahmet Iscen, André Araujo, Boqing Gong, Cordelia Schmid

An effective and simple approach to long-tailed visual recognition is to learn feature representations and a classifier separately, with instance and class-balanced sampling, respectively.

Knowledge Distillation Long-tail Learning

Improving robustness against common corruptions with frequency biased models

no code implementations ICCV 2021 Tonmoy Saikia, Cordelia Schmid, Thomas Brox

CNNs perform remarkably well when the training and test distributions are i. i. d, but unseen image corruptions can cause a surprisingly large drop in performance.

Data Augmentation object-detection +1

ViViT: A Video Vision Transformer

4 code implementations ICCV 2021 Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lučić, Cordelia Schmid

We present pure-transformer based models for video classification, drawing upon the recent success of such models in image classification.

Ranked #7 on Action Classification on Moments in Time (Top 5 Accuracy metric, using extra training data)

Action Classification Action Recognition +3

Unified Graph Structured Models for Video Understanding

no code implementations ICCV 2021 Anurag Arnab, Chen Sun, Cordelia Schmid

Accurate video understanding involves reasoning about the relationships between actors, objects and their environment, often over long temporal intervals.

Action Detection Graph Classification +3

Learning Temporal Dynamics from Cycles in Narrated Video

no code implementations ICCV 2021 Dave Epstein, Jiajun Wu, Cordelia Schmid, Chen Sun

Learning to model how the world changes as time elapses has proven a challenging problem for the computer vision community.

Image Matching with Scale Adjustment

no code implementations10 Dec 2020 Yves Dufournaud, Cordelia Schmid, Radu Horaud

In this paper we address the problem of matching two images with two different resolutions: a high-resolution image and a low-resolution one.

Look Before you Speak: Visually Contextualized Utterances

no code implementations CVPR 2021 Paul Hongsuck Seo, Arsha Nagrani, Cordelia Schmid

Leveraging recent advances in multimodal learning, our model consists of a novel co-attentional multimodal video transformer, and when trained on both textual and visual context, outperforms baselines that use textual inputs alone.

Just Ask: Learning to Answer Questions from Millions of Narrated Videos

1 code implementation ICCV 2021 Antoine Yang, Antoine Miech, Josef Sivic, Ivan Laptev, Cordelia Schmid

In this work, we propose to avoid manual annotation and generate a large-scale training dataset for video question answering making use of automatic cross-modal supervision.

Ranked #2 on Zero-Shot Learning on How2QA (using extra training data)

Question Answering Question Generation +4

Learning Obstacle Representations for Neural Motion Planning

1 code implementation25 Aug 2020 Robin Strudel, Ricardo Garcia, Justin Carpentier, Jean-Paul Laumond, Ivan Laptev, Cordelia Schmid

Motion planning and obstacle avoidance is a key challenge in robotics applications.

Robotics

Multi-modal Transformer for Video Retrieval

1 code implementation ECCV 2020 Valentin Gabeur, Chen Sun, Karteek Alahari, Cordelia Schmid

In this paper, we present a multi-modal transformer to jointly encode the different modalities in video, which allows each of them to attend to the others.

Ranked #11 on Video Retrieval on ActivityNet (using extra training data)

Natural Language Queries Retrieval +1

Consistency Guided Scene Flow Estimation

no code implementations ECCV 2020 Yuhua Chen, Luc van Gool, Cordelia Schmid, Cristian Sminchisescu

To handle inherent modeling error in the consistency loss (e. g. Lambertian assumptions) and for better generalization, we further introduce a learned, output refinement network, which takes the initial predictions, the loss, and the gradient as input, and efficiently predicts a correlated output update.

Scene Flow Estimation

What Makes for Good Views for Contrastive Learning?

1 code implementation NeurIPS 2020 Yonglong Tian, Chen Sun, Ben Poole, Dilip Krishnan, Cordelia Schmid, Phillip Isola

Contrastive learning between multiple views of the data has recently achieved state of the art performance in the field of self-supervised representation learning.

Contrastive Learning Data Augmentation +8

VectorNet: Encoding HD Maps and Agent Dynamics from Vectorized Representation

4 code implementations CVPR 2020 Jiyang Gao, Chen Sun, Hang Zhao, Yi Shen, Dragomir Anguelov, Cong-Cong Li, Cordelia Schmid

Behavior prediction in dynamic, multi-agent systems is an important problem in the context of self-driving cars, due to the complex representations and interactions of road components, including moving agents (e. g. pedestrians and vehicles) and road context information (e. g. lanes, traffic lights).

Self-Driving Cars

Learning visual policies for building 3D shape categories

no code implementations15 Apr 2020 Alexander Pashevich, Igor Kalevatykh, Ivan Laptev, Cordelia Schmid

We then show the success of our visual policies for building arches from different primitives.

Memory-Efficient Incremental Learning Through Feature Adaptation

no code implementations ECCV 2020 Ahmet Iscen, Jeffrey Zhang, Svetlana Lazebnik, Cordelia Schmid

We assume that the model is updated incrementally for new classes as new data becomes available sequentially. This requires adapting the previously stored feature vectors to the updated feature space without having access to the corresponding original training images.

Incremental Learning

Speech2Action: Cross-modal Supervision for Action Recognition

no code implementations CVPR 2020 Arsha Nagrani, Chen Sun, David Ross, Rahul Sukthankar, Cordelia Schmid, Andrew Zisserman

We train a BERT-based Speech2Action classifier on over a thousand movie screenplays, to predict action labels from transcribed speech segments.

Action Recognition

Selecting Relevant Features from a Multi-domain Representation for Few-shot Classification

1 code implementation ECCV 2020 Nikita Dvornik, Cordelia Schmid, Julien Mairal

Popular approaches for few-shot classification consist of first learning a generic data representation based on a large annotated dataset, before adapting the representation to new classes given only a few labeled samples.

Few-Shot Image Classification General Classification

Beyond the Camera: Neural Networks in World Coordinates

no code implementations12 Mar 2020 Gunnar A. Sigurdsson, Abhinav Gupta, Cordelia Schmid, Karteek Alahari

Eye movement and strategic placement of the visual field onto the retina, gives animals increased resolution of the scene and suppresses distracting information.

Action Recognition Video Stabilization +1

Optimized Generic Feature Learning for Few-shot Classification across Domains

no code implementations22 Jan 2020 Tonmoy Saikia, Thomas Brox, Cordelia Schmid

To learn models or features that generalize across tasks and domains is one of the grand goals of machine learning.

BIG-bench Machine Learning Classification +3

Synthetic Humans for Action Recognition from Unseen Viewpoints

1 code implementation9 Dec 2019 Gül Varol, Ivan Laptev, Cordelia Schmid, Andrew Zisserman

Although synthetic training data has been shown to be beneficial for tasks such as human pose estimation, its use for RGB human action recognition is relatively unexplored.

Action Classification Action Recognition +1

Learning to Track Any Object

no code implementations25 Oct 2019 Achal Dave, Pavel Tokmakov, Cordelia Schmid, Deva Ramanan

Moreover, at test time the same network can be applied to detection and tracking, resulting in a unified approach for the two tasks.

Instance Segmentation Object Tracking +4

White-box vs Black-box: Bayes Optimal Strategies for Membership Inference

no code implementations29 Aug 2019 Alexandre Sablayrolles, Matthijs Douze, Yann Ollivier, Cordelia Schmid, Hervé Jégou

Membership inference determines, given a sample and trained parameters of a machine learning model, whether the sample was part of the training set.

Self-supervised Learning with Geometric Constraints in Monocular Video: Connecting Flow, Depth, and Camera

no code implementations ICCV 2019 Yuhua Chen, Cordelia Schmid, Cristian Sminchisescu

We present GLNet, a self-supervised framework for learning depth, optical flow, camera pose and intrinsic parameters from monocular video - addressing the difficulty of acquiring realistic ground-truth for such tasks.

Optical Flow Estimation Self-Supervised Learning +1

Learning Video Representations using Contrastive Bidirectional Transformer

no code implementations13 Jun 2019 Chen Sun, Fabien Baradel, Kevin Murphy, Cordelia Schmid

This paper proposes a self-supervised learning approach for video features that results in significantly improved performance on downstream tasks (such as video classification, captioning and segmentation) compared to existing methods.

Automatic Speech Recognition Representation Learning +4

A Study on Action Detection in the Wild

no code implementations29 Apr 2019 Yubo Zhang, Pavel Tokmakov, Martial Hebert, Cordelia Schmid

In this work we study the problem of action detection in a highly-imbalanced dataset.

Action Detection

Learning to Augment Synthetic Images for Sim2Real Policy Transfer

1 code implementation18 Mar 2019 Alexander Pashevich, Robin Strudel, Igor Kalevatykh, Ivan Laptev, Cordelia Schmid

Policies learned in simulators, however, do not transfer well to real scenes given the domain gap between real and synthetic data.

Object Localization

Adaptive Density Estimation for Generative Models

no code implementations NeurIPS 2019 Thomas Lucas, Konstantin Shmelkov, Karteek Alahari, Cordelia Schmid, Jakob Verbeek

We show that our model significantly improves over existing hybrid models: offering GAN-like samples, IS and FID scores that are competitive with fully adversarial models, and improved likelihood scores.

Density Estimation

Detecting unseen visual relations using analogies

no code implementations ICCV 2019 Julia Peyre, Ivan Laptev, Cordelia Schmid, Josef Sivic

We seek to detect visual relations in images of the form of triplets t = (subject, predicate, object), such as "person riding dog", where training examples of the individual entities are available but their combinations are unseen at training.

Retrieval

A Structured Model For Action Detection

no code implementations CVPR 2019 Yubo Zhang, Pavel Tokmakov, Martial Hebert, Cordelia Schmid

A dominant paradigm for learning-based approaches in computer vision is training generic models, such as ResNet for image recognition, or I3D for video understanding, on large datasets and allowing them to discover the optimal representation for the problem at hand.

Action Detection Video Understanding

Modulated Policy Hierarchies

no code implementations30 Nov 2018 Alexander Pashevich, Danijar Hafner, James Davidson, Rahul Sukthankar, Cordelia Schmid

To achieve this, we study different modulation signals and exploration for hierarchical controllers.

reinforcement-learning

Coverage and Quality Driven Training of Generative Image Models

no code implementations27 Sep 2018 Thomas Lucas, Konstantin Shmelkov, Karteek Alahari, Cordelia Schmid, Jakob Verbeek

First, we propose a model that extends variational autoencoders by using deterministic invertible transformation layers to map samples from the decoder to the image space.

Déjà Vu: an empirical evaluation of the memorization properties of ConvNets

no code implementations ICLR 2019 Alexandre Sablayrolles, Matthijs Douze, Cordelia Schmid, Hervé Jégou

Convolutional neural networks memorize part of their training data, which is why strategies such as data augmentation and drop-out are employed to mitigate overfitting.

Data Augmentation Memorization

On the Importance of Visual Context for Data Augmentation in Scene Understanding

no code implementations6 Sep 2018 Nikita Dvornik, Julien Mairal, Cordelia Schmid

In this work, we consider object detection, semantic and instance segmentation and augment the training images by blending objects in existing scenes, using instance segmentation annotations.

Data Augmentation Instance Segmentation +4

Actor-Centric Relation Network

1 code implementation ECCV 2018 Chen Sun, Abhinav Shrivastava, Carl Vondrick, Kevin Murphy, Rahul Sukthankar, Cordelia Schmid

A visualization of the learned relation features confirms that our approach is able to attend to the relevant relations for each action.

Action Classification Action Detection +2

How good is my GAN?

no code implementations ECCV 2018 Konstantin Shmelkov, Cordelia Schmid, Karteek Alahari

Generative adversarial networks (GANs) are one of the most popular methods for generating images today.

General Classification Image Classification

End-to-End Incremental Learning

5 code implementations ECCV 2018 Francisco M. Castro, Manuel J. Marín-Jiménez, Nicolás Guil, Cordelia Schmid, Karteek Alahari

Although deep learning approaches have stood out in recent years due to their state-of-the-art results, they continue to suffer from catastrophic forgetting, a dramatic decrease in overall performance when training with new classes added incrementally.

Ranked #2 on Incremental Learning on ImageNet - 10 steps (# M Params metric)

Image Classification Incremental Learning

Modeling Visual Context is Key to Augmenting Object Detection Datasets

2 code implementations ECCV 2018 Nikita Dvornik, Julien Mairal, Cordelia Schmid

For this approach to be successful, we show that modeling appropriately the visual context surrounding objects is crucial to place them in the right environment.

Data Augmentation object-detection +1

Modeling Spatio-Temporal Human Track Structure for Action Localization

no code implementations28 Jun 2018 Guilhem Chéron, Anton Osokin, Ivan Laptev, Cordelia Schmid

In order to localize actions in time, we propose a recurrent localization network (RecLNet) designed to model the temporal structure of actions on the level of person tracks.

Human Detection Optical Flow Estimation +3

Spreading vectors for similarity search

1 code implementation ICLR 2019 Alexandre Sablayrolles, Matthijs Douze, Cordelia Schmid, Hervé Jégou

Discretizing multi-dimensional data distributions is a fundamental step of modern indexing methods.

Quantization

Unsupervised Learning of Artistic Styles with Archetypal Style Analysis

no code implementations NeurIPS 2018 Daan Wynen, Cordelia Schmid, Julien Mairal

In this paper, we introduce an unsupervised learning approach to automatically discover, summarize, and manipulate artistic styles from large collections of paintings.

Charades-Ego: A Large-Scale Dataset of Paired Third and First Person Videos

no code implementations25 Apr 2018 Gunnar A. Sigurdsson, Abhinav Gupta, Cordelia Schmid, Ali Farhadi, Karteek Alahari

In this paper we describe the egocentric aspect of the dataset and present annotations for Charades-Ego with 68, 536 activity instances in 68. 8 hours of first and third-person video, making it one of the largest and most diverse egocentric datasets available.

General Classification Video Classification +1

Actor and Observer: Joint Modeling of First and Third-Person Videos

1 code implementation CVPR 2018 Gunnar A. Sigurdsson, Abhinav Gupta, Cordelia Schmid, Ali Farhadi, Karteek Alahari

Several theories in cognitive neuroscience suggest that when people interact with the world, or simulate interactions, they do so from a first-person egocentric perspective, and seamlessly transfer knowledge between third-person (observer) and first-person (actor).

Action Recognition

Image-based Synthesis for Deep 3D Human Pose Estimation

no code implementations12 Feb 2018 Grégory Rogez, Cordelia Schmid

Here, we propose a solution to generate a large set of photorealistic synthetic images of humans with 3D pose annotations.

3D Human Pose Estimation 3D Pose Estimation +1

Learning to Segment Moving Objects

no code implementations1 Dec 2017 Pavel Tokmakov, Cordelia Schmid, Karteek Alahari

We formulate this as a learning problem and design our framework with three cues: (i) independent object motion between a pair of frames, which complements object recognition, (ii) object appearance, which helps to correct errors in motion estimation, and (iii) temporal consistency, which imposes additional constraints on the segmentation.

Motion Estimation Motion Segmentation +3

Incremental Learning of Object Detectors without Catastrophic Forgetting

3 code implementations ICCV 2017 Konstantin Shmelkov, Cordelia Schmid, Karteek Alahari

Despite their success for object detection, convolutional neural networks are ill-equipped for incremental learning, i. e., adapting the original model trained on a set of classes to additionally detect objects of new classes, in the absence of the initial training data.

Incremental Learning object-detection +1

Detecting Parts for Action Localization

no code implementations19 Jul 2017 Nicolas Chesneau, Grégory Rogez, Karteek Alahari, Cordelia Schmid

In this paper, we propose a new framework for action localization that tracks people in videos and extracts full-body human tubes, i. e., spatio-temporal regions localizing actions, even in the case of occlusions or truncations.

Action Localization

Developing the Path Signature Methodology and its Application to Landmark-based Human Action Recognition

no code implementations13 Jul 2017 Weixin Yang, Terry Lyons, Hao Ni, Cordelia Schmid, Lianwen Jin

To this end, we regard the evolving landmark data as a high-dimensional path and apply non-linear path signature techniques to provide an expressive, robust, non-linear, and interpretable representation for the sequential events.

Action Classification Action Recognition +2

AVA: A Video Dataset of Spatio-temporally Localized Atomic Visual Actions

4 code implementations CVPR 2018 Chunhui Gu, Chen Sun, David A. Ross, Carl Vondrick, Caroline Pantofaru, Yeqing Li, Sudheendra Vijayanarasimhan, George Toderici, Susanna Ricco, Rahul Sukthankar, Cordelia Schmid, Jitendra Malik

The AVA dataset densely annotates 80 atomic visual actions in 430 15-minute video clips, where actions are localized in space and time, resulting in 1. 58M action labels with multiple labels per person occurring frequently.

Action Recognition Video Understanding

SCNet: Learning Semantic Correspondence

1 code implementation ICCV 2017 Kai Han, Rafael S. Rezende, Bumsub Ham, Kwan-Yee K. Wong, Minsu Cho, Cordelia Schmid, Jean Ponce

This paper addresses the problem of establishing semantic correspondences between images depicting different instances of the same object or scene category.

Semantic correspondence

Action Tubelet Detector for Spatio-Temporal Action Localization

2 code implementations ICCV 2017 Vicky Kalogeiton, Philippe Weinzaepfel, Vittorio Ferrari, Cordelia Schmid

We propose the ACtion Tubelet detector (ACT-detector) that takes as input a sequence of frames and outputs tubelets, i. e., sequences of bounding boxes with associated scores.

Spatio-Temporal Action Localization Temporal Action Localization

SfM-Net: Learning of Structure and Motion from Video

no code implementations25 Apr 2017 Sudheendra Vijayanarasimhan, Susanna Ricco, Cordelia Schmid, Rahul Sukthankar, Katerina Fragkiadaki

We propose SfM-Net, a geometry-aware neural network for motion estimation in videos that decomposes frame-to-frame pixel motion in terms of scene and object depth, camera motion and 3D object rotations and translations.

Motion Estimation Optical Flow Estimation

Learning Video Object Segmentation with Visual Memory

no code implementations ICCV 2017 Pavel Tokmakov, Karteek Alahari, Cordelia Schmid

The module to build a "visual memory" in video, i. e., a joint representation of all the video frames, is realized with a convolutional recurrent unit learned from a small number of training video sequences.

Motion Segmentation Semantic Segmentation +2

Proposal Flow: Semantic Correspondences from Object Proposals

no code implementations21 Mar 2017 Bumsub Ham, Minsu Cho, Cordelia Schmid, Jean Ponce

Finding image correspondences remains a challenging problem in the presence of intra-class variations and large changes in scene layout.

Learning from Synthetic Humans

2 code implementations CVPR 2017 Gül Varol, Javier Romero, Xavier Martin, Naureen Mahmood, Michael J. Black, Ivan Laptev, Cordelia Schmid

In this work we present SURREAL (Synthetic hUmans foR REAL tasks): a new large-scale dataset with synthetically-generated but realistic images of people rendered from 3D sequences of human motion capture data.

3D Human Pose Estimation Human Part Segmentation

Learning Motion Patterns in Videos

no code implementations CVPR 2017 Pavel Tokmakov, Karteek Alahari, Cordelia Schmid

The problem of determining whether an object is in motion, irrespective of camera motion, is far from being solved.

Ranked #21 on Unsupervised Video Object Segmentation on DAVIS 2016 (using extra training data)

Motion Segmentation Optical Flow Estimation +2

MoCap-guided Data Augmentation for 3D Pose Estimation in the Wild

no code implementations NeurIPS 2016 Grégory Rogez, Cordelia Schmid

Here, we propose a solution to generate a large set of photorealistic synthetic images of humans with 3D pose annotations.

Ranked #94 on 3D Human Pose Estimation on Human3.6M (PA-MPJPE metric)

3D Human Pose Estimation 3D Pose Estimation +1

Human Action Localization with Sparse Spatial Supervision

no code implementations17 May 2016 Philippe Weinzaepfel, Xavier Martin, Cordelia Schmid

We introduce an approach for spatio-temporal human action localization using sparse spatial supervision.

Action Localization

Long-term Temporal Convolutions for Action Recognition

1 code implementation15 Apr 2016 Gül Varol, Ivan Laptev, Cordelia Schmid

Typical human actions last several seconds and exhibit characteristic spatio-temporal structure.

Action Recognition Optical Flow Estimation

Weakly-Supervised Semantic Segmentation using Motion Cues

no code implementations23 Mar 2016 Pavel Tokmakov, Karteek Alahari, Cordelia Schmid

We also demonstrate that the performance of M-CNN learned with 150 weak video annotations is on par with state-of-the-art weakly-supervised methods trained with thousands of images.

Image Segmentation Weakly supervised Semantic Segmentation +1

Convolutional Patch Representations for Image Retrieval: an Unsupervised Approach

no code implementations1 Mar 2016 Mattis Paulin, Julien Mairal, Matthijs Douze, Zaid Harchaoui, Florent Perronnin, Cordelia Schmid

Convolutional neural networks (CNNs) have recently received a lot of attention due to their ability to model local stationary structures in natural images in a multi-scale fashion, when learning all model parameters with supervision.

Image Classification Image Retrieval +1

Proposal Flow

no code implementations CVPR 2016 Bumsub Ham, Minsu Cho, Cordelia Schmid, Jean Ponce

Finding image correspondences remains a challenging problem in the presence of intra-class variations and large changes in scene layout.~Semantic flow methods are designed to handle images depicting different instances of the same object or scene category.

Approximate Fisher Kernels of non-iid Image Models for Image Categorization

no code implementations3 Oct 2015 Ramazan Gokberk Cinbis, Jakob Verbeek, Cordelia Schmid

It has been experimentally observed that the performance of BoW and FV representations can be improved by employing discounting transformations such as power normalization.

Image Categorization

Online Object Tracking with Proposal Selection

no code implementations ICCV 2015 Yang Hua, Karteek Alahari, Cordelia Schmid

Tracking-by-detection approaches are some of the most successful object trackers in recent years.

Visual Object Tracking

Expanded Parts Model for Semantic Description of Humans in Still Images

no code implementations14 Sep 2015 Gaurav Sharma, Frederic Jurie, Cordelia Schmid

We validate our method on three recent challenging datasets of human attributes and actions.

Beat-Event Detection in Action Movie Franchises

no code implementations15 Aug 2015 Danila Potapov, Matthijs Douze, Jerome Revaud, Zaid Harchaoui, Cordelia Schmid

While important advances were recently made towards temporally localizing and recognizing specific human actions or activities in videos, efficient detection and classification of long video chunks belonging to semantically defined categories such as "pursuit" or "romance" remains challenging. We introduce a new dataset, Action Movie Franchises, consisting of a collection of Hollywood action movie franchises.

Classification Event Detection +1

Learning to track for spatio-temporal action localization

no code implementations ICCV 2015 Philippe Weinzaepfel, Zaid Harchaoui, Cordelia Schmid

We present experimental results for spatio-temporal localization on the UCF-Sports, J-HMDB and UCF-101 action localization datasets, where our approach outperforms the state of the art with a margin of 15%, 7% and 12% respectively in mAP.

Spatio-Temporal Action Localization Temporal Action Localization +1

Learning to Detect Motion Boundaries

no code implementations CVPR 2015 Philippe Weinzaepfel, Jerome Revaud, Zaid Harchaoui, Cordelia Schmid

We compare the results obtained with several state-of-the-art optical flow approaches and study the impact of the different cues used in the random forest. Furthermore, we introduce a new dataset, the YouTube Motion Boundaries dataset (YMB), that comprises 60 sequences taken from real-world videos with manually annotated motion boundaries.

Boundary Detection Optical Flow Estimation

Weakly-Supervised Alignment of Video With Text

no code implementations ICCV 2015 Piotr Bojanowski, Rémi Lajugie, Edouard Grave, Francis Bach, Ivan Laptev, Jean Ponce, Cordelia Schmid

Given vectorial features for both video and text, we propose to cast this task as a temporal assignment problem, with an implicit linear mapping between the two feature modalities.

Unsupervised Object Discovery and Tracking in Video Collections

no code implementations ICCV 2015 Suha Kwak, Minsu Cho, Ivan Laptev, Jean Ponce, Cordelia Schmid

This paper addresses the problem of automatically localizing dominant objects as spatio-temporal tubes in a noisy collection of videos with minimal or even no supervision.

Object Discovery Video Understanding

Label-Embedding for Image Classification

2 code implementations30 Mar 2015 Zeynep Akata, Florent Perronnin, Zaid Harchaoui, Cordelia Schmid

Attributes act as intermediate representations that enable parameter sharing between classes, a must when training data is scarce.

Classification General Classification +3

Weakly Supervised Object Localization with Multi-fold Multiple Instance Learning

no code implementations3 Mar 2015 Ramazan Gokberk Cinbis, Jakob Verbeek, Cordelia Schmid

In this case, the supervised information is restricted to binary labels that indicate the absence/presence of object instances in the image, without their locations.

Multiple Instance Learning Weakly-Supervised Object Localization

Unsupervised Object Discovery and Localization in the Wild: Part-based Matching with Bottom-up Region Proposals

no code implementations CVPR 2015 Minsu Cho, Suha Kwak, Cordelia Schmid, Jean Ponce

This paper addresses unsupervised discovery and localization of dominant objects from a noisy image collection with multiple object classes.

Object Discovery

Weakly Supervised Action Labeling in Videos Under Ordering Constraints

no code implementations4 Jul 2014 Piotr Bojanowski, Rémi Lajugie, Francis Bach, Ivan Laptev, Jean Ponce, Cordelia Schmid, Josef Sivic

We are given a set of video clips, each one annotated with an {\em ordered} list of actions, such as "walk" then "sit" then "answer phone" extracted from, for example, the associated text script.

Convolutional Kernel Networks

no code implementations NeurIPS 2014 Julien Mairal, Piotr Koniusz, Zaid Harchaoui, Cordelia Schmid

An important goal in visual recognition is to devise image representations that are invariant to particular transformations.

Image Classification

Efficient Action Localization with Approximately Normalized Fisher Vectors

no code implementations CVPR 2014 Dan Oneata, Jakob Verbeek, Cordelia Schmid

Transformation of the FV by power and L2 normalizations has shown to significantly improve its performance, and led to state-of-the-art results for a range of image and video classification and retrieval tasks.

Action Recognition General Classification +2

Multi-fold MIL Training for Weakly Supervised Object Localization

no code implementations CVPR 2014 Ramazan Gokberk Cinbis, Jakob Verbeek, Cordelia Schmid

In this case, the supervised information is restricted to binary labels that indicate the absence/presence of object instances in the image, without their locations.

Multiple Instance Learning Weakly-Supervised Object Localization

Transformation Pursuit for Image Classification

no code implementations CVPR 2014 Mattis Paulin, Jerome Revaud, Zaid Harchaoui, Florent Perronnin, Cordelia Schmid

We propose a principled algorithm – Image Transformation Pursuit (ITP) – for the automatic selection of a compact set of transformations.

Classification General Classification +1

Event Retrieval in Large Video Collections with Circulant Temporal Encoding

no code implementations CVPR 2013 Jerome Revaud, Matthijs Douze, Cordelia Schmid, Herve Jegou

Furthermore, we extend product quantization to complex vectors in order to compress our descriptors, and to compare them in the compressed domain.

Copy Detection Quantization +1

Label-Embedding for Attribute-Based Classification

no code implementations CVPR 2013 Zeynep Akata, Florent Perronnin, Zaid Harchaoui, Cordelia Schmid

The label embedding framework offers other advantages such as the ability to leverage alternative sources of information in addition to attributes (e. g. class hierarchies) or to transition smoothly from zero-shot learning to learning with large quantities of data.

Classification General Classification +2

Expanded Parts Model for Human Attribute and Action Recognition in Still Images

no code implementations CVPR 2013 Gaurav Sharma, Frederic Jurie, Cordelia Schmid

We propose a new model for recognizing human attributes (e. g. wearing a suit, sitting, short hair) and actions (e. g. running, riding a horse) in still images.

Action Recognition In Still Images

Cannot find the paper you are looking for? You can Submit a new open access paper.