Search Results for author: Cordelia Schmid

Found 204 papers, 83 papers with code

Visual Lexicon: Rich Image Features in Language Space

no code implementations9 Dec 2024 Xudong Wang, Xingyi Zhou, Alireza Fathi, Trevor Darrell, Cordelia Schmid

We present Visual Lexicon, a novel visual language that encodes rich image information into the text space of vocabulary tokens while retaining intricate visual details that are often challenging to convey in natural language.

Image Generation Image Reconstruction +2

Language-Guided Image Tokenization for Generation

no code implementations8 Dec 2024 Kaiwen Zha, Lijun Yu, Alireza Fathi, David A. Ross, Cordelia Schmid, Dina Katabi, Xiuye Gu

Compared to the conventional tokenizer without text conditioning, TexTok achieves average reconstruction FID improvements of 29. 2% and 48. 1% on ImageNet-256 and -512 benchmarks respectively, across varying numbers of tokens.

Descriptive Text-to-Image Generation

Grounded Video Caption Generation

no code implementations12 Nov 2024 Evangelos Kazakos, Cordelia Schmid, Josef Sivic

We apply this approach to videos from the HowTo100M dataset, which results in a new large-scale training dataset, called HowToGround, with automatically annotated captions and spatio-temporally consistent bounding boxes with coherent natural language labels.

Caption Generation Image Captioning

Web-Scale Visual Entity Recognition: An LLM-Driven Data Approach

no code implementations31 Oct 2024 Mathilde Caron, Alireza Fathi, Cordelia Schmid, Ahmet Iscen

Web-scale visual entity recognition, the task of associating images with their corresponding entities within vast knowledge bases like Wikipedia, presents significant challenges due to the lack of clean, large-scale training data.

Language Modeling Language Modelling +2

Towards Zero-Shot Multimodal Machine Translation

2 code implementations18 Jul 2024 Matthieu Futeral, Cordelia Schmid, Benoît Sagot, Rachel Bawden

Current multimodal machine translation (MMT) systems rely on fully supervised data (i. e models are trained on sentences with their translations and accompanying images).

Language Modelling Multimodal Machine Translation +1

DataDream: Few-shot Guided Dataset Generation

1 code implementation15 Jul 2024 Jae Myung Kim, Jessica Bader, Stephan Alaniz, Cordelia Schmid, Zeynep Akata

While text-to-image diffusion models have been shown to achieve state-of-the-art results in image synthesis, they have yet to prove their effectiveness in downstream applications.

Classification Dataset Generation +2

mOSCAR: A Large-scale Multilingual and Multimodal Document-level Corpus

no code implementations13 Jun 2024 Matthieu Futeral, Armel Zebaze, Pedro Ortiz Suarez, Julien Abadji, Rémi Lacroix, Cordelia Schmid, Rachel Bawden, Benoît Sagot

We additionally train two types of multilingual model to prove the benefits of mOSCAR: (1) a model trained on a subset of mOSCAR and captioning data and (2) a model train on captioning data only.

Few-Shot Learning In-Context Learning

Smoke and Mirrors in Causal Downstream Tasks

1 code implementation27 May 2024 Riccardo Cadei, Lukas Lindorfer, Sylvia Cremer, Cordelia Schmid, Francesco Locatello

Machine Learning and AI have the potential to transform data-driven scientific discovery, enabling accurate predictions for several scientific phenomena.

Causal Inference Representation Learning +1

Learning text-to-video retrieval from image captioning

no code implementations26 Apr 2024 Lucas Ventura, Cordelia Schmid, Gül Varol

In this paper, we make use of this progress and instantiate the image experts from two types of models: a text-to-image retrieval model to provide an initial backbone, and image captioning models to provide supervision signal into unlabeled videos.

Image Captioning Image Retrieval +4

ViViDex: Learning Vision-based Dexterous Manipulation from Human Videos

no code implementations24 Apr 2024 Zerui Chen, ShiZhe Chen, Etienne Arlaud, Ivan Laptev, Cordelia Schmid

To address these limitations, we propose a new framework ViViDex to improve vision-based policy learning from human videos.

Learning Correlation Structures for Vision Transformers

no code implementations CVPR 2024 Manjin Kim, Paul Hongsuck Seo, Cordelia Schmid, Minsu Cho

We introduce a new attention mechanism, dubbed structural self-attention (StructSA), that leverages rich correlation patterns naturally emerging in key-query interactions of attention.

Action Classification Action Recognition +2

Streaming Dense Video Captioning

1 code implementation CVPR 2024 Xingyi Zhou, Anurag Arnab, Shyamal Buch, Shen Yan, Austin Myers, Xuehan Xiong, Arsha Nagrani, Cordelia Schmid

An ideal model for dense video captioning -- predicting captions localized temporally in a video -- should be able to handle long input videos, predict rich, detailed textual descriptions, and be able to produce outputs before processing the entire video.

Dense Video Captioning

SUGAR: Pre-training 3D Visual Representations for Robotics

no code implementations CVPR 2024 ShiZhe Chen, Ricardo Garcia, Ivan Laptev, Cordelia Schmid

SUGAR employs a versatile transformer-based model to jointly address five pre-training tasks, namely cross-modal knowledge distillation for semantic learning, masked point modeling to understand geometry structures, grasping pose synthesis for object affordance, 3D instance segmentation and referring expression grounding to analyze cluttered scenes.

3D Instance Segmentation 3D Object Recognition +5

A Generative Approach for Wikipedia-Scale Visual Entity Recognition

2 code implementations CVPR 2024 Mathilde Caron, Ahmet Iscen, Alireza Fathi, Cordelia Schmid

In this paper, we address web-scale visual entity recognition, specifically the task of mapping a given query image to one of the 6 million existing entities in Wikipedia.

Time-, Memory- and Parameter-Efficient Visual Adaptation

no code implementations5 Feb 2024 Otniel-Bogdan Mercea, Alexey Gritsenko, Cordelia Schmid, Anurag Arnab

Here, we outperform a prior adaptor-based method which could only scale to a 1 billion parameter backbone, or fully-finetuning a smaller backbone, with the same GPU and less training time.

Video Classification

RAVEN: Rethinking Adversarial Video Generation with Efficient Tri-plane Networks

no code implementations11 Jan 2024 Partha Ghosh, Soubhik Sanyal, Cordelia Schmid, Bernhard Schölkopf

To capture long spatio-temporal dependencies, our approach incorporates a hybrid explicit-implicit tri-plane representation inspired by 3D-aware generative frameworks developed for three-dimensional object representation and employs a single latent code to model an entire video clip.

Generative Adversarial Network Optical Flow Estimation +1

Pixel-Aligned Language Model

no code implementations CVPR 2024 Jiarui Xu, Xingyi Zhou, Shen Yan, Xiuye Gu, Anurag Arnab, Chen Sun, Xiaolong Wang, Cordelia Schmid

When taking locations as inputs the model performs location-conditioned captioning which generates captions for the indicated object or region.

Language Modeling Language Modelling

Time- Memory- and Parameter-Efficient Visual Adaptation

no code implementations CVPR 2024 Otniel-Bogdan Mercea, Alexey Gritsenko, Cordelia Schmid, Anurag Arnab

Here we outperform a prior adaptor-based method which could only scale to a 1 billion parameter backbone or fully-finetuning a smaller backbone with the same GPU and less training time.

Video Classification

Pixel Aligned Language Models

no code implementations14 Dec 2023 Jiarui Xu, Xingyi Zhou, Shen Yan, Xiuye Gu, Anurag Arnab, Chen Sun, Xiaolong Wang, Cordelia Schmid

When taking locations as inputs, the model performs location-conditioned captioning, which generates captions for the indicated object or region.

Language Modeling Language Modelling

Dense Optical Tracking: Connecting the Dots

1 code implementation CVPR 2024 Guillaume Le Moing, Jean Ponce, Cordelia Schmid

Code, data, and videos showcasing the capabilities of our approach are available in the project webpage: https://16lemoing. github. io/dot .

Optical Flow Estimation Point Tracking

PolarNet: 3D Point Clouds for Language-Guided Robotic Manipulation

1 code implementation27 Sep 2023 ShiZhe Chen, Ricardo Garcia, Cordelia Schmid, Ivan Laptev

The ability for robots to comprehend and execute manipulation tasks based on natural language instructions is a long-term goal in robotics.

Multi-Task Learning Robot Manipulation Generalization

VidChapters-7M: Video Chapters at Scale

no code implementations NeurIPS 2023 Antoine Yang, Arsha Nagrani, Ivan Laptev, Josef Sivic, Cordelia Schmid

To address this issue, we present VidChapters-7M, a dataset of 817K user-chaptered videos including 7M chapters in total.

Dense Video Captioning Navigate

CoVR-2: Automatic Data Construction for Composed Video Retrieval

1 code implementation28 Aug 2023 Lucas Ventura, Antoine Yang, Cordelia Schmid, Gül Varol

Composed Image Retrieval (CoIR) has recently gained popularity as a task that considers both text and image queries together, to search for relevant images in a database.

Composed Video Retrieval (CoVR) Language Modelling +4

POCO: 3D Pose and Shape Estimation with Confidence

1 code implementation24 Aug 2023 Sai Kumar Dwivedi, Cordelia Schmid, Hongwei Yi, Michael J. Black, Dimitrios Tzionas

To address this, we develop POCO, a novel framework for training HPS regressors to estimate not only a 3D human body, but also their confidence, in a single feed-forward pass.

Action Recognition Pose Estimation +1

UnLoc: A Unified Framework for Video Localization Tasks

1 code implementation ICCV 2023 Shen Yan, Xuehan Xiong, Arsha Nagrani, Anurag Arnab, Zhonghao Wang, Weina Ge, David Ross, Cordelia Schmid

While large-scale image-text pretrained models such as CLIP have been used for multiple video-level tasks on trimmed videos, their use for temporal localization in untrimmed videos is still a relatively unexplored task.

Action Segmentation Moment Retrieval +5

Object Goal Navigation with Recursive Implicit Maps

no code implementations10 Aug 2023 ShiZhe Chen, Thomas Chabal, Ivan Laptev, Cordelia Schmid

Object goal navigation aims to navigate an agent to locations of a given object category in unseen environments.

Navigate Object +1

Robust Visual Sim-to-Real Transfer for Robotic Manipulation

no code implementations28 Jul 2023 Ricardo Garcia, Robin Strudel, ShiZhe Chen, Etienne Arlaud, Ivan Laptev, Cordelia Schmid

While previous work mainly evaluates DR for disembodied tasks, such as pose estimation and object detection, here we systematically explore visual domain randomization methods and benchmark them on a rich set of challenging robotic manipulation tasks.

object-detection Object Detection +1

Does Visual Pretraining Help End-to-End Reasoning?

no code implementations NeurIPS 2023 Chen Sun, Calvin Luo, Xingyi Zhou, Anurag Arnab, Cordelia Schmid

A positive result would refute the common belief that explicit visual abstraction (e. g. object detection) is essential for compositional generalization on visual reasoning, and confirm the feasibility of a neural network "generalist" to solve visual recognition and reasoning tasks.

Image Classification Object +3

How can objects help action recognition?

1 code implementation CVPR 2023 Xingyi Zhou, Anurag Arnab, Chen Sun, Cordelia Schmid

In this paper, we investigate how we can use knowledge of objects to design better video models, namely to process fewer tokens and to improve recognition accuracy.

Action Recognition Object

Dense Video Object Captioning from Disjoint Supervision

1 code implementation20 Jun 2023 Xingyi Zhou, Anurag Arnab, Chen Sun, Cordelia Schmid

We propose a new task and model for dense video object captioning -- detecting, tracking and captioning trajectories of objects in a video.

Object Sentence +2

Waffling around for Performance: Visual Classification with Random Words and Broad Concepts

2 code implementations ICCV 2023 Karsten Roth, Jae Myung Kim, A. Sophia Koepke, Oriol Vinyals, Cordelia Schmid, Zeynep Akata

The visual classification performance of vision-language models such as CLIP has been shown to benefit from additional semantic knowledge from large language models (LLMs) such as GPT-3.

Classification Language Modeling +2

Learning Video-Conditioned Policies for Unseen Manipulation Tasks

no code implementations10 May 2023 Elliot Chane-Sane, Cordelia Schmid, Ivan Laptev

To encourage generalization to new tasks, we avoid particular tasks during training and learn our policy from unlabelled robot trajectories and corresponding robot videos.

Action Recognition Robot Manipulation +1

Improving Image Recognition by Retrieving from Web-Scale Image-Text Data

no code implementations CVPR 2023 Ahmet Iscen, Alireza Fathi, Cordelia Schmid

Retrieval augmented models are becoming increasingly popular for computer vision tasks after their recent success in NLP problems.

 Ranked #1 on Image Classification on WebVision-1000 (using extra training data)

Learning with noisy labels Long-tail Learning

Exposing and Mitigating Spurious Correlations for Cross-Modal Retrieval

no code implementations6 Apr 2023 Jae Myung Kim, A. Sophia Koepke, Cordelia Schmid, Zeynep Akata

In this work, we introduce ODmAP@k, an object decorrelation metric that measures a model's robustness to spurious correlations in the training data.

Cross-Modal Retrieval Image-text Retrieval +2

Bridging the Gap between Model Explanations in Partially Annotated Multi-label Classification

2 code implementations CVPR 2023 Youngwook Kim, Jae Myung Kim, Jieun Jeong, Cordelia Schmid, Zeynep Akata, Jungwoo Lee

Based on these findings, we propose to boost the attribution scores of the model trained with partial labels to make its explanation resemble that of the model trained with full labels.

Classification Multi-Label Classification +1

AVFormer: Injecting Vision into Frozen Speech Models for Zero-Shot AV-ASR

no code implementations CVPR 2023 Paul Hongsuck Seo, Arsha Nagrani, Cordelia Schmid

(ii) We also introduce a simple curriculum scheme during training which we show is crucial to enable the model to jointly process audio and visual information effectively; and finally (iii) we show that our model achieves state of the art zero-shot results on three different AV-ASR benchmarks (How2, VisSpeech and Ego4D), while also crucially preserving decent performance on traditional audio-only speech recognition benchmarks (LibriSpeech).

Automatic Speech Recognition Domain Adaptation +2

Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning

3 code implementations CVPR 2023 Antoine Yang, Arsha Nagrani, Paul Hongsuck Seo, Antoine Miech, Jordi Pont-Tuset, Ivan Laptev, Josef Sivic, Cordelia Schmid

In this work, we introduce Vid2Seq, a multi-modal single-stage dense event captioning model pretrained on narrated videos which are readily-available at scale.

 Ranked #1 on Dense Video Captioning on ActivityNet Captions (using extra training data)

Dense Video Captioning Language Modeling +2

Tackling Ambiguity with Images: Improved Multimodal Machine Translation and Contrastive Evaluation

2 code implementations20 Dec 2022 Matthieu Futeral, Cordelia Schmid, Ivan Laptev, Benoît Sagot, Rachel Bawden

One of the major challenges of machine translation (MT) is ambiguity, which can in some cases be resolved by accompanying context such as images.

Multimodal Machine Translation Translation

Audiovisual Masked Autoencoders

2 code implementations ICCV 2023 Mariana-Iuliana Georgescu, Eduardo Fonseca, Radu Tudor Ionescu, Mario Lucic, Cordelia Schmid, Anurag Arnab

Can we leverage the audiovisual information already present in video to improve self-supervised representation learning?

 Ranked #1 on Audio Classification on EPIC-KITCHENS-100 (using extra training data)

Audio Classification Representation Learning

WALDO: Future Video Synthesis using Object Layer Decomposition and Parametric Flow Prediction

1 code implementation ICCV 2023 Guillaume Le Moing, Jean Ponce, Cordelia Schmid

This paper presents WALDO (WArping Layer-Decomposed Objects), a novel approach to the prediction of future video frames from past ones.

SSIM

AVATAR submission to the Ego4D AV Transcription Challenge

no code implementations18 Nov 2022 Paul Hongsuck Seo, Arsha Nagrani, Cordelia Schmid

In this report, we describe our submission to the Ego4D AudioVisual (AV) Speech Transcription Challenge 2022.

Decoder

Language Conditioned Spatial Relation Reasoning for 3D Object Grounding

1 code implementation17 Nov 2022 ShiZhe Chen, Pierre-Louis Guhur, Makarand Tapaswi, Cordelia Schmid, Ivan Laptev

In this work we propose a language-conditioned transformer model for grounding 3D objects and their spatial relations.

Object Relation

Learning Reward Functions for Robotic Manipulation by Observing Humans

no code implementations16 Nov 2022 Minttu Alakuijala, Gabriel Dulac-Arnold, Julien Mairal, Jean Ponce, Cordelia Schmid

Unlike prior work on leveraging human videos to teach robots, our method, Human Offline Learned Distances (HOLD) requires neither a priori data from the robot environment, nor a set of task-specific human demonstrations, nor a predefined notion of correspondence across morphologies, yet it is able to accelerate training of several manipulation tasks on a simulated robot arm compared to using only a sparse reward obtained from task completion.

Contrastive Learning

A Memory Transformer Network for Incremental Learning

no code implementations10 Oct 2022 Ahmet Iscen, Thomas Bird, Mathilde Caron, Alireza Fathi, Cordelia Schmid

We study class-incremental learning, a training setup in which new classes of data are observed over time for the model to learn from.

class-incremental learning Class Incremental Learning +1

Instruction-driven history-aware policies for robotic manipulations

2 code implementations11 Sep 2022 Pierre-Louis Guhur, ShiZhe Chen, Ricardo Garcia, Makarand Tapaswi, Ivan Laptev, Cordelia Schmid

In human environments, robots are expected to accomplish a variety of manipulation tasks given simple natural language instructions.

Ranked #2 on Robot Manipulation on RLBench (Succ. Rate (10 tasks, 100 demos/task) metric)

Robot Manipulation Generalization

Learning from Unlabeled 3D Environments for Vision-and-Language Navigation

1 code implementation24 Aug 2022 ShiZhe Chen, Pierre-Louis Guhur, Makarand Tapaswi, Cordelia Schmid, Ivan Laptev

Our resulting HM3D-AutoVLN dataset is an order of magnitude larger than existing VLN datasets in terms of navigation environments and instructions.

Language Modeling Language Modelling +4

AlignSDF: Pose-Aligned Signed Distance Fields for Hand-Object Reconstruction

2 code implementations26 Jul 2022 Zerui Chen, Yana Hasson, Cordelia Schmid, Ivan Laptev

We show that such aligned SDFs better focus on reconstructing shape details and improve reconstruction accuracy both for hands and objects.

hand-object pose Object Reconstruction

M&M Mix: A Multimodal Multiview Transformer Ensemble

no code implementations20 Jun 2022 Xuehan Xiong, Anurag Arnab, Arsha Nagrani, Cordelia Schmid

This report describes the approach behind our winning solution to the 2022 Epic-Kitchens Action Recognition Challenge.

Ranked #2 on Action Recognition on EPIC-KITCHENS-100 (using extra training data)

Action Recognition Video Recognition

AVATAR: Unconstrained Audiovisual Speech Recognition

1 code implementation15 Jun 2022 Valentin Gabeur, Paul Hongsuck Seo, Arsha Nagrani, Chen Sun, Karteek Alahari, Cordelia Schmid

Audio-visual automatic speech recognition (AV-ASR) is an extension of ASR that incorporates visual cues, often from the movements of a speaker's mouth.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Learning to Answer Visual Questions from Web Videos

1 code implementation10 May 2022 Antoine Yang, Antoine Miech, Josef Sivic, Ivan Laptev, Cordelia Schmid

We use our method to generate the WebVidVQA3M dataset from the WebVid dataset, i. e., videos with alt-text annotations, and show its benefits for training VideoQA models.

Dataset Generation Question Answering +5

Weakly-supervised segmentation of referring expressions

no code implementations10 May 2022 Robin Strudel, Ivan Laptev, Cordelia Schmid

Visual grounding localizes regions (boxes or segments) in the image corresponding to given referring expressions.

Image Segmentation Referring Expression +5

Assembly Planning from Observations under Physical Constraints

no code implementations20 Apr 2022 Thomas Chabal, Robin Strudel, Etienne Arlaud, Jean Ponce, Cordelia Schmid

This paper addresses the problem of copying an unknown assembly of primitives with known shape and appearance using information extracted from a single photograph by an off-the-shelf procedure for object detection and pose estimation.

Object object-detection +2

Learning Audio-Video Modalities from Image Captions

no code implementations1 Apr 2022 Arsha Nagrani, Paul Hongsuck Seo, Bryan Seybold, Anja Hauth, Santiago Manen, Chen Sun, Cordelia Schmid

To close this gap we propose a new video mining pipeline which involves transferring captions from image captioning datasets to video clips with no additional manual effort.

Image Captioning Retrieval +4

The Right Spin: Learning Object Motion from Rotation-Compensated Flow Fields

no code implementations28 Feb 2022 Pia Bideau, Erik Learned-Miller, Cordelia Schmid, Karteek Alahari

In this work, we argue that the coupling of camera rotation and camera translation can create complex motion fields that are difficult for a deep network to untangle directly.

Motion Segmentation

Think Global, Act Local: Dual-scale Graph Transformer for Vision-and-Language Navigation

1 code implementation CVPR 2022 ShiZhe Chen, Pierre-Louis Guhur, Makarand Tapaswi, Cordelia Schmid, Ivan Laptev

To balance the complexity of large action space reasoning and fine-grained language grounding, we dynamically combine a fine-scale encoding over local observations and a coarse-scale encoding on a global map via graph transformers.

Efficient Exploration Navigate +2

Multiview Transformers for Video Recognition

1 code implementation CVPR 2022 Shen Yan, Xuehan Xiong, Anurag Arnab, Zhichao Lu, Mi Zhang, Chen Sun, Cordelia Schmid

Video understanding requires reasoning at multiple spatiotemporal resolutions -- from short fine-grained motions to events taking place over longer durations.

Ranked #5 on Action Classification on MiT (using extra training data)

Action Classification Action Recognition +1

Masking Modalities for Cross-modal Video Retrieval

no code implementations1 Nov 2021 Valentin Gabeur, Arsha Nagrani, Chen Sun, Karteek Alahari, Cordelia Schmid

Our proposal is to pre-train a video encoder using all the available video modalities as supervision, namely, appearance, sound, and transcribed speech.

Retrieval Video Retrieval

Variational Perturbations for Visual Feature Attribution

no code implementations29 Sep 2021 Jae Myung Kim, Eunji Kim, Sungroh Yoon, Jungwoo Lee, Cordelia Schmid, Zeynep Akata

Explaining a complex black-box system in a post-hoc manner is important to understand its predictions.

Airbert: In-domain Pretraining for Vision-and-Language Navigation

2 code implementations ICCV 2021 Pierre-Louis Guhur, Makarand Tapaswi, ShiZhe Chen, Ivan Laptev, Cordelia Schmid

Given the scarcity of domain-specific training data and the high diversity of image and language inputs, the generalization of VLN agents to unseen environments remains challenging.

Navigate Referring Expression +1

CCVS: Context-aware Controllable Video Synthesis

1 code implementation NeurIPS 2021 Guillaume Le Moing, Jean Ponce, Cordelia Schmid

The prediction model is doubly autoregressive, in the latent space of an autoencoder for forecasting, and in image space for updating contextual information, which is also used to enforce spatio-temporal consistency through a learnable optical flow module.

Decoder Optical Flow Estimation +3

Goal-Conditioned Reinforcement Learning with Imagined Subgoals

no code implementations1 Jul 2021 Elliot Chane-Sane, Cordelia Schmid, Ivan Laptev

Goal-conditioned reinforcement learning endows an agent with a large variety of skills, but it often struggles to solve tasks that require more temporally extended reasoning.

reinforcement-learning Reinforcement Learning +1

Attention Bottlenecks for Multimodal Fusion

1 code implementation NeurIPS 2021 Arsha Nagrani, Shan Yang, Anurag Arnab, Aren Jansen, Cordelia Schmid, Chen Sun

Humans perceive the world by concurrently processing and fusing high-dimensional inputs from multiple modalities such as vision and audio.

Action Classification Action Recognition +2

Residual Reinforcement Learning from Demonstrations

no code implementations15 Jun 2021 Minttu Alakuijala, Gabriel Dulac-Arnold, Julien Mairal, Jean Ponce, Cordelia Schmid

Residual reinforcement learning (RL) has been proposed as a way to solve challenging robotic tasks by adapting control actions from a conventional feedback controller to maximize a reward signal.

reinforcement-learning Reinforcement Learning +1

Large-Scale Unsupervised Object Discovery

1 code implementation NeurIPS 2021 Huy V. Vo, Elena Sizikova, Cordelia Schmid, Patrick Pérez, Jean Ponce

Extensive experiments on COCO and OpenImages show that, in the single-object discovery setting where a single prominent object is sought in each image, the proposed LOD (Large-scale Object Discovery) approach is on par with, or better than the state of the art for medium-scale datasets (up to 120K images), and over 37% better than the only other algorithms capable of scaling up to 1. 7M images.

Multi-object discovery Object +2

Episodic Transformer for Vision-and-Language Navigation

1 code implementation ICCV 2021 Alexander Pashevich, Cordelia Schmid, Chen Sun

We demonstrate that encoding the history with a transformer is critical to solve compositional tasks, and that pretraining and joint training with synthetic instructions further improve the performance.

Vision and Language Navigation

Class-Balanced Distillation for Long-Tailed Visual Recognition

3 code implementations12 Apr 2021 Ahmet Iscen, André Araujo, Boqing Gong, Cordelia Schmid

An effective and simple approach to long-tailed visual recognition is to learn feature representations and a classifier separately, with instance and class-balanced sampling, respectively.

Image Classification Knowledge Distillation +1

Improving robustness against common corruptions with frequency biased models

no code implementations ICCV 2021 Tonmoy Saikia, Cordelia Schmid, Thomas Brox

CNNs perform remarkably well when the training and test distributions are i. i. d, but unseen image corruptions can cause a surprisingly large drop in performance.

Data Augmentation object-detection +1

ViViT: A Video Vision Transformer

10 code implementations ICCV 2021 Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lučić, Cordelia Schmid

We present pure-transformer based models for video classification, drawing upon the recent success of such models in image classification.

Ranked #8 on Action Classification on MiT (Top 5 Accuracy metric, using extra training data)

Action Classification Action Recognition +4

Unified Graph Structured Models for Video Understanding

no code implementations ICCV 2021 Anurag Arnab, Chen Sun, Cordelia Schmid

Accurate video understanding involves reasoning about the relationships between actors, objects and their environment, often over long temporal intervals.

Action Detection Graph Classification +4

Learning Temporal Dynamics from Cycles in Narrated Video

no code implementations ICCV 2021 Dave Epstein, Jiajun Wu, Cordelia Schmid, Chen Sun

Learning to model how the world changes as time elapses has proven a challenging problem for the computer vision community.

Image Matching with Scale Adjustment

no code implementations10 Dec 2020 Yves Dufournaud, Cordelia Schmid, Radu Horaud

In this paper we address the problem of matching two images with two different resolutions: a high-resolution image and a low-resolution one.

Look Before you Speak: Visually Contextualized Utterances

no code implementations CVPR 2021 Paul Hongsuck Seo, Arsha Nagrani, Cordelia Schmid

Leveraging recent advances in multimodal learning, our model consists of a novel co-attentional multimodal video transformer, and when trained on both textual and visual context, outperforms baselines that use textual inputs alone.

Just Ask: Learning to Answer Questions from Millions of Narrated Videos

1 code implementation ICCV 2021 Antoine Yang, Antoine Miech, Josef Sivic, Ivan Laptev, Cordelia Schmid

In this work, we propose to avoid manual annotation and generate a large-scale training dataset for video question answering making use of automatic cross-modal supervision.

Question Answering Question Generation +4

Learning Obstacle Representations for Neural Motion Planning

1 code implementation25 Aug 2020 Robin Strudel, Ricardo Garcia, Justin Carpentier, Jean-Paul Laumond, Ivan Laptev, Cordelia Schmid

Motion planning and obstacle avoidance is a key challenge in robotics applications.

Robotics

Multi-modal Transformer for Video Retrieval

1 code implementation ECCV 2020 Valentin Gabeur, Chen Sun, Karteek Alahari, Cordelia Schmid

In this paper, we present a multi-modal transformer to jointly encode the different modalities in video, which allows each of them to attend to the others.

 Ranked #1 on Zero-Shot Video Retrieval on MSR-VTT (text-to-video Mean Rank metric, using extra training data)

Natural Language Queries Retrieval +2

Consistency Guided Scene Flow Estimation

no code implementations ECCV 2020 Yuhua Chen, Luc van Gool, Cordelia Schmid, Cristian Sminchisescu

To handle inherent modeling error in the consistency loss (e. g. Lambertian assumptions) and for better generalization, we further introduce a learned, output refinement network, which takes the initial predictions, the loss, and the gradient as input, and efficiently predicts a correlated output update.

Scene Flow Estimation

TAO: A Large-Scale Benchmark for Tracking Any Object

no code implementations ECCV 2020 Achal Dave, Tarasha Khurana, Pavel Tokmakov, Cordelia Schmid, Deva Ramanan

To this end, we ask annotators to label objects that move at any point in the video, and give names to them post factum.

Multi-Object Tracking Object +2

What Makes for Good Views for Contrastive Learning?

1 code implementation NeurIPS 2020 Yonglong Tian, Chen Sun, Ben Poole, Dilip Krishnan, Cordelia Schmid, Phillip Isola

Contrastive learning between multiple views of the data has recently achieved state of the art performance in the field of self-supervised representation learning.

Contrastive Learning Data Augmentation +8

VectorNet: Encoding HD Maps and Agent Dynamics from Vectorized Representation

4 code implementations CVPR 2020 Jiyang Gao, Chen Sun, Hang Zhao, Yi Shen, Dragomir Anguelov, Cong-Cong Li, Cordelia Schmid

Behavior prediction in dynamic, multi-agent systems is an important problem in the context of self-driving cars, due to the complex representations and interactions of road components, including moving agents (e. g. pedestrians and vehicles) and road context information (e. g. lanes, traffic lights).

Graph Neural Network Self-Driving Cars

Learning visual policies for building 3D shape categories

no code implementations15 Apr 2020 Alexander Pashevich, Igor Kalevatykh, Ivan Laptev, Cordelia Schmid

We then show the success of our visual policies for building arches from different primitives.

Object

Memory-Efficient Incremental Learning Through Feature Adaptation

no code implementations ECCV 2020 Ahmet Iscen, Jeffrey Zhang, Svetlana Lazebnik, Cordelia Schmid

We assume that the model is updated incrementally for new classes as new data becomes available sequentially. This requires adapting the previously stored feature vectors to the updated feature space without having access to the corresponding original training images.

Incremental Learning

Speech2Action: Cross-modal Supervision for Action Recognition

no code implementations CVPR 2020 Arsha Nagrani, Chen Sun, David Ross, Rahul Sukthankar, Cordelia Schmid, Andrew Zisserman

We train a BERT-based Speech2Action classifier on over a thousand movie screenplays, to predict action labels from transcribed speech segments.

Action Recognition

Selecting Relevant Features from a Multi-domain Representation for Few-shot Classification

1 code implementation ECCV 2020 Nikita Dvornik, Cordelia Schmid, Julien Mairal

Popular approaches for few-shot classification consist of first learning a generic data representation based on a large annotated dataset, before adapting the representation to new classes given only a few labeled samples.

feature selection Few-Shot Image Classification +2

Beyond the Camera: Neural Networks in World Coordinates

no code implementations12 Mar 2020 Gunnar A. Sigurdsson, Abhinav Gupta, Cordelia Schmid, Karteek Alahari

Eye movement and strategic placement of the visual field onto the retina, gives animals increased resolution of the scene and suppresses distracting information.

Action Recognition Video Stabilization +1

Optimized Generic Feature Learning for Few-shot Classification across Domains

no code implementations22 Jan 2020 Tonmoy Saikia, Thomas Brox, Cordelia Schmid

To learn models or features that generalize across tasks and domains is one of the grand goals of machine learning.

BIG-bench Machine Learning Classification +3

Synthetic Humans for Action Recognition from Unseen Viewpoints

1 code implementation9 Dec 2019 Gül Varol, Ivan Laptev, Cordelia Schmid, Andrew Zisserman

Although synthetic training data has been shown to be beneficial for tasks such as human pose estimation, its use for RGB human action recognition is relatively unexplored.

Action Classification Action Recognition +2

Learning to Track Any Object

no code implementations25 Oct 2019 Achal Dave, Pavel Tokmakov, Cordelia Schmid, Deva Ramanan

Moreover, at test time the same network can be applied to detection and tracking, resulting in a unified approach for the two tasks.

Instance Segmentation Object +5

White-box vs Black-box: Bayes Optimal Strategies for Membership Inference

no code implementations29 Aug 2019 Alexandre Sablayrolles, Matthijs Douze, Yann Ollivier, Cordelia Schmid, Hervé Jégou

Membership inference determines, given a sample and trained parameters of a machine learning model, whether the sample was part of the training set.

Self-supervised Learning with Geometric Constraints in Monocular Video: Connecting Flow, Depth, and Camera

no code implementations ICCV 2019 Yuhua Chen, Cordelia Schmid, Cristian Sminchisescu

We present GLNet, a self-supervised framework for learning depth, optical flow, camera pose and intrinsic parameters from monocular video - addressing the difficulty of acquiring realistic ground-truth for such tasks.

Monocular Depth Estimation Optical Flow Estimation +3

Learning Video Representations using Contrastive Bidirectional Transformer

no code implementations13 Jun 2019 Chen Sun, Fabien Baradel, Kevin Murphy, Cordelia Schmid

This paper proposes a self-supervised learning approach for video features that results in significantly improved performance on downstream tasks (such as video classification, captioning and segmentation) compared to existing methods.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +5

A Study on Action Detection in the Wild

no code implementations29 Apr 2019 Yubo Zhang, Pavel Tokmakov, Martial Hebert, Cordelia Schmid

In this work we study the problem of action detection in a highly-imbalanced dataset.

Action Detection

Diversity with Cooperation: Ensemble Methods for Few-Shot Classification

1 code implementation ICCV 2019 Nikita Dvornik, Cordelia Schmid, Julien Mairal

Few-shot classification consists of learning a predictive model that is able to effectively adapt to a new class, given only a few annotated samples.

Classification Diversity +3

Learning to Augment Synthetic Images for Sim2Real Policy Transfer

1 code implementation18 Mar 2019 Alexander Pashevich, Robin Strudel, Igor Kalevatykh, Ivan Laptev, Cordelia Schmid

Policies learned in simulators, however, do not transfer well to real scenes given the domain gap between real and synthetic data.

Object Localization

Adaptive Density Estimation for Generative Models

no code implementations NeurIPS 2019 Thomas Lucas, Konstantin Shmelkov, Karteek Alahari, Cordelia Schmid, Jakob Verbeek

We show that our model significantly improves over existing hybrid models: offering GAN-like samples, IS and FID scores that are competitive with fully adversarial models, and improved likelihood scores.

Decoder Density Estimation

Detecting unseen visual relations using analogies

no code implementations ICCV 2019 Julia Peyre, Ivan Laptev, Cordelia Schmid, Josef Sivic

We seek to detect visual relations in images of the form of triplets t = (subject, predicate, object), such as "person riding dog", where training examples of the individual entities are available but their combinations are unseen at training.

Retrieval Triplet

A Structured Model For Action Detection

no code implementations CVPR 2019 Yubo Zhang, Pavel Tokmakov, Martial Hebert, Cordelia Schmid

A dominant paradigm for learning-based approaches in computer vision is training generic models, such as ResNet for image recognition, or I3D for video understanding, on large datasets and allowing them to discover the optimal representation for the problem at hand.

Action Detection Video Understanding

Coverage and Quality Driven Training of Generative Image Models

no code implementations27 Sep 2018 Thomas Lucas, Konstantin Shmelkov, Karteek Alahari, Cordelia Schmid, Jakob Verbeek

First, we propose a model that extends variational autoencoders by using deterministic invertible transformation layers to map samples from the decoder to the image space.

Decoder

Déjà Vu: an empirical evaluation of the memorization properties of ConvNets

no code implementations ICLR 2019 Alexandre Sablayrolles, Matthijs Douze, Cordelia Schmid, Hervé Jégou

Convolutional neural networks memorize part of their training data, which is why strategies such as data augmentation and drop-out are employed to mitigate overfitting.

Data Augmentation Memorization

On the Importance of Visual Context for Data Augmentation in Scene Understanding

no code implementations6 Sep 2018 Nikita Dvornik, Julien Mairal, Cordelia Schmid

In this work, we consider object detection, semantic and instance segmentation and augment the training images by blending objects in existing scenes, using instance segmentation annotations.

Data Augmentation Instance Segmentation +7

Actor-Centric Relation Network

1 code implementation ECCV 2018 Chen Sun, Abhinav Shrivastava, Carl Vondrick, Kevin Murphy, Rahul Sukthankar, Cordelia Schmid

A visualization of the learned relation features confirms that our approach is able to attend to the relevant relations for each action.

Action Classification Action Detection +5

How good is my GAN?

no code implementations ECCV 2018 Konstantin Shmelkov, Cordelia Schmid, Karteek Alahari

Generative adversarial networks (GANs) are one of the most popular methods for generating images today.

Diversity General Classification +1

End-to-End Incremental Learning

6 code implementations ECCV 2018 Francisco M. Castro, Manuel J. Marín-Jiménez, Nicolás Guil, Cordelia Schmid, Karteek Alahari

Although deep learning approaches have stood out in recent years due to their state-of-the-art results, they continue to suffer from catastrophic forgetting, a dramatic decrease in overall performance when training with new classes added incrementally.

Image Classification Incremental Learning

Modeling Visual Context is Key to Augmenting Object Detection Datasets

2 code implementations ECCV 2018 Nikita Dvornik, Julien Mairal, Cordelia Schmid

For this approach to be successful, we show that modeling appropriately the visual context surrounding objects is crucial to place them in the right environment.

Data Augmentation object-detection +1

Modeling Spatio-Temporal Human Track Structure for Action Localization

no code implementations28 Jun 2018 Guilhem Chéron, Anton Osokin, Ivan Laptev, Cordelia Schmid

In order to localize actions in time, we propose a recurrent localization network (RecLNet) designed to model the temporal structure of actions on the level of person tracks.

Human Detection Optical Flow Estimation +3

Spreading vectors for similarity search

2 code implementations ICLR 2019 Alexandre Sablayrolles, Matthijs Douze, Cordelia Schmid, Hervé Jégou

Discretizing multi-dimensional data distributions is a fundamental step of modern indexing methods.

Quantization Triplet

Unsupervised Learning of Artistic Styles with Archetypal Style Analysis

no code implementations NeurIPS 2018 Daan Wynen, Cordelia Schmid, Julien Mairal

In this paper, we introduce an unsupervised learning approach to automatically discover, summarize, and manipulate artistic styles from large collections of paintings.

Actor and Observer: Joint Modeling of First and Third-Person Videos

1 code implementation CVPR 2018 Gunnar A. Sigurdsson, Abhinav Gupta, Cordelia Schmid, Ali Farhadi, Karteek Alahari

Several theories in cognitive neuroscience suggest that when people interact with the world, or simulate interactions, they do so from a first-person egocentric perspective, and seamlessly transfer knowledge between third-person (observer) and first-person (actor).

Action Recognition Temporal Action Localization

Charades-Ego: A Large-Scale Dataset of Paired Third and First Person Videos

no code implementations25 Apr 2018 Gunnar A. Sigurdsson, Abhinav Gupta, Cordelia Schmid, Ali Farhadi, Karteek Alahari

In this paper we describe the egocentric aspect of the dataset and present annotations for Charades-Ego with 68, 536 activity instances in 68. 8 hours of first and third-person video, making it one of the largest and most diverse egocentric datasets available.

General Classification Video Classification +1

Image-based Synthesis for Deep 3D Human Pose Estimation

no code implementations12 Feb 2018 Grégory Rogez, Cordelia Schmid

Here, we propose a solution to generate a large set of photorealistic synthetic images of humans with 3D pose annotations.

3D Human Pose Estimation 3D Pose Estimation +1

Learning to Segment Moving Objects

no code implementations1 Dec 2017 Pavel Tokmakov, Cordelia Schmid, Karteek Alahari

We formulate this as a learning problem and design our framework with three cues: (i) independent object motion between a pair of frames, which complements object recognition, (ii) object appearance, which helps to correct errors in motion estimation, and (iii) temporal consistency, which imposes additional constraints on the segmentation.

Motion Estimation Motion Segmentation +4