1 code implementation • 12 Dec 2024 • Arsha Nagrani, Mingda Zhang, Ramin Mehran, Rachel Hornung, Nitesh Bharadwaj Gundavarapu, Nilpa Jha, Austin Myers, Xingyi Zhou, Boqing Gong, Cordelia Schmid, Mikhail Sirotenko, Yukun Zhu, Tobias Weyand
This paper describes a semi-automatic pipeline to generate challenging question-answer-decoy sets for understanding long videos.
Ranked #1 on Multiple-choice on Neptune-Full
no code implementations • 9 Dec 2024 • Xudong Wang, Xingyi Zhou, Alireza Fathi, Trevor Darrell, Cordelia Schmid
We present Visual Lexicon, a novel visual language that encodes rich image information into the text space of vocabulary tokens while retaining intricate visual details that are often challenging to convey in natural language.
no code implementations • 8 Dec 2024 • Kaiwen Zha, Lijun Yu, Alireza Fathi, David A. Ross, Cordelia Schmid, Dina Katabi, Xiuye Gu
Compared to the conventional tokenizer without text conditioning, TexTok achieves average reconstruction FID improvements of 29. 2% and 48. 1% on ImageNet-256 and -512 benchmarks respectively, across varying numbers of tokens.
no code implementations • 12 Nov 2024 • Evangelos Kazakos, Cordelia Schmid, Josef Sivic
We apply this approach to videos from the HowTo100M dataset, which results in a new large-scale training dataset, called HowToGround, with automatically annotated captions and spatio-temporally consistent bounding boxes with coherent natural language labels.
no code implementations • 31 Oct 2024 • Mathilde Caron, Alireza Fathi, Cordelia Schmid, Ahmet Iscen
Web-scale visual entity recognition, the task of associating images with their corresponding entities within vast knowledge bases like Wikipedia, presents significant challenges due to the lack of clean, large-scale training data.
1 code implementation • 2 Oct 2024 • Ricardo Garcia, ShiZhe Chen, Cordelia Schmid
3D-LOTUS++ achieves state-of-the-art performance on novel tasks of GemBench, setting a new standard for generalization in robotic manipulation.
Ranked #1 on Robot Manipulation Generalization on GEMBench
2 code implementations • 18 Jul 2024 • Matthieu Futeral, Cordelia Schmid, Benoît Sagot, Rachel Bawden
Current multimodal machine translation (MMT) systems rely on fully supervised data (i. e models are trained on sentences with their translations and accompanying images).
1 code implementation • 15 Jul 2024 • Jae Myung Kim, Jessica Bader, Stephan Alaniz, Cordelia Schmid, Zeynep Akata
While text-to-image diffusion models have been shown to achieve state-of-the-art results in image synthesis, they have yet to prove their effectiveness in downstream applications.
no code implementations • 13 Jun 2024 • Matthieu Futeral, Armel Zebaze, Pedro Ortiz Suarez, Julien Abadji, Rémi Lacroix, Cordelia Schmid, Rachel Bawden, Benoît Sagot
We additionally train two types of multilingual model to prove the benefits of mOSCAR: (1) a model trained on a subset of mOSCAR and captioning data and (2) a model train on captioning data only.
1 code implementation • 27 May 2024 • Riccardo Cadei, Lukas Lindorfer, Sylvia Cremer, Cordelia Schmid, Francesco Locatello
Machine Learning and AI have the potential to transform data-driven scientific discovery, enabling accurate predictions for several scientific phenomena.
no code implementations • 26 Apr 2024 • Lucas Ventura, Cordelia Schmid, Gül Varol
In this paper, we make use of this progress and instantiate the image experts from two types of models: a text-to-image retrieval model to provide an initial backbone, and image captioning models to provide supervision signal into unlabeled videos.
no code implementations • 24 Apr 2024 • Zerui Chen, ShiZhe Chen, Etienne Arlaud, Ivan Laptev, Cordelia Schmid
To address these limitations, we propose a new framework ViViDex to improve vision-based policy learning from human videos.
no code implementations • CVPR 2024 • Juhong Min, Shyamal Buch, Arsha Nagrani, Minsu Cho, Cordelia Schmid
This paper addresses the task of video question answering (videoQA) via a decomposed multi-stage, modular reasoning framework.
Ranked #5 on Multiple-choice on Neptune-Full
no code implementations • CVPR 2024 • Manjin Kim, Paul Hongsuck Seo, Cordelia Schmid, Minsu Cho
We introduce a new attention mechanism, dubbed structural self-attention (StructSA), that leverages rich correlation patterns naturally emerging in key-query interactions of attention.
Ranked #5 on Action Recognition on Diving-48
1 code implementation • CVPR 2024 • Xingyi Zhou, Anurag Arnab, Shyamal Buch, Shen Yan, Austin Myers, Xuehan Xiong, Arsha Nagrani, Cordelia Schmid
An ideal model for dense video captioning -- predicting captions localized temporally in a video -- should be able to handle long input videos, predict rich, detailed textual descriptions, and be able to produce outputs before processing the entire video.
no code implementations • CVPR 2024 • ShiZhe Chen, Ricardo Garcia, Ivan Laptev, Cordelia Schmid
SUGAR employs a versatile transformer-based model to jointly address five pre-training tasks, namely cross-modal knowledge distillation for semantic learning, masked point modeling to understand geometry structures, grasping pose synthesis for object affordance, 3D instance segmentation and referring expression grounding to analyze cluttered scenes.
2 code implementations • CVPR 2024 • Mathilde Caron, Ahmet Iscen, Alireza Fathi, Cordelia Schmid
In this paper, we address web-scale visual entity recognition, specifically the task of mapping a given query image to one of the 6 million existing entities in Wikipedia.
no code implementations • 2 Mar 2024 • Ziniu Hu, Ahmet Iscen, Aashi Jain, Thomas Kipf, Yisong Yue, David A. Ross, Cordelia Schmid, Alireza Fathi
SceneCraft first models a scene graph as a blueprint, detailing the spatial relationships among assets in the scene.
no code implementations • 5 Feb 2024 • Otniel-Bogdan Mercea, Alexey Gritsenko, Cordelia Schmid, Anurag Arnab
Here, we outperform a prior adaptor-based method which could only scale to a 1 billion parameter backbone, or fully-finetuning a smaller backbone, with the same GPU and less training time.
no code implementations • 11 Jan 2024 • Partha Ghosh, Soubhik Sanyal, Cordelia Schmid, Bernhard Schölkopf
To capture long spatio-temporal dependencies, our approach incorporates a hybrid explicit-implicit tri-plane representation inspired by 3D-aware generative frameworks developed for three-dimensional object representation and employs a single latent code to model an entire video clip.
no code implementations • CVPR 2024 • Jiarui Xu, Xingyi Zhou, Shen Yan, Xiuye Gu, Anurag Arnab, Chen Sun, Xiaolong Wang, Cordelia Schmid
When taking locations as inputs the model performs location-conditioned captioning which generates captions for the indicated object or region.
no code implementations • CVPR 2024 • Otniel-Bogdan Mercea, Alexey Gritsenko, Cordelia Schmid, Anurag Arnab
Here we outperform a prior adaptor-based method which could only scale to a 1 billion parameter backbone or fully-finetuning a smaller backbone with the same GPU and less training time.
no code implementations • 14 Dec 2023 • Jiarui Xu, Xingyi Zhou, Shen Yan, Xiuye Gu, Anurag Arnab, Chen Sun, Xiaolong Wang, Cordelia Schmid
When taking locations as inputs, the model performs location-conditioned captioning, which generates captions for the indicated object or region.
1 code implementation • CVPR 2024 • Guillaume Le Moing, Jean Ponce, Cordelia Schmid
Code, data, and videos showcasing the capabilities of our approach are available in the project webpage: https://16lemoing. github. io/dot .
1 code implementation • 27 Sep 2023 • ShiZhe Chen, Ricardo Garcia, Cordelia Schmid, Ivan Laptev
The ability for robots to comprehend and execute manipulation tasks based on natural language instructions is a long-term goal in robotics.
Ranked #5 on Robot Manipulation Generalization on GEMBench
no code implementations • NeurIPS 2023 • Antoine Yang, Arsha Nagrani, Ivan Laptev, Josef Sivic, Cordelia Schmid
To address this issue, we present VidChapters-7M, a dataset of 817K user-chaptered videos including 7M chapters in total.
1 code implementation • 28 Aug 2023 • Lucas Ventura, Antoine Yang, Cordelia Schmid, Gül Varol
Composed Image Retrieval (CoIR) has recently gained popularity as a task that considers both text and image queries together, to search for relevant images in a database.
Ranked #1 on Composed Video Retrieval (CoVR) on WebVid-CoVR
1 code implementation • 24 Aug 2023 • Sai Kumar Dwivedi, Cordelia Schmid, Hongwei Yi, Michael J. Black, Dimitrios Tzionas
To address this, we develop POCO, a novel framework for training HPS regressors to estimate not only a 3D human body, but also their confidence, in a single feed-forward pass.
1 code implementation • ICCV 2023 • Shen Yan, Xuehan Xiong, Arsha Nagrani, Anurag Arnab, Zhonghao Wang, Weina Ge, David Ross, Cordelia Schmid
While large-scale image-text pretrained models such as CLIP have been used for multiple video-level tasks on trimmed videos, their use for temporal localization in untrimmed videos is still a relatively unexplored task.
Ranked #1 on Action Segmentation on COIN
no code implementations • 10 Aug 2023 • ShiZhe Chen, Thomas Chabal, Ivan Laptev, Cordelia Schmid
Object goal navigation aims to navigate an agent to locations of a given object category in unseen environments.
no code implementations • 28 Jul 2023 • Ricardo Garcia, Robin Strudel, ShiZhe Chen, Etienne Arlaud, Ivan Laptev, Cordelia Schmid
While previous work mainly evaluates DR for disembodied tasks, such as pose estimation and object detection, here we systematically explore visual domain randomization methods and benchmark them on a rich set of challenging robotic manipulation tasks.
no code implementations • NeurIPS 2023 • Chen Sun, Calvin Luo, Xingyi Zhou, Anurag Arnab, Cordelia Schmid
A positive result would refute the common belief that explicit visual abstraction (e. g. object detection) is essential for compositional generalization on visual reasoning, and confirm the feasibility of a neural network "generalist" to solve visual recognition and reasoning tasks.
1 code implementation • CVPR 2023 • Xingyi Zhou, Anurag Arnab, Chen Sun, Cordelia Schmid
In this paper, we investigate how we can use knowledge of objects to design better video models, namely to process fewer tokens and to improve recognition accuracy.
1 code implementation • 20 Jun 2023 • Xingyi Zhou, Anurag Arnab, Chen Sun, Cordelia Schmid
We propose a new task and model for dense video object captioning -- detecting, tracking and captioning trajectories of objects in a video.
2 code implementations • ICCV 2023 • Karsten Roth, Jae Myung Kim, A. Sophia Koepke, Oriol Vinyals, Cordelia Schmid, Zeynep Akata
The visual classification performance of vision-language models such as CLIP has been shown to benefit from additional semantic knowledge from large language models (LLMs) such as GPT-3.
no code implementations • 12 Jun 2023 • Ahmet Iscen, Mathilde Caron, Alireza Fathi, Cordelia Schmid
Contrastive image-text models such as CLIP form the building blocks of many state-of-the-art systems.
Ranked #3 on Fine-Grained Image Recognition on OVEN
1 code implementation • 8 Jun 2023 • Sanjay Subramanian, Medhini Narasimhan, Kushal Khangaonkar, Kevin Yang, Arsha Nagrani, Cordelia Schmid, Andy Zeng, Trevor Darrell, Dan Klein
We present a framework that formulates visual question answering as modular code generation.
no code implementations • 10 May 2023 • Elliot Chane-Sane, Cordelia Schmid, Ivan Laptev
To encourage generalization to new tasks, we avoid particular tasks during training and learn our policy from unlabelled robot trajectories and corresponding robot videos.
no code implementations • CVPR 2024 • Alexey Gritsenko, Xuehan Xiong, Josip Djolonga, Mostafa Dehghani, Chen Sun, Mario Lučić, Cordelia Schmid, Anurag Arnab
The most performant spatio-temporal action localisation models use external person proposals and complex external memory banks.
Ranked #1 on Action Recognition on AVA v2.1 (using extra training data)
1 code implementation • CVPR 2023 • Zerui Chen, ShiZhe Chen, Cordelia Schmid, Ivan Laptev
In particular, we address reconstruction of hands and manipulated objects from monocular RGB images.
Ranked #5 on hand-object pose on DexYCB
1 code implementation • ICCV 2023 • Liliane Momeni, Mathilde Caron, Arsha Nagrani, Andrew Zisserman, Cordelia Schmid
Understanding verbs is crucial to modelling how people and objects interact with each other and the environment through space and time.
Ranked #21 on Zero-Shot Video Question Answer on NExT-QA
no code implementations • CVPR 2023 • Ahmet Iscen, Alireza Fathi, Cordelia Schmid
Retrieval augmented models are becoming increasingly popular for computer vision tasks after their recent success in NLP problems.
Ranked #1 on Image Classification on WebVision-1000 (using extra training data)
no code implementations • 6 Apr 2023 • Jae Myung Kim, A. Sophia Koepke, Cordelia Schmid, Zeynep Akata
In this work, we introduce ODmAP@k, an object decorrelation metric that measures a model's robustness to spurious correlations in the training data.
2 code implementations • CVPR 2023 • Youngwook Kim, Jae Myung Kim, Jieun Jeong, Cordelia Schmid, Zeynep Akata, Jungwoo Lee
Based on these findings, we propose to boost the attribution scores of the model trained with partial labels to make its explanation resemble that of the model trained with full labels.
no code implementations • CVPR 2023 • Paul Hongsuck Seo, Arsha Nagrani, Cordelia Schmid
(ii) We also introduce a simple curriculum scheme during training which we show is crucial to enable the model to jointly process audio and visual information effectively; and finally (iii) we show that our model achieves state of the art zero-shot results on three different AV-ASR benchmarks (How2, VisSpeech and Ego4D), while also crucially preserving decent performance on traditional audio-only speech recognition benchmarks (LibriSpeech).
3 code implementations • CVPR 2023 • Antoine Yang, Arsha Nagrani, Paul Hongsuck Seo, Antoine Miech, Jordi Pont-Tuset, Ivan Laptev, Josef Sivic, Cordelia Schmid
In this work, we introduce Vid2Seq, a multi-modal single-stage dense event captioning model pretrained on narrated videos which are readily-available at scale.
Ranked #1 on Dense Video Captioning on ActivityNet Captions (using extra training data)
2 code implementations • 20 Dec 2022 • Matthieu Futeral, Cordelia Schmid, Ivan Laptev, Benoît Sagot, Rachel Bawden
One of the major challenges of machine translation (MT) is ambiguity, which can in some cases be resolved by accompanying context such as images.
1 code implementation • CVPR 2023 • Ziniu Hu, Ahmet Iscen, Chen Sun, ZiRui Wang, Kai-Wei Chang, Yizhou Sun, Cordelia Schmid, David A. Ross, Alireza Fathi
REVEAL consists of four key components: the memory, the encoder, the retriever and the generator.
Ranked #9 on Visual Question Answering (VQA) on OK-VQA
2 code implementations • ICCV 2023 • Mariana-Iuliana Georgescu, Eduardo Fonseca, Radu Tudor Ionescu, Mario Lucic, Cordelia Schmid, Anurag Arnab
Can we leverage the audiovisual information already present in video to improve self-supervised representation learning?
Ranked #1 on Audio Classification on EPIC-KITCHENS-100 (using extra training data)
1 code implementation • 5 Dec 2022 • Mathilde Caron, Neil Houlsby, Cordelia Schmid
Pixel-level labels are particularly expensive to acquire.
1 code implementation • ICCV 2023 • Guillaume Le Moing, Jean Ponce, Cordelia Schmid
This paper presents WALDO (WArping Layer-Decomposed Objects), a novel approach to the prediction of future video frames from past ones.
no code implementations • 18 Nov 2022 • Paul Hongsuck Seo, Arsha Nagrani, Cordelia Schmid
In this report, we describe our submission to the Ego4D AudioVisual (AV) Speech Transcription Challenge 2022.
1 code implementation • 17 Nov 2022 • ShiZhe Chen, Pierre-Louis Guhur, Makarand Tapaswi, Cordelia Schmid, Ivan Laptev
In this work we propose a language-conditioned transformer model for grounding 3D objects and their spatial relations.
no code implementations • 16 Nov 2022 • Minttu Alakuijala, Gabriel Dulac-Arnold, Julien Mairal, Jean Ponce, Cordelia Schmid
Unlike prior work on leveraging human videos to teach robots, our method, Human Offline Learned Distances (HOLD) requires neither a priori data from the robot environment, nor a set of task-specific human demonstrations, nor a predefined notion of correspondence across morphologies, yet it is able to accelerate training of several manipulation tasks on a simulated robot arm compared to using only a sparse reward obtained from task completion.
no code implementations • 10 Oct 2022 • Ahmet Iscen, Thomas Bird, Mathilde Caron, Alireza Fathi, Cordelia Schmid
We study class-incremental learning, a training setup in which new classes of data are observed over time for the model to learn from.
no code implementations • 19 Sep 2022 • Quentin Le Lidec, Wilson Jallet, Ivan Laptev, Cordelia Schmid, Justin Carpentier
Reinforcement learning (RL) and trajectory optimization (TO) present strong complementary advantages.
2 code implementations • 11 Sep 2022 • Pierre-Louis Guhur, ShiZhe Chen, Ricardo Garcia, Makarand Tapaswi, Ivan Laptev, Cordelia Schmid
In human environments, robots are expected to accomplish a variety of manipulation tasks given simple natural language instructions.
Ranked #2 on Robot Manipulation on RLBench (Succ. Rate (10 tasks, 100 demos/task) metric)
1 code implementation • 24 Aug 2022 • ShiZhe Chen, Pierre-Louis Guhur, Makarand Tapaswi, Cordelia Schmid, Ivan Laptev
Our resulting HM3D-AutoVLN dataset is an order of magnitude larger than existing VLN datasets in terms of navigation environments and instructions.
Ranked #1 on Visual Navigation on SOON Test
no code implementations • 14 Aug 2022 • Medhini Narasimhan, Arsha Nagrani, Chen Sun, Michael Rubinstein, Trevor Darrell, Anna Rohrbach, Cordelia Schmid
In this work, we focus on summarizing instructional videos, an under-explored area of video summarization.
2 code implementations • 26 Jul 2022 • Zerui Chen, Yana Hasson, Cordelia Schmid, Ivan Laptev
We show that such aligned SDFs better focus on reconstructing shape details and improve reconstruction accuracy both for hands and objects.
Ranked #9 on hand-object pose on DexYCB
no code implementations • 8 Jul 2022 • Anurag Arnab, Xuehan Xiong, Alexey Gritsenko, Rob Romijnders, Josip Djolonga, Mostafa Dehghani, Chen Sun, Mario Lučić, Cordelia Schmid
Transfer learning is the predominant paradigm for training deep networks on small target datasets.
no code implementations • 20 Jun 2022 • Xuehan Xiong, Anurag Arnab, Arsha Nagrani, Cordelia Schmid
This report describes the approach behind our winning solution to the 2022 Epic-Kitchens Action Recognition Challenge.
Ranked #2 on Action Recognition on EPIC-KITCHENS-100 (using extra training data)
3 code implementations • 16 Jun 2022 • Antoine Yang, Antoine Miech, Josef Sivic, Ivan Laptev, Cordelia Schmid
Manual annotation of question and answers for videos, however, is tedious and prohibits scalability.
Ranked #1 on Zero-Shot Video Question Answer on TVQA
1 code implementation • 15 Jun 2022 • Valentin Gabeur, Paul Hongsuck Seo, Arsha Nagrani, Chen Sun, Karteek Alahari, Cordelia Schmid
Audio-visual automatic speech recognition (AV-ASR) is an extension of ASR that incorporates visual cues, often from the movements of a speaker's mouth.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +1
1 code implementation • 10 May 2022 • Antoine Yang, Antoine Miech, Josef Sivic, Ivan Laptev, Cordelia Schmid
We use our method to generate the WebVidVQA3M dataset from the WebVid dataset, i. e., videos with alt-text annotations, and show its benefits for training VideoQA models.
no code implementations • 10 May 2022 • Robin Strudel, Ivan Laptev, Cordelia Schmid
Visual grounding localizes regions (boxes or segments) in the image corresponding to given referring expressions.
no code implementations • 20 Apr 2022 • Thomas Chabal, Robin Strudel, Etienne Arlaud, Jean Ponce, Cordelia Schmid
This paper addresses the problem of copying an unknown assembly of primitives with known shape and appearance using information extracted from a single photograph by an off-the-shelf procedure for object detection and pose estimation.
no code implementations • 1 Apr 2022 • Arsha Nagrani, Paul Hongsuck Seo, Bryan Seybold, Anja Hauth, Santiago Manen, Chen Sun, Cordelia Schmid
To close this gap we propose a new video mining pipeline which involves transferring captions from image captioning datasets to video clips with no additional manual effort.
Ranked #6 on Zero-shot Text to Audio Retrieval on AudioCaps
1 code implementation • CVPR 2022 • Antoine Yang, Antoine Miech, Josef Sivic, Ivan Laptev, Cordelia Schmid
We consider the problem of localizing a spatio-temporal tube in a video corresponding to a given text query.
Ranked #2 on Spatio-Temporal Video Grounding on VidSTG
no code implementations • 28 Feb 2022 • Pia Bideau, Erik Learned-Miller, Cordelia Schmid, Karteek Alahari
In this work, we argue that the coupling of camera rotation and camera translation can create complex motion fields that are difficult for a deep network to untangle directly.
1 code implementation • CVPR 2022 • ShiZhe Chen, Pierre-Louis Guhur, Makarand Tapaswi, Cordelia Schmid, Ivan Laptev
To balance the complexity of large action space reasoning and fine-grained language grounding, we dynamically combine a fine-scale encoding over local observations and a coarse-scale encoding on a global map via graph transformers.
Ranked #5 on Visual Navigation on SOON Test
1 code implementation • CVPR 2022 • Ahmet Iscen, Jack Valmadre, Anurag Arnab, Cordelia Schmid
Recent advances in deep learning have relied on large, labelled datasets to train high-capacity models.
no code implementations • CVPR 2022 • Paul Hongsuck Seo, Arsha Nagrani, Anurag Arnab, Cordelia Schmid
Recent video and language pretraining frameworks lack the ability to generate sentences.
Ranked #15 on Video Captioning on MSR-VTT (using extra training data)
1 code implementation • CVPR 2022 • Shen Yan, Xuehan Xiong, Anurag Arnab, Zhichao Lu, Mi Zhang, Chen Sun, Cordelia Schmid
Video understanding requires reasoning at multiple spatiotemporal resolutions -- from short fine-grained motions to events taking place over longer durations.
Ranked #5 on Action Classification on MiT (using extra training data)
no code implementations • 1 Nov 2021 • Valentin Gabeur, Arsha Nagrani, Chen Sun, Karteek Alahari, Cordelia Schmid
Our proposal is to pre-train a video encoder using all the available video modalities as supervision, namely, appearance, sound, and transcribed speech.
1 code implementation • NeurIPS 2021 • ShiZhe Chen, Pierre-Louis Guhur, Cordelia Schmid, Ivan Laptev
Vision-and-language navigation (VLN) aims to build autonomous visual agents that follow instructions and navigate in real scenes.
Ranked #3 on Vision and Language Navigation on RxR
no code implementations • NeurIPS 2021 • Quentin Le Lidec, Ivan Laptev, Cordelia Schmid, Justin Carpentier
Notably, images depend both on the properties of observed scenes and on the process of image formation.
no code implementations • 29 Sep 2021 • Jae Myung Kim, Eunji Kim, Sungroh Yoon, Jungwoo Lee, Cordelia Schmid, Zeynep Akata
Explaining a complex black-box system in a post-hoc manner is important to understand its predictions.
2 code implementations • ICCV 2021 • Pierre-Louis Guhur, Makarand Tapaswi, ShiZhe Chen, Ivan Laptev, Cordelia Schmid
Given the scarcity of domain-specific training data and the high diversity of image and language inputs, the generalization of VLN agents to unseen environments remains challenging.
Ranked #3 on Vision and Language Navigation on VLN Challenge
1 code implementation • 16 Aug 2021 • Yana Hasson, Gül Varol, Ivan Laptev, Cordelia Schmid
Our work aims to obtain 3D reconstruction of hands and manipulated objects from monocular videos.
Ranked #5 on hand-object pose on HO-3D v2
1 code implementation • NeurIPS 2021 • Guillaume Le Moing, Jean Ponce, Cordelia Schmid
The prediction model is doubly autoregressive, in the latent space of an autoencoder for forecasting, and in image space for updating contextual information, which is also used to enforce spatio-temporal consistency through a learnable optical flow module.
Ranked #8 on Video Generation on BAIR Robot Pushing
no code implementations • 1 Jul 2021 • Elliot Chane-Sane, Cordelia Schmid, Ivan Laptev
Goal-conditioned reinforcement learning endows an agent with a large variety of skills, but it often struggles to solve tasks that require more temporally extended reasoning.
1 code implementation • NeurIPS 2021 • Arsha Nagrani, Shan Yang, Anurag Arnab, Aren Jansen, Cordelia Schmid, Chen Sun
Humans perceive the world by concurrently processing and fusing high-dimensional inputs from multiple modalities such as vision and audio.
Ranked #2 on Action Classification on Kinetics-Sounds
no code implementations • CVPR 2021 • Lu Mi, Hang Zhao, Charlie Nash, Xiaohan Jin, Jiyang Gao, Chen Sun, Cordelia Schmid, Nir Shavit, Yuning Chai, Dragomir Anguelov
To address this issue, we introduce a new challenging task to generate HD maps.
no code implementations • 15 Jun 2021 • Minttu Alakuijala, Gabriel Dulac-Arnold, Julien Mairal, Jean Ponce, Cordelia Schmid
Residual reinforcement learning (RL) has been proposed as a way to solve challenging robotic tasks by adapting control actions from a conventional feedback controller to maximize a reward signal.
1 code implementation • NeurIPS 2021 • Huy V. Vo, Elena Sizikova, Cordelia Schmid, Patrick Pérez, Jean Ponce
Extensive experiments on COCO and OpenImages show that, in the single-object discovery setting where a single prominent object is sought in each image, the proposed LOD (Large-scale Object Discovery) approach is on par with, or better than the state of the art for medium-scale datasets (up to 120K images), and over 37% better than the only other algorithms capable of scaling up to 1. 7M images.
1 code implementation • ICCV 2021 • Alexander Pashevich, Cordelia Schmid, Chen Sun
We demonstrate that encoding the history with a transformer is critical to solve compositional tasks, and that pretraining and joint training with synthetic instructions further improve the performance.
8 code implementations • ICCV 2021 • Robin Strudel, Ricardo Garcia, Ivan Laptev, Cordelia Schmid
In this paper we introduce Segmenter, a transformer model for semantic segmentation.
Ranked #15 on Semantic Segmentation on PASCAL Context
3 code implementations • 12 Apr 2021 • Ahmet Iscen, André Araujo, Boqing Gong, Cordelia Schmid
An effective and simple approach to long-tailed visual recognition is to learn feature representations and a classifier separately, with instance and class-balanced sampling, respectively.
Ranked #13 on Long-tail Learning on iNaturalist 2018
1 code implementation • 6 Apr 2021 • Jack Valmadre, Alex Bewley, Jonathan Huang, Chen Sun, Cristian Sminchisescu, Cordelia Schmid
This paper introduces temporally local metrics for Multi-Object Tracking.
no code implementations • ICCV 2021 • Chen Sun, Arsha Nagrani, Yonglong Tian, Cordelia Schmid
We focus on contrastive methods for self-supervised video representation learning.
no code implementations • ICCV 2021 • Tonmoy Saikia, Cordelia Schmid, Thomas Brox
CNNs perform remarkably well when the training and test distributions are i. i. d, but unseen image corruptions can cause a surprisingly large drop in performance.
10 code implementations • ICCV 2021 • Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lučić, Cordelia Schmid
We present pure-transformer based models for video classification, drawing upon the recent success of such models in image classification.
Ranked #8 on Action Classification on MiT (Top 5 Accuracy metric, using extra training data)
no code implementations • ICCV 2021 • Anurag Arnab, Chen Sun, Cordelia Schmid
Accurate video understanding involves reasoning about the relationships between actors, objects and their environment, often over long temporal intervals.
no code implementations • ICCV 2021 • Dave Epstein, Jiajun Wu, Cordelia Schmid, Chen Sun
Learning to model how the world changes as time elapses has proven a challenging problem for the computer vision community.
no code implementations • 10 Dec 2020 • Yves Dufournaud, Cordelia Schmid, Radu Horaud
In this paper we address the problem of matching two images with two different resolutions: a high-resolution image and a low-resolution one.
no code implementations • CVPR 2021 • Paul Hongsuck Seo, Arsha Nagrani, Cordelia Schmid
Leveraging recent advances in multimodal learning, our model consists of a novel co-attentional multimodal video transformer, and when trained on both textual and visual context, outperforms baselines that use textual inputs alone.
1 code implementation • ICCV 2021 • Antoine Yang, Antoine Miech, Josef Sivic, Ivan Laptev, Cordelia Schmid
In this work, we propose to avoid manual annotation and generate a large-scale training dataset for video question answering making use of automatic cross-modal supervision.
Ranked #1 on Video Question Answering on VideoQA
1 code implementation • 25 Aug 2020 • Robin Strudel, Ricardo Garcia, Justin Carpentier, Jean-Paul Laumond, Ivan Laptev, Cordelia Schmid
Motion planning and obstacle avoidance is a key challenge in robotics applications.
Robotics
4 code implementations • 19 Aug 2020 • Hang Zhao, Jiyang Gao, Tian Lan, Chen Sun, Benjamin Sapp, Balakrishnan Varadarajan, Yue Shen, Yi Shen, Yuning Chai, Cordelia Schmid, Cong-Cong Li, Dragomir Anguelov
Our key insight is that for prediction within a moderate time horizon, the future modes can be effectively captured by a set of target states.
1 code implementation • 3 Aug 2020 • Samuel Albanie, Yang Liu, Arsha Nagrani, Antoine Miech, Ernesto Coto, Ivan Laptev, Rahul Sukthankar, Bernard Ghanem, Andrew Zisserman, Valentin Gabeur, Chen Sun, Karteek Alahari, Cordelia Schmid, Shi-Zhe Chen, Yida Zhao, Qin Jin, Kaixu Cui, Hui Liu, Chen Wang, Yudong Jiang, Xiaoshuai Hao
This report summarizes the results of the first edition of the challenge together with the findings of the participants.
no code implementations • 29 Jul 2020 • Jonathan C. Stroud, Zhichao Lu, Chen Sun, Jia Deng, Rahul Sukthankar, Cordelia Schmid, David A. Ross
Based on this observation, we propose to use text as a method for learning video representations.
no code implementations • ECCV 2020 • Anurag Arnab, Chen Sun, Arsha Nagrani, Cordelia Schmid
Despite the recent advances in video classification, progress in spatio-temporal action recognition has lagged behind.
1 code implementation • ECCV 2020 • Valentin Gabeur, Chen Sun, Karteek Alahari, Cordelia Schmid
In this paper, we present a multi-modal transformer to jointly encode the different modalities in video, which allows each of them to attend to the others.
Ranked #1 on Zero-Shot Video Retrieval on MSR-VTT (text-to-video Mean Rank metric, using extra training data)
1 code implementation • 28 Jun 2020 • Pavel Tokmakov, Martial Hebert, Cordelia Schmid
This paper addresses the task of unsupervised learning of representations for action recognition in videos.
no code implementations • ECCV 2020 • Yuhua Chen, Luc van Gool, Cordelia Schmid, Cristian Sminchisescu
To handle inherent modeling error in the consistency loss (e. g. Lambertian assumptions) and for better generalization, we further introduce a learned, output refinement network, which takes the initial predictions, the loss, and the gradient as input, and efficiently predicts a correlated output update.
no code implementations • ECCV 2020 • Achal Dave, Tarasha Khurana, Pavel Tokmakov, Cordelia Schmid, Deva Ramanan
To this end, we ask annotators to label objects that move at any point in the video, and give names to them post factum.
1 code implementation • NeurIPS 2020 • Yonglong Tian, Chen Sun, Ben Poole, Dilip Krishnan, Cordelia Schmid, Phillip Isola
Contrastive learning between multiple views of the data has recently achieved state of the art performance in the field of self-supervised representation learning.
Ranked #2 on Contrastive Learning on imagenet-1k
4 code implementations • CVPR 2020 • Jiyang Gao, Chen Sun, Hang Zhao, Yi Shen, Dragomir Anguelov, Cong-Cong Li, Cordelia Schmid
Behavior prediction in dynamic, multi-agent systems is an important problem in the context of self-driving cars, due to the complex representations and interactions of road components, including moving agents (e. g. pedestrians and vehicles) and road context information (e. g. lanes, traffic lights).
no code implementations • CVPR 2020 • Yana Hasson, Bugra Tekin, Federica Bogo, Ivan Laptev, Marc Pollefeys, Cordelia Schmid
Modeling hand-object manipulations is essential for understanding how humans interact with their environment.
Ranked #9 on hand-object pose on HO-3D v2
no code implementations • 15 Apr 2020 • Alexander Pashevich, Igor Kalevatykh, Ivan Laptev, Cordelia Schmid
We then show the success of our visual policies for building arches from different primitives.
no code implementations • ECCV 2020 • Ahmet Iscen, Jeffrey Zhang, Svetlana Lazebnik, Cordelia Schmid
We assume that the model is updated incrementally for new classes as new data becomes available sequentially. This requires adapting the previously stored feature vectors to the updated feature space without having access to the corresponding original training images.
no code implementations • CVPR 2020 • Arsha Nagrani, Chen Sun, David Ross, Rahul Sukthankar, Cordelia Schmid, Andrew Zisserman
We train a BERT-based Speech2Action classifier on over a thousand movie screenplays, to predict action labels from transcribed speech segments.
1 code implementation • ECCV 2020 • Nikita Dvornik, Cordelia Schmid, Julien Mairal
Popular approaches for few-shot classification consist of first learning a generic data representation based on a large annotated dataset, before adapting the representation to new classes given only a few labeled samples.
Ranked #4 on Few-Shot Image Classification on Meta-Dataset Rank
no code implementations • 12 Mar 2020 • Gunnar A. Sigurdsson, Abhinav Gupta, Cordelia Schmid, Karteek Alahari
Eye movement and strategic placement of the visual field onto the retina, gives animals increased resolution of the scene and suppresses distracting information.
2 code implementations • ICML 2020 • Alexandre Sablayrolles, Matthijs Douze, Cordelia Schmid, Hervé Jégou
The mark is robust to strong variations such as different architectures or optimization methods.
no code implementations • 22 Jan 2020 • Tonmoy Saikia, Thomas Brox, Cordelia Schmid
To learn models or features that generalize across tasks and domains is one of the grand goals of machine learning.
1 code implementation • 9 Dec 2019 • Gül Varol, Ivan Laptev, Cordelia Schmid, Andrew Zisserman
Although synthetic training data has been shown to be beneficial for tasks such as human pose estimation, its use for RGB human action recognition is relatively unexplored.
no code implementations • 25 Oct 2019 • Achal Dave, Pavel Tokmakov, Cordelia Schmid, Deva Ramanan
Moreover, at test time the same network can be applied to detection and tracking, resulting in a unified approach for the two tasks.
1 code implementation • ECCV 2020 • Ahmet Iscen, Giorgos Tolias, Yannis Avrithis, Ondrej Chum, Cordelia Schmid
In this work we consider the problem of learning a classifier from noisy labels when a few clean labeled examples are given.
no code implementations • 29 Aug 2019 • Alexandre Sablayrolles, Matthijs Douze, Yann Ollivier, Cordelia Schmid, Hervé Jégou
Membership inference determines, given a sample and trained parameters of a machine learning model, whether the sample was part of the training set.
1 code implementation • 2 Aug 2019 • Robin Strudel, Alexander Pashevich, Igor Kalevatykh, Ivan Laptev, Josef Sivic, Cordelia Schmid
Manipulation tasks such as preparing a meal or assembling furniture remain highly challenging for robotics and vision.
no code implementations • ICCV 2019 • Valentin Gabeur, Jean-Sebastien Franco, Xavier Martin, Cordelia Schmid, Gregory Rogez
In this paper, we tackle the problem of 3D human shape estimation from single RGB images.
no code implementations • ICCV 2019 • Yuhua Chen, Cordelia Schmid, Cristian Sminchisescu
We present GLNet, a self-supervised framework for learning depth, optical flow, camera pose and intrinsic parameters from monocular video - addressing the difficulty of acquiring realistic ground-truth for such tasks.
no code implementations • 13 Jun 2019 • Chen Sun, Fabien Baradel, Kevin Murphy, Cordelia Schmid
This paper proposes a self-supervised learning approach for video features that results in significantly improved performance on downstream tasks (such as video classification, captioning and segmentation) compared to existing methods.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +5
no code implementations • 29 Apr 2019 • Yubo Zhang, Pavel Tokmakov, Martial Hebert, Cordelia Schmid
In this work we study the problem of action detection in a highly-imbalanced dataset.
3 code implementations • CVPR 2019 • Yana Hasson, Gül Varol, Dimitrios Tzionas, Igor Kalevatykh, Michael J. Black, Ivan Laptev, Cordelia Schmid
Previous work has made significant progress towards reconstruction of hand poses and object shapes in isolation.
Ranked #7 on hand-object pose on DexYCB
no code implementations • CVPR 2019 • Chen Sun, Abhinav Shrivastava, Carl Vondrick, Rahul Sukthankar, Kevin Murphy, Cordelia Schmid
This paper focuses on multi-person action forecasting in videos.
3 code implementations • ICCV 2019 • Chen Sun, Austin Myers, Carl Vondrick, Kevin Murphy, Cordelia Schmid
Self-supervised learning has become increasingly important to leverage the abundance of unlabeled data available on platforms like YouTube.
Ranked #1 on Action Classification on YouCook2
1 code implementation • ICCV 2019 • Nikita Dvornik, Cordelia Schmid, Julien Mairal
Few-shot classification consists of learning a predictive model that is able to effectively adapt to a new class, given only a few annotated samples.
1 code implementation • 18 Mar 2019 • Alexander Pashevich, Robin Strudel, Igor Kalevatykh, Ivan Laptev, Cordelia Schmid
Policies learned in simulators, however, do not transfer well to real scenes given the domain gap between real and synthetic data.
1 code implementation • 5 Jan 2019 • Joseph Roth, Sourish Chaudhuri, Ondrej Klejch, Radhika Marvin, Andrew Gallagher, Liat Kaver, Sharadh Ramaswamy, Arkadiusz Stopczynski, Cordelia Schmid, Zhonghua Xi, Caroline Pantofaru
The dataset contains temporally labeled face tracks in video, where each face instance is labeled as speaking or not, and whether the speech is audible.
Active Speaker Detection Audio-Visual Active Speaker Detection +4
no code implementations • NeurIPS 2019 • Thomas Lucas, Konstantin Shmelkov, Karteek Alahari, Cordelia Schmid, Jakob Verbeek
We show that our model significantly improves over existing hybrid models: offering GAN-like samples, IS and FID scores that are competitive with fully adversarial models, and improved likelihood scores.
no code implementations • ICCV 2019 • Julia Peyre, Ivan Laptev, Cordelia Schmid, Josef Sivic
We seek to detect visual relations in images of the form of triplets t = (subject, predicate, object), such as "person riding dog", where training examples of the individual entities are available but their combinations are unseen at training.
no code implementations • CVPR 2019 • Yubo Zhang, Pavel Tokmakov, Martial Hebert, Cordelia Schmid
A dominant paradigm for learning-based approaches in computer vision is training generic models, such as ResNet for image recognition, or I3D for video understanding, on large datasets and allowing them to discover the optimal representation for the problem at hand.
no code implementations • 30 Nov 2018 • Alexander Pashevich, Danijar Hafner, James Davidson, Rahul Sukthankar, Cordelia Schmid
To achieve this, we study different modulation signals and exploration for hierarchical controllers.
no code implementations • 27 Sep 2018 • Thomas Lucas, Konstantin Shmelkov, Karteek Alahari, Cordelia Schmid, Jakob Verbeek
First, we propose a model that extends variational autoencoders by using deterministic invertible transformation layers to map samples from the decoder to the image space.
no code implementations • ICLR 2019 • Alexandre Sablayrolles, Matthijs Douze, Cordelia Schmid, Hervé Jégou
Convolutional neural networks memorize part of their training data, which is why strategies such as data augmentation and drop-out are employed to mitigate overfitting.
no code implementations • 6 Sep 2018 • Nikita Dvornik, Julien Mairal, Cordelia Schmid
In this work, we consider object detection, semantic and instance segmentation and augment the training images by blending objects in existing scenes, using instance segmentation annotations.
1 code implementation • ECCV 2018 • Chen Sun, Abhinav Shrivastava, Carl Vondrick, Kevin Murphy, Rahul Sukthankar, Cordelia Schmid
A visualization of the learned relation features confirms that our approach is able to attend to the relevant relations for each action.
Ranked #15 on Action Recognition on AVA v2.1
no code implementations • ECCV 2018 • Konstantin Shmelkov, Cordelia Schmid, Karteek Alahari
Generative adversarial networks (GANs) are one of the most popular methods for generating images today.
6 code implementations • ECCV 2018 • Francisco M. Castro, Manuel J. Marín-Jiménez, Nicolás Guil, Cordelia Schmid, Karteek Alahari
Although deep learning approaches have stood out in recent years due to their state-of-the-art results, they continue to suffer from catastrophic forgetting, a dramatic decrease in overall performance when training with new classes added incrementally.
Ranked #2 on Incremental Learning on ImageNet100 - 10 steps (# M Params metric)
2 code implementations • ECCV 2018 • Nikita Dvornik, Julien Mairal, Cordelia Schmid
For this approach to be successful, we show that modeling appropriately the visual context surrounding objects is crucial to place them in the right environment.
1 code implementation • NeurIPS 2018 • Guilhem Chéron, Jean-Baptiste Alayrac, Ivan Laptev, Cordelia Schmid
Our model is based on discriminative clustering and integrates different types of supervision as constraints on the optimization.
no code implementations • 28 Jun 2018 • Guilhem Chéron, Anton Osokin, Ivan Laptev, Cordelia Schmid
In order to localize actions in time, we propose a recurrent localization network (RecLNet) designed to model the temporal structure of actions on the level of person tracks.
2 code implementations • ICLR 2019 • Alexandre Sablayrolles, Matthijs Douze, Cordelia Schmid, Hervé Jégou
Discretizing multi-dimensional data distributions is a fundamental step of modern indexing methods.
no code implementations • CVPR 2018 • Vasileios Choutas, Philippe Weinzaepfel, Jérôme Revaud, Cordelia Schmid
We use the human joints as these keypoints and term our Pose moTion representation PoTion.
Ranked #1 on Skeleton Based Action Recognition on J-HMDB
no code implementations • NeurIPS 2018 • Daan Wynen, Cordelia Schmid, Julien Mairal
In this paper, we introduce an unsupervised learning approach to automatically discover, summarize, and manipulate artistic styles from large collections of paintings.
1 code implementation • CVPR 2018 • Gunnar A. Sigurdsson, Abhinav Gupta, Cordelia Schmid, Ali Farhadi, Karteek Alahari
Several theories in cognitive neuroscience suggest that when people interact with the world, or simulate interactions, they do so from a first-person egocentric perspective, and seamlessly transfer knowledge between third-person (observer) and first-person (actor).
no code implementations • 25 Apr 2018 • Gunnar A. Sigurdsson, Abhinav Gupta, Cordelia Schmid, Ali Farhadi, Karteek Alahari
In this paper we describe the egocentric aspect of the dataset and present annotations for Charades-Ego with 68, 536 activity instances in 68. 8 hours of first and third-person video, making it one of the largest and most diverse egocentric datasets available.
2 code implementations • ECCV 2018 • Gül Varol, Duygu Ceylan, Bryan Russell, Jimei Yang, Ersin Yumer, Ivan Laptev, Cordelia Schmid
Human shape estimation is an important task for video editing, animation and fashion industry.
Ranked #3 on 3D Human Pose Estimation on Surreal (using extra training data)
no code implementations • 1 Mar 2018 • Gregory Rogez, Philippe Weinzaepfel, Cordelia Schmid
We propose an end-to-end architecture for joint 2D and 3D human pose estimation in natural images.
3D Human Pose Estimation 3D Multi-Person Pose Estimation (absolute) +1
no code implementations • 12 Feb 2018 • Grégory Rogez, Cordelia Schmid
Here, we propose a solution to generate a large set of photorealistic synthetic images of humans with 3D pose annotations.
no code implementations • 1 Dec 2017 • Pavel Tokmakov, Cordelia Schmid, Karteek Alahari
We formulate this as a learning problem and design our framework with three cues: (i) independent object motion between a pair of frames, which complements object recognition, (ii) object appearance, which helps to correct errors in motion estimation, and (iii) temporal consistency, which imposes additional constraints on the segmentation.