1 code implementation • ICCV 2023 • Nina Shvetsova, Anna Kukleva, Bernt Schiele, Hilde Kuehne
Large-scale noisy web image-text datasets have been proven to be efficient for learning robust vision-language models.
no code implementations • ICCV 2023 • Swetha Sirnam, Mamshad Nayeem Rizve, Nina Shvetsova, Hilde Kuehne, Mubarak Shah
Self-supervised learning on large-scale multi-modal datasets allows learning semantically meaningful embeddings in a joint multi-modal representation space without relying on human annotations.
no code implementations • 21 May 2023 • Andrew Rouditchenko, Sameer Khurana, Samuel Thomas, Rogerio Feris, Leonid Karlinsky, Hilde Kuehne, David Harwath, Brian Kingsbury, James Glass
Recent models such as XLS-R and Whisper have made multilingual speech technologies more accessible by pre-training on audio from around 100 spoken languages each.
1 code implementation • 1 May 2023 • Felix Petersen, Tobias Sutter, Christian Borgelt, Dongsung Huh, Hilde Kuehne, Yuekai Sun, Oliver Deussen
We present ISAAC (Input-baSed ApproximAte Curvature), a novel method that conditions the gradient using selected second-order information and has an asymptotically vanishing computational overhead, assuming a batch size smaller than the number of neurons.
1 code implementation • CVPR 2023 • Aisha Urooj Khan, Hilde Kuehne, Bo Wu, Kim Chheu, Walid Bousselham, Chuang Gan, Niels Lobo, Mubarak Shah
The proposed method is trained in an end-to-end manner and optimized by a VQA loss with the cross-entropy function and a Hungarian matching loss for the situation graph prediction.
1 code implementation • 11 Apr 2023 • Marius Bock, Hilde Kuehne, Kristof Van Laerhoven, Michael Moeller
Though research has shown the complementarity of camera- and inertial-based data, datasets which offer both modalities remain scarce.
Egocentric Activity Recognition
Human Activity Recognition
+2
no code implementations • 29 Mar 2023 • Brian Chen, Nina Shvetsova, Andrew Rouditchenko, Daniel Kondermann, Samuel Thomas, Shih-Fu Chang, Rogerio Feris, James Glass, Hilde Kuehne
Spatio-temporal grounding describes the task of localizing events in space and time, e. g., in video data, based on verbal descriptions only.
1 code implementation • 23 Mar 2023 • Anna Kukleva, Moritz Böhle, Bernt Schiele, Hilde Kuehne, Christian Rupprecht
Such a schedule results in a constant `task switching' between an emphasis on instance discrimination and group-wise discrimination and thereby ensures that the model learns both group-wise features, as well as instance-specific details.
1 code implementation • ICCV 2023 • Wei Lin, Leonid Karlinsky, Nina Shvetsova, Horst Possegger, Mateusz Kozinski, Rameswar Panda, Rogerio Feris, Hilde Kuehne, Horst Bischof
We adapt a VL model for zero-shot and few-shot action recognition using a collection of unlabeled videos and an unpaired action dictionary.
Ranked #2 on
Zero-Shot Action Recognition
on Kinetics
no code implementations • 9 Mar 2023 • Wei Lin, Anna Kukleva, Horst Possegger, Hilde Kuehne, Horst Bischof
Temporal action segmentation in untrimmed videos has gained increased attention recently.
1 code implementation • ICCV 2023 • Nina Shvetsova, Felix Petersen, Anna Kukleva, Bernt Schiele, Hilde Kuehne
Contrastive learning has become an important tool in learning representations from unlabeled data mainly relying on the idea of minimizing distance between positive data pairs, e. g., views from the same images, and maximizing distance between negative data pairs, e. g., views from different images.
1 code implementation • CVPR 2023 • Wei Lin, Muhammad Jehanzeb Mirza, Mateusz Kozinski, Horst Possegger, Hilde Kuehne, Horst Bischof
Our proposed method demonstrates a substantial performance gain over existing test-time adaptation approaches in both evaluations of a single distribution shift and the challenging case of random distribution shifts.
1 code implementation • 15 Oct 2022 • Felix Petersen, Christian Borgelt, Hilde Kuehne, Oliver Deussen
Recently, research has increasingly focused on developing efficient neural network architectures.
1 code implementation • 7 Oct 2022 • Andrew Rouditchenko, Yung-Sung Chuang, Nina Shvetsova, Samuel Thomas, Rogerio Feris, Brian Kingsbury, Leonid Karlinsky, David Harwath, Hilde Kuehne, James Glass
Inspired by the fact that English text-video retrieval outperforms other languages, we train a student model using input text in different languages to match the cross-modal predictions from teacher models using input text in English.
1 code implementation • 2 Oct 2022 • Yuan Gong, Andrew Rouditchenko, Alexander H. Liu, David Harwath, Leonid Karlinsky, Hilde Kuehne, James Glass
In this paper, we first extend the recent Masked Auto-Encoder (MAE) model from a single modality to audio-visual multi-modalities.
Ranked #1 on
Audio Tagging
on AudioSet
(using extra training data)
1 code implementation • 12 Sep 2022 • Felix Vogel, Nina Shvetsova, Leonid Karlinsky, Hilde Kuehne
We follow up with the analysis of the attribute-based zero-shot learning capabilities of these models, evaluating how well this classical zero-shot notion emerges from large-scale webly supervision.
no code implementations • 3 Aug 2022 • Tim Frommknecht, Pedro Alves Zipf, Quanfu Fan, Nina Shvetsova, Hilde Kuehne
As the accuracy for ImageNet and similar datasets increased over time, the performance on tasks beyond the classification of natural images is yet to be explored.
1 code implementation • 5 Jul 2022 • Aisha Urooj Khan, Hilde Kuehne, Chuang Gan, Niels da Vitoria Lobo, Mubarak Shah
Transformers for visual-language representation learning have been getting a lot of interest and shown tremendous performance on visual question answering (VQA) and grounding.
1 code implementation • 15 Jun 2022 • Felix Petersen, Hilde Kuehne, Christian Borgelt, Oliver Deussen
In this work, we relax this assumption and optimize the model for multiple k simultaneously instead of using a single k. Leveraging recent advances in differentiable sorting and ranking, we propose a differentiable top-k cross-entropy classification loss.
Ranked #54 on
Image Classification
on ImageNet
1 code implementation • 30 Mar 2022 • Wei Lin, Anna Kukleva, Kunyang Sun, Horst Possegger, Hilde Kuehne, Horst Bischof
To address these challenges, we propose Cycle Domain Adaptation (CycDA), a cycle-based approach for unsupervised image-to-video domain adaptation by leveraging the joint spatial information in images and videos on the one hand and, on the other hand, training an independent spatio-temporal model to bridge the modality gap.
1 code implementation • ICLR 2022 • Felix Petersen, Christian Borgelt, Hilde Kuehne, Oliver Deussen
We introduce a family of sigmoid functions and prove that they produce differentiable sorting networks that are monotonic.
1 code implementation • CVPR 2022 • Nina Shvetsova, Brian Chen, Andrew Rouditchenko, Samuel Thomas, Brian Kingsbury, Rogerio S. Feris, David Harwath, James Glass, Hilde Kuehne
In this work, we present a multi-modal, modality agnostic fusion transformer that learns to exchange information between multiple modalities, such as video, audio, and text, and integrate them into a fused representation in a joined multi-modal embedding space.
1 code implementation • 8 Dec 2021 • Nina Shvetsova, Brian Chen, Andrew Rouditchenko, Samuel Thomas, Brian Kingsbury, Rogerio Feris, David Harwath, James Glass, Hilde Kuehne
Multi-modal learning from video data has seen increased attention recently as it allows to train semantically meaningful embeddings without human annotation enabling tasks like zero-shot retrieval and classification.
1 code implementation • CVPR 2022 • Sivan Harary, Eli Schwartz, Assaf Arbelle, Peter Staar, Shady Abu-Hussein, Elad Amrani, Roei Herzig, Amit Alfassy, Raja Giryes, Hilde Kuehne, Dina Katabi, Kate Saenko, Rogerio Feris, Leonid Karlinsky
The ability to generalize learned representations across significantly different visual domains, such as between real photos, clipart, paintings, and sketches, is a fundamental capacity of the human visual system.
no code implementations • 1 Dec 2021 • Kevin Duarte, Brian Chen, Nina Shvetsova, Andrew Rouditchenko, Samuel Thomas, Alexander Liu, David Harwath, James Glass, Hilde Kuehne, Mubarak Shah
We present a new multimodal capsule network that allows us to leverage the strength of capsules in the context of a multimodal learning framework on large amounts of video data.
1 code implementation • 8 Nov 2021 • Andrew Rouditchenko, Angie Boggust, David Harwath, Samuel Thomas, Hilde Kuehne, Brian Chen, Rameswar Panda, Rogerio Feris, Brian Kingsbury, Michael Picheny, James Glass
In this paper, we explore self-supervised audio-visual models that learn from instructional videos.
no code implementations • 20 Oct 2021 • Felix Petersen, Bastian Goldluecke, Oliver Deussen, Hilde Kuehne
Recently introduced differentiable renderers can be leveraged to learn the 3D geometry of objects from 2D images, but those approaches require additional supervision to enable the renderer to produce an output that can be compared to the input image.
1 code implementation • NeurIPS 2021 • Felix Petersen, Christian Borgelt, Hilde Kuehne, Oliver Deussen
The integration of algorithmic components into neural architectures has gained increased attention recently, as it allows training neural networks with new forms of supervision such as ordering constraints or silhouettes instead of using ground truth labels.
no code implementations • 29 Sep 2021 • Felix Petersen, Christian Borgelt, Hilde Kuehne, Oliver Deussen
We propose a sampling-free approximate formulation of Gaussian variational auto-encoders.
no code implementations • 29 Sep 2021 • Felix Petersen, Christian Borgelt, Mikhail Yurochkin, Hilde Kuehne, Oliver Deussen
We propose a new approach to propagating probability distributions through neural networks.
1 code implementation • ICCV 2021 • Anna Kukleva, Hilde Kuehne, Bernt Schiele
Both generalized and incremental few-shot learning have to deal with three major challenges: learning novel classes from only few samples per class, preventing catastrophic forgetting of base classes, and classifier calibration across novel and base classes.
1 code implementation • CVPR 2021 • Aisha Urooj Khan, Hilde Kuehne, Kevin Duarte, Chuang Gan, Niels Lobo, Mubarak Shah
In this paper, we focus on a more relaxed setting: the grounding of relevant visual entities in a weakly supervised manner by training on the VQA task alone.
1 code implementation • 9 May 2021 • Felix Petersen, Christian Borgelt, Hilde Kuehne, Oliver Deussen
Sorting and ranking supervision is a method for training neural networks end-to-end based on ordering constraints.
no code implementations • 30 Apr 2021 • Sirnam Swetha, Hilde Kuehne, Yogesh S Rawat, Mubarak Shah
This paper proposes a novel approach for unsupervised sub-action learning in complex activities.
Ranked #27 on
Action Segmentation
on Breakfast
1 code implementation • ICCV 2021 • Brian Chen, Andrew Rouditchenko, Kevin Duarte, Hilde Kuehne, Samuel Thomas, Angie Boggust, Rameswar Panda, Brian Kingsbury, Rogerio Feris, David Harwath, James Glass, Michael Picheny, Shih-Fu Chang
Multimodal self-supervised learning is getting more and more attention as it allows not only to train large networks without human supervision but also to search and retrieve data across various modalities.
1 code implementation • ICCV 2021 • Assaf Arbelle, Sivan Doveh, Amit Alfassy, Joseph Shtok, Guy Lev, Eli Schwartz, Hilde Kuehne, Hila Barak Levi, Prasanna Sattigeri, Rameswar Panda, Chun-Fu Chen, Alex Bronstein, Kate Saenko, Shimon Ullman, Raja Giryes, Rogerio Feris, Leonid Karlinsky
In this work, we focus on the task of Detector-Free WSG (DF-WSG) to solve WSG without relying on a pre-trained detector.
Ranked #1 on
Phrase Grounding
on Visual Genome
1 code implementation • 16 Jun 2020 • Andrew Rouditchenko, Angie Boggust, David Harwath, Brian Chen, Dhiraj Joshi, Samuel Thomas, Kartik Audhkhasi, Hilde Kuehne, Rameswar Panda, Rogerio Feris, Brian Kingsbury, Michael Picheny, Antonio Torralba, James Glass
Further, we propose a tri-modal model that jointly processes raw audio, video, and text captions from videos to learn a multi-modal semantic embedding space useful for text-video retrieval.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+5
no code implementations • 29 Jan 2020 • Rosaura G. VidalMata, Walter J. Scheirer, Anna Kukleva, David Cox, Hilde Kuehne
Understanding the structure of complex activities in untrimmed videos is a challenging task in the area of action recognition.
1 code implementation • NeurIPS 2019 • Quanfu Fan, Chun-Fu Chen, Hilde Kuehne, Marco Pistoia, David Cox
Current state-of-the-art models for video action recognition are mostly based on expensive 3D ConvNets.
Ranked #78 on
Action Recognition
on Something-Something V2
(using extra training data)
no code implementations • 3 Jun 2019 • Hilde Kuehne, Alexander Richard, Juergen Gall
Action recognition has become a rapidly developing research field within the last decade.
1 code implementation • 3 Jun 2019 • Hilde Kuehne, Ahsan Iqbal, Alexander Richard, Juergen Gall
Action recognition is so far mainly focusing on the problem of classification of hand selected preclipped actions and reaching impressive results in this field.
2 code implementations • CVPR 2019 • Anna Kukleva, Hilde Kuehne, Fadime Sener, Juergen Gall
The task of temporally detecting and segmenting actions in untrimmed videos has seen an increased attention recently.
no code implementations • CVPR 2018 • Alexander Richard, Hilde Kuehne, Ahsan Iqbal, Juergen Gall
Video learning is an important task in computer vision and has experienced increasing interest over the recent years.
no code implementations • 27 Jun 2017 • Ahsan Iqbal, Alexander Richard, Hilde Kuehne, Juergen Gall
In this work, we propose a novel recurrent ConvNet architecture called recurrent residual networks to address the task of action recognition.
1 code implementation • CVPR 2018 • Alexander Richard, Hilde Kuehne, Juergen Gall
Action detection and temporal segmentation of actions in videos are topics of increasing interest.
1 code implementation • CVPR 2017 • Alexander Richard, Hilde Kuehne, Juergen Gall
We present an approach for weakly supervised learning of human actions.
no code implementations • 7 Oct 2016 • Hilde Kuehne, Alexander Richard, Juergen Gall
Our system is based on the idea that, given a sequence of input data and a transcript, i. e. a list of the order the actions occur in the video, it is possible to infer the actions within the video stream, and thus, learn the related action models without the need for any frame-based annotation.
no code implementations • 7 Sep 2015 • Hilde Kuehne, Juergen Gall, Thomas Serre
We describe an end-to-end generative approach for the segmentation and recognition of human activities.
no code implementations • 25 Aug 2015 • Hilde Kuehne, Juergen Gall, Thomas Serre
Through extensive system evaluations, we demonstrate that combining compact video representations based on Fisher Vectors with HMM-based modeling yields very significant gains in accuracy and when properly trained with sufficient training samples, structured temporal models outperform unstructured bag-of-word types of models by a large margin on the tested performance metric.
no code implementations • CVPR 2014 • Hilde Kuehne, Ali Arslan, Thomas Serre
This paper describes a framework for modeling human activities as temporally structured processes.