2 code implementations • 9 Apr 2024 • David Kurzendörfer, Otniel-Bogdan Mercea, A. Sophia Koepke, Zeynep Akata
However, existing benchmarks predate the popularization of large multi-modal models, such as CLIP and CLAP.
no code implementations • 29 Feb 2024 • Andreea-Maria Oncescu, João F. Henriques, Andrew Zisserman, Samuel Albanie, A. Sophia Koepke
Furthermore, we show that using the same prompts, we can successfully employ LLMs to improve the retrieval on EpicSounds, compared to using the original audio class labels of the dataset.
1 code implementation • 14 Nov 2023 • Leonard Salewski, Stefan Fauth, A. Sophia Koepke, Zeynep Akata
In particular, our framework exploits a pre-trained large language model (LLM) for generating the text which is guided by a pre-trained audio-language model to produce captions that describe the audio content.
Ranked #1 on Zero-shot Audio Captioning on Clotho
1 code implementation • 8 Nov 2023 • Leonard Salewski, A. Sophia Koepke, Hendrik P. A. Lensch, Zeynep Akata
Converting a model's internals to text can yield human-understandable insights about the model.
1 code implementation • 26 Sep 2023 • Thomas Hummel, Otniel-Bogdan Mercea, A. Sophia Koepke, Zeynep Akata
We propose a framework for video-to-adverb retrieval (and vice versa) that aligns video embeddings with their matching compositional adverb-action text embedding in a joint embedding space.
1 code implementation • 7 Sep 2023 • Otniel-Bogdan Mercea, Thomas Hummel, A. Sophia Koepke, Zeynep Akata
Training deep learning models for video classification from audio-visual data commonly requires immense amounts of labeled training data collected via a costly process.
1 code implementation • ICCV 2023 • Anders Christensen, Massimiliano Mancini, A. Sophia Koepke, Ole Winther, Zeynep Akata
We achieve this with our proposed Image-free Classifier Injection with Semantics (ICIS) that injects classifiers for new, unseen classes into pre-trained classification models in a post-hoc fashion without relying on image data.
1 code implementation • 20 Jul 2023 • Leander Girrbach, Anders Christensen, Ole Winther, Zeynep Akata, A. Sophia Koepke
Whilst this captures useful information for linear classifiers, we find that no relevant spatial structure is present in later layers of deep neural networks, making neural persistence roughly equivalent to the variance of weights.
1 code implementation • ICCV 2023 • Karsten Roth, Jae Myung Kim, A. Sophia Koepke, Oriol Vinyals, Cordelia Schmid, Zeynep Akata
The visual classification performance of vision-language models such as CLIP has been shown to benefit from additional semantic knowledge from large language models (LLMs) such as GPT-3.
no code implementations • 6 Apr 2023 • Jae Myung Kim, A. Sophia Koepke, Cordelia Schmid, Zeynep Akata
In this work, we introduce ODmAP@k, an object decorrelation metric that measures a model's robustness to spurious correlations in the training data.
1 code implementation • 25 Oct 2022 • Katrin Renz, Kashyap Chitta, Otniel-Bogdan Mercea, A. Sophia Koepke, Zeynep Akata, Andreas Geiger
Planning an optimal route in a complex environment requires efficient reasoning about the surrounding scene.
Ranked #6 on CARLA longest6 on CARLA
2 code implementations • 20 Jul 2022 • Otniel-Bogdan Mercea, Thomas Hummel, A. Sophia Koepke, Zeynep Akata
We show that our proposed framework that ingests temporal features yields state-of-the-art performance on the \ucf, \vgg, and \activity benchmarks for (generalised) zero-shot learning.
Ranked #2 on GZSL Video Classification on UCF-GZSL(main)
1 code implementation • 5 Apr 2022 • Leonard Salewski, A. Sophia Koepke, Hendrik P. A. Lensch, Zeynep Akata
We present baseline results for generating natural language explanations in the context of VQA using two state-of-the-art frameworks on the CLEVR-X dataset.
Ranked #1 on Explanation Generation on CLEVR-X
1 code implementation • CVPR 2022 • Otniel-Bogdan Mercea, Lukas Riesch, A. Sophia Koepke, Zeynep Akata
Focusing on the relatively underexplored task of audio-visual zero-shot learning, we propose to learn multi-modal representations from audio-visual data using cross-modal attention and exploit textual label embeddings for transferring knowledge from seen classes to unseen classes.
Ranked #1 on ZSL Video Classification on UCF-GZSL (cls)
1 code implementation • 17 Dec 2021 • A. Sophia Koepke, Andreea-Maria Oncescu, João F. Henriques, Zeynep Akata, Samuel Albanie
Additionally, we introduce the SoundDescs benchmark, which consists of paired audio and natural language descriptions for a diverse collection of sounds that are complementary to those found in AudioCaps and Clotho.
Ranked #1 on Audio to Text Retrieval on SoundDescs
1 code implementation • 5 May 2021 • Andreea-Maria Oncescu, A. Sophia Koepke, João F. Henriques, Zeynep Akata, Samuel Albanie
We consider the task of retrieving audio using free-form natural language queries.
Ranked #1 on Audio/Video to Text Retrieval on AudioCaps
no code implementations • 4 May 2021 • Yanbei Chen, Thomas Hummel, A. Sophia Koepke, Zeynep Akata
Recent advances in XAI provide explanations for models trained on still images.
Explainable artificial intelligence Explainable Artificial Intelligence (XAI) +1
1 code implementation • CVPR 2021 • Yanbei Chen, Yongqin Xian, A. Sophia Koepke, Ying Shan, Zeynep Akata
Having access to multi-modal cues (e. g. vision and audio) empowers some cognitive tasks to be done faster compared to learning from a single modality.
no code implementations • 28 Oct 2019 • Olivia Wiles, A. Sophia Koepke, Andrew Zisserman
This work explores how to use self-supervised learning on videos to learn a class-specific image embedding that encodes pose and shape information.
no code implementations • ECCV 2018 • Olivia Wiles, A. Sophia Koepke, Andrew Zisserman
The objective of this paper is a neural network model that controls the pose and expression of a given face, using another face or modality (e. g. audio).
2 code implementations • 21 Aug 2018 • Olivia Wiles, A. Sophia Koepke, Andrew Zisserman
We propose a self-supervised framework for learning facial attributes by simply watching videos of a human face speaking, laughing, and moving over time.
Ranked #2 on Unsupervised Facial Landmark Detection on 300W
no code implementations • 27 Jul 2018 • Olivia Wiles, A. Sophia Koepke, Andrew Zisserman
The objective of this paper is a neural network model that controls the pose and expression of a given face, using another face or modality (e. g. audio).