Search Results for author: A. Sophia Koepke

Found 22 papers, 16 papers with code

Audio-Visual Generalized Zero-Shot Learning using Pre-Trained Large Multi-Modal Models

2 code implementations • 9 Apr 2024 • David Kurzendörfer, Otniel-Bogdan Mercea, A. Sophia Koepke, Zeynep Akata

However, existing benchmarks predate the popularization of large multi-modal models, such as CLIP and CLAP.

Audio Classification Generalized Zero-Shot Learning

304

Paper
Code

A SOUND APPROACH: Using Large Language Models to generate audio descriptions for egocentric text-audio retrieval

no code implementations • 29 Feb 2024 • Andreea-Maria Oncescu, João F. Henriques, Andrew Zisserman, Samuel Albanie, A. Sophia Koepke

Furthermore, we show that using the same prompts, we can successfully employ LLMs to improve the retrieval on EpicSounds, compared to using the original audio class labels of the dataset.

Retrieval

Paper
Add Code

Zero-shot audio captioning with audio-language model guidance and audio context keywords

1 code implementation • 14 Nov 2023 • Leonard Salewski, Stefan Fauth, A. Sophia Koepke, Zeynep Akata

In particular, our framework exploits a pre-trained large language model (LLM) for generating the text which is guided by a pre-trained audio-language model to produce captions that describe the audio content.

Ranked #1 on Zero-shot Audio Captioning on Clotho

Descriptive Image Captioning +5

Paper
Code

Zero-shot Translation of Attention Patterns in VQA Models to Natural Language

1 code implementation • 8 Nov 2023 • Leonard Salewski, A. Sophia Koepke, Hendrik P. A. Lensch, Zeynep Akata

Converting a model's internals to text can yield human-understandable insights about the model.

Image Captioning Language Modelling +3

Paper
Code

Video-adverb retrieval with compositional adverb-action embeddings

1 code implementation • 26 Sep 2023 • Thomas Hummel, Otniel-Bogdan Mercea, A. Sophia Koepke, Zeynep Akata

We propose a framework for video-to-adverb retrieval (and vice versa) that aligns video embeddings with their matching compositional adverb-action text embedding in a joint embedding space.

Ranked #1 on Video-Adverb Retrieval (Unseen Compositions) on MSR-VTT Adverbs

Video-Adverb Retrieval (Unseen Compositions)

Paper
Code

Text-to-feature diffusion for audio-visual few-shot learning

1 code implementation • 7 Sep 2023 • Otniel-Bogdan Mercea, Thomas Hummel, A. Sophia Koepke, Zeynep Akata

Training deep learning models for video classification from audio-visual data commonly requires immense amounts of labeled training data collected via a costly process.

Classification Few-Shot Learning +1

Paper
Code

Image-free Classifier Injection for Zero-Shot Classification

1 code implementation • ICCV 2023 • Anders Christensen, Massimiliano Mancini, A. Sophia Koepke, Ole Winther, Zeynep Akata

We achieve this with our proposed Image-free Classifier Injection with Semantics (ICIS) that injects classifiers for new, unseen classes into pre-trained classification models in a post-hoc fashion without relying on image data.

Classification Decoder +2

Paper
Code

Addressing caveats of neural persistence with deep graph persistence

1 code implementation • 20 Jul 2023 • Leander Girrbach, Anders Christensen, Ole Winther, Zeynep Akata, A. Sophia Koepke

Whilst this captures useful information for linear classifiers, we find that no relevant spatial structure is present in later layers of deep neural networks, making neural persistence roughly equivalent to the variance of weights.

Topological Data Analysis

Paper
Code

Waffling around for Performance: Visual Classification with Random Words and Broad Concepts

1 code implementation • ICCV 2023 • Karsten Roth, Jae Myung Kim, A. Sophia Koepke, Oriol Vinyals, Cordelia Schmid, Zeynep Akata

The visual classification performance of vision-language models such as CLIP has been shown to benefit from additional semantic knowledge from large language models (LLMs) such as GPT-3.

Classification Language Modelling +1

Paper
Code

Exposing and Mitigating Spurious Correlations for Cross-Modal Retrieval

no code implementations • 6 Apr 2023 • Jae Myung Kim, A. Sophia Koepke, Cordelia Schmid, Zeynep Akata

In this work, we introduce ODmAP@k, an object decorrelation metric that measures a model's robustness to spurious correlations in the training data.

Cross-Modal Retrieval Object +2

Paper
Add Code

PlanT: Explainable Planning Transformers via Object-Level Representations

1 code implementation • 25 Oct 2022 • Katrin Renz, Kashyap Chitta, Otniel-Bogdan Mercea, A. Sophia Koepke, Zeynep Akata, Andreas Geiger

Planning an optimal route in a complex environment requires efficient reasoning about the surrounding scene.

Ranked #6 on CARLA longest6 on CARLA

CARLA longest6 Imitation Learning +1

201

Paper
Code

Temporal and cross-modal attention for audio-visual zero-shot learning

2 code implementations • 20 Jul 2022 • Otniel-Bogdan Mercea, Thomas Hummel, A. Sophia Koepke, Zeynep Akata

We show that our proposed framework that ingests temporal features yields state-of-the-art performance on the \ucf, \vgg, and \activity benchmarks for (generalised) zero-shot learning.

Ranked #2 on GZSL Video Classification on UCF-GZSL(main)

GZSL Video Classification Video Classification

Paper
Code

CLEVR-X: A Visual Reasoning Dataset for Natural Language Explanations

1 code implementation • 5 Apr 2022 • Leonard Salewski, A. Sophia Koepke, Hendrik P. A. Lensch, Zeynep Akata

We present baseline results for generating natural language explanations in the context of VQA using two state-of-the-art frameworks on the CLEVR-X dataset.

Ranked #1 on Explanation Generation on CLEVR-X

Explanation Generation Question Answering +3

Paper
Code

Audio-visual Generalised Zero-shot Learning with Cross-modal Attention and Language

1 code implementation • CVPR 2022 • Otniel-Bogdan Mercea, Lukas Riesch, A. Sophia Koepke, Zeynep Akata

Focusing on the relatively underexplored task of audio-visual zero-shot learning, we propose to learn multi-modal representations from audio-visual data using cross-modal attention and exploit textual label embeddings for transferring knowledge from seen classes to unseen classes.

Ranked #1 on ZSL Video Classification on UCF-GZSL (cls)

GZSL Video Classification ZSL Video Classification

Paper
Code

Audio Retrieval with Natural Language Queries: A Benchmark Study

1 code implementation • 17 Dec 2021 • A. Sophia Koepke, Andreea-Maria Oncescu, João F. Henriques, Zeynep Akata, Samuel Albanie

Additionally, we introduce the SoundDescs benchmark, which consists of paired audio and natural language descriptions for a diverse collection of sounds that are complementary to those found in AudioCaps and Clotho.

Ranked #1 on Audio to Text Retrieval on SoundDescs

AudioCaps Audio captioning +5

Paper
Code

Audio Retrieval with Natural Language Queries

1 code implementation • 5 May 2021 • Andreea-Maria Oncescu, A. Sophia Koepke, João F. Henriques, Zeynep Akata, Samuel Albanie

We consider the task of retrieving audio using free-form natural language queries.

Ranked #1 on Audio/Video to Text Retrieval on AudioCaps

AudioCaps Audio to Text Retrieval +5

Paper
Code

Where and When: Space-Time Attention for Audio-Visual Explanations

no code implementations • 4 May 2021 • Yanbei Chen, Thomas Hummel, A. Sophia Koepke, Zeynep Akata

Recent advances in XAI provide explanations for models trained on still images.

Explainable artificial intelligence Explainable Artificial Intelligence (XAI) +1

Paper
Add Code

Distilling Audio-Visual Knowledge by Compositional Contrastive Learning

1 code implementation • CVPR 2021 • Yanbei Chen, Yongqin Xian, A. Sophia Koepke, Ying Shan, Zeynep Akata

Having access to multi-modal cues (e. g. vision and audio) empowers some cognitive tasks to be done faster compared to learning from a single modality.

Audio Tagging audio-visual learning +5

Paper
Code

Self-supervised learning of class embeddings from video

no code implementations • 28 Oct 2019 • Olivia Wiles, A. Sophia Koepke, Andrew Zisserman

This work explores how to use self-supervised learning on videos to learn a class-specific image embedding that encodes pose and shape information.

Decoder Self-Supervised Learning

Paper
Add Code

X2Face: A network for controlling face generation using images, audio, and pose codes

no code implementations • ECCV 2018 • Olivia Wiles, A. Sophia Koepke, Andrew Zisserman

The objective of this paper is a neural network model that controls the pose and expression of a given face, using another face or modality (e. g. audio).

Ranked #2 on Talking Head Generation on VoxCeleb1 - 1-shot learning

Talking Head Generation

Paper
Add Code

Self-supervised learning of a facial attribute embedding from video

2 code implementations • 21 Aug 2018 • Olivia Wiles, A. Sophia Koepke, Andrew Zisserman

We propose a self-supervised framework for learning facial attributes by simply watching videos of a human face speaking, laughing, and moving over time.

Ranked #2 on Unsupervised Facial Landmark Detection on 300W

Attribute Self-Supervised Learning +1

Paper
Code

X2Face: A network for controlling face generation by using images, audio, and pose codes

no code implementations • 27 Jul 2018 • Olivia Wiles, A. Sophia Koepke, Andrew Zisserman

The objective of this paper is a neural network model that controls the pose and expression of a given face, using another face or modality (e. g. audio).

Face Generation

Paper
Add Code

Cannot find the paper you are looking for? You can Submit a new open access paper.