1 code implementation • 14 Nov 2023 • Leonard Salewski, Stefan Fauth, A. Sophia Koepke, Zeynep Akata
In particular, our framework exploits a pre-trained large language model (LLM) for generating the text which is guided by a pre-trained audio-language model to produce captions that describe the audio content.
Ranked #1 on Zero-shot Audio Captioning on Clotho
1 code implementation • 8 Nov 2023 • Leonard Salewski, A. Sophia Koepke, Hendrik P. A. Lensch, Zeynep Akata
Converting a model's internals to text can yield human-understandable insights about the model.
1 code implementation • NeurIPS 2023 • Leonard Salewski, Stephan Alaniz, Isabel Rio-Torto, Eric Schulz, Zeynep Akata
These findings demonstrate that LLMs are capable of taking on diverse roles and that this in-context impersonation can be used to uncover their hidden strengths and biases.
1 code implementation • 19 Aug 2022 • Zohreh Ghaderi, Leonard Salewski, Hendrik P. A. Lensch
To generate proper captions for videos, the inference needs to identify relevant concepts and pay attention to the spatial relationships between them as well as to the temporal development in the clip.
Ranked #7 on Video Captioning on VATEX
1 code implementation • 5 Apr 2022 • Leonard Salewski, A. Sophia Koepke, Hendrik P. A. Lensch, Zeynep Akata
We present baseline results for generating natural language explanations in the context of VQA using two state-of-the-art frameworks on the CLEVR-X dataset.
Ranked #1 on Explanation Generation on CLEVR-X
2 code implementations • ICCV 2021 • Maxime Kayser, Oana-Maria Camburu, Leonard Salewski, Cornelius Emde, Virginie Do, Zeynep Akata, Thomas Lukasiewicz
e-ViL is a benchmark for explainable vision-language tasks that establishes a unified evaluation framework and provides the first comprehensive comparison of existing approaches that generate NLEs for VL tasks.
no code implementations • 22 Jul 2019 • Xiahan Shi, Leonard Salewski, Martin Schiegg, Zeynep Akata, Max Welling
Instead, we consider the extended setup of generalized few-shot learning (GFSL), where the model is required to perform classification on the joint label space consisting of both previously seen and novel classes.