Video-adverb retrieval with compositional adverb-action embeddings

1 code implementation26 Sep 2023 Thomas Hummel, Otniel-Bogdan Mercea, A. Sophia Koepke, Zeynep Akata

We propose a framework for video-to-adverb retrieval (and vice versa) that aligns video embeddings with their matching compositional adverb-action text embedding in a joint embedding space.

Text-to-feature diffusion for audio-visual few-shot learning

1 code implementation7 Sep 2023 Otniel-Bogdan Mercea, Thomas Hummel, A. Sophia Koepke, Zeynep Akata

Training deep learning models for video classification from audio-visual data commonly requires immense amounts of labeled training data collected via a costly process.

Semantic Image Synthesis with Semantically Coupled VQ-Model

no code implementations6 Sep 2022 Stephan Alaniz, Thomas Hummel, Zeynep Akata

Semantic image synthesis enables control over unconditional image generation by allowing guidance on what is being generated.

Temporal and cross-modal attention for audio-visual zero-shot learning

2 code implementations20 Jul 2022 Otniel-Bogdan Mercea, Thomas Hummel, A. Sophia Koepke, Zeynep Akata

We show that our proposed framework that ingests temporal features yields state-of-the-art performance on the \ucf, \vgg, and \activity benchmarks for (generalised) zero-shot learning.

Crossmodal Language Grounding in an Embodied Neurocognitive Model

1 code implementation24 Jun 2020 Stefan Heinrich, Yuan YAO, Tobias Hinz, Zhiyuan Liu, Thomas Hummel, Matthias Kerzel, Cornelius Weber, Stefan Wermter

From a neuroscientific perspective, natural language is embodied, grounded in most, if not all, sensory and sensorimotor modalities, and acquired by means of crossmodal integration.

