no code implementations • 7 Sep 2023 • Tal Shaharabany, Ariel Shaulov, Lior Wolf
Instead, captioning occurs as an inference process that involves three networks that correspond to the three desired qualities: (i) A Large Language Model, in our case, for reasons of convenience, GPT-2, (ii) A model that provides a matching score between an audio file and a text, for which we use a multimodal matching network called ImageBind, and (iii) A text classifier, trained using a dataset we collected automatically by instructing GPT-4 with prompts designed to direct the generation of both audible and inaudible sentences.
Ranked #2 on Zero-shot Audio Captioning on AudioCaps