no code implementations • 2 Apr 2024 • Maksim Dzabraev, Alexander Kunitsyn, Andrei Ivaniuta
In this work, we present an unsupervised method for enhancing an image captioning model (in our case, BLIP2) using reinforcement learning and vision-language models like CLIP and BLIP2-ITM as reward models.
no code implementations • 14 Mar 2022 • Alexander Kunitsyn, Maksim Kalashnikov, Maksim Dzabraev, Andrei Ivaniuta
In this work we present a new State-of-The-Art on the text-to-video retrieval task on MSR-VTT, LSMDC, MSVD, YouCook2 and TGIF obtained by a single model.
Ranked #1 on Video Retrieval on TGIF (using extra training data)