Search Results for author: Gengyuan Zhang

Found 6 papers, 3 papers with code

SPOT! Revisiting Video-Language Models for Event Understanding

no code implementations21 Nov 2023 Gengyuan Zhang, Jinhe Bi, Jindong Gu, Yanyu Chen, Volker Tresp

This raises a question: with such weak supervision, can video representation in video-language models gain the ability to distinguish even factual discrepancies in textual description and understand fine-grained events?

Attribute Video Understanding

Multi-event Video-Text Retrieval

1 code implementation ICCV 2023 Gengyuan Zhang, Jisen Ren, Jindong Gu, Volker Tresp

In this study, we introduce the Multi-event Video-Text Retrieval (MeVTR) task, addressing scenarios in which each video contains multiple different events, as a niche scenario of the conventional Video-Text Retrieval Task.

Language Modelling Retrieval +2

A Systematic Survey of Prompt Engineering on Vision-Language Foundation Models

1 code implementation24 Jul 2023 Jindong Gu, Zhen Han, Shuo Chen, Ahmad Beirami, Bailan He, Gengyuan Zhang, Ruotong Liao, Yao Qin, Volker Tresp, Philip Torr

This paper aims to provide a comprehensive survey of cutting-edge research in prompt engineering on three types of vision-language models: multimodal-to-text generation models (e. g. Flamingo), image-text matching models (e. g.

Image-text matching Language Modelling +4

Can Vision-Language Models be a Good Guesser? Exploring VLMs for Times and Location Reasoning

1 code implementation12 Jul 2023 Gengyuan Zhang, Yurui Zhang, Kerui Zhang, Volker Tresp

This makes us wonder if, based on visual cues, Vision-Language Models that are pre-trained with large-scale image-text resources can achieve and even outperform human's capability in reasoning times and location.

Cannot find the paper you are looking for? You can Submit a new open access paper.