Search Results for author: Gengyuan Zhang

Found 12 papers, 4 papers with code

Memory Helps, but Confabulation Misleads: Understanding Streaming Events in Videos with MLLMs

no code implementations21 Feb 2025 Gengyuan Zhang, Mingcong Ding, Tong Liu, Yao Zhang, Volker Tresp

Multimodal large language models (MLLMs) have demonstrated strong performance in understanding videos holistically, yet their ability to process streaming videos-videos are treated as a sequence of visual events-remains underexplored.

Misinformation

FedBiP: Heterogeneous One-Shot Federated Learning with Personalized Latent Diffusion Models

no code implementations7 Oct 2024 Haokun Chen, Hang Li, Yao Zhang, Jinhe Bi, Gengyuan Zhang, Yueqi Zhang, Philip Torr, Jindong Gu, Denis Krompass, Volker Tresp

However, directly applying pretrained LDM to heterogeneous OSFL results in significant distribution shifts in synthetic data, leading to performance degradation in classification models trained on such data.

Federated Learning

VideoINSTA: Zero-shot Long Video Understanding via Informative Spatial-Temporal Reasoning with LLMs

1 code implementation30 Sep 2024 Ruotong Liao, Max Erler, Huiyu Wang, Guangyao Zhai, Gengyuan Zhang, Yunpu Ma, Volker Tresp

The challenge of information redundancy in long videos prompts the question of what specific information is essential for large language models (LLMs) and how to leverage them for complex spatial-temporal reasoning in long-form video analysis.

EgoSchema Language Modelling +5

Multimodal Pragmatic Jailbreak on Text-to-image Models

no code implementations27 Sep 2024 Tong Liu, Zhixin Lai, Gengyuan Zhang, Philip Torr, Vera Demberg, Volker Tresp, Jindong Gu

This work introduces a novel type of jailbreak, which triggers T2I models to generate the image with visual text, where the image and the text, although considered to be safe in isolation, combine to form unsafe content.

Localizing Events in Videos with Multimodal Queries

no code implementations14 Jun 2024 Gengyuan Zhang, Mang Ling Ada Fok, Jialu Ma, Yan Xia, Daniel Cremers, Philip Torr, Volker Tresp, Jindong Gu

Localizing events in videos based on semantic queries is a pivotal task in video understanding, with the growing significance of user-oriented applications like video search.

Natural Language Queries Video Understanding

SPOT! Revisiting Video-Language Models for Event Understanding

no code implementations21 Nov 2023 Gengyuan Zhang, Jinhe Bi, Jindong Gu, Yanyu Chen, Volker Tresp

This raises a question: with such weak supervision, can video representation in video-language models gain the ability to distinguish even factual discrepancies in textual description and understand fine-grained events?

Attribute Video Understanding

Multi-event Video-Text Retrieval

1 code implementation ICCV 2023 Gengyuan Zhang, Jisen Ren, Jindong Gu, Volker Tresp

In this study, we introduce the Multi-event Video-Text Retrieval (MeVTR) task, addressing scenarios in which each video contains multiple different events, as a niche scenario of the conventional Video-Text Retrieval Task.

Language Modelling Text Retrieval +1

A Systematic Survey of Prompt Engineering on Vision-Language Foundation Models

2 code implementations24 Jul 2023 Jindong Gu, Zhen Han, Shuo Chen, Ahmad Beirami, Bailan He, Gengyuan Zhang, Ruotong Liao, Yao Qin, Volker Tresp, Philip Torr

This paper aims to provide a comprehensive survey of cutting-edge research in prompt engineering on three types of vision-language models: multimodal-to-text generation models (e. g. Flamingo), image-text matching models (e. g.

Image-text matching Language Modeling +5

Can Vision-Language Models be a Good Guesser? Exploring VLMs for Times and Location Reasoning

1 code implementation12 Jul 2023 Gengyuan Zhang, Yurui Zhang, Kerui Zhang, Volker Tresp

This makes us wonder if, based on visual cues, Vision-Language Models that are pre-trained with large-scale image-text resources can achieve and even outperform human's capability in reasoning times and location.

Cannot find the paper you are looking for? You can Submit a new open access paper.