no code implementations • EMNLP 2021 • Zhen Han, Gengyuan Zhang, Yunpu Ma, Volker Tresp
Various temporal knowledge graph (KG) completion models have been proposed in the recent literature.
no code implementations • 21 Feb 2025 • Gengyuan Zhang, Mingcong Ding, Tong Liu, Yao Zhang, Volker Tresp
Multimodal large language models (MLLMs) have demonstrated strong performance in understanding videos holistically, yet their ability to process streaming videos-videos are treated as a sequence of visual events-remains underexplored.
no code implementations • 26 Dec 2024 • Roberto Amoroso, Gengyuan Zhang, Rajat Koner, Lorenzo Baraldi, Rita Cucchiara, Volker Tresp
This progress is largely driven by the effective alignment between visual data and the language space of MLLMs.
no code implementations • 7 Oct 2024 • Haokun Chen, Hang Li, Yao Zhang, Jinhe Bi, Gengyuan Zhang, Yueqi Zhang, Philip Torr, Jindong Gu, Denis Krompass, Volker Tresp
However, directly applying pretrained LDM to heterogeneous OSFL results in significant distribution shifts in synthetic data, leading to performance degradation in classification models trained on such data.
1 code implementation • 30 Sep 2024 • Ruotong Liao, Max Erler, Huiyu Wang, Guangyao Zhai, Gengyuan Zhang, Yunpu Ma, Volker Tresp
The challenge of information redundancy in long videos prompts the question of what specific information is essential for large language models (LLMs) and how to leverage them for complex spatial-temporal reasoning in long-form video analysis.
no code implementations • 27 Sep 2024 • Tong Liu, Zhixin Lai, Gengyuan Zhang, Philip Torr, Vera Demberg, Volker Tresp, Jindong Gu
This work introduces a novel type of jailbreak, which triggers T2I models to generate the image with visual text, where the image and the text, although considered to be safe in isolation, combine to form unsafe content.
no code implementations • 14 Jun 2024 • Gengyuan Zhang, Mang Ling Ada Fok, Jialu Ma, Yan Xia, Daniel Cremers, Philip Torr, Volker Tresp, Jindong Gu
Localizing events in videos based on semantic queries is a pivotal task in video understanding, with the growing significance of user-oriented applications like video search.
no code implementations • 21 Nov 2023 • Gengyuan Zhang, Jinhe Bi, Jindong Gu, Yanyu Chen, Volker Tresp
This raises a question: with such weak supervision, can video representation in video-language models gain the ability to distinguish even factual discrepancies in textual description and understand fine-grained events?
1 code implementation • ICCV 2023 • Gengyuan Zhang, Jisen Ren, Jindong Gu, Volker Tresp
In this study, we introduce the Multi-event Video-Text Retrieval (MeVTR) task, addressing scenarios in which each video contains multiple different events, as a niche scenario of the conventional Video-Text Retrieval Task.
2 code implementations • 24 Jul 2023 • Jindong Gu, Zhen Han, Shuo Chen, Ahmad Beirami, Bailan He, Gengyuan Zhang, Ruotong Liao, Yao Qin, Volker Tresp, Philip Torr
This paper aims to provide a comprehensive survey of cutting-edge research in prompt engineering on three types of vision-language models: multimodal-to-text generation models (e. g. Flamingo), image-text matching models (e. g.
1 code implementation • 12 Jul 2023 • Gengyuan Zhang, Yurui Zhang, Kerui Zhang, Volker Tresp
This makes us wonder if, based on visual cues, Vision-Language Models that are pre-trained with large-scale image-text resources can achieve and even outperform human's capability in reasoning times and location.
no code implementations • 19 Nov 2022 • Yao Zhang, Haokun Chen, Ahmed Frikha, Yezi Yang, Denis Krompass, Gengyuan Zhang, Jindong Gu, Volker Tresp
Visual Question Answering (VQA) is a multi-discipline research task.