TOMATO: Assessing Visual Temporal Reasoning Capabilities in Multimodal Foundation Models

1 code implementation30 Oct 2024 Ziyao Shangguan, Chuhan Li, Yuxuan Ding, Yanan Zheng, Yilun Zhao, Tesca Fitzgerald, Arman Cohan

Our study of existing benchmarks shows that this capability of MFMs is likely overestimated as many questions can be solved by using a single, few, or out-of-order frames.

EZIGen: Enhancing zero-shot personalized image generation with precise subject encoding and decoupled guidance

1 code implementation12 Sep 2024 Zicheng Duan, Yuxuan Ding, Chenhui Gou, Ziqin Zhou, Ethan Smith, Lingqiao Liu

Zero-shot personalized image generation models aim to produce images that align with both a given text prompt and subject image, requiring the model to effectively incorporate both sources of guidance.

The CLIP Model is Secretly an Image-to-Prompt Converter

no code implementations NeurIPS 2023 Yuxuan Ding, Chunna Tian, Haoxuan Ding, Lingqiao Liu

The Stable Diffusion model is a prominent text-to-image generation model that relies on a text prompt as its input, which is encoded using the Contrastive Language-Image Pre-Training (CLIP).

Position-Aware Relation Learning for RGB-Thermal Salient Object Detection

no code implementations21 Sep 2022 Heng Zhou, Chunna Tian, Zhenxi Zhang, Chengyang Li, Yuxuan Ding, Yongqiang Xie, Zhongbo Li

FRDF utilizes the directional information between object pixels to effectively enhance the intra-class compactness of salient regions.

Seeking Subjectivity in Visual Emotion Distribution Learning

no code implementations25 Jul 2022 Jingyuan Yang, Jie Li, Leida Li, Xiumei Wang, Yuxuan Ding, Xinbo Gao

In psychology, the \textit{Object-Appraisal-Emotion} model has demonstrated that each individual's emotion is affected by his/her subjective appraisal, which is further formed by the affective memory.

Don't Stop Learning: Towards Continual Learning for the CLIP Model

no code implementations19 Jul 2022 Yuxuan Ding, Lingqiao Liu, Chunna Tian, Jingyuan Yang, Haoxuan Ding

The Contrastive Language-Image Pre-training (CLIP) Model is a recently proposed large-scale pre-train model which attracts increasing attention in the computer vision community.

Stimuli-Aware Visual Emotion Analysis

no code implementations4 Sep 2021 Jingyuan Yang, Jie Li, Xiumei Wang, Yuxuan Ding, Xinbo Gao

Then, we design three specific networks, i. e., Global-Net, Semantic-Net and Expression-Net, to extract distinct emotional features from different stimuli simultaneously.

