Search Results for author: Qifan Yu

Found 14 papers, 5 papers with code

What Limits Virtual Agent Application? OmniBench: A Scalable Multi-Dimensional Benchmark for Essential Virtual Agent Capabilities

no code implementations10 Jun 2025 Wendong Bu, Yang Wu, Qifan Yu, Minghe Gao, Bingchen Miao, Zhenkui Zhang, Kaihang Pan, Yunfei Li, Mengze Li, Wei Ji, Juncheng Li, Siliang Tang, Yueting Zhuang

To evaluate the diverse capabilities of virtual agents on the graph, we further present OmniEval, a multidimensional evaluation framework that includes subtask-level evaluation, graph-based metrics, and comprehensive tests across 10 capabilities.

Boosting Virtual Agent Learning and Reasoning: A Step-wise, Multi-dimensional, and Generalist Reward Model with Benchmark

1 code implementation24 Mar 2025 Bingchen Miao, Yang Wu, Minghe Gao, Qifan Yu, Wendong Bu, Wenqiao Zhang, Yunfei Li, Siliang Tang, Tat-Seng Chua, Juncheng Li

The development of Generalist Virtual Agents (GVAs) powered by Multimodal Large Language Models (MLLMs) has shown significant promise in autonomous task execution.

Enhancing Auto-regressive Chain-of-Thought through Loop-Aligned Reasoning

no code implementations12 Feb 2025 Qifan Yu, Zhenyu He, Sijie Li, Xun Zhou, Jun Zhang, Jingjing Xu, Di He

Specifically, we align the steps of Chain-of-Thought (CoT) reasoning with loop iterations and apply intermediate supervision during the training of Looped Transformers.

STEP: Enhancing Video-LLMs' Compositional Reasoning by Spatio-Temporal Graph-guided Self-Training

no code implementations CVPR 2025 Haiyi Qiu, Minghe Gao, Long Qian, Kaihang Pan, Qifan Yu, Juncheng Li, Wenjie Wang, Siliang Tang, Yueting Zhuang, Tat-Seng Chua

Video Large Language Models (Video-LLMs) have recently shown strong performance in basic video understanding tasks, such as captioning and coarse-grained question answering, but struggle with compositional reasoning that requires multi-step spatio-temporal inference across object relations, interactions, and events.

Question Answering Video Understanding

Unified Generative and Discriminative Training for Multi-modal Large Language Models

no code implementations1 Nov 2024 Wei Chow, Juncheng Li, Qifan Yu, Kaihang Pan, Hao Fei, Zhiqi Ge, Shuai Yang, Siliang Tang, Hanwang Zhang, Qianru Sun

Discriminative training, exemplified by models like CLIP, excels in zero-shot image-text classification and retrieval, yet struggles with complex scenarios requiring fine-grained semantic differentiation.

Dynamic Time Warping Image-text Classification +6

Towards Unified Multimodal Editing with Enhanced Knowledge Collaboration

1 code implementation30 Sep 2024 Kaihang Pan, Zhaoyu Fan, Juncheng Li, Qifan Yu, Hao Fei, Siliang Tang, Richang Hong, Hanwang Zhang, Qianru Sun

In this paper, we propose UniKE, a novel multimodal editing method that establishes a unified perspective and paradigm for intrinsic knowledge editing and external knowledge resorting.

knowledge editing

A high-accuracy multi-model mixing retrosynthetic method

no code implementations6 Sep 2024 Shang Xiang, Lin Yao, Zhen Wang, Qifan Yu, Wentan Liu, Wentao Guo, Guolin Ke

The field of computer-aided synthesis planning (CASP) has seen rapid advancements in recent years, achieving significant progress across various algorithmic benchmarks.

Diversity model +1

HalluciDoctor: Mitigating Hallucinatory Toxicity in Visual Instruction Data

1 code implementation CVPR 2024 Qifan Yu, Juncheng Li, Longhui Wei, Liang Pang, Wentao Ye, Bosheng Qin, Siliang Tang, Qi Tian, Yueting Zhuang

Multi-modal Large Language Models (MLLMs) tuned on machine-generated instruction-following data have demonstrated remarkable performance in various multi-modal understanding and generation tasks.

Attribute counterfactual +3

Dancing Avatar: Pose and Text-Guided Human Motion Videos Synthesis with Image Diffusion Model

no code implementations15 Aug 2023 Bosheng Qin, Wentao Ye, Qifan Yu, Siliang Tang, Yueting Zhuang

Our approach employs a pretrained T2I diffusion model to generate each video frame in an autoregressive fashion.

Image Inpainting

Interactive Data Synthesis for Systematic Vision Adaptation via LLMs-AIGCs Collaboration

1 code implementation22 May 2023 Qifan Yu, Juncheng Li, Wentao Ye, Siliang Tang, Yueting Zhuang

Recent text-to-image generation models have shown promising results in generating high-fidelity photo-realistic images.

Data Augmentation Prompt Engineering +2

Visually-Prompted Language Model for Fine-Grained Scene Graph Generation in an Open World

1 code implementation ICCV 2023 Qifan Yu, Juncheng Li, Yu Wu, Siliang Tang, Wei Ji, Yueting Zhuang

Based on that, we further introduce a novel Entangled cross-modal prompt approach for open-world predicate scene graph generation (Epic), where models can generalize to unseen predicates in a zero-shot manner.

Graph Generation Language Modeling +2

Cannot find the paper you are looking for? You can Submit a new open access paper.