Search Results for author: Haodong Duan

Found 50 papers, 43 papers with code

GDI-Bench: A Benchmark for General Document Intelligence with Vision and Reasoning Decoupling

no code implementations30 Apr 2025 Siqi Li, Yufan Shen, Xiangnan Chen, Jiayi Chen, Hengwei Ju, Haodong Duan, Song Mao, Hongbin Zhou, Bo Zhang, Pinlong Cai, Licheng Wen, Botian Shi, Yong liu, Xinyu Cai, Yu Qiao

We evaluate the GDI-Bench on various open-source and closed-source models, conducting decoupled analyses in the visual and reasoning domains.

MM-IFEngine: Towards Multimodal Instruction Following

1 code implementation10 Apr 2025 Shengyuan Ding, Shenxi Wu, Xiangyu Zhao, Yuhang Zang, Haodong Duan, Xiaoyi Dong, Pan Zhang, Yuhang Cao, Dahua Lin, Jiaqi Wang

The Instruction Following (IF) ability measures how well Multi-modal Large Language Models (MLLMs) understand exactly what users are telling them and whether they are doing it right.

Instruction Following

LEGO-Puzzles: How Good Are MLLMs at Multi-Step Spatial Reasoning?

no code implementations25 Mar 2025 Kexian Tang, Junyao Gao, Yanhong Zeng, Haodong Duan, Yanan sun, Zhening Xing, Wenran Liu, Kaifeng Lyu, Kai Chen

Multi-step spatial reasoning entails understanding and reasoning about spatial relationships across multiple sequential steps, which is crucial for tackling complex real-world applications, such as robotic manipulation, autonomous navigation, and automated assembly.

Autonomous Navigation Question Answering +1

Creation-MMBench: Assessing Context-Aware Creative Intelligence in MLLM

1 code implementation18 Mar 2025 Xinyu Fang, Zhijian Chen, Kai Lan, Shengyuan Ding, Yingji Liang, Xiangyu Zhao, Farong Wen, ZiCheng Zhang, Guofeng Zhang, Haodong Duan, Kai Chen, Dahua Lin

To address this gap, we introduce Creation-MMBench, a multimodal benchmark specifically designed to evaluate the creative capabilities of MLLMs in real-world, image-based tasks.

Information Density Principle for MLLM Benchmarks

1 code implementation13 Mar 2025 Chunyi Li, Xiaozhe Li, ZiCheng Zhang, Yuan Tian, Ziheng Jia, Xiaohong Liu, Xiongkuo Min, Jia Wang, Haodong Duan, Kai Chen, Guangtao Zhai

Therefore, we propose a critical principle of Information Density, which examines how much insight a benchmark can provide for the development of MLLMs.

Diversity

Image Quality Assessment: From Human to Machine Preference

1 code implementation13 Mar 2025 Chunyi Li, Yuan Tian, Xiaoyue Ling, ZiCheng Zhang, Haodong Duan, HaoNing Wu, Ziheng Jia, Xiaohong Liu, Xiongkuo Min, Guo Lu, Weisi Lin, Guangtao Zhai

Considering the huge gap between human and machine visual systems, this paper proposes the topic: Image Quality Assessment for Machine Vision for the first time.

Image Quality Assessment

VisualPRM: An Effective Process Reward Model for Multimodal Reasoning

no code implementations13 Mar 2025 Weiyun Wang, Zhangwei Gao, Lianjie Chen, Zhe Chen, Jinguo Zhu, Xiangyu Zhao, Yangzhou Liu, Yue Cao, Shenglong Ye, Xizhou Zhu, Lewei Lu, Haodong Duan, Yu Qiao, Jifeng Dai, Wenhai Wang

We introduce VisualPRM, an advanced multimodal Process Reward Model (PRM) with 8B parameters, which improves the reasoning abilities of existing Multimodal Large Language Models (MLLMs) across different model scales and families with Best-of-N (BoN) evaluation strategies.

Multimodal Reasoning

Visual-RFT: Visual Reinforcement Fine-Tuning

1 code implementation3 Mar 2025 Ziyu Liu, Zeyi Sun, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, Jiaqi Wang

Reinforcement Fine-Tuning (RFT) in Large Reasoning Models like OpenAI o1 learns from feedback on its answers, which is especially useful in applications when fine-tuning data is scarce.

Few-Shot Object Detection Fine-Grained Image Classification +4

OmniAlign-V: Towards Enhanced Alignment of MLLMs with Human Preference

1 code implementation25 Feb 2025 Xiangyu Zhao, Shengyuan Ding, ZiCheng Zhang, Haian Huang, Maosong Cao, Weiyun Wang, Jiaqi Wang, Xinyu Fang, Wenhai Wang, Guangtao Zhai, Haodong Duan, Hua Yang, Kai Chen

Recent advancements in open-source multi-modal large language models (MLLMs) have primarily focused on enhancing foundational capabilities, leaving a significant gap in human preference alignment.

Visual Question Answering (VQA)

VideoRoPE: What Makes for Good Video Rotary Position Embedding?

1 code implementation7 Feb 2025 Xilin Wei, Xiaoran Liu, Yuhang Zang, Xiaoyi Dong, Pan Zhang, Yuhang Cao, Jian Tong, Haodong Duan, Qipeng Guo, Jiaqi Wang, Xipeng Qiu, Dahua Lin

As part of our analysis, we introduce a challenging V-NIAH-D (Visual Needle-In-A-Haystack with Distractors) task, which adds periodic distractors into V-NIAH.

Hallucination Position +2

Condor: Enhance LLM Alignment with Knowledge-Driven Data Synthesis and Refinement

1 code implementation21 Jan 2025 Maosong Cao, Taolin Zhang, Mo Li, Chuyu Zhang, Yunxin Liu, Haodong Duan, Songyang Zhang, Kai Chen

The quality of Supervised Fine-Tuning (SFT) data plays a critical role in enhancing the conversational capabilities of Large Language Models (LLMs).

Synthetic Data Generation World Knowledge

OVO-Bench: How Far is Your Video-LLMs from Real-World Online Video Understanding?

1 code implementation9 Jan 2025 Yifei Li, Junbo Niu, Ziyang Miao, Chunjiang Ge, Yuanhang Zhou, Qihao He, Xiaoyi Dong, Haodong Duan, Shuangrui Ding, Rui Qian, Pan Zhang, Yuhang Zang, Yuhang Cao, Conghui He, Jiaqi Wang

OVO-Bench evaluates the ability of video LLMs to reason and respond to events occurring at specific timestamps under three distinct scenarios: (1) Backward tracing: trace back to past events to answer the question.

Benchmarking Video Understanding

MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs

4 code implementations22 Nov 2024 Chaoyou Fu, Yi-Fan Zhang, Shukang Yin, Bo Li, Xinyu Fang, Sirui Zhao, Haodong Duan, Xing Sun, Ziwei Liu, Liang Wang, Caifeng Shan, Ran He

As a prominent direction of Artificial General Intelligence (AGI), Multimodal Large Language Models (MLLMs) have garnered increased attention from both industry and academia.

Image Classification MME +1

MIA-DPO: Multi-Image Augmented Direct Preference Optimization For Large Vision-Language Models

1 code implementation23 Oct 2024 Ziyu Liu, Yuhang Zang, Xiaoyi Dong, Pan Zhang, Yuhang Cao, Haodong Duan, Conghui He, Yuanjun Xiong, Dahua Lin, Jiaqi Wang

Existing visual alignment methods, primarily designed for single-image scenarios, struggle to effectively handle the complexity of multi-image tasks due to the scarcity of diverse training data and the high cost of annotating chosen/rejected pairs.

CompassJudger-1: All-in-one Judge Model Helps Model Evaluation and Evolution

1 code implementation21 Oct 2024 Maosong Cao, Alexander Lam, Haodong Duan, Hongwei Liu, Songyang Zhang, Kai Chen

In this report, we introduce \textbf{CompassJudger-1}, the first open-source \textbf{all-in-one} judge LLM.

All model

ProSA: Assessing and Understanding the Prompt Sensitivity of LLMs

1 code implementation16 Oct 2024 Jingming Zhuo, Songyang Zhang, Xinyu Fang, Haodong Duan, Dahua Lin, Kai Chen

To address these shortcomings, we introduce ProSA, a framework designed to evaluate and comprehend prompt sensitivity in LLMs.

NeedleBench: Can LLMs Do Retrieval and Reasoning in Information-Dense Context?

2 code implementations16 Jul 2024 Mo Li, Songyang Zhang, Taolin Zhang, Haodong Duan, Yunxin Liu, Kai Chen

To address these limitations, we introduce NeedleBench, a synthetic framework for assessing retrieval and reasoning performance in bilingual long-context tasks with adaptive context lengths.

4k 8k +2

VLMEvalKit: An Open-Source Toolkit for Evaluating Large Multi-Modality Models

1 code implementation16 Jul 2024 Haodong Duan, Junming Yang, Yuxuan Qiao, Xinyu Fang, Lin Chen, YuAn Liu, Amit Agarwal, Zhe Chen, Mo Li, Yubo Ma, Hailong Sun, Xiangyu Zhao, Junbo Cui, Xiaoyi Dong, Yuhang Zang, Pan Zhang, Jiaqi Wang, Dahua Lin, Kai Chen

Based on the evaluation results obtained with the toolkit, we host OpenVLM Leaderboard, a comprehensive leaderboard to track the progress of multi-modality learning research.

MG-LLaVA: Towards Multi-Granularity Visual Instruction Tuning

1 code implementation25 Jun 2024 Xiangyu Zhao, Xiangtai Li, Haodong Duan, Haian Huang, Yining Li, Kai Chen, Hua Yang

We propose the integration of an additional high-resolution visual encoder to capture fine-grained details, which are then fused with base visual features through a Conv-Gate fusion network.

Object Object Recognition +1

Prism: A Framework for Decoupling and Assessing the Capabilities of VLMs

1 code implementation20 Jun 2024 Yuxuan Qiao, Haodong Duan, Xinyu Fang, Junming Yang, Lin Chen, Songyang Zhang, Jiaqi Wang, Dahua Lin, Kai Chen

Vision Language Models (VLMs) demonstrate remarkable proficiency in addressing a wide array of visual questions, which requires strong perception and reasoning faculties.

Language Modelling Large Language Model

MMBench-Video: A Long-Form Multi-Shot Benchmark for Holistic Video Understanding

1 code implementation20 Jun 2024 Xinyu Fang, Kangrui Mao, Haodong Duan, Xiangyu Zhao, Yining Li, Dahua Lin, Kai Chen

The advent of large vision-language models (LVLMs) has spurred research into their applications in multi-modal contexts, particularly in video understanding.

Form Video Understanding

ShareGPT4Video: Improving Video Understanding and Generation with Better Captions

no code implementations6 Jun 2024 Lin Chen, Xilin Wei, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Bin Lin, Zhenyu Tang, Li Yuan, Yu Qiao, Dahua Lin, Feng Zhao, Jiaqi Wang

To this end, we meticulously designed a differential video captioning strategy, which is stable, scalable, and efficient for generating captions for videos with arbitrary resolution, aspect ratios, and length.

Video Captioning Video Generation +2

MathBench: Evaluating the Theory and Application Proficiency of LLMs with a Hierarchical Mathematics Benchmark

1 code implementation20 May 2024 Hongwei Liu, Zilong Zheng, Yuxuan Qiao, Haodong Duan, Zhiwei Fei, Fengzhe Zhou, Wenwei Zhang, Songyang Zhang, Dahua Lin, Kai Chen

MathBench spans a wide range of mathematical disciplines, offering a detailed evaluation of both theoretical understanding and practical problem-solving skills.

College Mathematics GSM8K +1

Ada-LEval: Evaluating long-context LLMs with length-adaptable benchmarks

1 code implementation9 Apr 2024 Chonghua Wang, Haodong Duan, Songyang Zhang, Dahua Lin, Kai Chen

Recently, the large language model (LLM) community has shown increasing interest in enhancing LLMs' capability to handle extremely long documents.

Answer Selection Long-Context Understanding

Are We on the Right Way for Evaluating Large Vision-Language Models?

1 code implementation29 Mar 2024 Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, Feng Zhao

We evaluate 16 leading LVLMs on MMStar to assess their multi-modal capabilities, and on 7 benchmarks with the proposed metrics to investigate their data leakage and actual multi-modal gain.

World Knowledge

InternLM2 Technical Report

3 code implementations26 Mar 2024 Zheng Cai, Maosong Cao, Haojiong Chen, Kai Chen, Keyu Chen, Xin Chen, Xun Chen, Zehui Chen, Zhi Chen, Pei Chu, Xiaoyi Dong, Haodong Duan, Qi Fan, Zhaoye Fei, Yang Gao, Jiaye Ge, Chenya Gu, Yuzhe Gu, Tao Gui, Aijia Guo, Qipeng Guo, Conghui He, Yingfan Hu, Ting Huang, Tao Jiang, Penglong Jiao, Zhenjiang Jin, Zhikai Lei, Jiaxing Li, Jingwen Li, Linyang Li, Shuaibin Li, Wei Li, Yining Li, Hongwei Liu, Jiangning Liu, Jiawei Hong, Kaiwen Liu, Kuikun Liu, Xiaoran Liu, Chengqi Lv, Haijun Lv, Kai Lv, Li Ma, Runyuan Ma, Zerun Ma, Wenchang Ning, Linke Ouyang, Jiantao Qiu, Yuan Qu, FuKai Shang, Yunfan Shao, Demin Song, Zifan Song, Zhihao Sui, Peng Sun, Yu Sun, Huanze Tang, Bin Wang, Guoteng Wang, Jiaqi Wang, Jiayu Wang, Rui Wang, Yudong Wang, Ziyi Wang, Xingjian Wei, Qizhen Weng, Fan Wu, Yingtong Xiong, Chao Xu, Ruiliang Xu, Hang Yan, Yirong Yan, Xiaogui Yang, Haochen Ye, Huaiyuan Ying, JIA YU, Jing Yu, Yuhang Zang, Chuyu Zhang, Li Zhang, Pan Zhang, Peng Zhang, Ruijie Zhang, Shuo Zhang, Songyang Zhang, Wenjian Zhang, Wenwei Zhang, Xingcheng Zhang, Xinyue Zhang, Hui Zhao, Qian Zhao, Xiaomeng Zhao, Fengzhe Zhou, Zaida Zhou, Jingming Zhuo, Yicheng Zou, Xipeng Qiu, Yu Qiao, Dahua Lin

The evolution of Large Language Models (LLMs) like ChatGPT and GPT-4 has sparked discussions on the advent of Artificial General Intelligence (AGI).

4k Long-Context Understanding

BotChat: Evaluating LLMs' Capabilities of Having Multi-Turn Dialogues

1 code implementation20 Oct 2023 Haodong Duan, Jueqi Wei, Chonghua Wang, Hongwei Liu, Yixiao Fang, Songyang Zhang, Dahua Lin, Kai Chen

In contrast, other LLMs struggle to generate multi-turn dialogues of satisfactory quality due to poor instruction-following capability, tendency to generate lengthy utterances, or limited general capability.

Instruction Following

SkeleTR: Towrads Skeleton-based Action Recognition in the Wild

no code implementations20 Sep 2023 Haodong Duan, Mingze Xu, Bing Shuai, Davide Modolo, Zhuowen Tu, Joseph Tighe, Alessandro Bergamo

It first models the intra-person skeleton dynamics for each skeleton sequence with graph convolutions, and then uses stacked Transformer encoders to capture person interactions that are important for action recognition in general scenarios.

Action Classification Action Recognition +2

MMBench: Is Your Multi-modal Model an All-around Player?

3 code implementations12 Jul 2023 YuAn Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, Kai Chen, Dahua Lin

In response to these challenges, we propose MMBench, a bilingual benchmark for assessing the multi-modal capabilities of VLMs.

All Instruction Following +2

Self-supervised Action Representation Learning from Partial Spatio-Temporal Skeleton Sequences

1 code implementation17 Feb 2023 Yujie Zhou, Haodong Duan, Anyi Rao, Bing Su, Jiaqi Wang

Specifically, we construct a negative-sample-free triplet steam structure that is composed of an anchor stream without any masking, a spatial masking stream with Central Spatial Masking (CSM), and a temporal masking stream with Motion Attention Temporal Masking (MATM).

Contrastive Learning Data Augmentation +6

SkeleTR: Towards Skeleton-based Action Recognition in the Wild

no code implementations ICCV 2023 Haodong Duan, Mingze Xu, Bing Shuai, Davide Modolo, Zhuowen Tu, Joseph Tighe, Alessandro Bergamo

It first models the intra-person skeleton dynamics for each skeleton sequence with graph convolutions, and then uses stacked Transformer encoders to capture person interactions that are important for action recognition in the wild.

Action Classification Action Recognition +3

Mitigating Representation Bias in Action Recognition: Algorithms and Benchmarks

1 code implementation20 Sep 2022 Haodong Duan, Yue Zhao, Kai Chen, Yuanjun Xiong, Dahua Lin

Deep learning models have achieved excellent recognition results on large-scale video benchmarks.

Action Recognition

PYSKL: Towards Good Practices for Skeleton Action Recognition

1 code implementation19 May 2022 Haodong Duan, Jiaqi Wang, Kai Chen, Dahua Lin

The toolbox supports a wide variety of skeleton action recognition algorithms, including approaches based on GCN and CNN.

Action Recognition Skeleton Based Action Recognition

OCSampler: Compressing Videos to One Clip with Single-step Sampling

1 code implementation CVPR 2022 Jintao Lin, Haodong Duan, Kai Chen, Dahua Lin, LiMin Wang

Recent works prefer to formulate frame sampling as a sequential decision task by selecting frames one by one according to their importance, while we present a new paradigm of learning instance-specific video condensation policies to select informative frames for representing the entire video only in a single step.

Video Recognition

Revisiting Skeleton-based Action Recognition

4 code implementations CVPR 2022 Haodong Duan, Yue Zhao, Kai Chen, Dahua Lin, Bo Dai

In this work, we propose PoseC3D, a new approach to skeleton-based action recognition, which relies on a 3D heatmap stack instead of a graph sequence as the base representation of human skeletons.

Group Activity Recognition Pose Estimation +1

Omni-sourced Webly-supervised Learning for Video Recognition

3 code implementations ECCV 2020 Haodong Duan, Yue Zhao, Yuanjun Xiong, Wentao Liu, Dahua Lin

Then a joint-training strategy is proposed to deal with the domain gaps between multiple data sources and formats in webly-supervised learning.

Ranked #7 on Action Recognition on UCF101 (using extra training data)

Action Classification Action Recognition +1

TRB: A Novel Triplet Representation for Understanding 2D Human Body

2 code implementations ICCV 2019 Haodong Duan, Kwan-Yee Lin, Sheng Jin, Wentao Liu, Chen Qian, Wanli Ouyang

In this paper, we propose the Triplet Representation for Body (TRB) -- a compact 2D human body representation, with skeleton keypoints capturing human pose information and contour keypoints containing human shape information.

Conditional Image Generation Open-Ended Question Answering +1

Cannot find the paper you are looking for? You can Submit a new open access paper.