Search Results for author: Xiaoye Qu

Found 68 papers, 44 papers with code

Advancing Multimodal Reasoning: From Optimized Cold Start to Staged Reinforcement Learning

no code implementations4 Jun 2025 Shuang Chen, Yue Guo, Zhaochen Su, Yafu Li, Yulun Wu, Jiacheng Chen, Jiayu Chen, Weijie Wang, Xiaoye Qu, Yu Cheng

Inspired by the remarkable reasoning capabilities of Deepseek-R1 in complex textual tasks, many works attempt to incentivize similar capabilities in Multimodal Large Language Models (MLLMs) by directly applying reinforcement learning (RL).

Multimodal Reasoning Reinforcement Learning (RL)

Divide and Conquer: Grounding LLMs as Efficient Decision-Making Agents via Offline Hierarchical Reinforcement Learning

1 code implementation26 May 2025 Zican Hu, Wei Liu, Xiaoye Qu, Xiangyu Yue, Chunlin Chen, Zhi Wang, Yu Cheng

While showing sophisticated reasoning abilities, large language models (LLMs) still struggle with long-horizon decision-making tasks due to deficient exploration and long-term credit assignment, especially in sparse-reward scenarios.

Decision Making Hierarchical Reinforcement Learning

Benchmarking Multimodal Knowledge Conflict for Large Multimodal Models

1 code implementation26 May 2025 Yifan Jia, Kailin Jiang, Yuyang Liang, Qihan Ren, Yi Xin, Rui Yang, Fenze Feng, Mingcai Chen, Hengyang Lu, Haozhe Wang, Xiaoye Qu, Dongrui Liu, Lizhen Cui, Yuntao Du

Large Multimodal Models(LMMs) face notable challenges when encountering multimodal knowledge conflicts, particularly under retrieval-augmented generation(RAG) frameworks where the contextual information from external sources may contradict the model's internal parametric knowledge, leading to unreliable outputs.

Benchmarking RAG +1

SATORI-R1: Incentivizing Multimodal Reasoning with Spatial Grounding and Verifiable Rewards

1 code implementation25 May 2025 Chuming Shen, Wei Wei, Xiaoye Qu, Yu Cheng

Our analysis of the attention map confirms enhanced focus on critical regions, which brings improvements in accuracy.

Image Captioning Multimodal Reasoning +3

Step-level Reward for Free in RL-based T2I Diffusion Model Fine-tuning

1 code implementation25 May 2025 Xinyao Liao, Wei Wei, Xiaoye Qu, Yu Cheng

Recent advances in text-to-image (T2I) diffusion model fine-tuning leverage reinforcement learning (RL) to align generated images with learnable reward functions.

Denoising Reinforcement Learning (RL)

OpenThinkIMG: Learning to Think with Images via Visual Tool Reinforcement Learning

1 code implementation13 May 2025 Zhaochen Su, Linjie Li, Mingyang Song, Yunzhuo Hao, Zhengyuan Yang, Jun Zhang, Guanjie Chen, Jiawei Gu, Juntao Li, Xiaoye Qu, Yu Cheng

We hope OpenThinkIMG can serve as a foundational framework for advancing dynamic, tool-augmented visual reasoning, helping the community develop AI agents that can genuinely "think with images".

Reinforcement Learning (RL) Visual Reasoning

Learning to Reason under Off-Policy Guidance

1 code implementation21 Apr 2025 Jianhao Yan, Yafu Li, Zican Hu, Zhi Wang, Ganqu Cui, Xiaoye Qu, Yu Cheng, Yue Zhang

Recent advances in large reasoning models (LRMs) demonstrate that sophisticated behaviors such as multi-step reasoning and self-reflection can emerge via reinforcement learning with verifiable rewards~(\textit{RLVR}).

Math Reinforcement Learning (RL)

SEE: Continual Fine-tuning with Sequential Ensemble of Experts

1 code implementation9 Apr 2025 Zhilin Wang, Yafu Li, Xiaoye Qu, Yu Cheng

Some approaches use routers to assign tasks to experts, but in continual learning, they often require retraining for optimal performance.

Continual Learning Multi-Task Learning

A Survey of Efficient Reasoning for Large Reasoning Models: Language, Multimodality, and Beyond

1 code implementation27 Mar 2025 Xiaoye Qu, Yafu Li, Zhaochen Su, Weigao Sun, Jianhao Yan, Dongrui Liu, Ganqu Cui, Daizong Liu, Shuxian Liang, Junxian He, Peng Li, Wei Wei, Jing Shao, Chaochao Lu, Yue Zhang, Xian-Sheng Hua, BoWen Zhou, Yu Cheng

Recent Large Reasoning Models (LRMs), such as DeepSeek-R1 and OpenAI o1, have demonstrated strong performance gains by scaling up the length of Chain-of-Thought (CoT) reasoning during inference.

Survey

Linear-MoE: Linear Sequence Modeling Meets Mixture-of-Experts

1 code implementation7 Mar 2025 Weigao Sun, Disen Lan, Tong Zhu, Xiaoye Qu, Yu Cheng

Linear-MoE leverages the advantages of both LSM modules for linear-complexity sequence modeling and MoE layers for sparsely activation, aiming to offer high performance with efficient training.

Mixture-of-Experts State Space Models

Extrapolating and Decoupling Image-to-Video Generation Models: Motion Modeling is Easier Than You Think

1 code implementation CVPR 2025 Jie Tian, Xiaoye Qu, Zhenyi Lu, Wei Wei, Sichen Liu, Yu Cheng

(3) With the above two-stage models excelling in motion controllability and degree, we decouple the relevant parameters associated with each type of motion ability and inject them into the base I2V-DM.

Denoising Image to Video Generation

Make LoRA Great Again: Boosting LoRA with Adaptive Singular Values and Mixture-of-Experts Optimization Alignment

1 code implementation24 Feb 2025 Chenghao Fan, Zhenyi Lu, Sichen Liu, Chengfeng Gu, Xiaoye Qu, Wei Wei, Yu Cheng

While Low-Rank Adaptation (LoRA) enables parameter-efficient fine-tuning for Large Language Models (LLMs), its performance often falls short of Full Fine-Tuning (Full FT).

image-classification Image Classification +4

LASP-2: Rethinking Sequence Parallelism for Linear Attention and Its Hybrid

1 code implementation11 Feb 2025 Weigao Sun, Disen Lan, Yiran Zhong, Xiaoye Qu, Yu Cheng

In this paper, we introduce LASP-2, a new SP method to enhance both communication and computation parallelism when training linear attention transformer models with very-long input sequences.

Test-Time Preference Optimization: On-the-Fly Alignment via Iterative Textual Feedback

1 code implementation22 Jan 2025 Yafu Li, Xuyang Hu, Xiaoye Qu, Linjie Li, Yu Cheng

In this work, we introduce Test-time Preference Optimization (TPO), a framework that aligns LLM outputs with human preferences during inference, removing the need to update model parameters.

Instruction Following

PRMBench: A Fine-grained and Challenging Benchmark for Process-Level Reward Models

1 code implementation6 Jan 2025 Mingyang Song, Zhaochen Su, Xiaoye Qu, Jiawei Zhou, Yu Cheng

Since language models are prone to various types of errors during the reasoning process, PRMs are required to possess nuanced capabilities for detecting various implicit error types in real-world scenarios.

Decision Making

Towards Stabilized and Efficient Diffusion Transformers through Long-Skip-Connections with Spectral Constraints

1 code implementation26 Nov 2024 Guanjie Chen, Xinyu Zhao, Yucheng Zhou, Xiaoye Qu, Tianlong Chen, Yu Cheng

Diffusion Transformers (DiT) have emerged as a powerful architecture for image and video generation, offering superior quality and scalability.

Denoising Image Generation +1

LLaMA-MoE v2: Exploring Sparsity of LLaMA from Perspective of Mixture-of-Experts with Post-Training

1 code implementation24 Nov 2024 Xiaoye Qu, Daize Dong, Xuyang Hu, Tong Zhu, Weigao Sun, Yu Cheng

Recently, inspired by the concept of sparsity, Mixture-of-Experts (MoE) models have gained increasing popularity for scaling model size while keeping the number of activated parameters constant.

Math Mixture-of-Experts

CLIP-MoE: Towards Building Mixture of Experts for CLIP with Diversified Multiplet Upcycling

1 code implementation28 Sep 2024 Jihai Zhang, Xiaoye Qu, Tong Zhu, Yu Cheng

In recent years, Contrastive Language-Image Pre-training (CLIP) has become a cornerstone in multimodal intelligence.

image-classification Image Classification +4

SURf: Teaching Large Vision-Language Models to Selectively Utilize Retrieved Information

1 code implementation21 Sep 2024 Jiashuo Sun, Jihai Zhang, Yucheng Zhou, Zhaochen Su, Xiaoye Qu, Yu Cheng

To address these challenges, we propose a self-refinement framework designed to teach LVLMs to Selectively Utilize Retrieved Information (SURf).

RAG Retrieval-augmented Generation

Look, Compare, Decide: Alleviating Hallucination in Large Vision-Language Models via Multi-View Multi-Path Reasoning

1 code implementation30 Aug 2024 Xiaoye Qu, Jiashuo Sun, Wei Wei, Yu Cheng

By fully grasping the information in the image and carefully considering the certainty of the potential answers when decoding, our MVP can effectively reduce hallucinations in LVLMs. The extensive experiments verify that our proposed MVP significantly mitigates the hallucination problem across four well-known LVLMs.

Hallucination

ConflictBank: A Benchmark for Evaluating the Influence of Knowledge Conflicts in LLM

1 code implementation22 Aug 2024 Zhaochen Su, Jun Zhang, Xiaoye Qu, Tong Zhu, Yanshu Li, Jiashuo Sun, Juntao Li, Min Zhang, Yu Cheng

Only a few research explored the conflicts between the inherent knowledge of LLMs and the retrieved contextual knowledge.

Misinformation

Mitigating Multilingual Hallucination in Large Vision-Language Models

1 code implementation1 Aug 2024 Xiaoye Qu, Mingyang Song, Wei Wei, Jianfeng Dong, Yu Cheng

In this paper, we make the first attempt to mitigate this important multilingual hallucination in LVLMs.

Hallucination

Alleviating Hallucination in Large Vision-Language Models with Active Retrieval Augmentation

no code implementations1 Aug 2024 Xiaoye Qu, Qiyuan Chen, Wei Wei, Jishuo Sun, Jianfeng Dong

To assess the capability of our proposed ARA model in reducing hallucination, we employ three widely used LVLM models (LLaVA-1. 5, Qwen-VL, and mPLUG-Owl2) across four benchmarks.

Hallucination Image Comprehension +2

A Survey of Attacks on Large Vision-Language Models: Resources, Advances, and Future Trends

1 code implementation10 Jul 2024 Daizong Liu, Mingyu Yang, Xiaoye Qu, Pan Zhou, Yu Cheng, Wei Hu

Compared to traditional Large Language Models (LLMs), LVLMs present great potential and challenges due to its closer proximity to the multi-resource real-world applications and the complexity of multi-modal processing.

Data Poisoning

LLaMA-MoE: Building Mixture-of-Experts from LLaMA with Continual Pre-training

2 code implementations24 Jun 2024 Tong Zhu, Xiaoye Qu, Daize Dong, Jiacheng Ruan, Jingqi Tong, Conghui He, Yu Cheng

Motivated by this limit, we investigate building MoE models from existing dense large language models.

Mixture-of-Experts

Timo: Towards Better Temporal Reasoning for Language Models

1 code implementation20 Jun 2024 Zhaochen Su, Jun Zhang, Tong Zhu, Xiaoye Qu, Juntao Li, Min Zhang, Yu Cheng

Therefore, we propose a crucial question: Can we build a universal framework to handle a variety of temporal reasoning tasks?

Question Answering

Dynamic Data Mixing Maximizes Instruction Tuning for Mixture-of-Experts

1 code implementation17 Jun 2024 Tong Zhu, Daize Dong, Xiaoye Qu, Jiacheng Ruan, Wenliang Chen, Yu Cheng

Mixture-of-Experts (MoE) models have shown remarkable capability in instruction tuning, especially when the number of tasks scales.

Mixture-of-Experts

On Giant's Shoulders: Effortless Weak to Strong by Dynamic Logits Fusion

no code implementations17 Jun 2024 Chenghao Fan, Zhenyi Lu, Wei Wei, Jie Tian, Xiaoye Qu, Dangyang Chen, Yu Cheng

\thm{Can we fine-tune a series of task-specific small models and transfer their knowledge directly to a much larger model without additional training?}

In-Context Learning Task Arithmetic +1

Twin-Merging: Dynamic Integration of Modular Expertise in Model Merging

1 code implementation17 Jun 2024 Zhenyi Lu, Chenghao Fan, Wei Wei, Xiaoye Qu, Dangyang Chen, Yu Cheng

In view of this, we propose Twin-Merging, a method that encompasses two principal stages: (1) modularizing knowledge into shared and exclusive components, with compression to reduce redundancy and enhance efficiency; (2) dynamically merging shared and task-specific knowledge based on the input.

Mitigating Boundary Ambiguity and Inherent Bias for Text Classification in the Era of Large Language Models

1 code implementation11 Jun 2024 Zhenyi Lu, Jie Tian, Wei Wei, Xiaoye Qu, Yu Cheng, Wenfeng Xie, Dangyang Chen

Our approach is grounded in the empirical observation that pairwise comparisons can effectively alleviate boundary ambiguity and inherent bias.

text-classification Text Classification

A General and Flexible Multi-concept Parsing Framework for Multilingual Semantic Matching

no code implementations5 Mar 2024 Dong Yao, Asaad Alghamdi, Qingrong Xia, Xiaoye Qu, Xinyu Duan, Zhefeng Wang, Yi Zheng, Baoxing Huai, Peilun Cheng, Zhou Zhao

Although DC-Match is a simple yet effective method for semantic matching, it highly depends on the external NER techniques to identify the keywords of sentences, which limits the performance of semantic matching for minor languages since satisfactory NER tools are usually hard to obtain.

Chatbot Community Question Answering +5

Learning the Unlearned: Mitigating Feature Suppression in Contrastive Learning

1 code implementation19 Feb 2024 Jihai Zhang, Xiang Lan, Xiaoye Qu, Yu Cheng, Mengling Feng, Bryan Hooi

Self-Supervised Contrastive Learning has proven effective in deriving high-quality representations from unlabeled data.

Contrastive Learning

Enhancing Low-Resource Relation Representations through Multi-View Decoupling

1 code implementation26 Dec 2023 Chenghao Fan, Wei Wei, Xiaoye Qu, Zhenyi Lu, Wenfeng Xie, Yu Cheng, Dangyang Chen

Recently, prompt-tuning with pre-trained language models (PLMs) has demonstrated the significantly enhancing ability of relation extraction (RE) tasks.

Relation Relation Extraction +1

Mirror: A Universal Framework for Various Information Extraction Tasks

1 code implementation9 Nov 2023 Tong Zhu, Junfei Ren, Zijian Yu, Mengsong Wu, Guoliang Zhang, Xiaoye Qu, Wenliang Chen, Zhefeng Wang, Baoxing Huai, Min Zhang

Sharing knowledge between information extraction tasks has always been a challenge due to the diverse data formats and task variations.

Machine Reading Comprehension Triplet

Unified Multi-modal Unsupervised Representation Learning for Skeleton-based Action Understanding

1 code implementation6 Nov 2023 Shengkai Sun, Daizong Liu, Jianfeng Dong, Xiaoye Qu, Junyu Gao, Xun Yang, Xun Wang, Meng Wang

In this manner, our framework is able to learn the unified representations of uni-modal or multi-modal skeleton input, which is flexible to different kinds of modality input for robust action understanding in practical cases.

Action Understanding Representation Learning +1

MIRACLE: Towards Personalized Dialogue Generation with Latent-Space Multiple Personal Attribute Control

1 code implementation22 Oct 2023 Zhenyi Lu, Wei Wei, Xiaoye Qu, Xianling Mao, Dangyang Chen, Jixiong Chen

Subsequently, we employ a conditional variational auto-encoder to align with the dense personalized responses within a latent joint attribute space.

Attribute Dialogue Generation +1

TREA: Tree-Structure Reasoning Schema for Conversational Recommendation

1 code implementation20 Jul 2023 Wendi Li, Wei Wei, Xiaoye Qu, Xian-Ling Mao, Ye Yuan, Wenfeng Xie, Dangyang Chen

TREA constructs a multi-hierarchical scalable tree as the reasoning structure to clarify the causal relationships between mentioned entities, and fully utilizes historical conversations to generate more reasonable and suitable responses for recommended results.

Conversational Recommendation Knowledge Graphs +1

From Region to Patch: Attribute-Aware Foreground-Background Contrastive Learning for Fine-Grained Fashion Retrieval

1 code implementation17 May 2023 Jianfeng Dong, Xiaoman Peng, Zhe Ma, Daizong Liu, Xiaoye Qu, Xun Yang, Jixiang Zhu, Baolong Liu

As the attribute-specific similarity typically corresponds to the specific subtle regions of images, we propose a Region-to-Patch Framework (RPF) that consists of a region-aware branch and a patch-aware branch to extract fine-grained attribute-related visual features for precise retrieval in a coarse-to-fine manner.

Attribute Contrastive Learning +2

A Survey on Arabic Named Entity Recognition: Past, Recent Advances, and Future Trends

no code implementations7 Feb 2023 Xiaoye Qu, Yingjie Gu, Qingrong Xia, Zechang Li, Zhefeng Wang, Baoxing Huai

In this paper, we provide a comprehensive review of the development of Arabic NER, especially the recent advances in deep learning and pre-trained language model.

Feature Engineering Language Modeling +5

Distantly-Supervised Named Entity Recognition with Adaptive Teacher Learning and Fine-grained Student Ensemble

1 code implementation13 Dec 2022 Xiaoye Qu, Jun Zeng, Daizong Liu, Zhefeng Wang, Baoxing Huai, Pan Zhou

Distantly-Supervised Named Entity Recognition (DS-NER) effectively alleviates the data scarcity problem in NER by automatically generating training samples.

named-entity-recognition Named Entity Recognition +1

Reducing the Vision and Language Bias for Temporal Sentence Grounding

no code implementations27 Jul 2022 Daizong Liu, Xiaoye Qu, Wei Hu

In this paper, we study the above issue of selection biases and accordingly propose a Debiasing-TSG (D-TSG) model to filter and remove the negative biases in both vision and language modalities for enhancing the model generalization ability.

Information Retrieval Multimodal Reasoning +3

Unsupervised Temporal Video Grounding with Deep Semantic Clustering

no code implementations14 Jan 2022 Daizong Liu, Xiaoye Qu, Yinzhen Wang, Xing Di, Kai Zou, Yu Cheng, Zichuan Xu, Pan Zhou

Temporal video grounding (TVG) aims to localize a target segment in a video according to a given sentence query.

Clustering Sentence +1

Exploring Motion and Appearance Information for Temporal Sentence Grounding

no code implementations3 Jan 2022 Daizong Liu, Xiaoye Qu, Pan Zhou, Yang Liu

Then, we develop separate motion and appearance branches to learn motion-guided and appearance-guided object relations, respectively.

Object object-detection +3

Memory-Guided Semantic Learning Network for Temporal Sentence Grounding

no code implementations3 Jan 2022 Daizong Liu, Xiaoye Qu, Xing Di, Yu Cheng, Zichuan Xu, Pan Zhou

To tackle this issue, we propose a memory-augmented network, called Memory-Guided Semantic Learning Network (MGSL-Net), that learns and memorizes the rarely appeared content in TSG tasks.

Sentence Temporal Sentence Grounding

Efficient Document-level Event Extraction via Pseudo-Trigger-aware Pruned Complete Graph

1 code implementation11 Dec 2021 Tong Zhu, Xiaoye Qu, Wenliang Chen, Zhefeng Wang, Baoxing Huai, Nicholas Jing Yuan, Min Zhang

Most previous studies of document-level event extraction mainly focus on building argument chains in an autoregressive way, which achieves a certain success but is inefficient in both training and inference.

Document-level Event Extraction Event Extraction

Progressively Guide to Attend: An Iterative Alignment Framework for Temporal Sentence Grounding

no code implementations EMNLP 2021 Daizong Liu, Xiaoye Qu, Pan Zhou

A key solution to temporal sentence grounding (TSG) exists in how to learn effective alignment between vision and language features extracted from an untrimmed video and a sentence description.

Sentence Temporal Sentence Grounding

Adaptive Proposal Generation Network for Temporal Sentence Localization in Videos

no code implementations EMNLP 2021 Daizong Liu, Xiaoye Qu, Jianfeng Dong, Pan Zhou

However, the performance of bottom-up model is inferior to the top-down counterpart as it fails to exploit the segment-level interaction.

Sentence

Coarse to Fine: Domain Adaptive Crowd Counting via Adversarial Scoring Network

no code implementations27 Jul 2021 Zhikang Zou, Xiaoye Qu, Pan Zhou, Shuangjie Xu, Xiaoqing Ye, Wenhao Wu, Jin Ye

In specific, at the coarse-grained stage, we design a dual-discriminator strategy to adapt source domain to be close to the targets from the perspectives of both global and local feature space via adversarial learning.

Crowd Counting Transfer Learning

Context-aware Biaffine Localizing Network for Temporal Sentence Grounding

no code implementations CVPR 2021 Daizong Liu, Xiaoye Qu, Jianfeng Dong, Pan Zhou, Yu Cheng, Wei Wei, Zichuan Xu, Yulai Xie

This paper addresses the problem of temporal sentence grounding (TSG), which aims to identify the temporal boundary of a specific segment from an untrimmed video by a sentence query.

Sentence Temporal Sentence Grounding

Hierarchical Similarity Learning for Language-based Product Image Retrieval

1 code implementation18 Feb 2021 Zhe Ma, Fenghao Liu, Jianfeng Dong, Xiaoye Qu, Yuan He, Shouling Ji

In this paper, we focus on the cross-modal similarity measurement, and propose a novel Hierarchical Similarity Learning (HSL) network.

Image Retrieval Retrieval +1

Progressive Localization Networks for Language-based Moment Localization

no code implementations2 Feb 2021 Qi Zheng, Jianfeng Dong, Xiaoye Qu, Xun Yang, Yabing Wang, Pan Zhou, Baolong Liu, Xun Wang

The language-based setting of this task allows for an open set of target activities, resulting in a large variation of the temporal lengths of video moments.

Read, Retrospect, Select: An MRC Framework to Short Text Entity Linking

no code implementations7 Jan 2021 Yingjie Gu, Xiaoye Qu, Zhefeng Wang, Baoxing Huai, Nicholas Jing Yuan, Xiaolin Gui

Entity linking (EL) for the rapidly growing short text (e. g. search queries and news titles) is critical to industrial applications.

Entity Linking Machine Reading Comprehension +1

Reasoning Step-by-Step: Temporal Sentence Localization in Videos via Deep Rectification-Modulation Network

no code implementations COLING 2020 Daizong Liu, Xiaoye Qu, Jianfeng Dong, Pan Zhou

In this paper, we propose a novel deep rectification-modulation network (RMN), transforming this task into a multi-step reasoning process by repeating rectification and modulation.

Sentence

Fine-grained Iterative Attention Network for TemporalLanguage Localization in Videos

no code implementations6 Aug 2020 Xiaoye Qu, Pengwei Tang, Zhikang Zhou, Yu Cheng, Jianfeng Dong, Pan Zhou

In this paper, we propose a Fine-grained Iterative Attention Network (FIAN) that consists of an iterative attention module for bilateral query-video in-formation extraction.

Sentence

Jointly Cross- and Self-Modal Graph Attention Network for Query-Based Moment Localization

1 code implementation4 Aug 2020 Daizong Liu, Xiaoye Qu, Xiao-Yang Liu, Jianfeng Dong, Pan Zhou, Zichuan Xu

To this end, we propose a novel Cross- and Self-Modal Graph Attention Network (CSMGAN) that recasts this task as a process of iterative messages passing over a joint graph.

Graph Attention Sentence

Enhanced 3D convolutional networks for crowd counting

no code implementations12 Aug 2019 Zhikang Zou, Huiliang Shao, Xiaoye Qu, Wei Wei, Pan Zhou

Recently, convolutional neural networks (CNNs) are the leading defacto method for crowd counting.

Crowd Counting

Adversarial Category Alignment Network for Cross-domain Sentiment Classification

no code implementations NAACL 2019 Xiaoye Qu, Zhikang Zou, Yu Cheng, Yang Yang, Pan Zhou

Cross-domain sentiment classification aims to predict sentiment polarity on a target domain utilizing a classifier learned from a source domain.

Classification General Classification +2

Cannot find the paper you are looking for? You can Submit a new open access paper.