Search Results for author: Qinghao Ye

Found 28 papers, 14 papers with code

Semantics-enhanced Cross-modal Masked Image Modeling for Vision-Language Pre-training

no code implementations • 1 Mar 2024 • Haowei Liu, Yaya Shi, Haiyang Xu, Chunfeng Yuan, Qinghao Ye, Chenliang Li, Ming Yan, Ji Zhang, Fei Huang, Bing Li, Weiming Hu

In vision-language pre-training (VLP), masked image modeling (MIM) has recently been introduced for fine-grained cross-modal alignment.

Representation Learning

Paper
Add Code

Unifying Latent and Lexicon Representations for Effective Video-Text Retrieval

no code implementations • 26 Feb 2024 • Haowei Liu, Yaya Shi, Haiyang Xu, Chunfeng Yuan, Qinghao Ye, Chenliang Li, Ming Yan, Ji Zhang, Fei Huang, Bing Li, Weiming Hu

In this work, we propose the UNIFY framework, which learns lexicon representations to capture fine-grained semantics and combines the strengths of latent and lexicon representations for video-text retrieval.

Retrieval Text Retrieval +1

Paper
Add Code

TiMix: Text-aware Image Mixing for Effective Vision-Language Pre-training

1 code implementation • 14 Dec 2023 • Chaoya Jiang, Wei Ye, Haiyang Xu, Qinghao Ye, Ming Yan, Ji Zhang, Shikun Zhang

Self-supervised Multi-modal Contrastive Learning (SMCL) remarkably advances modern Vision-Language Pre-training (VLP) models by aligning visual and linguistic modalities.

Contrastive Learning Data Augmentation

Paper
Code

Hallucination Augmented Contrastive Learning for Multimodal Large Language Model

1 code implementation • 12 Dec 2023 • Chaoya Jiang, Haiyang Xu, Mengfan Dong, Jiaxing Chen, Wei Ye, Ming Yan, Qinghao Ye, Ji Zhang, Fei Huang, Shikun Zhang

We first analyzed the representation distribution of textual and visual tokens in MLLM, revealing two important findings: 1) there is a significant gap between textual and visual representations, indicating unsatisfactory cross-modal representation alignment; 2) representations of texts that contain and do not contain hallucinations are entangled, making it challenging to distinguish them.

Ranked #74 on Visual Question Answering on MM-Vet

Contrastive Learning Hallucination +4

Paper
Code

mPLUG-PaperOwl: Scientific Diagram Analysis with the Multimodal Large Language Model

1 code implementation • 30 Nov 2023 • Anwen Hu, Yaya Shi, Haiyang Xu, Jiabo Ye, Qinghao Ye, Ming Yan, Chenliang Li, Qi Qian, Ji Zhang, Fei Huang

In this work, towards a more versatile copilot for academic paper writing, we mainly focus on strengthening the multi-modal diagram analysis ability of Multimodal LLMs.

Language Modelling Large Language Model

860

Paper
Code

mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration

2 code implementations • 7 Nov 2023 • Qinghao Ye, Haiyang Xu, Jiabo Ye, Ming Yan, Anwen Hu, Haowei Liu, Qi Qian, Ji Zhang, Fei Huang, Jingren Zhou

Multi-modal Large Language Models (MLLMs) have demonstrated impressive instruction abilities across various open-ended tasks.

Ranked #11 on Visual Question Answering (VQA) on InfiMM-Eval

Language Modelling Large Language Model +1

1,926

Paper
Code

UReader: Universal OCR-free Visually-situated Language Understanding with Multimodal Large Language Model

2 code implementations • 8 Oct 2023 • Jiabo Ye, Anwen Hu, Haiyang Xu, Qinghao Ye, Ming Yan, Guohai Xu, Chenliang Li, Junfeng Tian, Qi Qian, Ji Zhang, Qin Jin, Liang He, Xin Alex Lin, Fei Huang

Text is ubiquitous in our visual world, conveying crucial information, such as in documents, websites, and everyday photographs.

Language Modelling Large Language Model +1

860

Paper
Code

Evaluation and Analysis of Hallucination in Large Vision-Language Models

1 code implementation • 29 Aug 2023 • Junyang Wang, Yiyang Zhou, Guohai Xu, Pengcheng Shi, Chenlin Zhao, Haiyang Xu, Qinghao Ye, Ming Yan, Ji Zhang, Jihua Zhu, Jitao Sang, Haoyu Tang

In this paper, we propose Hallucination Evaluation based on Large Language Models (HaELM), an LLM-based hallucination evaluation framework.

Hallucination Hallucination Evaluation

Paper
Code

BUS:Efficient and Effective Vision-language Pre-training with Bottom-Up Patch Summarization

no code implementations • 17 Jul 2023 • Chaoya Jiang, Haiyang Xu, Wei Ye, Qinghao Ye, Chenliang Li, Ming Yan, Bin Bi, Shikun Zhang, Fei Huang, Songfang Huang

Specifically, We incorporate a Text-Semantics-Aware Patch Selector (TSPS) into the ViT backbone to perform a coarse-grained visual token extraction and then attach a flexible Transformer-based Patch Abstraction Decoder (PAD) upon the backbone for top-level visual abstraction.

Text Summarization

Paper
Add Code

mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding

1 code implementation • 4 Jul 2023 • Jiabo Ye, Anwen Hu, Haiyang Xu, Qinghao Ye, Ming Yan, Yuhao Dan, Chenlin Zhao, Guohai Xu, Chenliang Li, Junfeng Tian, Qian Qi, Ji Zhang, Fei Huang

Nevertheless, without in-domain training, these models tend to ignore fine-grained OCR features, such as sophisticated tables or large blocks of text, which are essential for OCR-free document understanding.

document understanding Language Modelling +2

860

Paper
Code

Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Dataset for Pre-training and Benchmarks

1 code implementation • 7 Jun 2023 • Haiyang Xu, Qinghao Ye, Xuan Wu, Ming Yan, Yuan Miao, Jiabo Ye, Guohai Xu, Anwen Hu, Yaya Shi, Guangwei Xu, Chenliang Li, Qi Qian, Maofei Que, Ji Zhang, Xiao Zeng, Fei Huang

In addition, to facilitate a comprehensive evaluation of video-language models, we carefully build the largest human-annotated Chinese benchmarks covering three popular video-language tasks of cross-modal retrieval, video captioning, and video category classification.

Cross-Modal Retrieval Language Modelling +3

256

Paper
Code

Transforming Visual Scene Graphs to Image Captions

1 code implementation • 3 May 2023 • Xu Yang, Jiawei Peng, Zihua Wang, Haiyang Xu, Qinghao Ye, Chenliang Li, Songfang Huang, Fei Huang, Zhangzikang Li, Yu Zhang

In TSG, we apply multi-head attention (MHA) to design the Graph Neural Network (GNN) for embedding scene graphs.

Attribute Descriptive +1

Paper
Code

mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality

1 code implementation • 27 Apr 2023 • Qinghao Ye, Haiyang Xu, Guohai Xu, Jiabo Ye, Ming Yan, Yiyang Zhou, Junyang Wang, Anwen Hu, Pengcheng Shi, Yaya Shi, Chenliang Li, Yuanhong Xu, Hehong Chen, Junfeng Tian, Qi Qian, Ji Zhang, Fei Huang, Jingren Zhou

Our code, pre-trained model, instruction-tuned models, and evaluation set are available at https://github. com/X-PLUG/mPLUG-Owl.

Ranked #3 on Visual Question Answering (VQA) on HallusionBench

Visual Question Answering (VQA) Zero-Shot Video Question Answer

1,926

Paper
Code

ChatPLUG: Open-Domain Generative Dialogue System with Internet-Augmented Instruction Tuning for Digital Human

1 code implementation • 16 Apr 2023 • Junfeng Tian, Hehong Chen, Guohai Xu, Ming Yan, Xing Gao, Jianhai Zhang, Chenliang Li, Jiayi Liu, Wenshen Xu, Haiyang Xu, Qi Qian, Wei Wang, Qinghao Ye, Jiejing Zhang, Ji Zhang, Fei Huang, Jingren Zhou

In this paper, we present ChatPLUG, a Chinese open-domain dialogue system for digital human applications that instruction finetunes on a wide range of dialogue tasks in a unified internet-augmented format.

World Knowledge

300

Paper
Code

mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video

4 code implementations • 1 Feb 2023 • Haiyang Xu, Qinghao Ye, Ming Yan, Yaya Shi, Jiabo Ye, Yuanhong Xu, Chenliang Li, Bin Bi, Qi Qian, Wei Wang, Guohai Xu, Ji Zhang, Songfang Huang, Fei Huang, Jingren Zhou

In contrast to predominant paradigms of solely relying on sequence-to-sequence generation or encoder-based instance discrimination, mPLUG-2 introduces a multi-module composition network by sharing common universal modules for modality collaboration and disentangling different modality modules to deal with modality entanglement.

Ranked #1 on Video Captioning on MSR-VTT

Action Classification Image Classification +7

6,039

Paper
Code

Learning Trajectory-Word Alignments for Video-Language Tasks

no code implementations • ICCV 2023 • Xu Yang, Zhangzikang Li, Haiyang Xu, Hanwang Zhang, Qinghao Ye, Chenliang Li, Ming Yan, Yu Zhang, Fei Huang, Songfang Huang

To amend this, we propose a novel TW-BERT to learn Trajectory-Word alignment by a newly designed trajectory-to-word (T2W) attention for solving video-language tasks.

Question Answering Retrieval +4

Paper
Add Code

BUS: Efficient and Effective Vision-Language Pre-Training with Bottom-Up Patch Summarization.

no code implementations • ICCV 2023 • Chaoya Jiang, Haiyang Xu, Wei Ye, Qinghao Ye, Chenliang Li, Ming Yan, Bin Bi, Shikun Zhang, Fei Huang, Songfang Huang

In this paper, we propose a Bottom-Up Patch Summarization approach named BUS which is inspired by the Document Summarization Task in NLP to learn a concise visual summary of lengthy visual token sequences, guided by textual semantics.

Abstractive Text Summarization Document Summarization

Paper
Add Code

HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training

no code implementations • ICCV 2023 • Qinghao Ye, Guohai Xu, Ming Yan, Haiyang Xu, Qi Qian, Ji Zhang, Fei Huang

We achieve state-of-the-art results on 15 well-established video-language understanding and generation tasks, especially on temporal-oriented datasets (e. g., SSv2-Template and SSv2-Label) with 8. 6% and 11. 1% improvement respectively.

Ranked #1 on Visual Question Answering (VQA) on TGIF-QA

TGIF-Action TGIF-Frame +7

Paper
Add Code

All grains, one scheme (AGOS): Learning multigrain instance representation for aerial scene classification

1 code implementation • IEEE Transactions on Geoscience and Remote Sensing 2022 • Qi Bi, Beichen Zhou, Kun Qin, Qinghao Ye, Gui-Song Xia

Finally, our SSF module allows our framework to learn the same scene scheme from multigrain instance representations and fuses them, so that the entire framework is optimized as a whole.

Aerial Scene Classification Multiple Instance Learning +1

Paper
Code

All Grains, One Scheme (AGOS): Learning Multi-grain Instance Representation for Aerial Scene Classification

1 code implementation • IEEE Transactions on Geoscience and Remote Sensing 2022 • Qi Bi, Beichen Zhou, Kun Qin, Qinghao Ye, Gui-Song Xia

Finally, our SSF allows our framework to learn the same scene scheme from multi-grain instance representations and fuses them, so that the entire framework is optimized as a whole.

Ranked #1 on Scene Recognition on AID

Aerial Scene Classification Image Classification +3

Paper
Code

AI-based Medical e-Diagnosis for Fast and Automatic Ventricular Volume Measurement in the Patients with Normal Pressure Hydrocephalus

no code implementations • 31 Jan 2022 • Xi Zhou, Qinghao Ye, Xiaolin Yang, Jiakuan Chen, Haiqin Ma, Jun Xia, Javier Del Ser, Guang Yang

Finally, we verify the reliability of the model and achieved automatic measurement of VV and ICV.

Segmentation

Paper
Add Code

Exploring Global Diversity and Local Context for Video Summarization

no code implementations • 27 Jan 2022 • Yingchao Pan, Ouhan Huang, Qinghao Ye, Zhongjin Li, Wenjiang Wang, Guodun Li, Yuxing Chen

By combining these two attention mechanisms, a video SUMmarization model with Diversified Contextual Attention scheme is developed, namely SUM-DCA.

Video Summarization

Paper
Add Code

Robust Weakly Supervised Learning for COVID-19 Recognition Using Multi-Center CT Images

no code implementations • 9 Dec 2021 • Qinghao Ye, Yuan Gao, Weiping Ding, Zhangming Niu, Chengjia Wang, Yinghui Jiang, Minhao Wang, Evandro Fei Fang, Wade Menpes-Smith, Jun Xia, Guang Yang

The multi-domain shift problem for the multi-center and multi-scanner studies is therefore nontrivial that is also crucial for a dependable recognition and critical for reproducible and objective diagnosis and prognosis.

Computed Tomography (CT) Weakly-supervised Learning

Paper
Add Code

Explainable AI For COVID-19 CT Classifiers: An Initial Comparison Study

no code implementations • 25 Apr 2021 • Qinghao Ye, Jun Xia, Guang Yang

XAI is an AI model that is programmed to explain its goals, logic, and decision making so that the end users can understand.

Decision Making Explainable Artificial Intelligence (XAI) +1

Paper
Add Code

Unbox the Black-box for the Medical Explainable AI via Multi-modal and Multi-centre Data Fusion: A Mini-Review, Two Showcases and Beyond

no code implementations • 3 Feb 2021 • Guang Yang, Qinghao Ye, Jun Xia

Explainable Artificial Intelligence (XAI) is an emerging research topic of machine learning aimed at unboxing how AI systems' black-box choices are made.

BIG-bench Machine Learning Decision Making +2

Paper
Add Code

Temporal Cue Guided Video Highlight Detection With Low-Rank Audio-Visual Fusion

no code implementations • ICCV 2021 • Qinghao Ye, Xiyue Shen, Yuan Gao, ZiRui Wang, Qi Bi, Ping Li, Guang Yang

Video highlight detection plays an increasingly important role in social media content filtering, however, it remains highly challenging to develop automated video highlight detection methods because of the lack of temporal annotations (i. e., where the highlight moments are in long videos) for supervised learning.

Highlight Detection Model Optimization

Paper
Add Code

Exploring global diverse attention via pairwise temporal relation for video summarization

no code implementations • 23 Sep 2020 • Ping Li, Qinghao Ye, Luming Zhang, Li Yuan, Xianghua Xu, Ling Shao

In this paper, we propose an efficient convolutional neural network architecture for video SUMmarization via Global Diverse Attention called SUM-GDA, which adapts attention mechanism in a global perspective to consider pairwise temporal relations of video frames.

Relation Video Summarization

Paper
Add Code

Application of Time Series Analysis to Traffic Accidents in Los Angeles

no code implementations • 28 Nov 2019 • Qinghao Ye, Kaiyuan Hu, Yizhe WANG

The primary objective of this paper is to apply a set of methods for the time series analysis of traffic accidents in Los Angeles in the past few years.

Time Series Time Series Analysis

Paper
Add Code

Cannot find the paper you are looking for? You can Submit a new open access paper.