Search Results for author: Qinghao Ye

Found 28 papers, 14 papers with code

Semantics-enhanced Cross-modal Masked Image Modeling for Vision-Language Pre-training

no code implementations1 Mar 2024 Haowei Liu, Yaya Shi, Haiyang Xu, Chunfeng Yuan, Qinghao Ye, Chenliang Li, Ming Yan, Ji Zhang, Fei Huang, Bing Li, Weiming Hu

In vision-language pre-training (VLP), masked image modeling (MIM) has recently been introduced for fine-grained cross-modal alignment.

Representation Learning

Unifying Latent and Lexicon Representations for Effective Video-Text Retrieval

no code implementations26 Feb 2024 Haowei Liu, Yaya Shi, Haiyang Xu, Chunfeng Yuan, Qinghao Ye, Chenliang Li, Ming Yan, Ji Zhang, Fei Huang, Bing Li, Weiming Hu

In this work, we propose the UNIFY framework, which learns lexicon representations to capture fine-grained semantics and combines the strengths of latent and lexicon representations for video-text retrieval.

Retrieval Text Retrieval +1

TiMix: Text-aware Image Mixing for Effective Vision-Language Pre-training

1 code implementation14 Dec 2023 Chaoya Jiang, Wei Ye, Haiyang Xu, Qinghao Ye, Ming Yan, Ji Zhang, Shikun Zhang

Self-supervised Multi-modal Contrastive Learning (SMCL) remarkably advances modern Vision-Language Pre-training (VLP) models by aligning visual and linguistic modalities.

Contrastive Learning Data Augmentation

Hallucination Augmented Contrastive Learning for Multimodal Large Language Model

1 code implementation12 Dec 2023 Chaoya Jiang, Haiyang Xu, Mengfan Dong, Jiaxing Chen, Wei Ye, Ming Yan, Qinghao Ye, Ji Zhang, Fei Huang, Shikun Zhang

We first analyzed the representation distribution of textual and visual tokens in MLLM, revealing two important findings: 1) there is a significant gap between textual and visual representations, indicating unsatisfactory cross-modal representation alignment; 2) representations of texts that contain and do not contain hallucinations are entangled, making it challenging to distinguish them.

Contrastive Learning Hallucination +4

mPLUG-PaperOwl: Scientific Diagram Analysis with the Multimodal Large Language Model

1 code implementation30 Nov 2023 Anwen Hu, Yaya Shi, Haiyang Xu, Jiabo Ye, Qinghao Ye, Ming Yan, Chenliang Li, Qi Qian, Ji Zhang, Fei Huang

In this work, towards a more versatile copilot for academic paper writing, we mainly focus on strengthening the multi-modal diagram analysis ability of Multimodal LLMs.

Language Modelling Large Language Model

Evaluation and Analysis of Hallucination in Large Vision-Language Models

1 code implementation29 Aug 2023 Junyang Wang, Yiyang Zhou, Guohai Xu, Pengcheng Shi, Chenlin Zhao, Haiyang Xu, Qinghao Ye, Ming Yan, Ji Zhang, Jihua Zhu, Jitao Sang, Haoyu Tang

In this paper, we propose Hallucination Evaluation based on Large Language Models (HaELM), an LLM-based hallucination evaluation framework.

Hallucination Hallucination Evaluation

BUS:Efficient and Effective Vision-language Pre-training with Bottom-Up Patch Summarization

no code implementations17 Jul 2023 Chaoya Jiang, Haiyang Xu, Wei Ye, Qinghao Ye, Chenliang Li, Ming Yan, Bin Bi, Shikun Zhang, Fei Huang, Songfang Huang

Specifically, We incorporate a Text-Semantics-Aware Patch Selector (TSPS) into the ViT backbone to perform a coarse-grained visual token extraction and then attach a flexible Transformer-based Patch Abstraction Decoder (PAD) upon the backbone for top-level visual abstraction.

Text Summarization

mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding

1 code implementation4 Jul 2023 Jiabo Ye, Anwen Hu, Haiyang Xu, Qinghao Ye, Ming Yan, Yuhao Dan, Chenlin Zhao, Guohai Xu, Chenliang Li, Junfeng Tian, Qian Qi, Ji Zhang, Fei Huang

Nevertheless, without in-domain training, these models tend to ignore fine-grained OCR features, such as sophisticated tables or large blocks of text, which are essential for OCR-free document understanding.

document understanding Language Modelling +2

Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Dataset for Pre-training and Benchmarks

1 code implementation7 Jun 2023 Haiyang Xu, Qinghao Ye, Xuan Wu, Ming Yan, Yuan Miao, Jiabo Ye, Guohai Xu, Anwen Hu, Yaya Shi, Guangwei Xu, Chenliang Li, Qi Qian, Maofei Que, Ji Zhang, Xiao Zeng, Fei Huang

In addition, to facilitate a comprehensive evaluation of video-language models, we carefully build the largest human-annotated Chinese benchmarks covering three popular video-language tasks of cross-modal retrieval, video captioning, and video category classification.

Cross-Modal Retrieval Language Modelling +3

Transforming Visual Scene Graphs to Image Captions

1 code implementation3 May 2023 Xu Yang, Jiawei Peng, Zihua Wang, Haiyang Xu, Qinghao Ye, Chenliang Li, Songfang Huang, Fei Huang, Zhangzikang Li, Yu Zhang

In TSG, we apply multi-head attention (MHA) to design the Graph Neural Network (GNN) for embedding scene graphs.

Attribute Descriptive +1

ChatPLUG: Open-Domain Generative Dialogue System with Internet-Augmented Instruction Tuning for Digital Human

1 code implementation16 Apr 2023 Junfeng Tian, Hehong Chen, Guohai Xu, Ming Yan, Xing Gao, Jianhai Zhang, Chenliang Li, Jiayi Liu, Wenshen Xu, Haiyang Xu, Qi Qian, Wei Wang, Qinghao Ye, Jiejing Zhang, Ji Zhang, Fei Huang, Jingren Zhou

In this paper, we present ChatPLUG, a Chinese open-domain dialogue system for digital human applications that instruction finetunes on a wide range of dialogue tasks in a unified internet-augmented format.

World Knowledge

mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video

4 code implementations1 Feb 2023 Haiyang Xu, Qinghao Ye, Ming Yan, Yaya Shi, Jiabo Ye, Yuanhong Xu, Chenliang Li, Bin Bi, Qi Qian, Wei Wang, Guohai Xu, Ji Zhang, Songfang Huang, Fei Huang, Jingren Zhou

In contrast to predominant paradigms of solely relying on sequence-to-sequence generation or encoder-based instance discrimination, mPLUG-2 introduces a multi-module composition network by sharing common universal modules for modality collaboration and disentangling different modality modules to deal with modality entanglement.

Action Classification Image Classification +7

Learning Trajectory-Word Alignments for Video-Language Tasks

no code implementations ICCV 2023 Xu Yang, Zhangzikang Li, Haiyang Xu, Hanwang Zhang, Qinghao Ye, Chenliang Li, Ming Yan, Yu Zhang, Fei Huang, Songfang Huang

To amend this, we propose a novel TW-BERT to learn Trajectory-Word alignment by a newly designed trajectory-to-word (T2W) attention for solving video-language tasks.

Question Answering Retrieval +4

BUS: Efficient and Effective Vision-Language Pre-Training with Bottom-Up Patch Summarization.

no code implementations ICCV 2023 Chaoya Jiang, Haiyang Xu, Wei Ye, Qinghao Ye, Chenliang Li, Ming Yan, Bin Bi, Shikun Zhang, Fei Huang, Songfang Huang

In this paper, we propose a Bottom-Up Patch Summarization approach named BUS which is inspired by the Document Summarization Task in NLP to learn a concise visual summary of lengthy visual token sequences, guided by textual semantics.

Abstractive Text Summarization Document Summarization

HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training

no code implementations ICCV 2023 Qinghao Ye, Guohai Xu, Ming Yan, Haiyang Xu, Qi Qian, Ji Zhang, Fei Huang

We achieve state-of-the-art results on 15 well-established video-language understanding and generation tasks, especially on temporal-oriented datasets (e. g., SSv2-Template and SSv2-Label) with 8. 6% and 11. 1% improvement respectively.

TGIF-Action TGIF-Frame +7

Exploring Global Diversity and Local Context for Video Summarization

no code implementations27 Jan 2022 Yingchao Pan, Ouhan Huang, Qinghao Ye, Zhongjin Li, Wenjiang Wang, Guodun Li, Yuxing Chen

By combining these two attention mechanisms, a video SUMmarization model with Diversified Contextual Attention scheme is developed, namely SUM-DCA.

Video Summarization

Robust Weakly Supervised Learning for COVID-19 Recognition Using Multi-Center CT Images

no code implementations9 Dec 2021 Qinghao Ye, Yuan Gao, Weiping Ding, Zhangming Niu, Chengjia Wang, Yinghui Jiang, Minhao Wang, Evandro Fei Fang, Wade Menpes-Smith, Jun Xia, Guang Yang

The multi-domain shift problem for the multi-center and multi-scanner studies is therefore nontrivial that is also crucial for a dependable recognition and critical for reproducible and objective diagnosis and prognosis.

Computed Tomography (CT) Weakly-supervised Learning

Explainable AI For COVID-19 CT Classifiers: An Initial Comparison Study

no code implementations25 Apr 2021 Qinghao Ye, Jun Xia, Guang Yang

XAI is an AI model that is programmed to explain its goals, logic, and decision making so that the end users can understand.

Decision Making Explainable Artificial Intelligence (XAI) +1

Unbox the Black-box for the Medical Explainable AI via Multi-modal and Multi-centre Data Fusion: A Mini-Review, Two Showcases and Beyond

no code implementations3 Feb 2021 Guang Yang, Qinghao Ye, Jun Xia

Explainable Artificial Intelligence (XAI) is an emerging research topic of machine learning aimed at unboxing how AI systems' black-box choices are made.

BIG-bench Machine Learning Decision Making +2

Temporal Cue Guided Video Highlight Detection With Low-Rank Audio-Visual Fusion

no code implementations ICCV 2021 Qinghao Ye, Xiyue Shen, Yuan Gao, ZiRui Wang, Qi Bi, Ping Li, Guang Yang

Video highlight detection plays an increasingly important role in social media content filtering, however, it remains highly challenging to develop automated video highlight detection methods because of the lack of temporal annotations (i. e., where the highlight moments are in long videos) for supervised learning.

Highlight Detection Model Optimization

Exploring global diverse attention via pairwise temporal relation for video summarization

no code implementations23 Sep 2020 Ping Li, Qinghao Ye, Luming Zhang, Li Yuan, Xianghua Xu, Ling Shao

In this paper, we propose an efficient convolutional neural network architecture for video SUMmarization via Global Diverse Attention called SUM-GDA, which adapts attention mechanism in a global perspective to consider pairwise temporal relations of video frames.

Relation Video Summarization

Application of Time Series Analysis to Traffic Accidents in Los Angeles

no code implementations28 Nov 2019 Qinghao Ye, Kaiyuan Hu, Yizhe WANG

The primary objective of this paper is to apply a set of methods for the time series analysis of traffic accidents in Los Angeles in the past few years.

Time Series Time Series Analysis

Cannot find the paper you are looking for? You can Submit a new open access paper.