Search Results for author: Chaoya Jiang

Found 12 papers, 3 papers with code

MaVEn: An Effective Multi-granularity Hybrid Visual Encoding Framework for Multimodal Large Language Model

no code implementations22 Aug 2024 Chaoya Jiang, Jia Hongrui, Haiyang Xu, Wei Ye, Mengfan Dong, Ming Yan, Ji Zhang, Fei Huang, Shikun Zhang

This paper presents MaVEn, an innovative Multi-granularity Visual Encoding framework designed to enhance the capabilities of Multimodal Large Language Models (MLLMs) in multi-image reasoning.

Language Modelling Large Language Model +1

MIBench: Evaluating Multimodal Large Language Models over Multiple Images

no code implementations21 Jul 2024 Haowei Liu, Xi Zhang, Haiyang Xu, Yaya Shi, Chaoya Jiang, Ming Yan, Ji Zhang, Fei Huang, Chunfeng Yuan, Bing Li, Weiming Hu

However, most existing MLLMs and benchmarks primarily focus on single-image input scenarios, leaving the performance of MLLMs when handling realistic multiple images remain underexplored.

In-Context Learning Multiple-choice

Hal-Eval: A Universal and Fine-grained Hallucination Evaluation Framework for Large Vision Language Models

no code implementations24 Feb 2024 Chaoya Jiang, Wei Ye, Mengfan Dong, Hongrui Jia, Haiyang Xu, Ming Yan, Ji Zhang, Shikun Zhang

Large Vision Language Models exhibit remarkable capabilities but struggle with hallucinations inconsistencies between images and their descriptions.

Hallucination Hallucination Evaluation

Efficient Vision-and-Language Pre-training with Text-Relevant Image Patch Selection

no code implementations11 Jan 2024 Wei Ye, Chaoya Jiang, Haiyang Xu, Chenhao Ye, Chenliang Li, Ming Yan, Shikun Zhang, Songhang Huang, Fei Huang

Vision Transformers (ViTs) have become increasingly popular in large-scale Vision and Language Pre-training (VLP) models.

TiMix: Text-aware Image Mixing for Effective Vision-Language Pre-training

1 code implementation14 Dec 2023 Chaoya Jiang, Wei Ye, Haiyang Xu, Qinghao Ye, Ming Yan, Ji Zhang, Shikun Zhang

Self-supervised Multi-modal Contrastive Learning (SMCL) remarkably advances modern Vision-Language Pre-training (VLP) models by aligning visual and linguistic modalities.

Contrastive Learning Data Augmentation

Hallucination Augmented Contrastive Learning for Multimodal Large Language Model

1 code implementation CVPR 2024 Chaoya Jiang, Haiyang Xu, Mengfan Dong, Jiaxing Chen, Wei Ye, Ming Yan, Qinghao Ye, Ji Zhang, Fei Huang, Shikun Zhang

We first analyzed the representation distribution of textual and visual tokens in MLLM, revealing two important findings: 1) there is a significant gap between textual and visual representations, indicating unsatisfactory cross-modal representation alignment; 2) representations of texts that contain and do not contain hallucinations are entangled, making it challenging to distinguish them.

Contrastive Learning Hallucination +5

BUS:Efficient and Effective Vision-language Pre-training with Bottom-Up Patch Summarization

no code implementations17 Jul 2023 Chaoya Jiang, Haiyang Xu, Wei Ye, Qinghao Ye, Chenliang Li, Ming Yan, Bin Bi, Shikun Zhang, Fei Huang, Songfang Huang

Specifically, We incorporate a Text-Semantics-Aware Patch Selector (TSPS) into the ViT backbone to perform a coarse-grained visual token extraction and then attach a flexible Transformer-based Patch Abstraction Decoder (PAD) upon the backbone for top-level visual abstraction.

Decoder Text Summarization

PandaLM: An Automatic Evaluation Benchmark for LLM Instruction Tuning Optimization

2 code implementations8 Jun 2023 Yidong Wang, Zhuohao Yu, Zhengran Zeng, Linyi Yang, Cunxiang Wang, Hao Chen, Chaoya Jiang, Rui Xie, Jindong Wang, Xing Xie, Wei Ye, Shikun Zhang, Yue Zhang

To ensure the reliability of PandaLM, we collect a diverse human-annotated test dataset, where all contexts are generated by humans and labels are aligned with human preferences.

Language Modelling Large Language Model

Exploiting Pseudo Image Captions for Multimodal Summarization

no code implementations9 May 2023 Chaoya Jiang, Rui Xie, Wei Ye, Jinan Sun, Shikun Zhang

Cross-modal contrastive learning in vision language pretraining (VLP) faces the challenge of (partial) false negatives.

Common Sense Reasoning Contrastive Learning +1

BUS: Efficient and Effective Vision-Language Pre-Training with Bottom-Up Patch Summarization.

no code implementations ICCV 2023 Chaoya Jiang, Haiyang Xu, Wei Ye, Qinghao Ye, Chenliang Li, Ming Yan, Bin Bi, Shikun Zhang, Fei Huang, Songfang Huang

In this paper, we propose a Bottom-Up Patch Summarization approach named BUS which is inspired by the Document Summarization Task in NLP to learn a concise visual summary of lengthy visual token sequences, guided by textual semantics.

Abstractive Text Summarization Decoder +1

SIMILARITY LEARNING FOR COVER SONG IDENTIFICATION USING CROSS-SIMILARITY MATRICES OF MULTI-LEVEL DEEP SEQUENCES

no code implementations14 May 2020 Chaoya Jiang, Deshun Yang, Xiaoou Chen

One part is a network for learn- ing the deep sequence representation of music tracks, and the other is a similarity estimation network which takes as input the cross- similarity matrices calculated from the deep sequences of a pair of tracks.

Cover song identification Metric Learning

Cannot find the paper you are looking for? You can Submit a new open access paper.