Search Results for author: Xiaoyi Dong

Found 46 papers, 29 papers with code

VLMEvalKit: An Open-Source Toolkit for Evaluating Large Multi-Modality Models

1 code implementation16 Jul 2024 Haodong Duan, Junming Yang, Yuxuan Qiao, Xinyu Fang, Lin Chen, YuAn Liu, Amit Agarwal, Zhe Chen, Mo Li, Yubo Ma, Hailong Sun, Xiangyu Zhao, Junbo Cui, Xiaoyi Dong, Yuhang Zang, Pan Zhang, Jiaqi Wang, Dahua Lin, Kai Chen

Based on the evaluation results obtained with the toolkit, we host OpenVLM Leaderboard, a comprehensive leaderboard to track the progress of multi-modality learning research.

MMDU: A Multi-Turn Multi-Image Dialog Understanding Benchmark and Instruction-Tuning Dataset for LVLMs

1 code implementation17 Jun 2024 Ziyu Liu, Tao Chu, Yuhang Zang, Xilin Wei, Xiaoyi Dong, Pan Zhang, Zijian Liang, Yuanjun Xiong, Yu Qiao, Dahua Lin, Jiaqi Wang

Generating natural and meaningful responses to communicate with multi-modal human inputs is a fundamental capability of Large Vision-Language Models(LVLMs).

Visual Question Answering

ShareGPT4Video: Improving Video Understanding and Generation with Better Captions

no code implementations6 Jun 2024 Lin Chen, Xilin Wei, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Bin Lin, Zhenyu Tang, Li Yuan, Yu Qiao, Dahua Lin, Feng Zhao, Jiaqi Wang

To this end, we meticulously designed a differential video captioning strategy, which is stable, scalable, and efficient for generating captions for videos with arbitrary resolution, aspect ratios, and length.

Video Captioning Video Generation +2

Bootstrap3D: Improving 3D Content Creation with Synthetic Data

no code implementations31 May 2024 Zeyi Sun, Tong Wu, Pan Zhang, Yuhang Zang, Xiaoyi Dong, Yuanjun Xiong, Dahua Lin, Jiaqi Wang

Leveraging this pipeline, we have generated 1 million high-quality synthetic multi-view images with dense descriptive captions to address the shortage of high-quality 3D data.

Denoising Descriptive

Streaming Long Video Understanding with Large Language Models

no code implementations25 May 2024 Rui Qian, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Shuangrui Ding, Dahua Lin, Jiaqi Wang

After the encoding process, the Adaptive Memory Selection strategy selects a constant number of question-related memories from all the historical memories and feeds them into the LLM to generate informative responses.

Question Answering Video Understanding

ReasonPix2Pix: Instruction Reasoning Dataset for Advanced Image Editing

no code implementations18 May 2024 Ying Jin, Pengyang Ling, Xiaoyi Dong, Pan Zhang, Jiaqi Wang, Dahua Lin

Instruction-based image editing focuses on equipping a generative model with the capacity to adhere to human-written instructions for editing images.

Unified Scene Representation and Reconstruction for 3D Large Language Models

no code implementations19 Apr 2024 Tao Chu, Pan Zhang, Xiaoyi Dong, Yuhang Zang, Qiong Liu, Jiaqi Wang

Existing approaches extract point clouds either from ground truth (GT) geometry or 3D scenes reconstructed by auxiliary models.

3D Reconstruction Scene Understanding +1

Are We on the Right Way for Evaluating Large Vision-Language Models?

1 code implementation29 Mar 2024 Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, Feng Zhao

We evaluate 16 leading LVLMs on MMStar to assess their multi-modal capabilities, and on 7 benchmarks with the proposed metrics to investigate their data leakage and actual multi-modal gain.

World Knowledge

InternLM2 Technical Report

3 code implementations26 Mar 2024 Zheng Cai, Maosong Cao, Haojiong Chen, Kai Chen, Keyu Chen, Xin Chen, Xun Chen, Zehui Chen, Zhi Chen, Pei Chu, Xiaoyi Dong, Haodong Duan, Qi Fan, Zhaoye Fei, Yang Gao, Jiaye Ge, Chenya Gu, Yuzhe Gu, Tao Gui, Aijia Guo, Qipeng Guo, Conghui He, Yingfan Hu, Ting Huang, Tao Jiang, Penglong Jiao, Zhenjiang Jin, Zhikai Lei, Jiaxing Li, Jingwen Li, Linyang Li, Shuaibin Li, Wei Li, Yining Li, Hongwei Liu, Jiangning Liu, Jiawei Hong, Kaiwen Liu, Kuikun Liu, Xiaoran Liu, Chengqi Lv, Haijun Lv, Kai Lv, Li Ma, Runyuan Ma, Zerun Ma, Wenchang Ning, Linke Ouyang, Jiantao Qiu, Yuan Qu, FuKai Shang, Yunfan Shao, Demin Song, Zifan Song, Zhihao Sui, Peng Sun, Yu Sun, Huanze Tang, Bin Wang, Guoteng Wang, Jiaqi Wang, Jiayu Wang, Rui Wang, Yudong Wang, Ziyi Wang, Xingjian Wei, Qizhen Weng, Fan Wu, Yingtong Xiong, Chao Xu, Ruiliang Xu, Hang Yan, Yirong Yan, Xiaogui Yang, Haochen Ye, Huaiyuan Ying, JIA YU, Jing Yu, Yuhang Zang, Chuyu Zhang, Li Zhang, Pan Zhang, Peng Zhang, Ruijie Zhang, Shuo Zhang, Songyang Zhang, Wenjian Zhang, Wenwei Zhang, Xingcheng Zhang, Xinyue Zhang, Hui Zhao, Qian Zhao, Xiaomeng Zhao, Fengzhe Zhou, Zaida Zhou, Jingming Zhuo, Yicheng Zou, Xipeng Qiu, Yu Qiao, Dahua Lin

The evolution of Large Language Models (LLMs) like ChatGPT and GPT-4 has sparked discussions on the advent of Artificial General Intelligence (AGI).

4k Long-Context Understanding

Long-CLIP: Unlocking the Long-Text Capability of CLIP

1 code implementation22 Mar 2024 Beichen Zhang, Pan Zhang, Xiaoyi Dong, Yuhang Zang, Jiaqi Wang

Contrastive Language-Image Pre-training (CLIP) has been the cornerstone for zero-shot classification, text-image retrieval, and text-image generation by aligning image and text modalities.

Image Retrieval Language Modelling +3

RAR: Retrieving And Ranking Augmented MLLMs for Visual Recognition

2 code implementations20 Mar 2024 Ziyu Liu, Zeyi Sun, Yuhang Zang, Wei Li, Pan Zhang, Xiaoyi Dong, Yuanjun Xiong, Dahua Lin, Jiaqi Wang

Notably, our approach demonstrates a significant improvement in performance on 5 fine-grained visual recognition benchmarks, 11 few-shot image recognition datasets, and the 2 object detection datasets under the zero-shot recognition setting.

Contrastive Learning Fine-Grained Visual Recognition +3

DualFocus: Integrating Macro and Micro Perspectives in Multi-modal Large Language Models

1 code implementation22 Feb 2024 Yuhang Cao, Pan Zhang, Xiaoyi Dong, Dahua Lin, Jiaqi Wang

We present DualFocus, a novel framework for integrating macro and micro perspectives within multi-modal large language models (MLLMs) to enhance vision-language task performance.

Hallucination

OPERA: Alleviating Hallucination in Multi-Modal Large Language Models via Over-Trust Penalty and Retrospection-Allocation

2 code implementations CVPR 2024 Qidong Huang, Xiaoyi Dong, Pan Zhang, Bin Wang, Conghui He, Jiaqi Wang, Dahua Lin, Weiming Zhang, Nenghai Yu

Based on the observation, OPERA introduces a penalty term on the model logits during the beam-search decoding to mitigate the over-trust issue, along with a rollback strategy that retrospects the presence of summary tokens in the previously generated tokens, and re-allocate the token selection if necessary.

Hallucination

Beyond Hallucinations: Enhancing LVLMs through Hallucination-Aware Direct Preference Optimization

1 code implementation28 Nov 2023 Zhiyuan Zhao, Bin Wang, Linke Ouyang, Xiaoyi Dong, Jiaqi Wang, Conghui He

Multimodal large language models have made significant advancements in recent years, yet they still suffer from a common issue known as the "hallucination problem", in which the models generate textual descriptions that inaccurately depict or entirely fabricate content from associated images.

Hallucination

ShareGPT4V: Improving Large Multi-Modal Models with Better Captions

1 code implementation21 Nov 2023 Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Conghui He, Jiaqi Wang, Feng Zhao, Dahua Lin

In the realm of large multi-modal models (LMMs), efficient modality alignment is crucial yet often constrained by the scarcity of high-quality image-text data.

Descriptive visual instruction following +2

PersonMAE: Person Re-Identification Pre-Training with Masked AutoEncoders

no code implementations8 Nov 2023 Hezhen Hu, Xiaoyi Dong, Jianmin Bao, Dongdong Chen, Lu Yuan, Dong Chen, Houqiang Li

Pre-training is playing an increasingly important role in learning generic feature representation for Person Re-identification (ReID).

Person Re-Identification

Emotional Listener Portrait: Neural Listener Head Generation with Emotion

no code implementations ICCV 2023 Luchuan Song, Guojun Yin, Zhenchao Jin, Xiaoyi Dong, Chenliang Xu

Listener head generation centers on generating non-verbal behaviors (e. g., smile) of a listener in reference to the information delivered by a speaker.

MLLM-DataEngine: An Iterative Refinement Approach for MLLM

1 code implementation25 Aug 2023 Zhiyuan Zhao, Linke Ouyang, Bin Wang, Siyuan Huang, Pan Zhang, Xiaoyi Dong, Jiaqi Wang, Conghui He

Despite the great advance of Multimodal Large Language Models (MLLMs) in both instruction dataset building and benchmarking, the independence of training and evaluation makes current MLLMs hard to further improve their capability under the guidance of evaluation results with a relatively low human cost.

Benchmarking

VIGC: Visual Instruction Generation and Correction

2 code implementations24 Aug 2023 Bin Wang, Fan Wu, Xiao Han, Jiahui Peng, Huaping Zhong, Pan Zhang, Xiaoyi Dong, Weijia Li, Wei Li, Jiaqi Wang, Conghui He

A practical solution to this problem would be to utilize the available multimodal large language models (MLLMs) to generate instruction data for vision-language tasks.

Hallucination Image Captioning +1

Diversity-Aware Meta Visual Prompting

1 code implementation CVPR 2023 Qidong Huang, Xiaoyi Dong, Dongdong Chen, Weiming Zhang, Feifei Wang, Gang Hua, Nenghai Yu

We present Diversity-Aware Meta Visual Prompting~(DAM-VP), an efficient and effective prompting method for transferring pre-trained models to downstream tasks with frozen backbone.

Diversity Visual Prompting

CLIP Itself is a Strong Fine-tuner: Achieving 85.7% and 88.0% Top-1 Accuracy with ViT-B and ViT-L on ImageNet

1 code implementation12 Dec 2022 Xiaoyi Dong, Jianmin Bao, Ting Zhang, Dongdong Chen, Shuyang Gu, Weiming Zhang, Lu Yuan, Dong Chen, Fang Wen, Nenghai Yu

Recent studies have shown that CLIP has achieved remarkable success in performing zero-shot inference while its fine-tuning performance is not satisfactory.

PointCAT: Contrastive Adversarial Training for Robust Point Cloud Recognition

no code implementations16 Sep 2022 Qidong Huang, Xiaoyi Dong, Dongdong Chen, Hang Zhou, Weiming Zhang, Kui Zhang, Gang Hua, Nenghai Yu

Notwithstanding the prominent performance achieved in various applications, point cloud recognition models have often suffered from natural corruptions and adversarial perturbations.

MaskCLIP: Masked Self-Distillation Advances Contrastive Language-Image Pretraining

no code implementations CVPR 2023 Xiaoyi Dong, Jianmin Bao, Yinglin Zheng, Ting Zhang, Dongdong Chen, Hao Yang, Ming Zeng, Weiming Zhang, Lu Yuan, Dong Chen, Fang Wen, Nenghai Yu

Second, masked self-distillation is also consistent with vision-language contrastive from the perspective of training objective as both utilize the visual encoder for feature aligning, and thus is able to learn local semantics getting indirect supervision from the language.

Representation Learning

Bootstrapped Masked Autoencoders for Vision BERT Pretraining

1 code implementation14 Jul 2022 Xiaoyi Dong, Jianmin Bao, Ting Zhang, Dongdong Chen, Weiming Zhang, Lu Yuan, Dong Chen, Fang Wen, Nenghai Yu

The first design is motivated by the observation that using a pretrained MAE to extract the features as the BERT prediction target for masked tokens can achieve better pretraining performance.

Decoder Object Detection +2

Shape-invariant 3D Adversarial Point Clouds

1 code implementation CVPR 2022 Qidong Huang, Xiaoyi Dong, Dongdong Chen, Hang Zhou, Weiming Zhang, Nenghai Yu

In this paper, we propose a novel Point-Cloud Sensitivity Map to boost both the efficiency and imperceptibility of point perturbations.

Protecting Celebrities from DeepFake with Identity Consistency Transformer

1 code implementation CVPR 2022 Xiaoyi Dong, Jianmin Bao, Dongdong Chen, Ting Zhang, Weiming Zhang, Nenghai Yu, Dong Chen, Fang Wen, Baining Guo

In this work we propose Identity Consistency Transformer, a novel face forgery detection method that focuses on high-level semantics, specifically identity information, and detecting a suspect face by finding identity inconsistency in inner and outer face regions.

Face Swapping

CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows

6 code implementations CVPR 2022 Xiaoyi Dong, Jianmin Bao, Dongdong Chen, Weiming Zhang, Nenghai Yu, Lu Yuan, Dong Chen, Baining Guo

By further pretraining on the larger dataset ImageNet-21K, we achieve 87. 5% Top-1 accuracy on ImageNet-1K and high segmentation performance on ADE20K with 55. 7 mIoU.

Image Classification Semantic Segmentation

Identity-Driven DeepFake Detection

no code implementations7 Dec 2020 Xiaoyi Dong, Jianmin Bao, Dongdong Chen, Weiming Zhang, Nenghai Yu, Dong Chen, Fang Wen, Baining Guo

Our approach takes as input the suspect image/video as well as the target identity information (a reference image or video).

DeepFake Detection Face Swapping

LG-GAN: Label Guided Adversarial Network for Flexible Targeted Attack of Point Cloud-based Deep Networks

no code implementations1 Nov 2020 Hang Zhou, Dongdong Chen, Jing Liao, Weiming Zhang, Kejiang Chen, Xiaoyi Dong, Kunlin Liu, Gang Hua, Nenghai Yu

To overcome these shortcomings, this paper proposes a novel label guided adversarial network (LG-GAN) for real-time flexible targeted point cloud attack.

GreedyFool: Distortion-Aware Sparse Adversarial Attack

1 code implementation NeurIPS 2020 Xiaoyi Dong, Dongdong Chen, Jianmin Bao, Chuan Qin, Lu Yuan, Weiming Zhang, Nenghai Yu, Dong Chen

Sparse adversarial samples are a special branch of adversarial samples that can fool the target model by only perturbing a few pixels.

Adversarial Attack

Robust Superpixel-Guided Attentional Adversarial Attack

no code implementations CVPR 2020 Xiaoyi Dong, Jiangfan Han, Dongdong Chen, Jiayang Liu, Huanyu Bian, Zehua Ma, Hongsheng Li, Xiaogang Wang, Weiming Zhang, Nenghai Yu

Firstly, because of the contradiction between the local smoothness of natural images and the noisy property of these adversarial perturbations, this "pixel-wise" way makes these methods not robust to image processing based defense methods and steganalysis based detection methods.

Adversarial Attack Steganalysis

Self-Robust 3D Point Recognition via Gather-Vector Guidance

no code implementations CVPR 2020 Xiaoyi Dong, Dongdong Chen, Hang Zhou, Gang Hua, Weiming Zhang, Nenghai Yu

In this paper, we look into the problem of 3D adversary attack, and propose to leverage the internal properties of the point clouds and the adversarial examples to design a new self-robust deep neural network (DNN) based 3D recognition systems.

Position

Once a MAN: Towards Multi-Target Attack via Learning Multi-Target Adversarial Network Once

no code implementations ICCV 2019 Jiangfan Han, Xiaoyi Dong, Ruimao Zhang, Dong-Dong Chen, Weiming Zhang, Nenghai Yu, Ping Luo, Xiaogang Wang

Recently, generation-based methods have received much attention since they directly use feed-forward networks to generate the adversarial samples, which avoid the time-consuming iterative attacking procedure in optimization-based and gradient-based methods.

Classification General Classification

CAAD 2018: Powerful None-Access Black-Box Attack Based on Adversarial Transformation Network

no code implementations3 Nov 2018 Xiaoyi Dong, Weiming Zhang, Nenghai Yu

In this paper, we propose an improvement of Adversarial Transformation Networks(ATN) to generate adversarial examples, which can fool white-box models and black-box models with a state of the art performance and won the 2rd place in the non-target task in CAAD 2018.

Cannot find the paper you are looking for? You can Submit a new open access paper.