Search Results for author: Anwen Hu

Found 17 papers, 14 papers with code

mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality

1 code implementation • 27 Apr 2023 • Qinghao Ye, Haiyang Xu, Guohai Xu, Jiabo Ye, Ming Yan, Yiyang Zhou, Junyang Wang, Anwen Hu, Pengcheng Shi, Yaya Shi, Chenliang Li, Yuanhong Xu, Hehong Chen, Junfeng Tian, Qi Qian, Ji Zhang, Fei Huang, Jingren Zhou

Our code, pre-trained model, instruction-tuned models, and evaluation set are available at https://github. com/X-PLUG/mPLUG-Owl.

Ranked #3 on Visual Question Answering (VQA) on HallusionBench

Visual Question Answering (VQA) Zero-Shot Video Question Answer

1,914

Paper
Code

mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration

2 code implementations • 7 Nov 2023 • Qinghao Ye, Haiyang Xu, Jiabo Ye, Ming Yan, Anwen Hu, Haowei Liu, Qi Qian, Ji Zhang, Fei Huang, Jingren Zhou

Multi-modal Large Language Models (MLLMs) have demonstrated impressive instruction abilities across various open-ended tasks.

Ranked #11 on Visual Question Answering (VQA) on InfiMM-Eval

Language Modelling Large Language Model +1

1,914

Paper
Code

mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding

1 code implementation • 4 Jul 2023 • Jiabo Ye, Anwen Hu, Haiyang Xu, Qinghao Ye, Ming Yan, Yuhao Dan, Chenlin Zhao, Guohai Xu, Chenliang Li, Junfeng Tian, Qian Qi, Ji Zhang, Fei Huang

Nevertheless, without in-domain training, these models tend to ignore fine-grained OCR features, such as sophisticated tables or large blocks of text, which are essential for OCR-free document understanding.

document understanding Language Modelling +2

837

Paper
Code

UReader: Universal OCR-free Visually-situated Language Understanding with Multimodal Large Language Model

2 code implementations • 8 Oct 2023 • Jiabo Ye, Anwen Hu, Haiyang Xu, Qinghao Ye, Ming Yan, Guohai Xu, Chenliang Li, Junfeng Tian, Qi Qian, Ji Zhang, Qin Jin, Liang He, Xin Alex Lin, Fei Huang

Text is ubiquitous in our visual world, conveying crucial information, such as in documents, websites, and everyday photographs.

Language Modelling Large Language Model +1

837

Paper
Code

mPLUG-PaperOwl: Scientific Diagram Analysis with the Multimodal Large Language Model

1 code implementation • 30 Nov 2023 • Anwen Hu, Yaya Shi, Haiyang Xu, Jiabo Ye, Qinghao Ye, Ming Yan, Chenliang Li, Qi Qian, Ji Zhang, Fei Huang

In this work, towards a more versatile copilot for academic paper writing, we mainly focus on strengthening the multi-modal diagram analysis ability of Multimodal LLMs.

Language Modelling Large Language Model

837

Paper
Code

mPLUG-DocOwl 1.5: Unified Structure Learning for OCR-free Document Understanding

1 code implementation • 19 Mar 2024 • Anwen Hu, Haiyang Xu, Jiabo Ye, Ming Yan, Liang Zhang, Bo Zhang, Chen Li, Ji Zhang, Qin Jin, Fei Huang, Jingren Zhou

In this work, we emphasize the importance of structure information in Visual Document Understanding and propose the Unified Structure Learning to boost the performance of MLLMs.

document understanding Optical Character Recognition (OCR)

837

Paper
Code

WenLan: Bridging Vision and Language by Large-Scale Multi-Modal Pre-Training

2 code implementations • 11 Mar 2021 • Yuqi Huo, Manli Zhang, Guangzhen Liu, Haoyu Lu, Yizhao Gao, Guoxing Yang, Jingyuan Wen, Heng Zhang, Baogui Xu, Weihao Zheng, Zongzheng Xi, Yueqian Yang, Anwen Hu, Jinming Zhao, Ruichen Li, Yida Zhao, Liang Zhang, Yuqing Song, Xin Hong, Wanqing Cui, Danyang Hou, Yingyan Li, Junyi Li, Peiyu Liu, Zheng Gong, Chuhao Jin, Yuchong Sun, ShiZhe Chen, Zhiwu Lu, Zhicheng Dou, Qin Jin, Yanyan Lan, Wayne Xin Zhao, Ruihua Song, Ji-Rong Wen

We further construct a large Chinese multi-source image-text dataset called RUC-CAS-WenLan for pre-training our BriVL model.

Ranked #1 on Image Retrieval on RUC-CAS-WenLan

Contrastive Learning Image Captioning +2

273

Paper
Code

Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Dataset for Pre-training and Benchmarks

1 code implementation • 7 Jun 2023 • Haiyang Xu, Qinghao Ye, Xuan Wu, Ming Yan, Yuan Miao, Jiabo Ye, Guohai Xu, Anwen Hu, Yaya Shi, Guangwei Xu, Chenliang Li, Qi Qian, Maofei Que, Ji Zhang, Xiao Zeng, Fei Huang

In addition, to facilitate a comprehensive evaluation of video-language models, we carefully build the largest human-annotated Chinese benchmarks covering three popular video-language tasks of cross-modal retrieval, video captioning, and video category classification.

Cross-Modal Retrieval Language Modelling +3

256

Paper
Code

Movie101: A New Movie Understanding Benchmark

1 code implementation • 20 May 2023 • Zihao Yue, Qi Zhang, Anwen Hu, Liang Zhang, Ziheng Wang, Qin Jin

Closer to real scenarios, the Movie Clip Narrating (MCN) task in our benchmark asks models to generate role-aware narration paragraphs for complete movie clips where no actors are speaking.

Video Captioning

Paper
Code

Question-controlled Text-aware Image Captioning

1 code implementation • 4 Aug 2021 • Anwen Hu, ShiZhe Chen, Qin Jin

To explore how to generate personalized text-aware captions, we define a new challenging task, namely Question-controlled Text-aware Image Captioning (Qc-TextCap).

Image Captioning Question Answering

Paper
Code

MPMQA: Multimodal Question Answering on Product Manuals

1 code implementation • 19 Apr 2023 • Liang Zhang, Anwen Hu, Jing Zhang, Shuo Hu, Qin Jin

Taking into account the length of product manuals and the fact that a question is always related to a small number of pages, MPMQA can be naturally split into two subtasks: retrieving most related pages and then generating multimodal answers.

Question Answering Sentence

Paper
Code

InfoMetIC: An Informative Metric for Reference-free Image Caption Evaluation

1 code implementation • 10 May 2023 • Anwen Hu, ShiZhe Chen, Liang Zhang, Qin Jin

Existing metrics only provide a single score to measure caption qualities, which are less explainable and informative.

Benchmarking Image Captioning

Paper
Code

ICECAP: Information Concentrated Entity-aware Image Captioning

1 code implementation • 4 Aug 2021 • Anwen Hu, ShiZhe Chen, Qin Jin

In this work, we focus on the entity-aware news image captioning task which aims to generate informative captions by leveraging the associated news articles to provide background knowledge about the target image.

Image Captioning Retrieval +1

Paper
Code

Accommodating Audio Modality in CLIP for Multimodal Processing

1 code implementation • 12 Mar 2023 • Ludan Ruan, Anwen Hu, Yuqing Song, Liang Zhang, Sipeng Zheng, Qin Jin

In this paper, we extend the stateof-the-art Vision-Language model CLIP to accommodate the audio modality for Vision-Language-Audio multimodal processing.

AudioCaps Contrastive Learning +4

Paper
Code

A Roadmap for Big Model

no code implementations • 26 Mar 2022 • Sha Yuan, Hanyu Zhao, Shuai Zhao, Jiahong Leng, Yangxiao Liang, Xiaozhi Wang, Jifan Yu, Xin Lv, Zhou Shao, Jiaao He, Yankai Lin, Xu Han, Zhenghao Liu, Ning Ding, Yongming Rao, Yizhao Gao, Liang Zhang, Ming Ding, Cong Fang, Yisen Wang, Mingsheng Long, Jing Zhang, Yinpeng Dong, Tianyu Pang, Peng Cui, Lingxiao Huang, Zheng Liang, HuaWei Shen, HUI ZHANG, Quanshi Zhang, Qingxiu Dong, Zhixing Tan, Mingxuan Wang, Shuo Wang, Long Zhou, Haoran Li, Junwei Bao, Yingwei Pan, Weinan Zhang, Zhou Yu, Rui Yan, Chence Shi, Minghao Xu, Zuobai Zhang, Guoqiang Wang, Xiang Pan, Mengjie Li, Xiaoyu Chu, Zijun Yao, Fangwei Zhu, Shulin Cao, Weicheng Xue, Zixuan Ma, Zhengyan Zhang, Shengding Hu, Yujia Qin, Chaojun Xiao, Zheni Zeng, Ganqu Cui, Weize Chen, Weilin Zhao, Yuan YAO, Peng Li, Wenzhao Zheng, Wenliang Zhao, Ziyi Wang, Borui Zhang, Nanyi Fei, Anwen Hu, Zenan Ling, Haoyang Li, Boxi Cao, Xianpei Han, Weidong Zhan, Baobao Chang, Hao Sun, Jiawen Deng, Chujie Zheng, Juanzi Li, Lei Hou, Xigang Cao, Jidong Zhai, Zhiyuan Liu, Maosong Sun, Jiwen Lu, Zhiwu Lu, Qin Jin, Ruihua Song, Ji-Rong Wen, Zhouchen Lin, LiWei Wang, Hang Su, Jun Zhu, Zhifang Sui, Jiajun Zhang, Yang Liu, Xiaodong He, Minlie Huang, Jian Tang, Jie Tang

With the rapid development of deep learning, training Big Models (BMs) for multiple downstream tasks becomes a popular paradigm.

Language Modelling Machine Translation +1

Paper
Add Code

Generalizing Multimodal Pre-training into Multilingual via Language Acquisition

no code implementations • 29 May 2022 • Liang Zhang, Anwen Hu, Qin Jin

Specifically, we design a lightweight language acquisition encoder based on state-of-the-art monolingual VLP models.

Language Acquisition Retrieval +2

Paper
Add Code

Explore and Tell: Embodied Visual Captioning in 3D Environments

no code implementations • ICCV 2023 • Anwen Hu, ShiZhe Chen, Liang Zhang, Qin Jin

To overcome this limitation, we propose a novel task called Embodied Captioning, which equips visual captioning models with navigation capabilities, enabling them to actively explore the scene and reduce visual ambiguity from suboptimal viewpoints.

Image Captioning Navigate +1

Paper
Add Code

Cannot find the paper you are looking for? You can Submit a new open access paper.