Search Results for author: Anwen Hu

Found 17 papers, 14 papers with code

mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding

1 code implementation4 Jul 2023 Jiabo Ye, Anwen Hu, Haiyang Xu, Qinghao Ye, Ming Yan, Yuhao Dan, Chenlin Zhao, Guohai Xu, Chenliang Li, Junfeng Tian, Qian Qi, Ji Zhang, Fei Huang

Nevertheless, without in-domain training, these models tend to ignore fine-grained OCR features, such as sophisticated tables or large blocks of text, which are essential for OCR-free document understanding.

document understanding Language Modelling +2

mPLUG-PaperOwl: Scientific Diagram Analysis with the Multimodal Large Language Model

1 code implementation30 Nov 2023 Anwen Hu, Yaya Shi, Haiyang Xu, Jiabo Ye, Qinghao Ye, Ming Yan, Chenliang Li, Qi Qian, Ji Zhang, Fei Huang

In this work, towards a more versatile copilot for academic paper writing, we mainly focus on strengthening the multi-modal diagram analysis ability of Multimodal LLMs.

Language Modelling Large Language Model

mPLUG-DocOwl 1.5: Unified Structure Learning for OCR-free Document Understanding

1 code implementation19 Mar 2024 Anwen Hu, Haiyang Xu, Jiabo Ye, Ming Yan, Liang Zhang, Bo Zhang, Chen Li, Ji Zhang, Qin Jin, Fei Huang, Jingren Zhou

In this work, we emphasize the importance of structure information in Visual Document Understanding and propose the Unified Structure Learning to boost the performance of MLLMs.

document understanding Optical Character Recognition (OCR)

Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Dataset for Pre-training and Benchmarks

1 code implementation7 Jun 2023 Haiyang Xu, Qinghao Ye, Xuan Wu, Ming Yan, Yuan Miao, Jiabo Ye, Guohai Xu, Anwen Hu, Yaya Shi, Guangwei Xu, Chenliang Li, Qi Qian, Maofei Que, Ji Zhang, Xiao Zeng, Fei Huang

In addition, to facilitate a comprehensive evaluation of video-language models, we carefully build the largest human-annotated Chinese benchmarks covering three popular video-language tasks of cross-modal retrieval, video captioning, and video category classification.

Cross-Modal Retrieval Language Modelling +3

Movie101: A New Movie Understanding Benchmark

1 code implementation20 May 2023 Zihao Yue, Qi Zhang, Anwen Hu, Liang Zhang, Ziheng Wang, Qin Jin

Closer to real scenarios, the Movie Clip Narrating (MCN) task in our benchmark asks models to generate role-aware narration paragraphs for complete movie clips where no actors are speaking.

Video Captioning

Question-controlled Text-aware Image Captioning

1 code implementation4 Aug 2021 Anwen Hu, ShiZhe Chen, Qin Jin

To explore how to generate personalized text-aware captions, we define a new challenging task, namely Question-controlled Text-aware Image Captioning (Qc-TextCap).

Image Captioning Question Answering

MPMQA: Multimodal Question Answering on Product Manuals

1 code implementation19 Apr 2023 Liang Zhang, Anwen Hu, Jing Zhang, Shuo Hu, Qin Jin

Taking into account the length of product manuals and the fact that a question is always related to a small number of pages, MPMQA can be naturally split into two subtasks: retrieving most related pages and then generating multimodal answers.

Question Answering Sentence

InfoMetIC: An Informative Metric for Reference-free Image Caption Evaluation

1 code implementation10 May 2023 Anwen Hu, ShiZhe Chen, Liang Zhang, Qin Jin

Existing metrics only provide a single score to measure caption qualities, which are less explainable and informative.

Benchmarking Image Captioning

ICECAP: Information Concentrated Entity-aware Image Captioning

1 code implementation4 Aug 2021 Anwen Hu, ShiZhe Chen, Qin Jin

In this work, we focus on the entity-aware news image captioning task which aims to generate informative captions by leveraging the associated news articles to provide background knowledge about the target image.

Image Captioning Retrieval +1

Accommodating Audio Modality in CLIP for Multimodal Processing

1 code implementation12 Mar 2023 Ludan Ruan, Anwen Hu, Yuqing Song, Liang Zhang, Sipeng Zheng, Qin Jin

In this paper, we extend the stateof-the-art Vision-Language model CLIP to accommodate the audio modality for Vision-Language-Audio multimodal processing.

AudioCaps Contrastive Learning +4

Generalizing Multimodal Pre-training into Multilingual via Language Acquisition

no code implementations29 May 2022 Liang Zhang, Anwen Hu, Qin Jin

Specifically, we design a lightweight language acquisition encoder based on state-of-the-art monolingual VLP models.

Language Acquisition Retrieval +2

Explore and Tell: Embodied Visual Captioning in 3D Environments

no code implementations ICCV 2023 Anwen Hu, ShiZhe Chen, Liang Zhang, Qin Jin

To overcome this limitation, we propose a novel task called Embodied Captioning, which equips visual captioning models with navigation capabilities, enabling them to actively explore the scene and reduce visual ambiguity from suboptimal viewpoints.

Image Captioning Navigate +1

Cannot find the paper you are looking for? You can Submit a new open access paper.