Genixer: Empowering Multimodal Large Language Models as a Powerful Data Generator

11 Dec 2023  ·  Henry Hengyuan Zhao, Pan Zhou, Mike Zheng Shou ·

Instruction tuning data is essential for training the Multimodal Large Language Models (MLLMs). However, the creation of high-quality instruction tuning data presents significant challenges. Asking the human to label the instruction tuning data is label-intensive and time-consuming. Some works prompted to GPT-4 for data generation were not only costly but also lacked satisfactory performance in complex tasks (i.e., grounding-based reasoning tasks). To address the challenges of data creation, we are the first to explore the potential of empowering MLLMs with the ability to generate instruction-tuning data by following user instructions. Specifically, we developed an innovative data generation pipeline Genixer to generate various high-quality instruction tuning data, including nine representative tasks, e.g., Common VQA, REC, REG, and PointQ. Genixer provides a unified solution for data generation with four key steps: (i) instruction data collection, (ii) instruction template design, (iii) empowering MLLM, and (iv) data generation and filtering. To validate the effectiveness of generated data, we conducted the human evaluation and user preference study to assess the quality of generated data. Subsequently, we generated two instruction-tuning datasets for the training of two representative MLLMs, LLaVA1.5 and Shikra, and noted consistent improvements across various VQA tasks and multimodal benchmarks. For instance, performance on the VizWiz benchmark improved from 50.0% to 53.8%, and on ScienceQA, it increased from 66.8% to 69.7%, reconfirming the quality of the generated instruction tuning data. The data, code, and models will be released.

PDF Abstract

Results from the Paper


  Submit results from this paper to get state-of-the-art GitHub badges and help the community compare results to other papers.

Methods