LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day

Conversational generative AI has demonstrated remarkable promise for empowering biomedical practitioners, but current investigations focus on unimodal text. Multimodal conversational AI has seen rapid progress by leveraging billions of image-text pairs from the public web, but such general-domain vision-language models still lack sophistication in understanding and conversing about biomedical images. In this paper, we propose a cost-efficient approach for training a vision-language conversational assistant that can answer open-ended research questions of biomedical images. The key idea is to leverage a large-scale, broad-coverage biomedical figure-caption dataset extracted from PubMed Central, use GPT-4 to self-instruct open-ended instruction-following data from the captions, and then fine-tune a large general-domain vision-language model using a novel curriculum learning method. Specifically, the model first learns to align biomedical vocabulary using the figure-caption pairs as is, then learns to master open-ended conversational semantics using GPT-4 generated instruction-following data, broadly mimicking how a layperson gradually acquires biomedical knowledge. This enables us to train a Large Language and Vision Assistant for BioMedicine (LLaVA-Med) in less than 15 hours (with eight A100s). LLaVA-Med exhibits excellent multimodal conversational capability and can follow open-ended instruction to assist with inquiries about a biomedical image. On three standard biomedical visual question answering datasets, LLaVA-Med outperforms previous supervised state-of-the-art on certain metrics. To facilitate biomedical multimodal research, we will release our instruction-following data and the LLaVA-Med model.

PDF Abstract NeurIPS 2023 PDF NeurIPS 2023 Abstract
Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Referring Expression Comprehension ColonINST-v1 (Seen) LLaVA-Med-v1.5 (w/ LoRA, w/o extra data) Intersection over Union 64.69 # 2
Referring expression generation ColonINST-v1 (Seen) LLaVA-Med-v1.5 (w/ LoRA, w/ extra data) Accuray 90.4 # 14
Referring Expression Comprehension ColonINST-v1 (Seen) LLaVA-Med-v1.5 (w/ LoRA, w/ extra data) Intersection over Union 13.39 # 17
Referring Expression Comprehension ColonINST-v1 (Seen) LLaVA-Med-v1.0 (w/o LoRA, w/ extra data) Intersection over Union 39.43 # 12
Referring Expression Comprehension ColonINST-v1 (Seen) LLaVA-Med-v1.0 (w/o LoRA, w/o extra data) Intersection over Union 41.6 # 10
Referring expression generation ColonINST-v1 (Seen) LLaVA-Med-v1.5 (w/ LoRA, w/o extra data) Accuray 99.3 # 3
Referring expression generation ColonINST-v1 (Seen) LLaVA-Med-v1.0 (w/o LoRA, w/ extra data) Accuray 97.35 # 10
Referring expression generation ColonINST-v1 (Seen) LLaVA-Med-v1.0 (w/o LoRA, w/o extra data) Accuray 97.74 # 9
Image Classification ColonINST-v1 (Seen) LLaVA-Med-v1.5 (w/ LoRA, w/ extra data) Accuray 87.22 # 17
Image Classification ColonINST-v1 (Seen) LLaVA-Med-v1.0 (w/o LoRA, w/o extra data) Accuray 93.52 # 5
Image Classification ColonINST-v1 (Seen) LLaVA-Med-v1.5 (w/ LoRA, w/o extra data) Accuray 93.62 # 4
Image Classification ColonINST-v1 (Seen) LLaVA-Med-v1.0 (w/o LoRA, w/ extra data) Accuray 93.84 # 2
Image Classification ColonINST-v1 (Unseen) LLaVA-Med-v1.0 (w/o LoRA, w/ extra data) Accuray 77.38 # 12
Image Classification ColonINST-v1 (Unseen) LLaVA-Med-v1.0 (w/o LoRA, w/o extra data) Accuray 78.04 # 10
Referring Expression Comprehension ColonINST-v1 (Unseen) LLaVA-Med-v1.5 (w/ LoRA, w/ extra data) Intersection over Union 12.95 # 15
Referring Expression Comprehension ColonINST-v1 (Unseen) LLaVA-Med-v1.5 (w/ LoRA, w/o extra data) Intersection over Union 41.97 # 3
Referring Expression Comprehension ColonINST-v1 (Unseen) LLaVA-Med-v1.0 (w/o LoRA, w/ extra data) Intersection over Union 20.85 # 12
Referring Expression Comprehension ColonINST-v1 (Unseen) LLaVA-Med-v1.0 (w/o LoRA, w/o extra data) Intersection over Union 24.89 # 11
Referring expression generation ColonINST-v1 (Unseen) LLaVA-Med-v1.5 (w/ LoRA, w/ extra data) Accuray 70.00 # 13
Referring expression generation ColonINST-v1 (Unseen) LLaVA-Med-v1.5 (w/ LoRA, w/o extra data) Accuray 73.05 # 8
Referring expression generation ColonINST-v1 (Unseen) LLaVA-Med-v1.0 (w/o LoRA, w/ extra data) Accuray 75.25 # 3
Referring expression generation ColonINST-v1 (Unseen) LLaVA-Med-v1.0 (w/o LoRA, w/o extra data) Accuray 75.07 # 5
Image Classification ColonINST-v1 (Unseen) LLaVA-Med-v1.5 (w/ LoRA, w/ extra data) Accuray 66.51 # 16
Image Classification ColonINST-v1 (Unseen) LLaVA-Med-v1.5 (w/ LoRA, w/o extra data) Accuray 79.24 # 5

Methods