InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
The exponential growth of large language models (LLMs) has opened up numerous possibilities for multimodal AGI systems. However, the progress in vision and vision-language foundation models, which are also critical elements of multi-modal AGI, has not kept pace with LLMs. In this work, we design a large-scale vision-language foundation model (InternVL), which scales up the vision foundation model to 6 billion parameters and progressively aligns it with the LLM, using web-scale image-text data from various sources. This model can be broadly applied to and achieve state-of-the-art performance on 32 generic visual-linguistic benchmarks including visual perception tasks such as image-level or pixel-level recognition, vision-language tasks such as zero-shot image/video classification, zero-shot image/video-text retrieval, and link with LLMs to create multi-modal dialogue systems. It has powerful visual capabilities and can be a good alternative to the ViT-22B. We hope that our research could contribute to the development of multi-modal large models. Code and models are available at https://github.com/OpenGVLab/InternVL.
PDF Abstract CVPR 2024 PDF CVPR 2024 AbstractCode
Tasks















Datasets











































Results from the Paper
Ranked #1 on
Zero-Shot Video Retrieval
on MSR-VTT-full
(using extra training data)
Task | Dataset | Model | Metric Name | Metric Value | Global Rank | Uses Extra Training Data |
Benchmark |
---|---|---|---|---|---|---|---|
Zero-Shot Transfer Image Classification | CN-ImageNet | InternVL-C | Accuracy (Private) | 64.5 | # 1 | ||
Zero-Shot Cross-Modal Retrieval | COCO 2014 | InternVL-G | Image-to-text R@1 | 74.9 | # 1 | ||
Image-to-text R@5 | 91.3 | # 2 | |||||
Image-to-text R@10 | 95.2 | # 3 | |||||
Text-to-image R@1 | 58.6 | # 1 | |||||
Text-to-image R@5 | 81.3 | # 2 | |||||
Text-to-image R@10 | 88.0 | # 2 | |||||
Zero-Shot Cross-Modal Retrieval | COCO 2014 | InternVL-C | Image-to-text R@1 | 70.6 | # 4 | ||
Image-to-text R@5 | 89.0 | # 6 | |||||
Image-to-text R@10 | 93.5 | # 6 | |||||
Text-to-image R@1 | 54.1 | # 3 | |||||
Text-to-image R@5 | 77.3 | # 4 | |||||
Text-to-image R@10 | 84.6 | # 5 | |||||
Zero-shot Image Retrieval | COCO-CN | InternVL-C | R@1 | 68.9 | # 5 | ||
R@5 | 91.9 | # 3 | |||||
R@10 | 96.5 | # 4 | |||||
Zero-shot Image Retrieval | COCO-CN | InternVL-G | R@1 | 73.8 | # 2 | ||
R@5 | 94.4 | # 2 | |||||
R@10 | 98.1 | # 2 | |||||
Zero-Shot Cross-Modal Retrieval | Flickr30k | InternVL-C | Image-to-text R@1 | 94.7 | # 3 | ||
Image-to-text R@5 | 99.6 | # 3 | |||||
Image-to-text R@10 | 99.9 | # 2 | |||||
Text-to-image R@1 | 81.7 | # 4 | |||||
Text-to-image R@5 | 96.0 | # 4 | |||||
Text-to-image R@10 | 98.2 | # 3 | |||||
Image-to-Text Retrieval | Flickr30k | InternVL-C-FT (finetuned, w/o ranking) | Recall@1 | 97.2 | # 4 | ||
Recall@5 | 100 | # 1 | |||||
Recall@10 | 100 | # 1 | |||||
Image-to-Text Retrieval | Flickr30k | InternVL-G-FT (finetuned, w/o ranking) | Recall@1 | 97.9 | # 1 | ||
Recall@5 | 100 | # 1 | |||||
Recall@10 | 100 | # 1 | |||||
Zero-Shot Cross-Modal Retrieval | Flickr30k | InternVL-G | Image-to-text R@1 | 95.7 | # 1 | ||
Image-to-text R@5 | 99.7 | # 2 | |||||
Image-to-text R@10 | 99.9 | # 2 | |||||
Text-to-image R@1 | 85.0 | # 3 | |||||
Text-to-image R@5 | 97.0 | # 2 | |||||
Text-to-image R@10 | 98.6 | # 2 | |||||
Zero-shot Image Retrieval | Flickr30k-CN | InternVL-C | R@1 | 75.1 | # 3 | ||
R@5 | 92.9 | # 3 | |||||
R@10 | 96.4 | # 3 | |||||
Image Retrieval | Flickr30k-CN | InternVL-C-FT | R@1 | 85.2 | # 2 | ||
R@5 | 98.5 | # 2 | |||||
R@10 | 97.0 | # 7 | |||||
Image Retrieval | Flickr30k-CN | InternVL-G-FT | R@1 | 85.9 | # 1 | ||
R@5 | 98.7 | # 1 | |||||
R@10 | 97.1 | # 6 | |||||
Zero-shot Image Retrieval | Flickr30k-CN | InternVL-G | R@1 | 77.7 | # 2 | ||
R@5 | 94.8 | # 2 | |||||
R@10 | 97.3 | # 2 | |||||
Zero-Shot Transfer Image Classification | Food-101 | InternVL-C | Top 1 Accuracy | 95.3 | # 3 | ||
Zero-Shot Transfer Image Classification | ImageNet | InternVL-C | Accuracy (Private) | 83.2 | # 10 | ||
Zero-Shot Transfer Image Classification | ImageNet-A | InternVL-C | Accuracy (Private) | 83.8 | # 7 | ||
Zero-Shot Transfer Image Classification | ImageNet-Sketch | InternVL-C | Accuracy (Private) | 73.9 | # 5 | ||
Zero-Shot Transfer Image Classification | ImageNet V2 | InternVL-C | Accuracy (Private) | 77.3 | # 8 | ||
MMR total | MRR-Benchmark | InternVL2-1B | Total Column Score | 237 | # 12 | ||
MMR total | MRR-Benchmark | InternVL2-8B | Total Column Score | 368 | # 6 | ||
Zero-Shot Video Retrieval | MSR-VTT-full | InternVL-G | text-to-video R@1 | 46.3 | # 1 | ||
text-to-video R@5 | 70.5 | # 1 | |||||
text-to-video R@10 | 79.6 | # 1 | |||||
video-to-text R@1 | 42.4 | # 2 | |||||
video-to-text R@5 | 65.9 | # 2 | |||||
video-to-text R@10 | 75.4 | # 2 | |||||
Zero-Shot Video Retrieval | MSR-VTT-full | InternVL-C | text-to-video R@1 | 44.7 | # 2 | ||
text-to-video R@5 | 68.2 | # 2 | |||||
text-to-video R@10 | 78.4 | # 2 | |||||
video-to-text R@1 | 40.2 | # 3 | |||||
video-to-text R@5 | 63.1 | # 3 | |||||
video-to-text R@10 | 74.1 | # 3 | |||||
Zero-Shot Transfer Image Classification | ObjectNet | InternVL-C | Accuracy (Private) | 80.6 | # 6 | ||
Visual Question Answering (VQA) | VQA v2 test-dev | InternVL-C | Accuracy | 81.2 | # 9 | ||
Zero-shot Image Retrieval | XTD10 | InternVL-G | EN-Recall@10 | 98.6 | # 1 | ||
ES-Recall@10 | 97.7 | # 1 | |||||
FR-Recall@10 | 96.5 | # 1 | |||||
ZH-Recall@10 | 96.7 | # 1 | |||||
KO-Recall@10 | 95.1 | # 1 | |||||
RU-Recall@10 | 94.8 | # 1 | |||||
JA-Recall@10 | 96.1 | # 1 | |||||
IT-Recall@10 | 96.9 | # 1 | |||||
Zero-shot Image Retrieval | XTD10 | InternVL-C | EN-Recall@10 | 97.3 | # 2 | ||
ES-Recall@10 | 95.7 | # 2 | |||||
FR-Recall@10 | 95.1 | # 2 | |||||
ZH-Recall@10 | 95.6 | # 2 | |||||
KO-Recall@10 | 92.2 | # 3 | |||||
RU-Recall@10 | 93.3 | # 2 | |||||
JA-Recall@10 | 95.5 | # 2 | |||||
IT-Recall@10 | 96.0 | # 2 |