In this work, we present Painter, a generalist model which addresses these obstacles with an "image"-centric solution, that is, to redefine the output of core vision tasks as images, and specify task prompts as also images.
In this report, we present a fast and accurate object detection method dubbed DAMO-YOLO, which achieves higher performance than the state-of-the-art YOLO series.
Ranked #29 on Real-Time Object Detection on COCO
This paper proposes DiffusionInst, a novel framework that represents instances as instance-aware filters and formulates instance segmentation as a noise-to-filter denoising process.
Ranked #33 on Instance Segmentation on COCO test-dev
This paper presents SimVTP: a Simple Video-Text Pretraining framework via masked autoencoders.
Specifically, InternVideo efficiently explores masked video modeling and video-language contrastive learning as the pretraining objectives, and selectively coordinates video representations of these two complementary frameworks in a learnable manner to boost various video applications.
Ranked #1 on Zero-Shot Video Retrieval on VATEX
The combination of generative pre-training and a new dataset for this task results in $77$% stronger performance on melody transcription relative to the strongest available baseline.
We launch EVA, a vision-centric foundation model to explore the limits of visual representation at scale using only publicly accessible data.
Ranked #1 on Image Classification on ImageNet (finetuned)
We study the capabilities of speech processing systems trained simply to predict large amounts of transcripts of audio on the internet.
Ranked #1 on Speech Recognition on Tedlium