Images Speak in Images: A Generalist Painter for In-Context Visual Learning

baaivision/painter 5 Dec 2022

In this work, we present Painter, a generalist model which addresses these obstacles with an "image"-centric solution, that is, to redefine the output of core vision tasks as images, and specify task prompts as also images.

Keypoint Detection Semantic Segmentation

DAMO-YOLO : A Report on Real-Time Object Detection Design

tinyvision/damo-yolo 23 Nov 2022

In this report, we present a fast and accurate object detection method dubbed DAMO-YOLO, which achieves higher performance than the state-of-the-art YOLO series.

Neural Architecture Search object-detection +1

DiffusionInst: Diffusion Model for Instance Segmentation

chenhaoxing/DiffusionInst 6 Dec 2022

This paper proposes DiffusionInst, a novel framework that represents instances as instance-aware filters and formulates instance segmentation as a noise-to-filter denoising process.

Image Denoising Image Generation +3

Zero-Shot Image Restoration Using Denoising Diffusion Null-Space Model

wyhuai/ddnm 1 Dec 2022

Most existing Image Restoration (IR) models are task-specific, which can not be generalized to different degradation operators.

Colorization Deblurring +7

SimVTP: Simple Video Text Pre-training with Masked Autoencoders

mayuelala/simvtp 7 Dec 2022

This paper presents SimVTP: a Simple Video-Text Pretraining framework via masked autoencoders.

Contrastive Learning Text Matching

InternVideo: General Video Foundation Models via Generative and Discriminative Learning

opengvlab/internvideo 6 Dec 2022

Specifically, InternVideo efficiently explores masked video modeling and video-language contrastive learning as the pretraining objectives, and selectively coordinates video representations of these two complementary frameworks in a learnable manner to boost various video applications.

Action Classification Action Recognition +6

Melody transcription via generative pre-training

chrisdonahue/sheetsage 4 Dec 2022

The combination of generative pre-training and a new dataset for this task results in $77$% stronger performance on melody transcription relative to the strongest available baseline.

Chord Recognition Information Retrieval +2

EVA: Exploring the Limits of Masked Visual Representation Learning at Scale

baaivision/eva 14 Nov 2022

We launch EVA, a vision-centric foundation model to explore the limits of visual representation at scale using only publicly accessible data.

 Ranked #1 on Instance Segmentation on LVIS v1.0 val (using extra training data)

Action Classification Action Recognition +7

Robust Speech Recognition via Large-Scale Weak Supervision

openai/whisper Preprint 2022

We study the capabilities of speech processing systems trained simply to predict large amounts of transcripts of audio on the internet.

Robust Speech Recognition Spoken language identification

