Open Vocabulary Object Detection
75 papers with code • 4 benchmarks • 6 datasets
Open-vocabulary detection (OVD) aims to generalize beyond the limited number of base classes labeled during the training phase. The goal is to detect novel classes defined by an unbounded (open) vocabulary at inference.
Libraries
Use these libraries to find Open Vocabulary Object Detection models and implementationsMost implemented papers
Open-vocabulary Object Detection via Vision and Language Knowledge Distillation
On COCO, ViLD outperforms the previous state-of-the-art by 4. 8 on novel AP and 11. 4 on overall AP.
Simple Open-Vocabulary Object Detection with Vision Transformers
Combining simple architectures with large-scale pre-training has led to massive improvements in image classification.
Scaling Open-Vocabulary Object Detection
However, with OWL-ST, we can scale to over 1B examples, yielding further large improvement: With an L/14 architecture, OWL-ST improves AP on LVIS rare classes, for which the model has seen no human box annotations, from 31. 2% to 44. 6% (43% relative improvement).
PointCLIP: Point Cloud Understanding by CLIP
On top of that, we design an inter-view adapter to better extract the global feature and adaptively fuse the few-shot knowledge learned from 3D into CLIP pre-trained in 2D.
Open-Vocabulary DETR with Conditional Matching
To this end, we propose a novel open-vocabulary detector based on DETR -- hence the name OV-DETR -- which, once trained, can detect any object given its class name or an exemplar image.
Open Vocabulary Object Detection with Proposal Mining and Prediction Equalization
Open-vocabulary object detection (OVD) aims to scale up vocabulary size to detect objects of novel categories beyond the training vocabulary.
PointCLIP V2: Prompting CLIP and GPT for Powerful 3D Open-world Learning
In this paper, we first collaborate CLIP and GPT to be a unified 3D open-world learner, named as PointCLIP V2, which fully unleashes their potential for zero-shot 3D classification, segmentation, and detection.
X-Paste: Revisiting Scalable Copy-Paste for Instance Segmentation using CLIP and StableDiffusion
We demonstrate for the first time that using a text2image model to generate images or zero-shot recognition model to filter noisily crawled images for different object categories is a feasible way to make Copy-Paste truly scalable.
Region-Aware Pretraining for Open-Vocabulary Object Detection with Vision Transformers
We present Region-aware Open-vocabulary Vision Transformers (RO-ViT) - a contrastive image-text pretraining recipe to bridge the gap between image-level pretraining and open-vocabulary object detection.
Taming Self-Training for Open-Vocabulary Object Detection
This work identifies two challenges of using self-training in OVD: noisy PLs from VLMs and frequent distribution changes of PLs.