Open Vocabulary Object Detection
56 papers with code • 4 benchmarks • 6 datasets
Open-vocabulary detection (OVD) aims to generalize beyond the limited number of base classes labeled during the training phase. The goal is to detect novel classes defined by an unbounded (open) vocabulary at inference.
Libraries
Use these libraries to find Open Vocabulary Object Detection models and implementationsMost implemented papers
Bridging the Gap between Object and Image-level Representations for Open-Vocabulary Detection
Two popular forms of weak-supervision used in open-vocabulary detection (OVD) include pretrained CLIP model and image-level supervision.
Exploiting Unlabeled Data with Vision and Language Models for Object Detection
We propose a novel method that leverages the rich semantics available in recent vision and language models to localize and classify objects in unlabeled images, effectively generating pseudo labels for object detection.
OmDet: Large-scale vision-language multi-dataset pre-training with multimodal detection network
The advancement of object detection (OD) in open-vocabulary and open-world scenarios is a critical challenge in computer vision.
F-VLM: Open-Vocabulary Object Detection upon Frozen Vision and Language Models
We present F-VLM, a simple open-vocabulary object detection method built upon Frozen Vision and Language Models.
Understanding and Mitigating Overfitting in Prompt Tuning for Vision-Language Models
Pretrained vision-language models (VLMs) such as CLIP have shown impressive generalization capability in downstream vision tasks with appropriate text prompts.
Open-vocabulary Attribute Detection
The objective of the novel task and benchmark is to probe object-level attribute information learned by vision-language models.
Learning Object-Language Alignments for Open-Vocabulary Object Detection
In this paper, we propose a novel open-vocabulary object detection framework directly learning from image-text pair data.
X-Paste: Revisiting Scalable Copy-Paste for Instance Segmentation using CLIP and StableDiffusion
We demonstrate for the first time that using a text2image model to generate images or zero-shot recognition model to filter noisily crawled images for different object categories is a feasible way to make Copy-Paste truly scalable.
Learning To Generate Language-Supervised and Open-Vocabulary Scene Graph Using Pre-Trained Visual-Semantic Space
Specifically, cheap scene graph supervision data can be easily obtained by parsing image language descriptions into semantic graphs.
Distilling DETR with Visual-Linguistic Knowledge for Open-Vocabulary Object Detection
Current methods for open-vocabulary object detection (OVOD) rely on a pre-trained vision-language model (VLM) to acquire the recognition ability.