TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK	EXTRA DATA	REMOVE
Open Vocabulary Object Detection	LVIS v1.0	BARON	AP novel-LVIS base training	22.6	# 11
Open Vocabulary Object Detection	MSCOCO	BARON	AP 0.5	42.7	# 7

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/aligning-bag-of-regions-for-open-vocabulary/open-vocabulary-object-detection-on-mscoco)](https://paperswithcode.com/sota/open-vocabulary-object-detection-on-mscoco?p=aligning-bag-of-regions-for-open-vocabulary)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/aligning-bag-of-regions-for-open-vocabulary/open-vocabulary-object-detection-on-lvis-v1-0)](https://paperswithcode.com/sota/open-vocabulary-object-detection-on-lvis-v1-0?p=aligning-bag-of-regions-for-open-vocabulary)`

Aligning Bag of Regions for Open-Vocabulary Object Detection

CVPR 2023 · Size Wu, Wenwei Zhang, Sheng Jin, Wentao Liu, Chen Change Loy ·

Pre-trained vision-language models (VLMs) learn to align vision and language representations on large-scale datasets, where each image-text pair usually contains a bag of semantic concepts. However, existing open-vocabulary object detectors only align region embeddings individually with the corresponding features extracted from the VLMs. Such a design leaves the compositional structure of semantic concepts in a scene under-exploited, although the structure may be implicitly learned by the VLMs. In this work, we propose to align the embedding of bag of regions beyond individual regions. The proposed method groups contextually interrelated regions as a bag. The embeddings of regions in a bag are treated as embeddings of words in a sentence, and they are sent to the text encoder of a VLM to obtain the bag-of-regions embedding, which is learned to be aligned to the corresponding features extracted by a frozen VLM. Applied to the commonly used Faster R-CNN, our approach surpasses the previous best results by 4.6 box AP50 and 2.8 mask AP on novel categories of open-vocabulary COCO and LVIS benchmarks, respectively. Code and models are available at https://github.com/wusize/ovdet.

PDF Abstract CVPR 2023 PDF CVPR 2023 Abstract

Code

Add Remove Mark official

wusize/ovdet official

160

Tasks

Add Remove

Object

object-detection

Object Detection

Open Vocabulary Object Detection

Sentence

Datasets

MS COCO

LVIS

COCO Captions

Objects365 MSCOCO

Results from the Paper

Edit

Ranked #7 on Open Vocabulary Object Detection on MSCOCO (using extra training data)

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Uses Extra Training Data	Result	Benchmark
Open Vocabulary Object Detection	LVIS v1.0	BARON	AP novel-LVIS base training	22.6	# 11			Compare
Open Vocabulary Object Detection	MSCOCO	BARON	AP 0.5	42.7	# 7			Compare

Methods

Add Remove

ALIGN • Convolution • Faster R-CNN • RoIPool • RPN • Softmax

Edit Social Preview

Aligning Bag of Regions for Open-Vocabulary Object Detection

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove