The Places dataset is proposed for scene recognition and contains more than 2.5 million images covering more than 205 scene categories with more than 5,000 images per category.
751 PAPERS • 4 BENCHMARKS
ScanNet is an instance-level indoor RGB-D dataset that includes both 2D and 3D data. It is a collection of labeled voxels rather than points or objects. Up to now, ScanNet v2, the newest version of ScanNet, has collected 1513 annotated scans with an approximate 90% surface coverage. In the semantic segmentation task, this dataset is marked in 20 classes of annotated 3D voxelized objects.
689 PAPERS • 15 BENCHMARKS
The ADE20K semantic segmentation dataset contains more than 20K scene-centric images exhaustively annotated with pixel-level objects and object parts labels. There are totally 150 semantic categories, which include stuffs like sky, road, grass, and discrete objects like person, car, bed.
606 PAPERS • 16 BENCHMARKS
The Places205 dataset is a large-scale scene-centric dataset with 205 common scene categories. The training dataset contains around 2,500,000 images from these categories. In the training set, each scene category has the minimum 5,000 and maximum 15,000 images. The validation set contains 100 images per category (a total of 20,500 images), and the testing set includes 200 images per category (a total of 41,000 images).
467 PAPERS • 1 BENCHMARK
The SUN RGBD dataset contains 10335 real RGB-D images of room scenes. Each RGB image has a corresponding depth and segmentation map. As many as 700 object categories are labeled. The training and testing sets contain 5285 and 5050 images, respectively.
302 PAPERS • 12 BENCHMARKS
The Places365 dataset is a scene recognition dataset. It is composed of 10 million images comprising 434 scene classes. There are two versions of the dataset: Places365-Standard with 1.8 million train and 36000 validation images from K=365 scene classes, and Places365-Challenge-2016, in which the size of the training set is increased up to 6.2 million extra images, including 69 new scene classes (leading to a total of 8 million train images from 434 scene classes).
20 PAPERS • 6 BENCHMARKS
The Scene UNderstanding (SUN) database contains 899 categories and 130,519 images. There are 397 well-sampled categories to evaluate numerous state-of-the-art algorithms for scene recognition.
19 PAPERS • 5 BENCHMARKS
A new and challenging video database of dynamic scenes that more than doubles the size of those previously available. This dataset is explicitly split into two subsets of equal size that contain videos with and without camera motion to allow for systematic study of how this variable interacts with the defining dynamics of the scene per se.
11 PAPERS • 1 BENCHMARK
Context This is the Original data provided by MIT .
6 PAPERS • 1 BENCHMARK
An annotated dataset is released to enable dynamic scene classification that includes 80 hours of diverse high quality driving video data clips collected in the San Francisco Bay area. The dataset includes temporal annotations for road places, road types, weather, and road surface conditions.
4 PAPERS • NO BENCHMARKS YET
The AuDio Visual Aerial sceNe reCognition datasEt (ADVANCE) is a brand-new multimodal learning dataset, which aims to explore the contribution of both audio and conventional visual messages to scene recognition. This dataset in summary contains 5075 pairs of geotagged aerial images and sounds, classified into 13 scene classes, i.e., airport, sports land, beach, bridge, farmland, forest, grassland, harbor, lake, orchard, residential area, shrub land, and train station.
2 PAPERS • NO BENCHMARKS YET
HOWS-CL-25 (Household Objects Within Simulation dataset for Continual Learning) is a synthetic dataset especially designed for object classification on mobile robots operating in a changing environment (like a household), where it is important to learn new, never seen objects on the fly. This dataset can also be used for other learning use-cases, like instance segmentation or depth estimation. Or where household objects or continual learning are of interest.
1 PAPER • 2 BENCHMARKS
MAI is a dataset for multi-scene recognition in single aerial images. It consists of 3,923 labelled large-scale images from Google Earth imagery that covers the United States, Germany, and France. The size of each image is 512 ×512, and spatial resolutions vary from 0.3 m/pixel to 0.6 m/pixel. After capturing aerial images, multiple scene-level labels were manually assigned to each image from in total 24 scene categories, including apron, baseball, beach, commercial, farmland, woodland, parking lot, port, residential, river, storage tanks, sea, bridge, lake, park, roundabout, soccer field, stadium, train station, works, golf course, runway, sparse shrub, and tennis court
1 PAPER • NO BENCHMARKS YET
This dataset is an extremely challenging set of over 5000+ original India food images captured and crowdsourced from over 800+ urban and rural areas, where each image is manually reviewed and verified by computer vision professionals at ****DC Labs.
0 PAPER • NO BENCHMARKS YET
This dataset is an extremely challenging set of over 3000+ originally Stair images captured and crowdsourced from over 500+ urban and rural areas, where each image is manually reviewed and verified by computer vision professionals at Datacluster Labs.
0 PAPER • NO BENCHMARKS YET