The CVUSA dataset is a matching task between street- and aerial views, from different regions of the US. This task helps to determine localization without GPS coordinates for the street-view images. Google Street View panoramas are used as ground images, and matching aerial images at zoom level 19 are obtained from Microsoft Bing Maps. The dataset comprises 35,532 image pairs for training and 8,884 image pairs for testing, and recall is the primary metric for evaluation.
14 PAPERS • 2 BENCHMARKS
The CVACT dataset is a matching task between street- and aerial views, from Canberra (Australia). This task helps to determine localization without GPS coordinates for the street-view images. Google Street View panoramas are used as ground images, and matching aerial images also from the Google Maps API. The dataset comprises 35,532 image pairs for training and 8,884 image pairs for evaluation, and recall is the primary metric for evaluation. To further test the generalization in comparison to the CVUSA dataset, CVACT features 92,802 test images.
8 PAPERS • 2 BENCHMARKS
Similar to CVUSA and CVACT, the VIGOR dataset contains satellites and street imagery to match them to each other to find the location of the street imagery. For this purpose, data from 4 major American cities were used, namely San Francisco, New York, Seattle and Chicago. Unlike the previous datasets, there are two settings: The SAME-Area setting where images of all cities are available in training and validation split. Secondly, there is the CROSS area setting where training is done on two cities (New York, Seattle) and evaluation is done on Chicago and San Francisco. In addition, the dataset contains semi-positive images which are very close to an actual ground truth image and thus serve as a distraction for the matching task. In total, the dataset consists of 90,618 satellite images and 105,214 street images.
6 PAPERS • 2 BENCHMARKS
The appearance of the world varies dramatically not only from place to place but also from hour to hour and month to month. Every day billions of images capture this complex relationship, many of which are associated with precise time and location metadata. We propose to use these images to construct a global-scale, dynamic map of visual appearance attributes. Such a map enables fine-grained understanding of the expected appearance at any geographic location and time. Our approach integrates dense overhead imagery with location and time metadata into a general framework capable of mapping a wide variety of visual attributes. A key feature of our approach is that it requires no manual data annotation. We demonstrate how this approach can support various applications, including image-driven mapping, image geolocalization, and metadata verification.
2 PAPERS • 1 BENCHMARK
The standard evaluation protocol of Cross-View Time dataset allows for certain cameras to be shared between training and testing sets. This protocol can emulate scenarios in which we need to verify the authenticity of images from a particular set of devices and locations. Considering the ubiquity of surveillance systems (CCTV) nowadays, this is a common scenario, especially for big cities and high visibility events (e.g., protests, musical concerts, terrorist attempts, sports events). In such cases, we can leverage the availability of historical photographs of that device and collect additional images from previous days, months, and years. This would allow the model to better capture the particularities of how time influences the appearance of that specific place, probably leading to a better verification accuracy. However, there might be cases in which data is originated from heterogeneous sources, such as social media. In this sense, it is essential that models are optimized on camer
1 PAPER • 1 BENCHMARK