Furthermore, the holistic features are refined by the multi-scale temporal relations in a novel fusion module for yielding more discriminative video representations.
By fine-tuning the models on a number of commonly used remote sensing datasets, we show that our approach outperforms existing pre-training strategies for remote sensing imagery.
We conduct experiments with extensive baseline models on both MultiScene-Clean and MultiScene to offer benchmarks for multi-scene recognition in single images and learning from noisy labels for this task, respectively.
The detected keypoints are subsequently reformulated as a closed polygon, which is the semantic boundary of the building.
With the help of this dataset, we evaluate three proposed approaches for transferring the sound event knowledge to the aerial scene recognition task in a multimodal learning framework, and show the benefit of exploiting the audio information for the aerial scene recognition.
In this paper, we introduce a novel problem of event recognition in unconstrained aerial videos in the remote sensing community and present a large-scale, human-annotated dataset, named ERA (Event Recognition in Aerial videos), consisting of 2, 864 videos each with a label from 25 different classes corresponding to an event unfolding 5 seconds.
Aerial image scene classification is a fundamental problem for understanding high-resolution remote sensing images and has become an active research task in the field of remote sensing due to its important role in a wide range of applications.