In this context, we introduce a global-scale, high-quality image-text dataset for remote sensing, providing natural language descriptions for Sentinel-2 data to facilitate the understanding of satellite imagery for common users.
Subseasonal forecasting, which is pivotal for agriculture, water resource management, and early warning of disasters, faces challenges due to the chaotic nature of the atmosphere.
Specifically, after projecting RS visual features to the language domain via an alignment layer, they are fed jointly with task-specific instructions into an LLM-based RS decoder to predict answers for RS open-ended tasks.
Current remote sensing foundation models typically specialize in a single modality or a specific spatial resolution range, limiting their versatility for downstream datasets.
To foster this task, we propose Mono3DVG-TR, an end-to-end transformer-based network, which takes advantage of both the appearance and geometry information in text embeddings for multi-modal learning and 3D object localization.
To tackle this problem, we propose a method for monocular height estimation from optical imagery, which is currently one of the richest sources of remote sensing data.
In this context, few-shot object detection (FSOD) has emerged as a promising direction, which aims at enabling the model to detect novel objects with only few of them annotated.
We propose Decoupling Common and Unique Representations (DeCUR), a simple yet effective method for multimodal self-supervised learning.
Motivated by this, we explore a geometric deep learning-based temporal Graph Convolutional Network (GCN) for precipitation nowcasting.
To address this issue, in this work, we investigate the parameter-efficient transfer learning (PETL) method to effectively and efficiently transfer visual-language knowledge from the natural domain to the RS domain on the image-text retrieval task.
We present PolyGNN, a polyhedron-based graph neural network for 3D building reconstruction from point clouds.
We present the RSSOD-Bench dataset for salient object detection (SOD) in optical remote sensing imagery.
Towards a fair and comprehensive analysis of existing methods, the proposed benchmark consists of 1) a large-scale dataset including co-registered RGB and nDSM pairs and pixel-wise semantic labels; 2) a comprehensive evaluation and analysis of existing multi-modal fusion strategies for both convolutional and Transformer-based networks on remote sensing data.
Ranked #1 on Semantic Segmentation on GAMUS
Self-supervised pre-training bears potential to generate expressive representations without human annotation.
Ranked #1 on Multi-Label Image Classification on BigEarthNet (official test set) (using extra training data)
Furthermore, a new platform for Earth observation, termed EarthNets, is released as a means of achieving a fair and consistent evaluation of deep learning methods on remote sensing data.
To address this problem, we propose to aggregate the learnable covariance matrices with a deformable 4D Transformer to effectively predict the segmentation map.
Ranked #1 on Few-Shot Semantic Segmentation on FSS-1000 (5-shot)
Furthermore, a novel unsupervised semantic segmentation task based on height estimation is first introduced in this work.
Therefore, we propose a new benchmark dataset to study the transferability of height estimation models in a cross-dataset setting.
In order to provide every user with flexible access to change information and help them better understand land-cover changes, we introduce a novel task: change detection-based visual question answering (CDVQA) on multi-temporal aerial images.
Discriminative local theme-level and object-level representations can be selected with the DLFS module from the spatially-correlated multi-modal RGB-D features.
Extensive experiments demonstrate the proposed CM is efficient and robust to fit arbitrary-shaped text instances, and also validate the effectiveness of MPF and constraints loss for discriminative text features recognition.
The main contributions of this work are as follows: 1) a novel VCD module is proposed, which exploits learnable Gaussian kernels to enable feature learning with structured adaptive-context; 2) variational Bayesian probabilistic modeling is introduced for the training of VCD module, which can make it continuous and more stable; 3) a perspective-aware guidance module is designed to take advantage of multi-modal information for RGB-D segmentation.
Ranked #1 on Scene Parsing on Cityscapes test
Our contributions are as follows: 1) We propose a multi-resolution feature fusion network architecture which exploits densely connected deconvolution layers with skip connections, and can learn more effective features for the small size object; 2) We frame the traffic sign detection as a spatial sequence classification and regression task, and propose a vertical spatial sequence attention (VSSA) module to gain more context information for better detection performance.