In this work, we address the task of unsupervised domain adaptation (UDA) for semantic segmentation in presence of multiple target domains: The objective is to train a single model that can handle all these domains at test time.
Despite the recent progress of generative adversarial networks (GANs) at synthesizing photo-realistic images, producing complex urban scenes remains a challenging problem.
While most works focus only on the image modality, there are many important multi-modal datasets.
In this paper, we introduce a novel target criterion for model confidence, namely the true class probability (TCP).
In this paper we propose a multi-task learning model to predict pedestrian actions, crossing intent and forecast their future path from video sequences.
While fully-supervised deep learning yields good models for urban scene semantic segmentation, these models struggle to generalize to new environments with different lighting or weather conditions for instance.
In this work, we define and address a novel domain adaptation (DA) problem in semantic scene segmentation, where the target domain not only exhibits a data distribution shift w. r. t.
In this work, we explore how to learn from multi-modality and propose cross-modal UDA (xMUDA) where we assume the presence of 2D images and 3D point clouds for 3D semantic segmentation.
Semantic segmentation models are limited in their ability to scale to large numbers of object classes.
Ranked #1 on Zero-Shot Learning on PASCAL Context
As a result, the performance of the trained semantic segmentation model on the target domain is boosted.
Ranked #8 on Image-to-Image Translation on SYNTHIA-to-Cityscapes
Our goal in this paper is to learn discriminative models for the temporal evolution of object appearance and to use such models for object detection.
Semantic segmentation is a key problem for many computer vision tasks.
This paper proposes a novel memory-based online video representation that is efficient, accurate and predictive.
First, we leverage person-scene relations and propose a Global CNN model trained to predict positions and scales of heads directly from the full image.