We introduce AutoRF - a new approach for learning neural 3D object representations where each object in the training set is observed by only a single view.
Most deep UDA approaches operate in a single-source, single-target scenario, i. e. they assume that the source and the target samples arise from a single distribution.
We introduce the problem of weakly supervised Multi-Object Tracking and Segmentation, i. e. joint weakly supervised instance segmentation and multi-object tracking, in which we do not provide any kind of mask annotation.
Crop-based training strategies decouple training resolution from GPU memory consumption, allowing the use of large-capacity panoptic segmentation networks on multi-megapixel images.
Pseudo-LiDAR-based methods for monocular 3D object detection have received considerable attention in the community due to the performance gains exhibited on the KITTI3D benchmark, in particular on the commonly reported validation split.
In this work we review the coarse-to-fine spatial feature pyramid concept, which is used in state-of-the-art optical flow estimation networks to make exploration of the pixel flow search space computationally tractable and efficient.
While expensive LiDAR and stereo camera rigs have enabled the development of successful 3D object detection methods, monocular RGB-only approaches lag much behind.
Training MOTSNet with our automatically extracted data leads to significantly improved sMOTSA scores on the novel KITTI MOTS dataset (+1. 9%/+7. 5% on cars/pedestrians), and MOTSNet improves by +4. 1% over previously best methods on the MOTSChallenge dataset.
In this paper, we introduce a traffic sign benchmark dataset of 100K street-level images around the world that encapsulates diverse scenes, wide coverage of geographical locations, and varying weather and lighting conditions and covers more than 300 manually annotated traffic sign classes.
In this paper we propose an approach for monocular 3D object detection from a single RGB image, which leverages a novel disentangling transformation for 2D and 3D detection losses and a novel, self-supervised confidence score for 3D bounding boxes.
In this work we introduce a novel, CNN-based architecture that can be trained end-to-end to deliver seamless scene segmentation results.
Ranked #2 on Panoptic Segmentation on KITTI Panoptic Segmentation
We propose a method for predicting the 3D shape of a deformable surface from a single view.
Our approach is based on the introduction of two main components, which can be embedded into any existing CNN architecture: (i) a side branch that automatically computes the assignment of a source sample to a latent domain and (ii) novel layers that exploit domain membership information to appropriately align the distribution of the CNN internal feature representations to a reference distribution.
Also, we demonstrate how frequently used checkpointing approaches can be made computationally as efficient as InPlace-ABN.
Ranked #2 on Semantic Segmentation on KITTI Semantic Segmentation
In this paper we are interested in recognizing human actions from sequences of 3D skeleton data.
Ranked #68 on Skeleton Based Action Recognition on NTU RGB+D
Here we take a different route, proposing to align the learned representations by embedding in any given network specific Domain Alignment Layers, designed to match the source and target feature distributions to a reference one.
The empirical fact that classifiers, trained on given data collections, perform poorly when tested on data acquired in different settings is theoretically explained in domain adaptation through a shift among distributions of the source and target domains.