Multi-Task Learning (MTL) aims to enhance the model generalization by sharing representations between related tasks for better performance.
We benchmark these improvements on the vanilla ResNet-FPN backbone with RetinaNet and RCNN detectors.
Ranked #15 on Object Detection on COCO minival
Self-supervised learning has recently shown great potential in vision tasks through contrastive learning which aims to discriminate each image, or instance, in the dataset.
Robust multi-object tracking (MOT) is a prerequisite fora safe deployment of self-driving cars.
Caricature is an artistic drawing created to abstract or exaggerate facial features of a person.
Existing weakly-supervised semantic segmentation methods using image-level annotations typically rely on initial responses to locate object regions.
Obtaining object response maps is one important step to achieve weakly-supervised semantic segmentation using image-level labels.
We learn both 3D point cloud reconstruction and pose estimation networks in a self-supervised manner, making use of differentiable point cloud renderer to train with 2D supervision.
This intermediate domain is constructed by translating the source images to mimic the ones in the target domain.
However, current state-of-the-art face parsing methods require large amounts of labeled data on the pixel-level and such process for caricature is tedious and labor-intensive.
Parts provide a good intermediate representation of objects that is robust with respect to the camera, pose and appearance variations.
Specifically, given a foreground image and a background image, our proposed method automatically generates a set of blending photos with scores that indicate the aesthetics quality with the proposed quality network and policy network.
Ranked #27 on Semi-Supervised Video Object Segmentation on DAVIS 2016
In this paper, we propose an adversarial learning method for domain adaptation in the context of semantic segmentation.
Ranked #3 on Domain Adaptation on Synscapes-to-Cityscapes
We propose a method for semi-supervised semantic segmentation using an adversarial network.
We present a scene parsing method that utilizes global context information based on both the parametric and non- parametric models.
In addition, we apply a filter on the refined score map that aims to recognize the best connected region using spatial and temporal consistencies in the video.