The AVA dataset densely annotates 80 atomic visual actions in 430 15-minute video clips, where actions are localized in space and time, resulting in 1. 58M action labels with multiple labels per person occurring frequently.
In particular, annotation errors, the size of the dataset, and the level of challenge are addressed: new annotation for both datasets is created with an extra attention to the reliability of the ground truth.
We present MorphNet, an approach to automate the design of neural network structures.
In this work, we establish dense correspondences between RGB image and a surface-based representation of the human body, a task we refer to as dense human pose estimation.
#2 best model for Pose Estimation on DensePose-COCO
To address this limitation, we propose StarGAN, a novel and scalable approach that can perform image-to-image translations for multiple domains using only a single model.
SOTA for Image-to-Image Translation on RaFD
We present a new method for synthesizing high-resolution photo-realistic images from semantic label maps using conditional generative adversarial networks (conditional GANs).
#2 best model for Image-to-Image Translation on ADE20K-Outdoor Labels-to-Photos
The improvements in recent CNN-based object detection works, from R-CNN , Fast/Faster R-CNN [10, 31] to recent Mask R-CNN  and RetinaNet , mainly come from new network, new framework, or novel loss design.
In this paper, we explore the impact of global contextual information in semantic segmentation by introducing the Context Encoding Module, which captures the semantic context of scenes and selectively highlights class-dependent featuremaps.
#2 best model for Semantic Segmentation on ADE20K