We demonstrate this framework on 3D pose estimation by proposing a differentiable objective that seeks the optimal set of keypoints for recovering the relative pose between two views of an object.
Recent progress has demonstrated that such meta-learning methods may exceed scalable human-invented architectures on image classification tasks.
#3 best model for Semantic Segmentation on Cityscapes (using extra training data)
In this paper, we study how we can develop HRL algorithms that are general, in that they do not make onerous additional assumptions beyond standard RL algorithms, and efficient, in the sense that they can be used with modest numbers of interaction samples, making them suitable for real-world problems such as robotic control.
We study the problem of video-to-video synthesis, whose goal is to learn a mapping function from an input source video (e. g., a sequence of semantic segmentation masks) to an output photorealistic video that precisely depicts the content of the source video.
This paper presents the key algorithmic techniques behind CatBoost, a new gradient boosting toolkit.
In this paper, we propose bilinear attention networks (BAN) that find bilinear attention distributions to utilize given vision-language information seamlessly.
#3 best model for Visual Question Answering on VQA v2
Flow-based generative models (Dinh et al., 2014) are conceptually attractive due to tractability of the exact log-likelihood, tractability of exact latent-variable inference, and parallelizability of both training and synthesis.
#5 best model for Image Generation on ImageNet 64x64
Clone a voice in 5 seconds to generate arbitrary speech in real-time
SOTA for Text-To-Speech Synthesis on LJSpeech (using extra training data)
Our implementation based on Faster-RCNN with a ResNet-101 backbone obtains an mAP of 47. 6% on the COCO dataset for bounding box detection and can process 5 images per second during inference with a single GPU.
SOTA for Object Detection on PASCAL VOC 2007