Specifically, we show that a single transformer-based model - with a single set of weights - trained purely offline can play a suite of up to 46 Atari games simultaneously at close-to-human performance.
We verify this by developing SimCLR and BYOL formulations compatible with the Conditional Entropy Bottleneck (CEB) objective, allowing us to both measure and control the amount of compression in the learned representation, and observe their impact on downstream tasks.
Ranked #21 on Self-Supervised Image Classification on ImageNet
The Predictive Information is the mutual information between the past and the future, I(X_past; X_future).
To aid RL researchers and production users with the evaluation and improvement of reliability, we propose a set of metrics that quantitatively measure different aspects of reliability.
In this work, we investigate the problem of grounding language commands as reward functions using inverse reinforcement learning, and argue that language-conditioned rewards are more transferable than language-conditioned policies to new environments.
We use large amounts of unlabeled video to learn models for visual tracking without manual human supervision.
Many machine vision applications, such as semantic segmentation and depth prediction, require predictions for every pixel of the input image.
Then, given the generated low-resolution color image and the original grayscale image as inputs, we train a second CNN to generate a high-resolution colorization of an image.
Ranked #3 on Colorization on ImageNet val
We propose a new method for semantic instance segmentation, by first computing how likely two pixels are to belong to the same object, and then by grouping similar pixels together.
Finally, we show that using our PG method we can optimize any of the metrics, including the proposed SPIDEr metric which results in image captions that are strongly preferred by human raters compared to captions generated by the same model but trained to optimize MLE or the COCO metrics.
On the opposite end in which accuracy is critical, we present a detector that achieves state-of-the-art performance measured on the COCO detection task.
Ranked #191 on Object Detection on COCO test-dev
We present a system which can recognize the contents of your meal from a single image, and then predict its nutritional contents, such as calories.
Models based on deep convolutional networks have dominated recent image interpretation tasks; we investigate whether models which are also recurrent, or "temporally deep", are effective for tasks involving sequences, visual and otherwise.
Ranked #3 on Human Interaction Recognition on BIT
A major challenge in scaling object detection is the difficulty of obtaining labeled images for large numbers of categories.
The framework is a BSD-licensed C++ library with Python and MATLAB bindings for training and deploying general-purpose convolutional neural networks and other deep models efficiently on commodity architectures.