5 code implementations • 30 Jul 2021 • Andrew Jaegle, Sebastian Borgeaud, Jean-Baptiste Alayrac, Carl Doersch, Catalin Ionescu, David Ding, Skanda Koppula, Daniel Zoran, Andrew Brock, Evan Shelhamer, Olivier Hénaff, Matthew M. Botvinick, Andrew Zisserman, Oriol Vinyals, João Carreira
The recently-proposed Perceiver model obtains good results on several domains (images, audio, multimodal, point clouds) while scaling linearly in compute and memory with the input size.
Ranked #1 on Optical Flow Estimation on KITTI 2015 (Average End-Point Error metric)
Self-supervised pretraining has been shown to yield powerful representations for transfer learning.
We describe the 2020 edition of the DeepMind Kinetics human action dataset, which replenishes and extends the Kinetics-700 dataset.
The dataset is collected by annotating videos from the Kinetics-700 dataset using the AVA annotation protocol, and extending the original AVA dataset with these new AVA annotated Kinetics clips.
Given this shared embedding we demonstrate that (i) we can map words between the languages, particularly the 'visual' words; (ii) that the shared embedding provides a good initialization for existing unsupervised text-based word translation techniques, forming the basis for our proposed hybrid visual-text mapping algorithm, MUVE; and (iii) our approach achieves superior performance by addressing the shortcomings of text-based methods -- it is more robust, handles datasets with less commonality, and is applicable to low-resource languages.
The objective of this paper is to be able to separate a video into its natural layers, and to control which of the separated layers to attend to.
We introduce the Action Transformer model for recognizing and localizing human actions in video clips.
Ranked #4 on Action Recognition on AVA v2.1
True video understanding requires making sense of non-lambertian scenes where the color of light arriving at the camera sensor encodes information about not just the last object it collided with, but about multiple mediums -- colored windows, dirty mirrors, smoke or rain.
We introduce a simple baseline for action localization on the AVA dataset.
Ranked #10 on Action Recognition on AVA v2.1
Actions as simple as grasping an object or navigating around it require a rich understanding of that object's 3D shape from a given viewpoint.
We consider the problem of enriching current object detection systems with veridical object sizes and relative depth estimates from a single image.
Object reconstruction from a single image -- in the wild -- is a problem where we can make progress and get meaningful results today.