The experiments show that the resultant unified foundation transformer works surprisingly well on both the vision-only and text-only tasks, and the proposed knowledge distillation and gradient masking strategy can effectively lift the performance to approach the level of separately-trained models.
The training objective consists of two parts: a fine-grained temporal learning objective to maximize the similarity between corresponding temporal embeddings in the short clip and the long clip, and a persistent temporal learning objective to pull together global embeddings of the two clips.
1 code implementation • 26 Apr 2021 • Yu-Chuan Su, Soravit Changpinyo, Xiangning Chen, Sathish Thoppay, Cho-Jui Hsieh, Lior Shapira, Radu Soricut, Hartwig Adam, Matthew Brown, Ming-Hsuan Yang, Boqing Gong
To enable progress on this task, we create a new dataset consisting of 220k human-annotated 2. 5D relationships among 512K objects from 11K images.
We investigate the use of Neural Radiance Fields (NeRF) to learn high quality 3D object category models from collections of input images.
We present Mobile Video Networks (MoViNets), a family of computation and memory efficient video networks that can operate on streaming video for online inference.
Ranked #1 on Action Classification on Kinetics-600 (GFLOPs metric)
Ensembling is a simple and popular technique for boosting evaluation performance by training multiple models (e. g., with different initializations) and aggregating their predictions.
Object frequency in the real world often follows a power law, leading to a mismatch between datasets with long-tailed class distributions seen by a machine learning model and our expectation of the model to perform well on all classes.
Ranked #22 on Long-tail Learning on Places-LT
Autonomous drone racing is a challenging research problem at the intersection of computer vision, planning, state estimation, and control.
In this work, we look at the effect such non-identical data distributions has on visual classification via Federated Learning.
In the extreme, we observed that a model trained on patches extracted from just one scan, with each patch augmented 50 times; achieved a Dice score of 0. 73 in a validation set of 40 cases.
Recent advances in video super-resolution have shown that convolutional neural networks combined with motion compensation are able to merge information from multiple low-resolution (LR) frames to generate high-quality images.
Ranked #5 on Video Super-Resolution on Vid4 - 4x upscaling
We call this process weight imprinting as it directly sets weights for a new category based on an appropriately scaled copy of the embedding layer activations for that training example.
Deep reinforcement learning yields great results for a large array of problems, but models are generally retrained anew for each new problem to be solved.
This paper addresses video summarization, or the problem of distilling a raw video into a shorter form while still capturing the original story.
We present an unsupervised learning framework for the task of monocular depth and camera motion estimation from unstructured video sequences.
In a more realistic environment, without the oracle keypoints, the proposed deep person instance segmentation model conditioned on human pose achieves 3. 8% to 10. 5% relative improvements comparing with its strongest baseline of a deep network trained only for segmentation.
In this paper we present a dense ground truth dataset of nonrigidly deforming real-world scenes.
We demonstrate the success of our approach by showing significant error reduction on 6 popular optical flow algorithms applied to a range of real-world nonrigid benchmarks.
We present a systematic analysis of how to fuse conditional computation with representation learning and achieve a continuum of hybrid models with different ratios of accuracy vs. efficiency.
This paper addresses the segmentation of videos with arbitrary motion, including dynamic textures, using novel motion features and a supervised learning approach.