Many top-down architectures for instance segmentation achieve significant success when trained and tested on pre-defined closed-world taxonomy.
Video understanding tasks take many forms, from action detection to visual query localization and spatio-temporal grounding of sentences.
From PA we construct a large set of pseudo-ground-truth instance masks; combined with human-annotated instance masks we train GGNs and significantly outperform the SOTA on open-world instance segmentation on various benchmarks including COCO, LVIS, ADE20K, and UVO.
Our approach, named Long-Short Temporal Contrastive Learning (LSTCL), enables video transformers to learn an effective clip-level representation by predicting temporal context captured from a longer temporal extent.
Current state-of-the-art object detection and segmentation methods work well under the closed-world assumption.
A majority of methods for video frame interpolation compute bidirectional optical flow between adjacent frames of a video, followed by a suitable warping algorithm to generate the output frames.
Ranked #2 on Video Frame Interpolation on GoPro
To the best of our knowledge, XDC is the first self-supervised learning method that outperforms large-scale fully-supervised pretraining for action recognition on the same architecture.
FASTER aims to leverage the redundancy between neighboring clips and reduce the computational cost by learning to aggregate the predictions from models of different complexities.
Ranked #23 on Action Recognition on UCF101
It consists of a shared 2D spatial convolution followed by two parallel point-wise convolutional layers, one devoted to images and the other one used for videos.
To reduce the need for dense annotations, we propose a PoseWarper network that leverages training videos with sparse annotations (every k frames) to learn to perform dense temporal pose propagation and estimation.
Ranked #2 on Multi-Person Pose Estimation on PoseTrack2018 (using extra training data)
Consider end-to-end training of a multi-modal vs. a single-modal network on a task with multiple input modalities: the multi-modal network receives more information, so it should match or outperform its single-modal counterpart.
Ranked #1 on Action Recognition In Videos on miniSports
Second, frame-based models perform quite well on action recognition; is pre-training for good image features sufficient or is pre-training for spatio-temporal features valuable for optimal transfer learning?
Ranked #2 on Egocentric Activity Recognition on EPIC-KITCHENS-55 (Actions Top-1 (S2) metric)
We demonstrate that the computational cost of action recognition on untrimmed videos can be dramatically reduced by invoking recognition only on these most salient clips.
Ranked #1 on Action Recognition on miniSports
It is natural to ask: 1) if group convolution can help to alleviate the high computational cost of video classification networks; 2) what factors matter the most in 3D group convolutional networks; and 3) what are good computation/accuracy trade-offs with 3D group convolutional networks.
Ranked #1 on Action Recognition on Sports-1M
In this work, we propose an alternative approach to learning video representations that require no semantically labeled videos and instead leverages the years of effort in collecting and labeling large and clean still-image datasets.
Ranked #70 on Action Recognition on HMDB-51 (using extra training data)
Our network learns to spatially sample features from Frame B in order to maximize pose detection accuracy in Frame A.
The videos retrieved by the search engines are then veried for correctness by human annotators.
There is a natural correlation between the visual and auditive elements of a video.
Ranked #7 on Self-Supervised Audio Classification on ESC-50
This paper addresses the problem of estimating and tracking human body keypoints in complex, multi-person video.
Ranked #7 on Keypoint Detection on COCO test-challenge
In this paper we discuss several forms of spatiotemporal convolutions for video analysis and study their effects on action recognition.
Ranked #3 on Action Recognition on Sports-1M
Learning image representations with ConvNets by pre-training on ImageNet has proven useful across many visual understanding tasks including object detection, semantic segmentation, and image captioning.
Ranked #69 on Action Recognition on HMDB-51
Language has been exploited to sidestep the problem of defining video categories, by formulating video understanding as the task of captioning or description.
Over the last few years deep learning methods have emerged as one of the most prominent approaches for video analysis.
We propose a simple, yet effective approach for spatiotemporal feature learning using deep 3-dimensional convolutional networks (3D ConvNets) trained on a large scale supervised video dataset.
Ranked #8 on Action Recognition on Sports-1M
We show the generality of our approach by building our mid-level descriptors from two different low-level feature representations.