We design a multivariate search space, including 6 search variables to capture a wide variety of choices in designing two-stream models.
Current state-of-the-art object detection and segmentation methods work well under the closed-world assumption.
This paper presents a novel task together with a new benchmark for detecting generic, taxonomy-free event boundaries that segment a whole video into chunks.
Furthermore, to search fast in the multi-variate space, we propose a coarse-to-fine strategy by using a factorized distribution at the beginning which can reduce the number of architecture parameters by over an order of magnitude.
To obtain the single-frame supervision, the annotators are asked to identify only a single frame within the temporal window of an action.
Ranked #4 on Weakly Supervised Action Localization on BEOID
Our key idea is to decorrelate feature representations of a category from its co-occurring context.
However, in current video datasets it has been observed that action classes can often be recognized without any temporal information from a single frame of video.
FASTER aims to leverage the redundancy between neighboring clips and reduce the computational cost by learning to aggregate the predictions from models of different complexities.
Ranked #20 on Action Recognition on UCF101
Consider end-to-end training of a multi-modal vs. a single-modal network on a task with multiple input modalities: the multi-modal network receives more information, so it should match or outperform its single-modal counterpart.
Ranked #1 on Action Recognition on miniSports (Video hit@1 metric)
Second, frame-based models perform quite well on action recognition; is pre-training for good image features sufficient or is pre-training for spatio-temporal features valuable for optimal transfer learning?
Ranked #1 on Egocentric Activity Recognition on EPIC-KITCHENS-55 (Actions Top-1 (S2) metric)
It is natural to ask: 1) if group convolution can help to alleviate the high computational cost of video classification networks; 2) what factors matter the most in 3D group convolutional networks; and 3) what are good computation/accuracy trade-offs with 3D group convolutional networks.
Ranked #1 on Action Recognition on Sports-1M
The videos retrieved by the search engines are then veried for correctness by human annotators.
As any generative model induces a probability density on its output domain, we propose studying this density directly.