To address this challenge, we present a novel spatio-temporal convolution block that is capable of extracting spatio-temporal patterns at multiple temporal resolutions.
Generalizing over temporal variations is a prerequisite for effective action recognition in videos.
Ranked #2 on Action Recognition on HACS
We show that using Class Regularization blocks in state-of-the-art CNN architectures for action recognition leads to systematic improvement gains of 1. 8%, 1. 2% and 1. 4% on the Kinetics, UCF-101 and HMDB-51 datasets, respectively.
Motivated by the often distinctive temporal characteristics of actions in either horizontal or vertical direction, we introduce a novel convolution block for CNN architectures with video input.
We demonstrate the method on six state-of-the-art 3D convolution neural networks (CNNs) on three action recognition (Kinetics-400, UCF-101, and HMDB-51) and two egocentric action recognition datasets (EPIC-Kitchens and EGTEA Gaze+).
We employ this idea to tackle action recognition in egocentric videos by introducing additional supervised tasks.
Light field imaging presents an attractive alternative to RGB imaging because of the recording of the direction of the incoming light.
We acknowledge that the presence of objects is significant for the execution of actions by humans and in general for the description of a scene.
Deep learning approaches have been established as the main methodology for video classification and recognition.
Our results demonstrate the power of linguistic analysis in real-world deception research when applied at the individual level and provide evidence that factually incorrect tweets are not random mistakes of the sender.
The main challenges stem from dealing with the considerable variation in recording setting, the appearance of the people depicted and the coordinated performance of their interaction.