A key function of auditory cognition is the association of characteristic sounds with their corresponding semantics over time.
Ranked #1 on Audio Classification on EPIC-KITCHENS-100
Early action prediction deals with inferring the ongoing action from partially-observed videos, typically at the outset of the video.
Ranked #1 on Early Action Prediction on Something-Something V2
The hierarchical extraction of features models variations of relatively similar classes the same as very dissimilar classes.
Visual interpretability of Convolutional Neural Networks (CNNs) has gained significant popularity because of the great challenges that CNN complexity imposes to understanding their inner workings.
To address this challenge, we present a novel spatio-temporal convolution block that is capable of extracting spatio-temporal patterns at multiple temporal resolutions.
Generalizing over temporal variations is a prerequisite for effective action recognition in videos.
Ranked #2 on Action Recognition on HACS
We show that using Class Regularization blocks in state-of-the-art CNN architectures for action recognition leads to systematic improvement gains of 1. 8%, 1. 2% and 1. 4% on the Kinetics, UCF-101 and HMDB-51 datasets, respectively.
Motivated by the often distinctive temporal characteristics of actions in either horizontal or vertical direction, we introduce a novel convolution block for CNN architectures with video input.
We demonstrate the method on six state-of-the-art 3D convolution neural networks (CNNs) on three action recognition (Kinetics-400, UCF-101, and HMDB-51) and two egocentric action recognition datasets (EPIC-Kitchens and EGTEA Gaze+).
Deep learning approaches have been established as the main methodology for video classification and recognition.
The main challenges stem from dealing with the considerable variation in recording setting, the appearance of the people depicted and the coordinated performance of their interaction.