A crucial task of Video Understanding is to recognise and localise (in space and time) different actions or events appearing in the video.

CogME: A Novel Evaluation Metric for Video Understanding Intelligence

Then we propose a top-down evaluation system for VideoQA, based on the cognitive process of humans and story elements: Cognitive Modules for Evaluation (CogME).

Spatio-Temporal Context for Action Detection

Research in action detection has grown in the recentyears, as it plays a key role in video understanding.

An Image Classifier Can Suffice For Video Understanding

We propose a new perspective on video understanding by casting the video recognition problem as an image recognition task.

TokenLearner: What Can 8 Learned Tokens Do for Images and Videos?

In this paper, we introduce a novel visual representation learning which relies on a handful of adaptively learned tokens, and which is applicable to both image and video understanding tasks.

Long-Short Temporal Contrastive Learning of Video Transformers

Our approach, named Long-Short Temporal Contrastive Learning (LSTCL), enables video transformers to learn an effective clip-level representation by predicting temporal context captured from a longer temporal extent.

$C^3$: Compositional Counterfactual Constrastive Learning for Video-grounded Dialogues

Video-grounded dialogue systems aim to integrate video understanding and dialogue understanding to generate responses that are relevant to both the dialogue and video context.

Towards Training Stronger Video Vision Transformers for EPIC-KITCHENS-100 Action Recognition

In this paper, we present empirical results for training a stronger video vision transformer on the EPIC-KITCHENS-100 Action Recognition dataset.

Learning Dynamics via Graph Neural Networks for Human Pose Estimation and Tracking

In this paper, we propose a novel online approach to learning the pose dynamics, which are independent of pose detections in current fame, and hence may serve as a robust estimation even in challenging scenarios including occlusion.

Transformed ROIs for Capturing Visual Transformations in Videos

Modeling the visual changes that an action brings to a scene is critical for video understanding.

A Study On the Effects of Pre-processing On Spatio-temporal Action Recognition Using Spiking Neural Networks Trained with STDP

In this paper, we rely on the network architecture of a convolutional spiking neural network trained with STDP, and we test the performance of this network when challenged with action recognition tasks.

