Video Understanding

95 papers with code • 0 benchmarks • 25 datasets

A crucial task of Video Understanding is to recognise and localise (in space and time) different actions or events appearing in the video.

Source: Action Detection from a Robot-Car Perspective

Latest papers with code

Disentangle Your Dense Object Detector

zehuichen123/DDOD 7 Jul 2021

Extensive experiments on MS COCO benchmark show that our approach can lead to 2. 0 mAP, 2. 4 mAP and 2. 2 mAP absolute improvements on RetinaNet, FCOS, and ATSS baselines with negligible extra overhead.

Video Understanding

07 Jul 2021

Feature Combination Meets Attention: Baidu Soccer Embeddings and Transformer based Temporal Detection

baidu-research/vidpress-sports 28 Jun 2021

With rapidly evolving internet technologies and emerging tools, sports related videos generated online are increasing at an unprecedentedly fast pace.

Action Recognition Action Spotting +2

28 Jun 2021

Video Swin Transformer

SwinTransformer/Video-Swin-Transformer 24 Jun 2021

The vision community is witnessing a modeling shift from CNNs to Transformers, where pure Transformer architectures have attained top accuracy on the major video recognition benchmarks.

 Ranked #1 on Action Classification on Kinetics-400 (using extra training data)

Action Classification Action Recognition +3

24 Jun 2021

Towards Long-Form Video Understanding

chaoyuaw/lvu CVPR 2021

Our world offers a never-ending stream of visual stimuli, yet today's vision systems only accurately recognize patterns within a few seconds.

Action Recognition Video Recognition +1

21 Jun 2021

VIMPAC: Video Pre-Training via Masked Token Prediction and Contrastive Learning

airsplay/vimpac 21 Jun 2021

Unlike language, where the text tokens are more independent, neighboring video tokens typically have strong correlations (e. g., consecutive video frames usually look very similar), and hence uniformly masking individual tokens will make the task too trivial to learn useful representations.

Action Classification Action Recognition +2

21 Jun 2021

Learning the Predictability of the Future

cvlab-columbia/hyperfuture CVPR 2021

We introduce a framework for learning from unlabeled video what is predictable in the future.

Hierarchical structure Representation Learning +2

19 Jun 2021

NExT-QA: Next Phase of Question-Answering to Explaining Temporal Actions

doc-doc/NExT-QA CVPR 2021

We introduce NExT-QA, a rigorously designed video question answering (VideoQA) benchmark to advance video understanding from describing to explaining the temporal actions.

Question Answering Video Question Answering +2

19 Jun 2021

End-to-end Temporal Action Detection with Transformer

xlliu7/TadTR 18 Jun 2021

Temporal action detection (TAD) aims to determine the semantic label and the boundaries of every action instance in an untrimmed video.

Action Detection Video Understanding

18 Jun 2021

Discerning Generic Event Boundaries in Long-Form Wild Videos

rayush7/GEBD 18 Jun 2021

Detecting generic, taxonomy-free event boundaries invideos represents a major stride forward towards holisticvideo understanding.

Boundary Detection Video Understanding

18 Jun 2021

Isolated Sign Recognition from RGB Video using Pose Flow and Self-Attention

m-decoster/ChaLearn-2021-LAP Computer Vision and Pattern Recognition Workshops (CVPRW) 2021

However, due to the limited amount of labeled data that is commonly available for training automatic sign (language) recognition, the VTN cannot reach its full potential in this domain.

Action Recognition Sign Language Recognition +1

11 Jun 2021