TimeSformer

Introduced by Bertasius et al. in Is Space-Time Attention All You Need for Video Understanding?

TimeSformer is a convolution-free approach to video classification built exclusively on self-attention over space and time. It adapts the standard Transformer architecture to video by enabling spatiotemporal feature learning directly from a sequence of frame-level patches. Specifically, the method adapts the image model [Vision Transformer](https://paperswithcode.com/method/vision-transformer) (ViT) to video by extending the self-attention mechanism from the image space to the space-time 3D volume. As in ViT, each patch is linearly mapped into an embedding and augmented with positional information. This makes it possible to interpret the resulting sequence of vector

Source: Is Space-Time Attention All You Need for Video Understanding?

Read Paper See Code

Papers

Paper	Code	Results	Date	Stars

Tasks

Task	Papers	Share
Classification	3	10.34%
Video Classification	2	6.90%
Video Question Answering	2	6.90%
Action Classification	2	6.90%
Age Estimation	1	3.45%
Automatic Speech Recognition (ASR)	1	3.45%
Image Captioning	1	3.45%
Language Modelling	1	3.45%
Speech Recognition	1	3.45%

Usage Over Time

This feature is experimental; we are continuously improving our matching algorithm.

Components

Component	Type	Add Remove
🤖 No Components Found	You can add them if they exist; e.g. Mask R-CNN uses RoIAlign

Categories

Add Remove

Generative Video Models