Generative Video Models

TimeSformer

Introduced by Bertasius et al. in Is Space-Time Attention All You Need for Video Understanding?

TimeSformer is a convolution-free approach to video classification built exclusively on self-attention over space and time. It adapts the standard Transformer architecture to video by enabling spatiotemporal feature learning directly from a sequence of frame-level patches. Specifically, the method adapts the image model [Vision Transformer](https://paperswithcode.com/method/vision-transformer) (ViT) to video by extending the self-attention mechanism from the image space to the space-time 3D volume. As in ViT, each patch is linearly mapped into an embedding and augmented with positional information. This makes it possible to interpret the resulting sequence of vector

Source: Is Space-Time Attention All You Need for Video Understanding?

Papers


Paper Code Results Date Stars

Components


Component Type
🤖 No Components Found You can add them if they exist; e.g. Mask R-CNN uses RoIAlign

Categories