TimeSformer is a convolution-free approach to video classification built exclusively on self-attention over space and time. It adapts the standard Transformer architecture to video by enabling spatiotemporal feature learning directly from a sequence of frame-level patches. Specifically, the method adapts the image model [Vision Transformer](https://paperswithcode.com/method/vision-transformer) (ViT) to video by extending the self-attention mechanism from the image space to the space-time 3D volume. As in ViT, each patch is linearly mapped into an embedding and augmented with positional information. This makes it possible to interpret the resulting sequence of vector
Source: Is Space-Time Attention All You Need for Video Understanding?Paper | Code | Results | Date | Stars |
---|
Task | Papers | Share |
---|---|---|
Classification | 3 | 10.34% |
Video Classification | 2 | 6.90% |
Video Question Answering | 2 | 6.90% |
Action Classification | 2 | 6.90% |
Age Estimation | 1 | 3.45% |
Automatic Speech Recognition (ASR) | 1 | 3.45% |
Image Captioning | 1 | 3.45% |
Language Modelling | 1 | 3.45% |
Speech Recognition | 1 | 3.45% |
Component | Type |
|
---|---|---|
🤖 No Components Found | You can add them if they exist; e.g. Mask R-CNN uses RoIAlign |