However, most standard neural networks have the same function type and fixed computation budget on different samples regardless of their nature and difficulty.
The model's memory module ensures that a new observation will only be processed with the contents of the memory (and not the entire history), meaning that it can efficiently process long sequences with a bounded computational cost at each step.
Ranked #1 on
Action Detection
on Charades
Video understanding requires reasoning at multiple spatiotemporal resolutions -- from short fine-grained motions to events taking place over longer durations.
Ranked #2 on
Action Recognition
on EPIC-KITCHENS-100
(using extra training data)
In this paper, we introduce a novel visual representation learning which relies on a handful of adaptively learned tokens, and which is applicable to both image and video understanding tasks.
Ranked #1 on
Action Classification
on Charades
We present pure-transformer based models for video classification, drawing upon the recent success of such models in image classification.
Ranked #8 on
Action Classification
on Moments in Time
(Top 5 Accuracy metric, using extra
training data)
Pixel-level labels are particularly expensive to acquire.
Recent advances in deep learning have relied on large, labelled datasets to train high-capacity models.
Scenic is an open-source JAX library with a focus on Transformer-based models for computer vision research and beyond.