Contrastive learning allows us to flexibly define powerful losses by contrasting positive pairs from sets of negative samples.
Video action recognition is one of the representative tasks for video understanding.
Many real-world video-text tasks involve different levels of granularity, such as frames and words, clip and sentences or videos and paragraphs, each with distinct semantics.
Ranked #4 on Video Captioning on ActivityNet Captions
In this paper, we present a network architecture for video generation that models spatio-temporal consistency without resorting to costly 3D architectures.
In this paper, we introduce a network architecture that takes long-term content into account and enables fast per-video processing at the same time.
Ranked #63 on Action Recognition on Something-Something V1
In this paper, we propose a network architecture that computes and integrates the most important visual cues for action recognition: pose, motion, and the raw images.
With the help of a sample-variant multi-tasking architecture, the network is trained on different tasks depending on the availability of ground-truth.
In this paper, we show that the object orientation plays an important role in 3D recognition.