Video Alignment
22 papers with code • 2 benchmarks • 4 datasets
Latest papers
Weakly Supervised Video Representation Learning with Unaligned Text for Sequential Videos
Sequential video understanding, as an emerging video understanding task, has driven lots of researchers' attention because of its goal-oriented nature.
Learning a Grammar Inducer from Massive Uncurated Instructional Videos
While previous work focuses on building systems for inducing grammars on text that are well-aligned with video content, we investigate the scenario, in which text and video are only in loose correspondence.
Learning Viewpoint-Agnostic Visual Representations by Recovering Tokens in 3D Space
To this end, we propose a 3D Token Representation Layer (3DTRL) that estimates the 3D positional information of the visual tokens and leverages it for learning viewpoint-agnostic representations.
Frame-wise Action Representations for Long Videos via Sequence Contrastive Learning
In this paper, we introduce a novel contrastive action representation learning (CARL) framework to learn frame-wise action representations, especially for long videos, in a self-supervised manner.
View-Invariant, Occlusion-Robust Probabilistic Embedding for Human Pose
Recognition of human poses and actions is crucial for autonomous systems to interact smoothly with people.
View-Invariant Probabilistic Embedding for Human Pose
Depictions of similar human body configurations can vary with changing viewpoints.
Adversarial Skill Networks: Unsupervised Robot Skill Learning from Video
Our method learns a general skill embedding independently from the task context by using an adversarial loss.
Temporal Cycle-Consistency Learning
We introduce a self-supervised representation learning method based on the task of temporal alignment between videos.
Dynamic Temporal Alignment of Speech to Lips
This alignment is based on deep audio-visual features, mapping the lips video and the speech signal to a shared representation.
LAMV: Learning to Align and Match Videos With Kernelized Temporal Layers
This paper considers a learnable approach for comparing and aligning videos.