Video Alignment
20 papers with code • 2 benchmarks • 4 datasets
Latest papers with no code
Scaling Up Video Summarization Pretraining with Large Language Models
Long-form video content constitutes a significant portion of internet traffic, making automated video summarization an essential research problem.
The Effects of Short Video-Sharing Services on Video Copy Detection
From the experimental results focusing on segment-level and video-level situations, we can see that three effects: "Segment-level VCD in short video-sharing services is more difficult than those in general video-sharing services", "Video-level VCD in short video-sharing services is easier than those in general video-sharing services", "The video alignment component mainly suppress the detection performance in short video-sharing services".
CoCoCo: Improving Text-Guided Video Inpainting for Better Consistency, Controllability and Compatibility
To this end, this paper proposes a novel text-guided video inpainting model that achieves better consistency, controllability and compatibility.
FastVideoEdit: Leveraging Consistency Models for Efficient Text-to-Video Editing
By leveraging the self-consistency property of CMs, we eliminate the need for time-consuming inversion or additional condition extraction, reducing editing time.
Towards A Better Metric for Text-to-Video Generation
Experiments on the TVGE dataset demonstrate the superiority of the proposed T2VScore on offering a better metric for text-to-video generation.
STELLA: Continual Audio-Video Pre-training with Spatio-Temporal Localized Alignment
Continuously learning a variety of audio-video semantics over time is crucial for audio-related reasoning tasks in our ever-evolving world.
Audio-Enhanced Text-to-Video Retrieval using Text-Conditioned Feature Alignment
Nonetheless, the objective of the text-to-video retrieval task is to capture the complementary audio and video information that is pertinent to the text query rather than simply achieving better audio and video alignment.
ContentCTR: Frame-level Live Streaming Click-Through Rate Prediction with Multimodal Transformer
However, most previous works treat the live as a whole item and explore the Click-through-Rate (CTR) prediction framework on item-level, neglecting that the dynamic changes that occur even within the same live room.
Learning to Ground Instructional Articles in Videos through Narrations
To deal with the scarcity of labeled data at scale, we source the step descriptions from a language knowledge base (wikiHow) containing instructional articles for a large variety of procedural tasks.
Learning by Aligning 2D Skeleton Sequences in Time
This paper presents a self-supervised temporal video alignment framework which is useful for several fine-grained human activity understanding applications.