Video Alignment
21 papers with code • 2 benchmarks • 4 datasets
Latest papers
Subjective-Aligned Dataset and Metric for Text-to-Video Quality Assessment
Based on T2VQA-DB, we propose a novel transformer-based model for subjective-aligned Text-to-Video Quality Assessment (T2VQA).
AIGCBench: Comprehensive Evaluation of Image-to-Video Content Generated by AI
To establish a unified evaluation framework for video generation tasks, our benchmark includes 11 metrics spanning four dimensions to assess algorithm performance.
EvalCrafter: Benchmarking and Evaluating Large Video Generation Models
For video generation, various open-sourced models and public-available services have been developed to generate high-quality videos.
Show-1: Marrying Pixel and Latent Diffusion Models for Text-to-Video Generation
In this paper, we are the first to propose a hybrid model, dubbed as Show-1, which marries pixel-based and latent-based VDMs for text-to-video generation.
A Solution to CVPR'2023 AQTC Challenge: Video Alignment for Multi-Step Inference
In this paper, we present a solution for enhancing video alignment to improve multi-step inference.
Seeing the Pose in the Pixels: Learning Pose-Aware Representations in Vision Transformers
Both PAAT and PAAB surpass their respective backbone Transformers by up to 9. 8% in real-world action recognition and 21. 8% in multi-view robotic video alignment.
Swap Attention in Spatiotemporal Diffusions for Text-to-Video Generation
Moreover, to fully unlock model capabilities for high-quality video generation and promote the development of the field, we curate a large-scale and open-source video dataset called HD-VG-130M.
Zero-Shot Video Editing Using Off-The-Shelf Image Diffusion Models
Our vid2vid-zero leverages off-the-shelf image diffusion models, and doesn't require training on any video.
Aligning Step-by-Step Instructional Diagrams to Video Demonstrations
In this paper, we consider a novel setting where such an alignment is between (i) instruction steps that are depicted as assembly diagrams (commonly seen in Ikea assembly manuals) and (ii) video segments from in-the-wild videos; these videos comprising an enactment of the assembly actions in the real world.
Weakly Supervised Video Representation Learning with Unaligned Text for Sequential Videos
Sequential video understanding, as an emerging video understanding task, has driven lots of researchers' attention because of its goal-oriented nature.