In this paper, we are the first to propose a hybrid model, dubbed as Show-1, which marries pixel-based and latent-based VDMs for text-to-video generation.
Dataset condensation aims to condense a large dataset with a lot of training samples into a small set.
Despite the rapid advancement of unsupervised learning in visual representation, it requires training on large-scale datasets that demand costly data collection, and pose additional challenges due to concerns regarding data privacy.
Specifically, TL;DR can compress the mainstream VLP datasets at a high ratio, e. g., reduce well-cleaned CC3M dataset from 2. 82M to 0. 67M ($\sim$24\%) and noisy YFCC15M from 15M to 2. 5M ($\sim$16. 7\%).
In this work, we propose a novel Semantic Token ViT (STViT), for efficient global and local vision transformers, which can also be revised to serve as backbone for downstream tasks.
Remarkably, with only 25% annotated video frames, our method still outperforms the base CL learners, which are trained with 100% annotations on all video frames.
In this paper, we present DeVRF, a novel representation to accelerate learning dynamic radiance fields.
Learning spatial-temporal relation among multiple actors is crucial for group activity recognition.
With such multi-dimension and multi-scale factorization, our MorphMLP block can achieve a great accuracy-computation balance.
Ranked #28 on Action Recognition on Something-Something V2 (using extra training data)