To overcome these limitations, we introduce StreamingT2V, an autoregressive approach for long video generation of 80, 240, 600, 1200 or more frames with smooth transitions.
We present FoundationPose, a unified foundation model for 6D object pose estimation and tracking, supporting both model-based and model-free setups.
To this end, we propose a novel filter-based VINS framework named SchurVINS, which could guarantee both high accuracy by building a complete residual model and low computational complexity with Schur complement.
This study explores the role of cross-attention during inference in text-conditional diffusion models.
We introduce Eurus, a suite of large language models (LLMs) optimized for reasoning.
Controllability plays a crucial role in video generation since it allows users to create desired content.
For the change decoder, which is available in all three architectures, we propose three spatio-temporal relationship modeling mechanisms, which can be naturally combined with the Mamba architecture and fully utilize its attribute to achieve spatio-temporal interaction of multi-temporal features and obtain accurate change information.
Time series analysis is essential for comprehending the complexities inherent in various real-world systems and applications.
Tables convey factual and quantitative data with implicit conventions created by humans that are often challenging for machines to parse.
Recent advancements in subject-driven image generation have led to zero-shot generation, yet precise selection and focus on crucial subject representations remain challenging.