To effectively solve the indirect elemental interactions across chunks in the dual-path architecture, MossFormer employs a joint local and global self-attention architecture that simultaneously performs a full-computation self-attention on local chunks and a linearised low-cost self-attention over the full sequence.
Ranked #1 on Speech Separation on WSJ0-2mix-16k (using extra training data)
We address the problem of gaze target estimation, which aims to predict where a person is looking in a scene.
We introduce a new model - Segment any Text (SaT) - to solve this problem.
During inference, we propose a novel Hamilton-Jacobi-Bellman (HJB) equation-based optimization to further enhance the face quality.
Recent advancements in video diffusion models have shown exceptional abilities in simulating real-world dynamics and maintaining 3D consistency.
Additionally, we show that our loss is model-agnostic and can be used to improve the performance of other diffusion models.
With the explosive growth of available training data, single-image 3D human modeling is ahead of a transition to a data-centric paradigm.
To reduce these gaps, this paper introduces Video Seal, a comprehensive framework for neural video watermarking and a competitive open-sourced model.
In this report, we introduce HunyuanVideo, an innovative open-source video foundation model that demonstrates performance in video generation comparable to, or even surpassing, that of leading closed-source models.
Finally, we provide a detailed analysis of the limitations of LLM judges and discuss potential future directions.