Experiments show that models trained on added TextSSR-F data exhibit better accuracy compared to models trained on 4 million existing synthetic data.
The labeled data is used to train a reward model that simulates human judgment and serves as an automatic evaluator of the proactiveness of LLM agents.
Recent work on human animation usually involves audio, pose, or movement maps conditions, thereby achieves vivid animation quality.
In this paper, we propose SVTRv2, a CTC model that beats leading EDTRs in both accuracy and inference speed.
In this paper, we introduce OminiControl, a highly versatile and parameter-efficient framework that integrates image conditions into pre-trained Diffusion Transformer (DiT) models.
During inference, we propose a novel Hamilton-Jacobi-Bellman (HJB) equation-based optimization to further enhance the face quality.
Document content analysis has been a crucial research area in computer vision.
Despite the development of numerous variance reduction algorithms in the past decade aimed at accelerating stochastic optimization in both convex and nonconvex settings, variance reduction has not found widespread success in training deep neural networks or large language models.
Global medium-range weather forecasting is critical to decision-making across many social and economic domains.
The Segment Anything Model 2 (SAM 2) has demonstrated strong performance in object segmentation tasks but faces challenges in visual object tracking, particularly when managing crowded scenes with fast-moving or self-occluding objects.
Ranked #1 on Visual Object Tracking on GOT-10k