In this paper, we discuss two effective approaches to improve the efficiency and robustness of CLIP training: (1) augmenting the training dataset while maintaining the same number of optimization steps, and (2) filtering out samples that contain text regions in the image.
Interventions on model-internal states are fundamental operations in many areas of AI, including model editing, steering, robustness, and interpretability.
Through extensive analysis of SfM initialization in the frequency domain and analysis of a 1D regression task with multiple 1D Gaussians, we propose a novel optimization strategy dubbed RAIN-GS (Relaxing Accurate Initialization Constraint for 3D Gaussian Splatting), that successfully trains 3D Gaussians from random point clouds.
We present Score-Guided Human Mesh Recovery (ScoreHMR), an approach for solving inverse problems for 3D human pose and shape reconstruction.
We present OOTDiffusion, a novel network architecture for realistic and controllable image-based virtual try-on (VTON).
Named Entity Recognition (NER) is essential in various Natural Language Processing (NLP) applications.
Despite the success in specific tasks and scenarios, existing foundation agents, empowered by large models (LMs) and advanced tools, still cannot generalize to different scenarios, mainly due to dramatic differences in the observations and actions across scenarios.
Animating a still image offers an engaging visual experience.
While recent preference alignment algorithms for language models have demonstrated promising results, supervised fine-tuning (SFT) remains imperative for achieving successful convergence.
Large Language Model (LLM)-based agents have demonstrated remarkable effectiveness.