We trained a large, deep convolutional neural network to classify the 1. 3 million high-resolution images in the LSVRC-2010 ImageNet training set into the 1000 different classes.
Ranked #4 on
Graph Classification
on HIV-fMRI-77
Achieving flexible and high-fidelity identity-preserved image generation remains formidable, particularly with advanced Diffusion Transformers (DiTs) like FLUX.
Existing RoPE lacks effective temporal decay for remote context and fails to extrapolate well to long video sequences.
Ranked #2 on
Video Generation
on UCF-101
Animatable 3D human reconstruction from a single image is a challenging problem due to the ambiguity in decoupling geometry, appearance, and deformation.
Monocular 3D estimation is crucial for visual perception.
Ranked #2 on
Monocular Depth Estimation
on KITTI Eigen split
We present VGGT, a feed-forward neural network that directly infers all key 3D attributes of a scene, including camera parameters, point maps, depth maps, and 3D point tracks, from one, a few, or hundreds of its views.
This paper introduces UI-TARS, a native GUI agent model that solely perceives the screenshots as input and performs human-like interactions (e. g., keyboard and mouse operations).
In this paper, we propose Knowledge Base augmented Language Model (KBLaM), a new method for augmenting Large Language Models (LLMs) with external knowledge.
Reasoning large language models are rapidly evolving across various domains.
Recent advancements in large language models (LLMs) have driven significant progress in zero-shot text-to-speech (TTS) synthesis.