Contrastive Language-Image Pre-training (CLIP) has been the cornerstone for zero-shot classification, text-image retrieval, and text-image generation by aligning image and text modalities.
Building on our examination of the developmental trajectories, we further investigate the emerging synergies between code intelligence and broader machine intelligence, uncovering new cross-domain opportunities and illustrating the substantial influence of code intelligence across various domains.
We propose MVSplat, an efficient feed-forward 3D Gaussian Splatting model learned from sparse multi-view images.
Ranked #1 on Generalizable Novel View Synthesis on ACID
In this paper, we introduce FRESCO, intra-frame correspondence alongside inter-frame correspondence to establish a more robust spatial-temporal constraint.
In recent years, the application of multimodal large language models (MLLM) in various fields has achieved remarkable success.
Therefore, we release the first open-access decompilation LLMs ranging from 1B to 33B pre-trained on 4 billion tokens of C source code and the corresponding assembly code.
We introduce InternVideo2, a new video foundation model (ViFM) that achieves the state-of-the-art performance in action recognition, video-text tasks, and video-centric dialogue.
Ranked #1 on Audio Classification on ESC-50 (using extra training data)
Efficient fine-tuning is vital for adapting large language models (LLMs) to downstream tasks.
Applying Reinforcement Learning (RL) to sequence generation models enables the direct optimization of long-term rewards (\textit{e. g.,} BLEU and human feedback), but typically requires large-scale sampling over a space of action sequences.
We introduce GRM, a large-scale reconstructor capable of recovering a 3D asset from sparse-view images in around 0. 1s.