Based on this observation, we further sparsify delta parameters of multiple SFT homologous models with DARE and subsequently merge them into a single model by parameter averaging.
In this work, we unify visual representation into the language feature space to advance the foundational LLM towards a unified LVLM.
Ranked #1 on Zero-Shot Video Question Answer on TGIF-QA
Transformer-like models for vision tasks have recently proven effective for a wide range of downstream applications such as segmentation and detection.
Global medium-range weather forecasting is critical to decision-making across many social and economic domains.
Large language models (LLMs) have dramatically enhanced the field of language intelligence, as demonstrably evidenced by their formidable empirical performance across a spectrum of complex reasoning tasks.
Semantic, instance, and panoptic segmentation of 3D point clouds have been addressed using task-specific models of distinct design.
Ranked #1 on Panoptic Segmentation on ScanNetV2
Furthermore, we significantly improve the naturalness and speaker similarity of synthetic speech even in zero-shot speech synthesis scenarios.
Zero-shot 6D object pose estimation involves the detection of novel objects with their 6D poses in cluttered scenes, presenting significant challenges for model generalizability.