Based on this observation, we further sparsify delta parameters of multiple SFT homologous models with DARE and subsequently merge them into a single model by parameter averaging.
In this work, we unify visual representation into the language feature space to advance the foundational LLM towards a unified LVLM.
Ranked #1 on
Zero-Shot Video Question Answer
on TGIF-QA
Transformer-like models for vision tasks have recently proven effective for a wide range of downstream applications such as segmentation and detection.
Global medium-range weather forecasting is critical to decision-making across many social and economic domains.
We introduce CogVLM, a powerful open-source visual language foundation model.
Ranked #3 on
Visual Question Answering (VQA)
on CORE-MM
Large language models (LLMs) have dramatically enhanced the field of language intelligence, as demonstrably evidenced by their formidable empirical performance across a spectrum of complex reasoning tasks.
To remedy this, we design a new training algorithm Incremental Low-Rank Learning (InRank), which explicitly expresses cumulative weight updates as low-rank matrices while incrementally augmenting their ranks during training.
Semantic, instance, and panoptic segmentation of 3D point clouds have been addressed using task-specific models of distinct design.
Ranked #1 on
Panoptic Segmentation
on ScanNetV2
Furthermore, we significantly improve the naturalness and speaker similarity of synthetic speech even in zero-shot speech synthesis scenarios.
Zero-shot 6D object pose estimation involves the detection of novel objects with their 6D poses in cluttered scenes, presenting significant challenges for model generalizability.