1 code implementation • 6 Feb 2025 • Zuyan Liu, Yuhao Dong, Jiahui Wang, Ziwei Liu, Winston Hu, Jiwen Lu, Yongming Rao
Our training pipeline begins with the most distinct modalities: image and text, then gradually expands the skill sets of the model using speech data that connects language and audio knowledge, and video data that connects all modalities.
1 code implementation • 21 Nov 2024 • Yuhao Dong, Zuyan Liu, Hai-Long Sun, Jingkang Yang, Winston Hu, Yongming Rao, Ziwei Liu
In this paper, we present Insight-V, an early effort to 1) scalably produce long and robust reasoning data for complex multi-modal tasks, and 2) an effective training pipeline to enhance the reasoning capabilities of multi-modal large language models (MLLMs).
1 code implementation • 19 Sep 2024 • Zuyan Liu, Yuhao Dong, Ziwei Liu, Winston Hu, Jiwen Lu, Yongming Rao
Visual data comes in various forms, ranging from small icons of just a few pixels to long videos spanning hours.
Ranked #1 on
Video Question Answering
on Perception Test
1 code implementation • 25 Jul 2024 • Zuyan Liu, Benlin Liu, Jiahui Wang, Yuhao Dong, Guangyi Chen, Yongming Rao, Ranjay Krishna, Jiwen Lu
Surrounding less important caches are then merged with these anchors, enhancing the preservation of contextual information in the KV caches while yielding an arbitrary acceleration ratio.
1 code implementation • 19 Mar 2024 • Zuyan Liu, Yuhao Dong, Yongming Rao, Jie zhou, Jiwen Lu
In the realm of vision-language understanding, the proficiency of models in interpreting and reasoning over visual content has become a cornerstone for numerous applications.
Ranked #130 on
Visual Question Answering
on MM-Vet
no code implementations • 29 Jul 2023 • Zuyan Liu, Gaojie Lin, Congyi Wang, Min Zheng, Feida Zhu
Our approach involves a unified and multi-granularity strategy that includes a pseudo keypoint alignment module in the teacher-student framework for learning pose-aware semantic class tokens.
2 code implementations • ICCV 2023 • Wenliang Zhao, Yongming Rao, Zuyan Liu, Benlin Liu, Jie zhou, Jiwen Lu
In this paper, we propose VPD (Visual Perception with a pre-trained Diffusion model), a new framework that exploits the semantic information of a pre-trained text-to-image diffusion model in visual perception tasks.
1 code implementation • CVPR 2023 • Wenliang Zhao, Yongming Rao, Weikang Shi, Zuyan Liu, Jie zhou, Jiwen Lu
Unlike previous work that relies on carefully designed network architectures and loss functions to fuse the information from the source and target faces, we reformulate the face swapping as a conditional inpainting task, performed by a powerful diffusion model guided by the desired face attributes (e. g., identity and landmarks).
1 code implementation • 4 Jul 2022 • Yongming Rao, Zuyan Liu, Wenliang Zhao, Jie zhou, Jiwen Lu
We extend our method to hierarchical models including CNNs and hierarchical vision Transformers as well as more complex dense prediction tasks that require structured feature maps by formulating a more generic dynamic spatial sparsification framework with progressive sparsification and asymmetric computation for different spatial locations.
1 code implementation • ICCV 2021 • Xumin Yu, Yongming Rao, Ziyi Wang, Zuyan Liu, Jiwen Lu, Jie zhou
In this paper, we present a new method that reformulates point cloud completion as a set-to-set translation problem and design a new model, called PoinTr that adopts a transformer encoder-decoder architecture for point cloud completion.
Ranked #1 on
Point Cloud Completion
on ShapeNet
(Chamfer Distance L2 metric)