Search Results for author: Zuyan Liu

Found 10 papers, 9 papers with code

Ola: Pushing the Frontiers of Omni-Modal Language Model with Progressive Modality Alignment

1 code implementation6 Feb 2025 Zuyan Liu, Yuhao Dong, Jiahui Wang, Ziwei Liu, Winston Hu, Jiwen Lu, Yongming Rao

Our training pipeline begins with the most distinct modalities: image and text, then gradually expands the skill sets of the model using speech data that connects language and audio knowledge, and video data that connects all modalities.

cross-modal alignment Language Modeling +1

Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large Language Models

1 code implementation21 Nov 2024 Yuhao Dong, Zuyan Liu, Hai-Long Sun, Jingkang Yang, Winston Hu, Yongming Rao, Ziwei Liu

In this paper, we present Insight-V, an early effort to 1) scalably produce long and robust reasoning data for complex multi-modal tasks, and 2) an effective training pipeline to enhance the reasoning capabilities of multi-modal large language models (MLLMs).

Visual Reasoning

Efficient Inference of Vision Instruction-Following Models with Elastic Cache

1 code implementation25 Jul 2024 Zuyan Liu, Benlin Liu, Jiahui Wang, Yuhao Dong, Guangyi Chen, Yongming Rao, Ranjay Krishna, Jiwen Lu

Surrounding less important caches are then merged with these anchors, enhancing the preservation of contextual information in the KV caches while yielding an arbitrary acceleration ratio.

Instruction Following Text Generation

Chain-of-Spot: Interactive Reasoning Improves Large Vision-Language Models

1 code implementation19 Mar 2024 Zuyan Liu, Yuhao Dong, Yongming Rao, Jie zhou, Jiwen Lu

In the realm of vision-language understanding, the proficiency of models in interpreting and reasoning over visual content has become a cornerstone for numerous applications.

visual instruction following Visual Question Answering

HandMIM: Pose-Aware Self-Supervised Learning for 3D Hand Mesh Estimation

no code implementations29 Jul 2023 Zuyan Liu, Gaojie Lin, Congyi Wang, Min Zheng, Feida Zhu

Our approach involves a unified and multi-granularity strategy that includes a pseudo keypoint alignment module in the teacher-student framework for learning pose-aware semantic class tokens.

Pose Estimation regression +2

Unleashing Text-to-Image Diffusion Models for Visual Perception

2 code implementations ICCV 2023 Wenliang Zhao, Yongming Rao, Zuyan Liu, Benlin Liu, Jie zhou, Jiwen Lu

In this paper, we propose VPD (Visual Perception with a pre-trained Diffusion model), a new framework that exploits the semantic information of a pre-trained text-to-image diffusion model in visual perception tasks.

Denoising Image Segmentation +4

DiffSwap: High-Fidelity and Controllable Face Swapping via 3D-Aware Masked Diffusion

1 code implementation CVPR 2023 Wenliang Zhao, Yongming Rao, Weikang Shi, Zuyan Liu, Jie zhou, Jiwen Lu

Unlike previous work that relies on carefully designed network architectures and loss functions to fuse the information from the source and target faces, we reformulate the face swapping as a conditional inpainting task, performed by a powerful diffusion model guided by the desired face attributes (e. g., identity and landmarks).

Face Swapping

Dynamic Spatial Sparsification for Efficient Vision Transformers and Convolutional Neural Networks

1 code implementation4 Jul 2022 Yongming Rao, Zuyan Liu, Wenliang Zhao, Jie zhou, Jiwen Lu

We extend our method to hierarchical models including CNNs and hierarchical vision Transformers as well as more complex dense prediction tasks that require structured feature maps by formulating a more generic dynamic spatial sparsification framework with progressive sparsification and asymmetric computation for different spatial locations.

PoinTr: Diverse Point Cloud Completion with Geometry-Aware Transformers

1 code implementation ICCV 2021 Xumin Yu, Yongming Rao, Ziyi Wang, Zuyan Liu, Jiwen Lu, Jie zhou

In this paper, we present a new method that reformulates point cloud completion as a set-to-set translation problem and design a new model, called PoinTr that adopts a transformer encoder-decoder architecture for point cloud completion.

 Ranked #1 on Point Cloud Completion on ShapeNet (Chamfer Distance L2 metric)

Decoder Inductive Bias +2

Cannot find the paper you are looking for? You can Submit a new open access paper.