no code implementations • 27 May 2025 • Muzhi Zhu, Hao Zhong, Canyu Zhao, Zongze Du, Zheng Huang, MingYu Liu, Hao Chen, Cheng Zou, Jingdong Chen, Ming Yang, Chunhua Shen
However, despite the importance of active perception in embodied intelligence, there is little to no exploration of how MLLMs can be equipped with or learn active perception capabilities.
1 code implementation • 26 May 2025 • Hao Zhong, Muzhi Zhu, Zongze Du, Zheng Huang, Canyu Zhao, MingYu Liu, Wen Wang, Hao Chen, Chunhua Shen
Long-horizon video-audio reasoning and fine-grained pixel understanding impose conflicting requirements on omnimodal models: dense temporal coverage demands many low-resolution frames, whereas precise grounding calls for high-resolution inputs.
1 code implementation • 24 Feb 2025 • Canyu Zhao, MingYu Liu, Huanyi Zheng, Muzhi Zhu, Zhiyue Zhao, Hao Chen, Tong He, Chunhua Shen
We achieve results on par with SAM-vit-h using only 0. 06% of their data (e. g., 600K vs. 1B pixel-level annotated images).
no code implementations • 23 Jul 2024 • Canyu Zhao, MingYu Liu, Wen Wang, Weihua Chen, Fan Wang, Hao Chen, Bo Zhang, Chunhua Shen
Our approach utilizes autoregressive models for global narrative coherence, predicting sequences of visual tokens that are subsequently transformed into high-quality video frames through diffusion rendering.
2 code implementations • CVPR 2024 • Ganggui Ding, Canyu Zhao, Wen Wang, Zhen Yang, Zide Liu, Hao Chen, Chunhua Shen
Experiments show that our method's produced images are consistent with the given concepts and better aligned with the input text.
1 code implementation • 19 Nov 2023 • Wen Wang, Canyu Zhao, Hao Chen, Zhekai Chen, Kecheng Zheng, Chunhua Shen
We empirically find that sparse control conditions, such as bounding boxes, are suitable for layout planning, while dense control conditions, e. g., sketches and keypoints, are suitable for generating high-quality image content.