no code implementations • 18 Mar 2024 • Yue Fan, Xiaojian Ma, Rujie Wu, Yuntao Du, Jiaqi Li, Zhi Gao, Qing Li
We explore how reconciling several foundation models (large language models and vision-language models) with a novel unified memory mechanism could tackle the challenging video understanding problem, especially capturing the long-term temporal relations in lengthy videos.
1 code implementation • 16 Oct 2023 • Rujie Wu, Xiaojian Ma, Zhenliang Zhang, Wei Wang, Qing Li, Song-Chun Zhu, Yizhou Wang
We even conceived a neuro-symbolic reasoning approach that reconciles LLMs & VLMs with logical reasoning to emulate the human problem-solving process for Bongard Problems.
Ranked #1 on Visual Reasoning on Bongard-OpenWorld
1 code implementation • 22 Jul 2022 • Hang Ye, Wentao Zhu, Chunyu Wang, Rujie Wu, Yizhou Wang
While the voxel-based methods have achieved promising results for multi-person 3D pose estimation from multi-cameras, they suffer from heavy computation burdens, especially for large scenes.
Ranked #5 on 3D Multi-Person Pose Estimation on Campus