no code implementations • 7 Jun 2024 • Yuxing Long, Wenzhe Cai, Hongcheng Wang, Guanqi Zhan, Hao Dong
To reach this goal, we introduce Dynamic Chain-of-Navigation (DCoN) to unify the planning process for different types of navigation instructions.
no code implementations • CVPR 2024 • Xiaoqi Li, Mingxu Zhang, Yiran Geng, Haoran Geng, Yuxing Long, Yan Shen, Renrui Zhang, Jiaming Liu, Hao Dong
By fine-tuning the injected adapters, we preserve the inherent common sense and reasoning ability of the MLLMs while equipping them with the ability for manipulation.
no code implementations • 20 Sep 2023 • Yuxing Long, Xiaoqi Li, Wenzhe Cai, Hao Dong
The performances on the representative VLN task R2R show that our method surpasses the leading zero-shot VLN model by a large margin on all metrics.
no code implementations • 14 Sep 2023 • Yunshui Li, Binyuan Hui, Zhaochao Yin, Wanwei He, Run Luo, Yuxing Long, Min Yang, Fei Huang, Yongbin Li
Visually-grounded dialog systems, which integrate multiple modes of communication such as text and visual inputs, have become an increasingly popular area of investigation.
1 code implementation • 19 Aug 2023 • Fulong Ye, Yuxing Long, Fangxiang Feng, Xiaojie Wang
Referring Expression Generation (REG) aims to generate unambiguous Referring Expressions (REs) for objects in a visual scene, with a dual task of Referring Expression Comprehension (REC) to locate the referred object.
1 code implementation • 26 May 2023 • Yuxing Long, Binyuan Hui, Caixia Yuan1, Fei Huang, Yongbin Li, Xiaojie Wang
Existing multimodal task-oriented dialog data fails to demonstrate the diverse expressions of user subjective preferences and recommendation acts in the real-life shopping scenario.
1 code implementation • 5 Jan 2023 • Yuxing Long, Binyuan Hui, Fulong Ye, Yanyang Li, Zhuoxin Han, Caixia Yuan, Yongbin Li, Xiaojie Wang
Existing multimodal conversation agents have shown impressive abilities to locate absolute positions or retrieve attributes in simple scenarios, but they fail to perform well when complex relative positions and information alignments are involved, which poses a bottleneck in response quality.