no code implementations • 18 Mar 2025 • Shoubin Yu, Difan Liu, Ziqiao Ma, Yicong Hong, Yang Zhou, Hao Tan, Joyce Chai, Mohit Bansal
To support diverse tasks and complex instructions, we employ a curriculum learning strategy: first aligning the MLLM and video diffusion model with large-scale instructional image editing data, followed by end-to-end fine-tuning on high-quality multitask video data.
1 code implementation • 11 Dec 2024 • Zun Wang, Jialu Li, Yicong Hong, Songze Li, Kunchang Li, Shoubin Yu, Yi Wang, Yu Qiao, Yali Wang, Mohit Bansal, LiMin Wang
In this paper, we introduce a Self-Refining Data Flywheel (SRDF) that generates high-quality and large-scale navigational instruction-trajectory pairs by iteratively refining the data pool through the collaboration between two models, the instruction generator and the navigator, without any human-in-the-loop annotation.
1 code implementation • 7 Dec 2024 • Gengze Zhou, Yicong Hong, Zun Wang, Chongyang Zhao, Mohit Bansal, Qi Wu
The academic field of learning instruction-guided visual navigation can be generally categorized into high-level category-specific search and low-level language-guided navigation, depending on the granularity of language instruction, in which the former emphasizes the exploration process, while the latter concentrates on following detailed textual commands.
no code implementations • 16 Oct 2024 • Chen Ziwen, Hao Tan, Kai Zhang, Sai Bi, Fujun Luan, Yicong Hong, Li Fuxin, Zexiang Xu
Unlike previous feed-forward models that are limited to processing 1~4 input images and can only reconstruct a small portion of a large scene, Long-LRM reconstructs the entire scene in a single feed-forward step.
1 code implementation • 10 Oct 2024 • Desai Xie, Zhan Xu, Yicong Hong, Hao Tan, Difan Liu, Feng Liu, Arie Kaufman, Yang Zhou
Current frontier video diffusion models have demonstrated remarkable results at generating high-quality videos.
1 code implementation • 17 Jul 2024 • Gengze Zhou, Yicong Hong, Zun Wang, Xin Eric Wang, Qi Wu
Capitalizing on the remarkable advancements in Large Language Models (LLMs), there is a burgeoning initiative to harness LLMs for instruction following robotic navigation.
1 code implementation • 3 Jun 2024 • Bahram Mohammadi, Yicong Hong, Yuankai Qi, Qi Wu, Shirui Pan, Javen Qinfeng Shi
To address enhancing representation, we propose an augmented commonsense knowledge model (ACK) to leverage commonsense information as a spatio-temporal knowledge graph for improving agent navigation.
no code implementations • 24 Feb 2024 • Jiazhao Zhang, Kunyu Wang, Rongtao Xu, Gengze Zhou, Yicong Hong, Xiaomeng Fang, Qi Wu, Zhizheng Zhang, He Wang
Vision-and-language navigation (VLN) stands as a key research problem of Embodied AI, aiming at enabling agents to navigate in unseen environments following linguistic instructions.
1 code implementation • 10 Nov 2023 • Jiahao Li, Hao Tan, Kai Zhang, Zexiang Xu, Fujun Luan, Yinghao Xu, Yicong Hong, Kalyan Sunkavalli, Greg Shakhnarovich, Sai Bi
Text-to-3D with diffusion models has achieved remarkable progress in recent years.
1 code implementation • 8 Nov 2023 • Yicong Hong, Kai Zhang, Jiuxiang Gu, Sai Bi, Yang Zhou, Difan Liu, Feng Liu, Kalyan Sunkavalli, Trung Bui, Hao Tan
We propose the first Large Reconstruction Model (LRM) that predicts the 3D model of an object from a single input image within just 5 seconds.
1 code implementation • ICCV 2023 • Zun Wang, Jialu Li, Yicong Hong, Yi Wang, Qi Wu, Mohit Bansal, Stephen Gould, Hao Tan, Yu Qiao
Recent research in language-guided visual navigation has demonstrated a significant demand for the diversity of traversable environments and the quantity of supervision for training generalizable agents.
1 code implementation • ICCV 2023 • Yicong Hong, Yang Zhou, Ruiyi Zhang, Franck Dernoncourt, Trung Bui, Stephen Gould, Hao Tan
Being able to perceive the semantics and the spatial structure of the environment is essential for visual navigation of a household robot.
2 code implementations • 26 May 2023 • Gengze Zhou, Yicong Hong, Qi Wu
Trained with an unprecedented scale of data, large language models (LLMs) like ChatGPT and GPT-4 exhibit the emergence of significant reasoning abilities from model scaling.
1 code implementation • 29 Mar 2023 • Zheyuan Liu, Weixuan Sun, Yicong Hong, Damien Teney, Stephen Gould
Composed image retrieval searches for a target image based on a multi-modal user query comprised of a reference image and modification text describing the desired changes.
Ranked #7 on
Image Retrieval
on Fashion IQ
no code implementations • IEEE Transactions on Pattern Analysis and Machine Intelligence 2023 • Yanyuan Qiao, Yuankai Qi, Yicong Hong, Zheng Yu, Peng Wang, and Qi Wu ̊
To address these problems, we present a history-enhanced and order-aware pre-training with the complementing fine-tuning paradigm (HOP+) for VLN.
1 code implementation • 23 Jun 2022 • Dong An, Zun Wang, Yangguang Li, Yi Wang, Yicong Hong, Yan Huang, Liang Wang, Jing Shao
Our model consists of three modules: the candidate waypoints predictor (CWP), the history enhanced planner and the tryout controller.
1 code implementation • CVPR 2022 • Yanyuan Qiao, Yuankai Qi, Yicong Hong, Zheng Yu, Peng Wang, Qi Wu
Pre-training has been adopted in a few of recent works for Vision-and-Language Navigation (VLN).
Ranked #5 on
Visual Navigation
on R2R
1 code implementation • CVPR 2022 • Yicong Hong, Zun Wang, Qi Wu, Stephen Gould
To bridge the discrete-to-continuous gap, we propose a predictor to generate a set of candidate waypoints during navigation, so that agents designed with high-level actions can be transferred to and trained in continuous environments.
no code implementations • CVPR 2021 • Yicong Hong, Qi Wu, Yuankai Qi, Cristian Rodriguez-Opazo, Stephen Gould
In this paper we propose a recurrent BERT model that is time-aware for use in VLN.
1 code implementation • 15 Apr 2021 • Jiawei Liu, Jing Zhang, Yicong Hong, Nick Barnes
Within this pipeline, the class activation map (CAM) is obtained and further processed to serve as a pseudo label to train the semantic segmentation model in a fully-supervised manner.
1 code implementation • ICCV 2021 • Yuankai Qi, Zizheng Pan, Yicong Hong, Ming-Hsuan Yang, Anton Van Den Hengel, Qi Wu
Vision-and-Language Navigation (VLN) requires an agent to find a path to a remote location on the basis of natural-language instructions and a set of photo-realistic panoramas.
1 code implementation • 26 Nov 2020 • Yicong Hong, Qi Wu, Yuankai Qi, Cristian Rodriguez-Opazo, Stephen Gould
In this paper we propose a recurrent BERT model that is time-aware for use in VLN.
Ranked #8 on
Visual Navigation
on R2R
1 code implementation • NeurIPS 2020 • Yicong Hong, Cristian Rodriguez-Opazo, Yuankai Qi, Qi Wu, Stephen Gould
From both the textual and visual perspectives, we find that the relationships among the scene, its objects, and directional clues are essential for the agent to interpret complex instructions and correctly perceive the environment.
1 code implementation • EMNLP 2020 • Yicong Hong, Cristian Rodriguez-Opazo, Qi Wu, Stephen Gould
Vision-and-language navigation requires an agent to navigate through a real 3D environment following natural language instructions.