1 code implementation • 31 Jan 2025 • Xuan Shen, Yizhou Wang, Xiangxi Shi, Yanzhi Wang, Pu Zhao, Jiuxiang Gu
Meanwhile, we design corresponding Heima Decoder with traditional Large Language Models (LLMs) to adaptively interpret the hidden representations into variable-length textual sequence, reconstructing reasoning processes that closely resemble the original CoTs.
no code implementations • 3 Dec 2024 • Zijiao Yang, Xiangxi Shi, Eric Slyman, Stefan Lee
Assistive embodied agents that can be instructed in natural language to perform tasks in open-world environments have the potential to significantly impact labor tasks like manufacturing or in-home care -- benefiting the lives of those who come to depend on them.
no code implementations • CVPR 2024 • Xiangxi Shi, Zhonghua Wu, Stefan Lee
In this paper we investigate the significance of viewpoint information in 3D visual grounding -- introducing a model that explicitly predicts the speaker's viewpoint based on the referring expression and scene.
1 code implementation • ICCV 2021 • Zhonghua Wu, Xiangxi Shi, Guosheng Lin, Jianfei Cai
To explicitly learn meta-class representations in few-shot segmentation task, we propose a novel Meta-class Memory based few-shot segmentation method (MM-Net), where we introduce a set of learnable memory embeddings to memorize the meta-class information during the base class training and transfer to novel classes during the inference stage.
no code implementations • 27 Jul 2021 • Xiangxi Shi, Zhonghua Wu, Guosheng Lin, Jianfei Cai, Shafiq Joty
Therefore, in this paper, we propose a memory-based Image Manipulation Network (MIM-Net), where a set of memories learned from images is introduced to synthesize the texture information with the guidance of the textual description.
no code implementations • ECCV 2020 • Xiangxi Shi, Xu Yang, Jiuxiang Gu, Shafiq Joty, Jianfei Cai
In this paper, we propose a novel visual encoder to explicitly distinguish viewpoint changes from semantic changes in the change captioning task.
no code implementations • 21 Jul 2019 • Xiangxi Shi, Jianfei Cai, Shafiq Joty, Jiuxiang Gu
With the rapid growth of video data and the increasing demands of various applications such as intelligent video search and assistance toward visually-impaired people, video captioning task has received a lot of attention recently in computer vision and natural language processing fields.
no code implementations • 8 Jul 2018 • Xiangxi Shi, Jianfei Cai, Jiuxiang Gu, Shafiq Joty
In this paper, we propose a boundary-aware hierarchical language decoder for video captioning, which consists of a high-level GRU based language decoder, working as a global (caption-level) language model, and a low-level GRU based language decoder, working as a local (phrase-level) language model.