no code implementations • 18 Sep 2023 • Xingyu Yang, Daqing Liu, Heng Zhang, Yong Luo, Chaoyue Wang, Jing Zhang
Composed image retrieval is a type of image retrieval task where the user provides a reference image as a starting point and specifies a text on how to shift from the starting point to the desired target image.
no code implementations • 1 Jun 2023 • Minghui Hu, Jianbin Zheng, Daqing Liu, Chuanxia Zheng, Chaoyue Wang, DaCheng Tao, Tat-Jen Cham
In this work, we propose Cocktail, a pipeline to mix various modalities into one embedding, amalgamated with a generalized ControlNet (gControlNet), a controllable normalisation (ControlNorm), and a spatial guidance sampling method, to actualize multi-modal and spatially-refined control for text-conditional diffusion models.
no code implementations • 10 May 2023 • Jianbin Zheng, Daqing Liu, Chaoyue Wang, Minghui Hu, Zuopeng Yang, Changxing Ding, DaCheng Tao
To this end, we propose to generate images conditioned on the compositions of multimodal control signals, where modalities are imperfectly complementary, i. e., composed multimodal conditional image synthesis (CMCIS).
1 code implementation • 2 Mar 2023 • Qi Zheng, Daqing Liu, Chaoyue Wang, Jing Zhang, Dadong Wang, DaCheng Tao
Vision-and-language navigation (VLN) simulates a visual agent that follows natural-language navigation instructions in real-world scenes.
no code implementations • 1 Mar 2023 • Chao Xue, Wei Liu, Shuai Xie, Zhenfang Wang, Jiaxing Li, Xuyang Peng, Liang Ding, Shanshan Zhao, Qiong Cao, Yibo Yang, Fengxiang He, Bohua Cai, Rongcheng Bian, Yiyan Zhao, Heliang Zheng, Xiangyang Liu, Dongkai Liu, Daqing Liu, Li Shen, Chang Li, Shijin Zhang, Yukang Zhang, Guanpu Chen, Shixiang Chen, Yibing Zhan, Jing Zhang, Chaoyue Wang, DaCheng Tao
Automated machine learning (AutoML) seeks to build ML models with minimal human effort.
1 code implementation • 5 Feb 2023 • Zuopeng Yang, Tianshu Chu, Xin Lin, Erdun Gao, Daqing Liu, Jie Yang, Chaoyue Wang
The proposed model incorporates a Bias Elimination Cycle that consists of both a forward path and an inverted path, each featuring a Structural Consistency Cycle to ensure the preservation of image content during the editing process.
1 code implementation • CVPR 2023 • Heng Zhang, Daqing Liu, Qi Zheng, Bing Su
Specifically, we enforce the embeddings of the frame sequence of interest to approximate a goal-oriented stochastic process, i. e., Brownian bridge, in the latent space via a process-based contrastive loss.
no code implementations • ICCV 2023 • Heng Zhang, Daqing Liu, Zezhong Lv, Bing Su, DaCheng Tao
Paired video and language data is naturally temporal concurrency, which requires the modeling of the temporal dynamics within each modality and the temporal alignment across modalities simultaneously.
1 code implementation • 21 Nov 2022 • Qi Zheng, Chaoyue Wang, Daqing Liu, Dadong Wang, DaCheng Tao
For each positive pair, we regard the images from different graphs as negative samples and deduct the version of multi-positive contrastive learning.
1 code implementation • 21 Jun 2022 • Gang Li, Heliang Zheng, Daqing Liu, Chaoyue Wang, Bing Su, Changwen Zheng
In this paper, we explore a potential visual analogue of words, i. e., semantic parts, and we integrate semantic information into the training process of MAE by proposing a Semantic-Guided Masking strategy.
1 code implementation • 14 Jun 2022 • Jiajun Deng, Zhengyuan Yang, Daqing Liu, Tianlang Chen, Wengang Zhou, Yanyong Zhang, Houqiang Li, Wanli Ouyang
For another, we devise Language Conditioned Vision Transformer that removes external fusion modules and reuses the uni-modal ViT for vision-language fusion at the intermediate layers.
1 code implementation • CVPR 2022 • Zuopeng Yang, Daqing Liu, Chaoyue Wang, Jie Yang, DaCheng Tao
Compared to existing CNN-based and Transformer-based generation models that entangled modeling on pixel-level&patch-level and object-level&patch-level respectively, the proposed focal attention predicts the current patch token by only focusing on its highly-related tokens that specified by the spatial layout, thereby achieving disambiguation during training.
1 code implementation • 6 Jan 2022 • Yuanen Zhou, Zhenzhen Hu, Daqing Liu, Huixia Ben, Meng Wang
In this paper, we introduce a Compact Bidirectional Transformer model for image captioning that can leverage bidirectional context implicitly and explicitly while the decoder can be executed parallelly.
1 code implementation • 17 Jul 2020 • Ganchao Tan, Daqing Liu, Meng Wang, Zheng-Jun Zha
However, existing visual reasoning methods designed for visual question answering are not appropriate to video captioning, for it requires more complex visual reasoning on videos over both space and time, and dynamic module composition along the generation process.
1 code implementation • CVPR 2020 • Yuanen Zhou, Meng Wang, Daqing Liu, Zhenzhen Hu, Hanwang Zhang
To improve the grounding accuracy while retaining the captioning quality, it is expensive to collect the word-region alignment as strong supervision.
no code implementations • 9 Jun 2019 • Daqing Liu, Hanwang Zhang, Zheng-Jun Zha, Meng Wang, Qianru Sun
In this paper, we alleviate the missing-annotation problem and enable the joint reasoning by leveraging the language scene graph which covers both labeled referent and unlabeled contexts (other objects, attributes, and relationships).
1 code implementation • 6 Jun 2019 • Zheng-Jun Zha, Daqing Liu, Hanwang Zhang, Yongdong Zhang, Feng Wu
With the maturity of visual detection techniques, we are more ambitious in describing visual content with open-vocabulary, fine-grained and free-form language, i. e., the task of image captioning.
no code implementations • 5 Jun 2019 • Richang Hong, Daqing Liu, Xiaoyu Mo, Xiangnan He, Hanwang Zhang
Grounding natural language in images, such as localizing "the black dog on the left of the tree", is one of the core problems in artificial intelligence, as it needs to comprehend the fine-grained and compositional language space.
no code implementations • ICCV 2019 • Daqing Liu, Hanwang Zhang, Feng Wu, Zheng-Jun Zha
In particular, we develop a novel modular network called Neural Module Tree network (NMTree) that regularizes the visual grounding along the dependency parsing tree of the sentence, where each node is a neural module that calculates visual attention according to its linguistic feature, and the grounding score is accumulated in a bottom-up direction where as needed.
1 code implementation • 16 Aug 2018 • Daqing Liu, Zheng-Jun Zha, Hanwang Zhang, Yongdong Zhang, Feng Wu
To fill the gap, we propose a Context-Aware Visual Policy network (CAVP) for sequence-level image captioning.