Search Results for author: Daqing Liu

Found 20 papers, 12 papers with code

Decompose Semantic Shifts for Composed Image Retrieval

no code implementations18 Sep 2023 Xingyu Yang, Daqing Liu, Heng Zhang, Yong Luo, Chaoyue Wang, Jing Zhang

Composed image retrieval is a type of image retrieval task where the user provides a reference image as a starting point and specifies a text on how to shift from the starting point to the desired target image.

Image Retrieval Retrieval

Cocktail: Mixing Multi-Modality Controls for Text-Conditional Image Generation

no code implementations1 Jun 2023 Minghui Hu, Jianbin Zheng, Daqing Liu, Chuanxia Zheng, Chaoyue Wang, DaCheng Tao, Tat-Jen Cham

In this work, we propose Cocktail, a pipeline to mix various modalities into one embedding, amalgamated with a generalized ControlNet (gControlNet), a controllable normalisation (ControlNorm), and a spatial guidance sampling method, to actualize multi-modal and spatially-refined control for text-conditional diffusion models.

Conditional Image Generation

MMoT: Mixture-of-Modality-Tokens Transformer for Composed Multimodal Conditional Image Synthesis

no code implementations10 May 2023 Jianbin Zheng, Daqing Liu, Chaoyue Wang, Minghui Hu, Zuopeng Yang, Changxing Ding, DaCheng Tao

To this end, we propose to generate images conditioned on the compositions of multimodal control signals, where modalities are imperfectly complementary, i. e., composed multimodal conditional image synthesis (CMCIS).

Image Generation

ESceme: Vision-and-Language Navigation with Episodic Scene Memory

1 code implementation2 Mar 2023 Qi Zheng, Daqing Liu, Chaoyue Wang, Jing Zhang, Dadong Wang, DaCheng Tao

Vision-and-language navigation (VLN) simulates a visual agent that follows natural-language navigation instructions in real-world scenes.

Vision and Language Navigation

Eliminating Contextual Prior Bias for Semantic Image Editing via Dual-Cycle Diffusion

1 code implementation5 Feb 2023 Zuopeng Yang, Tianshu Chu, Xin Lin, Erdun Gao, Daqing Liu, Jie Yang, Chaoyue Wang

The proposed model incorporates a Bias Elimination Cycle that consists of both a forward path and an inverted path, each featuring a Structural Consistency Cycle to ensure the preservation of image content during the editing process.

Text-to-Image Generation

Exploring Temporal Concurrency for Video-Language Representation Learning

no code implementations ICCV 2023 Heng Zhang, Daqing Liu, Zezhong Lv, Bing Su, DaCheng Tao

Paired video and language data is naturally temporal concurrency, which requires the modeling of the temporal dynamics within each modality and the temporal alignment across modalities simultaneously.

Dynamic Time Warping Metric Learning +6

Modeling Video As Stochastic Processes for Fine-Grained Video Representation Learning

1 code implementation CVPR 2023 Heng Zhang, Daqing Liu, Qi Zheng, Bing Su

Specifically, we enforce the embeddings of the frame sequence of interest to approximate a goal-oriented stochastic process, i. e., Brownian bridge, in the latent space via a process-based contrastive loss.

Contrastive Learning Representation Learning +3

Cross-Modal Contrastive Learning for Robust Reasoning in VQA

1 code implementation21 Nov 2022 Qi Zheng, Chaoyue Wang, Daqing Liu, Dadong Wang, DaCheng Tao

For each positive pair, we regard the images from different graphs as negative samples and deduct the version of multi-positive contrastive learning.

Contrastive Learning Question Answering +1

SemMAE: Semantic-Guided Masking for Learning Masked Autoencoders

1 code implementation21 Jun 2022 Gang Li, Heliang Zheng, Daqing Liu, Chaoyue Wang, Bing Su, Changwen Zheng

In this paper, we explore a potential visual analogue of words, i. e., semantic parts, and we integrate semantic information into the training process of MAE by proposing a Semantic-Guided Masking strategy.

Language Modelling Masked Language Modeling +1

TransVG++: End-to-End Visual Grounding with Language Conditioned Vision Transformer

1 code implementation14 Jun 2022 Jiajun Deng, Zhengyuan Yang, Daqing Liu, Tianlang Chen, Wengang Zhou, Yanyong Zhang, Houqiang Li, Wanli Ouyang

For another, we devise Language Conditioned Vision Transformer that removes external fusion modules and reuses the uni-modal ViT for vision-language fusion at the intermediate layers.

Visual Grounding

Modeling Image Composition for Complex Scene Generation

1 code implementation CVPR 2022 Zuopeng Yang, Daqing Liu, Chaoyue Wang, Jie Yang, DaCheng Tao

Compared to existing CNN-based and Transformer-based generation models that entangled modeling on pixel-level&patch-level and object-level&patch-level respectively, the proposed focal attention predicts the current patch token by only focusing on its highly-related tokens that specified by the spatial layout, thereby achieving disambiguation during training.

Layout-to-Image Generation Object +1

Compact Bidirectional Transformer for Image Captioning

1 code implementation6 Jan 2022 Yuanen Zhou, Zhenzhen Hu, Daqing Liu, Huixia Ben, Meng Wang

In this paper, we introduce a Compact Bidirectional Transformer model for image captioning that can leverage bidirectional context implicitly and explicitly while the decoder can be executed parallelly.

Image Captioning Sentence

Learning to Discretely Compose Reasoning Module Networks for Video Captioning

1 code implementation17 Jul 2020 Ganchao Tan, Daqing Liu, Meng Wang, Zheng-Jun Zha

However, existing visual reasoning methods designed for visual question answering are not appropriate to video captioning, for it requires more complex visual reasoning on videos over both space and time, and dynamic module composition along the generation process.

Question Answering Sentence +3

More Grounded Image Captioning by Distilling Image-Text Matching Model

1 code implementation CVPR 2020 Yuanen Zhou, Meng Wang, Daqing Liu, Zhenzhen Hu, Hanwang Zhang

To improve the grounding accuracy while retaining the captioning quality, it is expensive to collect the word-region alignment as strong supervision.

Image Captioning Image-text matching +4

Joint Visual Grounding with Language Scene Graphs

no code implementations9 Jun 2019 Daqing Liu, Hanwang Zhang, Zheng-Jun Zha, Meng Wang, Qianru Sun

In this paper, we alleviate the missing-annotation problem and enable the joint reasoning by leveraging the language scene graph which covers both labeled referent and unlabeled contexts (other objects, attributes, and relationships).

Referring Expression Visual Grounding

Context-Aware Visual Policy Network for Fine-Grained Image Captioning

1 code implementation6 Jun 2019 Zheng-Jun Zha, Daqing Liu, Hanwang Zhang, Yongdong Zhang, Feng Wu

With the maturity of visual detection techniques, we are more ambitious in describing visual content with open-vocabulary, fine-grained and free-form language, i. e., the task of image captioning.

Image Captioning Image Paragraph Captioning +2

Learning to Compose and Reason with Language Tree Structures for Visual Grounding

no code implementations5 Jun 2019 Richang Hong, Daqing Liu, Xiaoyu Mo, Xiangnan He, Hanwang Zhang

Grounding natural language in images, such as localizing "the black dog on the left of the tree", is one of the core problems in artificial intelligence, as it needs to comprehend the fine-grained and compositional language space.

Visual Grounding Visual Reasoning

Learning to Assemble Neural Module Tree Networks for Visual Grounding

no code implementations ICCV 2019 Daqing Liu, Hanwang Zhang, Feng Wu, Zheng-Jun Zha

In particular, we develop a novel modular network called Neural Module Tree network (NMTree) that regularizes the visual grounding along the dependency parsing tree of the sentence, where each node is a neural module that calculates visual attention according to its linguistic feature, and the grounding score is accumulated in a bottom-up direction where as needed.

Dependency Parsing Natural Language Visual Grounding +5

Context-Aware Visual Policy Network for Sequence-Level Image Captioning

1 code implementation16 Aug 2018 Daqing Liu, Zheng-Jun Zha, Hanwang Zhang, Yongdong Zhang, Feng Wu

To fill the gap, we propose a Context-Aware Visual Policy network (CAVP) for sequence-level image captioning.

Image Captioning Reinforcement Learning (RL)

Cannot find the paper you are looking for? You can Submit a new open access paper.