Search Results for author: Lele Cheng

Found 6 papers, 1 papers with code

Decouple Content and Motion for Conditional Image-to-Video Generation

no code implementations • 24 Nov 2023 • Cuifeng Shen, Yulu Gan, Chen Chen, Xiongwei Zhu, Lele Cheng, Tingting Gao, Jinzhi Wang

The goal of conditional image-to-video (cI2V) generation is to create a believable new video by beginning with the condition, i. e., one image and text. The previous cI2V generation methods conventionally perform in RGB pixel space, with limitations in modeling motion consistency and visual continuity.

Image to Video Generation

Paper
Add Code

Paragraph-to-Image Generation with Information-Enriched Diffusion Model

1 code implementation • 24 Nov 2023 • Weijia Wu, Zhuang Li, Yefei He, Mike Zheng Shou, Chunhua Shen, Lele Cheng, Yan Li, Tingting Gao, Di Zhang, Zhongyuan Wang

In this paper, we introduce an information-enriched diffusion model for paragraph-to-image generation task, termed ParaDiffusion, which delves into the transference of the extensive semantic comprehension capabilities of large language models to the task of image generation.

Image Generation Language Modelling +1

Paper
Code

TokenFlow: Rethinking Fine-grained Cross-modal Alignment in Vision-Language Retrieval

no code implementations • 28 Sep 2022 • Xiaohan Zou, Changqiao Wu, Lele Cheng, Zhongyuan Wang

Most existing methods in vision-language retrieval match two modalities by either comparing their global feature vectors which misses sufficient information and lacks interpretability, detecting objects in images or videos and aligning the text with fine-grained features which relies on complicated model designs, or modeling fine-grained interaction via cross-attention upon visual and textual tokens which suffers from inferior efficiency.

Retrieval Text Retrieval +1

Paper
Add Code

LaT: Latent Translation with Cycle-Consistency for Video-Text Retrieval

no code implementations • 11 Jul 2022 • Jinbin Bai, Chunhui Liu, Feiyue Ni, Haofan Wang, Mengying Hu, Xiaofeng Guo, Lele Cheng

To overcome the above issue, we present a novel mechanism for learning the translation relationship from a source modality space $\mathcal{S}$ to a target modality space $\mathcal{T}$ without the need for a joint latent space, which bridges the gap between visual and textual domains.

Ranked #11 on Zero-Shot Video Retrieval on MSVD