Search Results for author: Chenfei Wu

Found 28 papers, 15 papers with code

Using Left and Right Brains Together: Towards Vision and Language Planning

no code implementations16 Feb 2024 Jun Cen, Chenfei Wu, Xiao Liu, Shengming Yin, Yixuan Pei, Jinglong Yang, Qifeng Chen, Nan Duan, JianGuo Zhang

Large Language Models (LLMs) and Large Multi-modality Models (LMMs) have demonstrated remarkable decision masking capabilities on a variety of tasks.

StrokeNUWA: Tokenizing Strokes for Vector Graphic Synthesis

no code implementations30 Jan 2024 Zecheng Tang, Chenfei Wu, Zekai Zhang, Mingheng Ni, Shengming Yin, Yu Liu, Zhengyuan Yang, Lijuan Wang, Zicheng Liu, Juntao Li, Nan Duan

To leverage LLMs for visual synthesis, traditional methods convert raster image information into discrete grid tokens through specialized visual modules, while disrupting the model's ability to capture the true semantic representation of visual scenes.

Vector Graphics

EIPE-text: Evaluation-Guided Iterative Plan Extraction for Long-Form Narrative Text Generation

no code implementations12 Oct 2023 Wang You, Wenshan Wu, Yaobo Liang, Shaoguang Mao, Chenfei Wu, Maosong Cao, Yuzhe Cai, Yiduo Guo, Yan Xia, Furu Wei, Nan Duan

In this paper, we propose a new framework called Evaluation-guided Iterative Plan Extraction for long-form narrative text generation (EIPE-text), which extracts plans from the corpus of narratives and utilizes the extracted plans to construct a better planner.

In-Context Learning Text Generation

LayoutNUWA: Revealing the Hidden Layout Expertise of Large Language Models

1 code implementation18 Sep 2023 Zecheng Tang, Chenfei Wu, Juntao Li, Nan Duan

Graphic layout generation, a growing research field, plays a significant role in user engagement and information perception.

Code Completion Code Generation

ORES: Open-vocabulary Responsible Visual Synthesis

1 code implementation26 Aug 2023 Minheng Ni, Chenfei Wu, Xiaodong Wang, Shengming Yin, Lijuan Wang, Zicheng Liu, Nan Duan

In this work, we formalize a new task, Open-vocabulary Responsible Visual Synthesis (ORES), where the synthesis model is able to avoid forbidden visual concepts while allowing users to input any desired content.

Image Generation Language Modelling

GameEval: Evaluating LLMs on Conversational Games

1 code implementation19 Aug 2023 Dan Qiao, Chenfei Wu, Yaobo Liang, Juntao Li, Nan Duan

In this paper, we propose GameEval, a novel approach to evaluating LLMs through goal-driven conversational games, overcoming the limitations of previous methods.

Question Answering

DragNUWA: Fine-grained Control in Video Generation by Integrating Text, Image, and Trajectory

no code implementations16 Aug 2023 Shengming Yin, Chenfei Wu, Jian Liang, Jie Shi, Houqiang Li, Gong Ming, Nan Duan

Our experiments validate the effectiveness of DragNUWA, demonstrating its superior performance in fine-grained control in video generation.

Trajectory Modeling Video Generation

ManagerTower: Aggregating the Insights of Uni-Modal Experts for Vision-Language Representation Learning

1 code implementation31 May 2023 Xiao Xu, Bei Li, Chenfei Wu, Shao-Yen Tseng, Anahita Bhiwandiwalla, Shachar Rosenman, Vasudev Lal, Wanxiang Che, Nan Duan

With only 4M VLP data, ManagerTower achieves superior performances on various downstream VL tasks, especially 79. 15% accuracy on VQAv2 Test-Std, 86. 56% IR@1 and 95. 64% TR@1 on Flickr30K.

Representation Learning

Learning to Plan with Natural Language

1 code implementation20 Apr 2023 Yiduo Guo, Yaobo Liang, Chenfei Wu, Wenshan Wu, Dongyan Zhao, Nan Duan

To obtain it, we propose the Learning to Plan method, which involves two phases: (1) In the first learning task plan phase, it iteratively updates the task plan with new step-by-step solutions and behavioral instructions, which are obtained by prompting LLMs to derive from training error feedback.

Transfer Learning

Low-code LLM: Visual Programming over LLMs

1 code implementation17 Apr 2023 Yuzhe Cai, Shaoguang Mao, Wenshan Wu, Zehua Wang, Yaobo Liang, Tao Ge, Chenfei Wu, Wang You, Ting Song, Yan Xia, Jonathan Tien, Nan Duan

The proposed Low-code LLM framework consists of a Planning LLM that designs a structured planning workflow for complex tasks, which can be correspondingly edited and confirmed by users through low-code visual programming operations, and an Executing LLM that generates responses following the user-confirmed workflow.

Prompt Engineering

TaskMatrix.AI: Completing Tasks by Connecting Foundation Models with Millions of APIs

no code implementations29 Mar 2023 Yaobo Liang, Chenfei Wu, Ting Song, Wenshan Wu, Yan Xia, Yu Liu, Yang Ou, Shuai Lu, Lei Ji, Shaoguang Mao, Yun Wang, Linjun Shou, Ming Gong, Nan Duan

On the other hand, there are also many existing models and systems (symbolic-based or neural-based) that can do some domain-specific tasks very well.

Code Generation Common Sense Reasoning +1

Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models

2 code implementations8 Mar 2023 Chenfei Wu, Shengming Yin, Weizhen Qi, Xiaodong Wang, Zecheng Tang, Nan Duan

To this end, We build a system called \textbf{Visual ChatGPT}, incorporating different Visual Foundation Models, to enable the user to interact with ChatGPT by 1) sending and receiving not only languages but also images 2) providing complex visual questions or visual editing instructions that require the collaboration of multiple AI models with multi-steps.

ReCo: Region-Controlled Text-to-Image Generation

no code implementations CVPR 2023 Zhengyuan Yang, JianFeng Wang, Zhe Gan, Linjie Li, Kevin Lin, Chenfei Wu, Nan Duan, Zicheng Liu, Ce Liu, Michael Zeng, Lijuan Wang

Human evaluation on PaintSkill shows that ReCo is +19. 28% and +17. 21% more accurate in generating images with correct object count and spatial relationship than the T2I model.

Conditional Text-to-Image Synthesis Position

HORIZON: High-Resolution Semantically Controlled Panorama Synthesis

no code implementations10 Oct 2022 Kun Yan, Lei Ji, Chenfei Wu, Jian Liang, Ming Zhou, Nan Duan, Shuai Ma

Panorama synthesis endeavors to craft captivating 360-degree visual landscapes, immersing users in the heart of virtual worlds.

Vocal Bursts Intensity Prediction

NUWA-Infinity: Autoregressive over Autoregressive Generation for Infinite Visual Synthesis

1 code implementation20 Jul 2022 Chenfei Wu, Jian Liang, Xiaowei Hu, Zhe Gan, JianFeng Wang, Lijuan Wang, Zicheng Liu, Yuejian Fang, Nan Duan

In this paper, we present NUWA-Infinity, a generative model for infinite visual synthesis, which is defined as the task of generating arbitrarily-sized high-resolution images or long-duration videos.

Image Outpainting Text-to-Image Generation +1

BridgeTower: Building Bridges Between Encoders in Vision-Language Representation Learning

1 code implementation17 Jun 2022 Xiao Xu, Chenfei Wu, Shachar Rosenman, Vasudev Lal, Wanxiang Che, Nan Duan

Vision-Language (VL) models with the Two-Tower architecture have dominated visual-language representation learning in recent years.

Representation Learning

DiVAE: Photorealistic Images Synthesis with Denoising Diffusion Decoder

no code implementations1 Jun 2022 Jie Shi, Chenfei Wu, Jian Liang, Xiang Liu, Nan Duan

Our work proposes a VQ-VAE architecture model with a diffusion decoder (DiVAE) to work as the reconstructing component in image synthesis.

Denoising Image Generation

NÜWA-LIP: Language Guided Image Inpainting with Defect-free VQGAN

no code implementations10 Feb 2022 Minheng Ni, Chenfei Wu, Haoyang Huang, Daxin Jiang, WangMeng Zuo, Nan Duan

Language guided image inpainting aims to fill in the defective regions of an image under the guidance of text while keeping non-defective regions unchanged.

Image Inpainting

NÜWA: Visual Synthesis Pre-training for Neural visUal World creAtion

1 code implementation24 Nov 2021 Chenfei Wu, Jian Liang, Lei Ji, Fan Yang, Yuejian Fang, Daxin Jiang, Nan Duan

To cover language, image, and video at the same time for different scenarios, a 3D transformer encoder-decoder framework is designed, which can not only deal with videos as 3D data but also adapt to texts and images as 1D and 2D data, respectively.

Text-to-Image Generation Text-to-Video Generation +2

KD-VLP: Improving End-to-End Vision-and-Language Pretraining with Object Knowledge Distillation

1 code implementation Findings (NAACL) 2022 Yongfei Liu, Chenfei Wu, Shao-Yen Tseng, Vasudev Lal, Xuming He, Nan Duan

Self-supervised vision-and-language pretraining (VLP) aims to learn transferable multi-modal representations from large-scale image-text data and to achieve strong performances on a broad scope of vision-language tasks after finetuning.

Knowledge Distillation Object +1

GEM: A General Evaluation Benchmark for Multimodal Tasks

1 code implementation Findings (ACL) 2021 Lin Su, Nan Duan, Edward Cui, Lei Ji, Chenfei Wu, Huaishao Luo, Yongfei Liu, Ming Zhong, Taroon Bharti, Arun Sacheti

Comparing with existing multimodal datasets such as MSCOCO and Flicker30K for image-language tasks, YouCook2 and MSR-VTT for video-language tasks, GEM is not only the largest vision-language dataset covering image-language tasks and video-language tasks at the same time, but also labeled in multiple languages.

GODIVA: Generating Open-DomaIn Videos from nAtural Descriptions

1 code implementation30 Apr 2021 Chenfei Wu, Lun Huang, Qianxi Zhang, Binyang Li, Lei Ji, Fan Yang, Guillermo Sapiro, Nan Duan

Generating videos from text is a challenging task due to its high computational requirements for training and infinite possible answers for evaluation.

Ranked #16 on Text-to-Video Generation on MSR-VTT (CLIPSIM metric)

Text-to-Video Generation Video Generation

Deep Reason: A Strong Baseline for Real-World Visual Reasoning

no code implementations24 May 2019 Chenfei Wu, Yanzhao Zhou, Gen Li, Nan Duan, Duyu Tang, Xiaojie Wang

This paper presents a strong baseline for real-world visual reasoning (GQA), which achieves 60. 93% in GQA 2019 challenge and won the sixth place.

Visual Reasoning

Chain of Reasoning for Visual Question Answering

no code implementations NeurIPS 2018 Chenfei Wu, Jinlai Liu, Xiaojie Wang, Xuan Dong

A chain of reasoning (CoR) is constructed for supporting multi-step and dynamic reasoning on changed relations and objects.

Object Question Answering +3

Cannot find the paper you are looking for? You can Submit a new open access paper.