Search Results for author: Licheng Yu

Found 42 papers, 23 papers with code

Animated Stickers: Bringing Stickers to Life with Video Diffusion

no code implementations • 8 Feb 2024 • David Yan, Winnie Zhang, Luxin Zhang, Anmol Kalia, Dingkang Wang, Ankit Ramchandani, Miao Liu, Albert Pumarola, Edgar Schoenfeld, Elliot Blanchard, Krishna Narni, Yaqiao Luo, Lawrence Chen, Guan Pang, Ali Thabet, Peter Vajda, Amy Bearman, Licheng Yu

Our model is built on top of the state-of-the-art Emu text-to-image model, with the addition of temporal layers to model motion.

Paper
Add Code

FlowVid: Taming Imperfect Optical Flows for Consistent Video-to-Video Synthesis

no code implementations • 29 Dec 2023 • Feng Liang, Bichen Wu, Jialiang Wang, Licheng Yu, Kunpeng Li, Yinan Zhao, Ishan Misra, Jia-Bin Huang, Peizhao Zhang, Peter Vajda, Diana Marculescu

This enables our model for video synthesis by editing the first frame with any prevalent I2I models and then propagating edits to successive frames.

Optical Flow Estimation Video-to-Video Synthesis

Paper
Add Code

Fairy: Fast Parallelized Instruction-Guided Video-to-Video Synthesis

no code implementations • 20 Dec 2023 • Bichen Wu, Ching-Yao Chuang, Xiaoyan Wang, Yichen Jia, Kapil Krishnakumar, Tong Xiao, Feng Liang, Licheng Yu, Peter Vajda

In this paper, we introduce Fairy, a minimalist yet robust adaptation of image-editing diffusion models, enhancing them for video editing applications.

Data Augmentation Video Editing +1

Paper
Add Code

AVID: Any-Length Video Inpainting with Diffusion Model

1 code implementation • 6 Dec 2023 • Zhixing Zhang, Bichen Wu, Xiaoyan Wang, Yaqiao Luo, Luxin Zhang, Yinan Zhao, Peter Vajda, Dimitris Metaxas, Licheng Yu

Given a video, a masked region at its initial frame, and an editing prompt, it requires a model to do infilling at each frame following the editing guidance while keeping the out-of-mask region intact.

Image Inpainting Video Inpainting

Paper
Code

VideoSwap: Customized Video Subject Swapping with Interactive Semantic Point Correspondence

no code implementations • 4 Dec 2023 • YuChao Gu, Yipin Zhou, Bichen Wu, Licheng Yu, Jia-Wei Liu, Rui Zhao, Jay Zhangjie Wu, David Junhao Zhang, Mike Zheng Shou, Kevin Tang

In contrast to previous methods that rely on dense correspondences, we introduce the VideoSwap framework that exploits semantic point correspondences, inspired by our observation that only a small number of semantic points are necessary to align the subject's motion trajectory and modify its shape.

Video Editing

Paper
Add Code

Text-to-Sticker: Style Tailoring Latent Diffusion Models for Human Expression

no code implementations • 17 Nov 2023 • Animesh Sinha, Bo Sun, Anmol Kalia, Arantxa Casanova, Elliot Blanchard, David Yan, Winnie Zhang, Tony Nelli, Jiahui Chen, Hardik Shah, Licheng Yu, Mitesh Kumar Singh, Ankit Ramchandani, Maziar Sanjabi, Sonal Gupta, Amy Bearman, Dhruv Mahajan

Evaluation results show our method improves visual quality by 14%, prompt alignment by 16. 2% and scene diversity by 15. 3%, compared to prompt engineering the base Emu model for stickers generation.

Image Generation Prompt Engineering

Paper
Add Code

AMELI: Enhancing Multimodal Entity Linking with Fine-Grained Attributes

no code implementations • 24 May 2023 • Barry Menglong Yao, Yu Chen, Qifan Wang, Sijia Wang, Minqian Liu, Zhiyang Xu, Licheng Yu, Lifu Huang

We propose attribute-aware multimodal entity linking, where the input is a mention described with a text and image, and the goal is to predict the corresponding target entity from a multimodal knowledge base (KB) where each entity is also described with a text description, a visual image and a set of attributes and values.

Attribute Entity Linking

Paper
Add Code

Learning Procedure-aware Video Representation from Instructional Videos and Their Narrations

1 code implementation • CVPR 2023 • Yiwu Zhong, Licheng Yu, Yang Bai, Shangwen Li, Xueting Yan, Yin Li

In this work, we propose to learn video representation that encodes both action steps and their temporal ordering, based on a large-scale dataset of web instructional videos and their narrations, without using human annotations.

Paper
Code

Learning and Verification of Task Structure in Instructional Videos

no code implementations • 23 Mar 2023 • Medhini Narasimhan, Licheng Yu, Sean Bell, Ning Zhang, Trevor Darrell

We introduce a new pre-trained video model, VideoTaskformer, focused on representing the semantics and structure of instructional videos.

Activity Recognition

Paper
Add Code

FAME-ViL: Multi-Tasking Vision-Language Model for Heterogeneous Fashion Tasks

1 code implementation • CVPR 2023 • Xiao Han, Xiatian Zhu, Licheng Yu, Li Zhang, Yi-Zhe Song, Tao Xiang

In the fashion domain, there exists a variety of vision-and-language (V+L) tasks, including cross-modal retrieval, text-guided image retrieval, multi-modal classification, and image captioning.

Cross-Modal Retrieval Image Captioning +4

Paper
Code

RoPAWS: Robust Semi-supervised Representation Learning from Uncurated Data

1 code implementation • 28 Feb 2023 • Sangwoo Mo, Jong-Chyi Su, Chih-Yao Ma, Mido Assran, Ishan Misra, Licheng Yu, Sean Bell

Semi-supervised learning aims to train a model using limited labels.

Density Estimation Image Classification +1

488

Paper
Code

Que2Engage: Embedding-based Retrieval for Relevant and Engaging Products at Facebook Marketplace

no code implementations • 21 Feb 2023 • Yunzhong He, Yuxin Tian, Mengjiao Wang, Feier Chen, Licheng Yu, Maolong Tang, Congcong Chen, Ning Zhang, Bin Kuang, Arul Prakash

In this paper we presents Que2Engage, a search EBR system built towards bridging the gap between retrieval and ranking for end-to-end optimizations.

Retrieval

Paper
Add Code

CiT: Curation in Training for Effective Vision-Language Data

1 code implementation • ICCV 2023 • Hu Xu, Saining Xie, Po-Yao Huang, Licheng Yu, Russell Howes, Gargi Ghosh, Luke Zettlemoyer, Christoph Feichtenhofer

Large vision-language models are generally applicable to many downstream tasks, but come at an exorbitant training cost that only large institutions can afford.

Paper
Code

Tell Me What Happened: Unifying Text-guided Video Completion via Multimodal Masked Video Generation

1 code implementation • CVPR 2023 • Tsu-Jui Fu, Licheng Yu, Ning Zhang, Cheng-Yang Fu, Jong-Chyi Su, William Yang Wang, Sean Bell

Inspired by this, we introduce a novel task, text-guided video completion (TVC), which requests the model to generate a video from partial frames guided by an instruction.

Ranked #3 on Video Prediction on BAIR Robot Pushing

Text-to-Video Generation Video Generation +1

Paper
Code

FaD-VLP: Fashion Vision-and-Language Pre-training towards Unified Retrieval and Captioning

no code implementations • 26 Oct 2022 • Suvir Mirchandani, Licheng Yu, Mengjiao Wang, Animesh Sinha, WenWen Jiang, Tao Xiang, Ning Zhang

Additionally, these works have mainly been restricted to multimodal understanding tasks.

Cross-Modal Retrieval Decoder +4

Paper
Add Code

FashionViL: Fashion-Focused Vision-and-Language Representation Learning

1 code implementation • 17 Jul 2022 • Xiao Han, Licheng Yu, Xiatian Zhu, Li Zhang, Yi-Zhe Song, Tao Xiang

We thus propose a Multi-View Contrastive Learning task for pulling closer the visual representation of one image to the compositional multimodal representation of another image+text.

Contrastive Learning Image Retrieval +2

Paper
Code

GEB+: A Benchmark for Generic Event Boundary Captioning, Grounding and Retrieval

1 code implementation • 1 Apr 2022 • Yuxuan Wang, Difei Gao, Licheng Yu, Stan Weixian Lei, Matt Feiszli, Mike Zheng Shou

In this paper, we introduce a new dataset called Kinetic-GEB+.

Ranked #1 on Boundary Captioning on Kinetics-GEB+

Boundary Captioning Boundary Grounding +2

Paper
Code

LoopITR: Combining Dual and Cross Encoder Architectures for Image-Text Retrieval

no code implementations • 10 Mar 2022 • Jie Lei, Xinlei Chen, Ning Zhang, Mengjiao Wang, Mohit Bansal, Tamara L. Berg, Licheng Yu

In this work, we propose LoopITR, which combines them in the same network for joint learning.

Retrieval Text Retrieval

Paper
Add Code

Unsupervised Vision-and-Language Pre-training via Retrieval-based Multi-Granular Alignment

no code implementations • CVPR 2022 • Mingyang Zhou, Licheng Yu, Amanpreet Singh, Mengjiao Wang, Zhou Yu, Ning Zhang

We adapt our pre-trained model to a set of V+L downstream tasks, including VQA, NLVR2, Visual Entailment, and RefCOCO+.

Retrieval Sentence +3

Paper
Add Code

CommerceMM: Large-Scale Commerce MultiModal Representation Learning with Omni Retrieval

no code implementations • 15 Feb 2022 • Licheng Yu, Jun Chen, Animesh Sinha, Mengjiao MJ Wang, Hugo Chen, Tamara L. Berg, Ning Zhang

We introduce CommerceMM - a multimodal model capable of providing a diverse and granular understanding of commerce topics associated to the given piece of content (image, text, image+text), and having the capability to generalize to a wide range of tasks, including Multimodal Categorization, Image-Text Retrieval, Query-to-Product Retrieval, Image-to-Product Retrieval, etc.

Representation Learning Retrieval +1

Paper
Add Code

VALUE: A Multi-Task Benchmark for Video-and-Language Understanding Evaluation

1 code implementation • 8 Jun 2021 • Linjie Li, Jie Lei, Zhe Gan, Licheng Yu, Yen-Chun Chen, Rohit Pillai, Yu Cheng, Luowei Zhou, Xin Eric Wang, William Yang Wang, Tamara Lee Berg, Mohit Bansal, Jingjing Liu, Lijuan Wang, Zicheng Liu

Most existing video-and-language (VidL) research focuses on a single dataset, or multiple datasets of a single task.

Multi-Task Learning Question Answering +5

Paper
Code

Connecting What to Say With Where to Look by Modeling Human Attention Traces

1 code implementation • CVPR 2021 • Zihang Meng, Licheng Yu, Ning Zhang, Tamara Berg, Babak Damavandi, Vikas Singh, Amy Bearman

Learning the grounding of each word is challenging, due to noise in the human-provided traces and the presence of words that cannot be meaningfully visually grounded.

Caption Generation Image Captioning +1

Paper
Code

What is More Likely to Happen Next? Video-and-Language Future Event Prediction

1 code implementation • EMNLP 2020 • Jie Lei, Licheng Yu, Tamara L. Berg, Mohit Bansal

Given a video with aligned dialogue, people can often infer what is more likely to happen next.

Paper
Code

Behind the Scene: Revealing the Secrets of Pre-trained Vision-and-Language Models

no code implementations • ECCV 2020 • Jize Cao, Zhe Gan, Yu Cheng, Licheng Yu, Yen-Chun Chen, Jingjing Liu

To reveal the secrets behind the scene of these powerful models, we present VALUE (Vision-And-Language Understanding Evaluation), a set of meticulously designed probing tasks (e. g., Visual Coreference Resolution, Visual Relation Detection, Linguistic Probing Tasks) generalizable to standard pre-trained V+L models, aiming to decipher the inner workings of multimodal pre-training (e. g., the implicit knowledge garnered in individual attention heads, the inherent cross-modal alignment learned through contextualized multimodal embeddings).

coreference-resolution

Paper
Add Code

HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training

3 code implementations • EMNLP 2020 • Linjie Li, Yen-Chun Chen, Yu Cheng, Zhe Gan, Licheng Yu, Jingjing Liu

We present HERO, a novel framework for large-scale video+language omni-representation learning.

Ranked #1 on Video Retrieval on TVR

Language Modelling Masked Language Modeling +8

228

Paper
Code

BachGAN: High-Resolution Image Synthesis from Salient Object Layout

1 code implementation • CVPR 2020 • Yandong Li, Yu Cheng, Zhe Gan, Licheng Yu, Liqiang Wang, Jingjing Liu

We propose a new task towards more practical application for image generation - high-quality image synthesis from salient object layout.

Generative Adversarial Network Hallucination +4

Paper
Code

VIOLIN: A Large-Scale Dataset for Video-and-Language Inference

1 code implementation • CVPR 2020 • Jingzhou Liu, Wenhu Chen, Yu Cheng, Zhe Gan, Licheng Yu, Yiming Yang, Jingjing Liu

We introduce a new task, Video-and-Language Inference, for joint multimodal understanding of video and text.

154

Paper
Code

TVR: A Large-Scale Dataset for Video-Subtitle Moment Retrieval

2 code implementations • ECCV 2020 • Jie Lei, Licheng Yu, Tamara L. Berg, Mohit Bansal

The queries are also labeled with query types that indicate whether each of them is more related to video or subtitle or both, allowing for in-depth analysis of the dataset and the methods that built on top of it.

Ranked #2 on Video Retrieval on TVR

Moment Retrieval Retrieval +2

148

Paper
Code

UNITER: Learning UNiversal Image-TExt Representations

no code implementations • 25 Sep 2019 • Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, Jingjing Liu

Joint image-text embedding is the bedrock for most Vision-and-Language (V+L) tasks, where multimodality inputs are jointly processed for visual and textual understanding.

Image-text matching Language Modelling +10

Paper
Add Code

UNITER: UNiversal Image-TExt Representation Learning

7 code implementations • ECCV 2020 • Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, Jingjing Liu

Different from previous work that applies joint random masking to both modalities, we use conditional masking on pre-training tasks (i. e., masked language/region modeling is conditioned on full observation of image/text).

Ranked #3 on Visual Question Answering (VQA) on VCR (Q-A) test

Image-text matching Language Modelling +12

762

Paper
Code

TVQA+: Spatio-Temporal Grounding for Video Question Answering

3 code implementations • ACL 2020 • Jie Lei, Licheng Yu, Tamara L. Berg, Mohit Bansal

We present the task of Spatio-Temporal Video Question Answering, which requires intelligent systems to simultaneously retrieve relevant moments and detect referenced visual concepts (people and objects) to answer natural language questions about videos.

Ranked #6 on Video Question Answering on TVQA

Question Answering Video Question Answering

120

Paper
Code

Multi-Target Embodied Question Answering

1 code implementation • CVPR 2019 • Licheng Yu, Xinlei Chen, Georgia Gkioxari, Mohit Bansal, Tamara L. Berg, Dhruv Batra

To address this, we propose a modular architecture composed of a program generator, a controller, a navigator, and a VQA module.

Embodied Question Answering Navigate +1

288

Paper
Code

Learning to Navigate Unseen Environments: Back Translation with Environmental Dropout

1 code implementation • NAACL 2019 • Hao Tan, Licheng Yu, Mohit Bansal

Next, we apply semi-supervised learning (via back-translation) on these dropped-out environments to generate new paths and instructions.

Ranked #1 on Vision-Language Navigation on Room2Room

Navigate Translation +1

120

Paper
Code

TVQA: Localized, Compositional Video Question Answering

4 code implementations • EMNLP 2018 • Jie Lei, Licheng Yu, Mohit Bansal, Tamara L. Berg

Recent years have witnessed an increasing interest in image-based question-answering (QA) tasks.

Ranked #4 on Video Question Answering on SUTD-TrafficQA

Video Question Answering

160

Paper
Code

MAttNet: Modular Attention Network for Referring Expression Comprehension

1 code implementation • CVPR 2018 • Licheng Yu, Zhe Lin, Xiaohui Shen, Jimei Yang, Xin Lu, Mohit Bansal, Tamara L. Berg

In this paper, we address referring expression comprehension: localizing an image region described by a natural language expression.

Ranked #7 on Generalized Referring Expression Segmentation on gRefCOCO

Generalized Referring Expression Segmentation Referring Expression +1

291

Paper
Code

A unified framework for manifold landmarking

no code implementations • 25 Oct 2017 • Hongteng Xu, Licheng Yu, Mark Davenport, Hongyuan Zha

Active manifold learning aims to select and label representative landmarks on a manifold from a given set of samples to improve semi-supervised manifold learning.

Paper
Add Code

Hierarchically-Attentive RNN for Album Summarization and Storytelling

no code implementations • EMNLP 2017 • Licheng Yu, Mohit Bansal, Tamara L. Berg

For this task, we make use of the Visual Storytelling dataset and a model composed of three hierarchically-attentive Recurrent Neural Nets (RNNs) to: encode the album photos, select representative (summary) photos, and compose the story.

Ranked #15 on Visual Storytelling on VIST (BLEU-3 metric)

Retrieval Visual Storytelling

Paper
Add Code

A Joint Speaker-Listener-Reinforcer Model for Referring Expressions

2 code implementations • CVPR 2017 • Licheng Yu, Hao Tan, Mohit Bansal, Tamara L. Berg

The speaker generates referring expressions, the listener comprehends referring expressions, and the reinforcer introduces a reward function to guide sampling of more discriminative expressions.

Referring Expression Referring Expression Comprehension

Paper
Code

Detailed Garment Recovery from a Single-View Image

no code implementations • 3 Aug 2016 • Shan Yang, Tanya Ambert, Zherong Pan, Ke Wang, Licheng Yu, Tamara Berg, Ming C. Lin

Most recent garment capturing techniques rely on acquiring multiple views of clothing, which may not always be readily available, especially in the case of pre-existing photographs from the web.

Semantic Parsing Virtual Try-on

Paper
Add Code

Modeling Context in Referring Expressions

4 code implementations • 31 Jul 2016 • Licheng Yu, Patrick Poirson, Shan Yang, Alexander C. Berg, Tamara L. Berg

Humans refer to objects in their environments all the time, especially in dialogue with other people.

Referring Expression Referring expression generation +1

398

Paper
Code

Visual Madlibs: Fill in the Blank Description Generation and Question Answering

no code implementations • ICCV 2015 • Licheng Yu, Eunbyung Park, Alexander C. Berg, Tamara L. Berg

In this paper, we introduce a new dataset consisting of 360, 001 focused natural language descriptions for 10, 738 images.

Multiple-choice Question Answering

Paper
Add Code

Visual Madlibs: Fill in the blank Image Generation and Question Answering

no code implementations • 31 May 2015 • Licheng Yu, Eunbyung Park, Alexander C. Berg, Tamara L. Berg

In this paper, we introduce a new dataset consisting of 360, 001 focused natural language descriptions for 10, 738 images.

Image Generation Multiple-choice +1

Paper
Add Code

Cannot find the paper you are looking for? You can Submit a new open access paper.