Search Results for author: Licheng Yu

Found 33 papers, 21 papers with code

FAME-ViL: Multi-Tasking Vision-Language Model for Heterogeneous Fashion Tasks

1 code implementation4 Mar 2023 Xiao Han, Xiatian Zhu, Licheng Yu, Li Zhang, Yi-Zhe Song, Tao Xiang

In the fashion domain, there exists a variety of vision-and-language (V+L) tasks, including cross-modal retrieval, text-guided image retrieval, multi-modal classification, and image captioning.

Cross-Modal Retrieval Image Captioning +4

Que2Engage: Embedding-based Retrieval for Relevant and Engaging Products at Facebook Marketplace

no code implementations21 Feb 2023 Yunzhong He, Yuxin Tian, Mengjiao Wang, Feier Chen, Licheng Yu, Maolong Tang, Congcong Chen, Ning Zhang, Bin Kuang, Arul Prakash

In this paper we presents Que2Engage, a search EBR system built towards bridging the gap between retrieval and ranking for end-to-end optimizations.

Retrieval

CiT: Curation in Training for Effective Vision-Language Data

1 code implementation5 Jan 2023 Hu Xu, Saining Xie, Po-Yao Huang, Licheng Yu, Russell Howes, Gargi Ghosh, Luke Zettlemoyer, Christoph Feichtenhofer

Large vision-language models are generally applicable to many downstream tasks, but come at an exorbitant training cost that only large institutions can afford.

Tell Me What Happened: Unifying Text-guided Video Completion via Multimodal Masked Video Generation

1 code implementation23 Nov 2022 Tsu-Jui Fu, Licheng Yu, Ning Zhang, Cheng-Yang Fu, Jong-Chyi Su, William Yang Wang, Sean Bell

Inspired by this, we introduce a novel task, text-guided video completion (TVC), which requests the model to generate a video from partial frames guided by an instruction.

Text-to-Video Generation Video Generation +1

FashionViL: Fashion-Focused Vision-and-Language Representation Learning

1 code implementation17 Jul 2022 Xiao Han, Licheng Yu, Xiatian Zhu, Li Zhang, Yi-Zhe Song, Tao Xiang

We thus propose a Multi-View Contrastive Learning task for pulling closer the visual representation of one image to the compositional multimodal representation of another image+text.

Contrastive Learning Image Retrieval +2

CommerceMM: Large-Scale Commerce MultiModal Representation Learning with Omni Retrieval

no code implementations15 Feb 2022 Licheng Yu, Jun Chen, Animesh Sinha, Mengjiao MJ Wang, Hugo Chen, Tamara L. Berg, Ning Zhang

We introduce CommerceMM - a multimodal model capable of providing a diverse and granular understanding of commerce topics associated to the given piece of content (image, text, image+text), and having the capability to generalize to a wide range of tasks, including Multimodal Categorization, Image-Text Retrieval, Query-to-Product Retrieval, Image-to-Product Retrieval, etc.

Representation Learning Retrieval +1

Connecting What to Say With Where to Look by Modeling Human Attention Traces

1 code implementation CVPR 2021 Zihang Meng, Licheng Yu, Ning Zhang, Tamara Berg, Babak Damavandi, Vikas Singh, Amy Bearman

Learning the grounding of each word is challenging, due to noise in the human-provided traces and the presence of words that cannot be meaningfully visually grounded.

Image Captioning Visual Grounding

What is More Likely to Happen Next? Video-and-Language Future Event Prediction

1 code implementation EMNLP 2020 Jie Lei, Licheng Yu, Tamara L. Berg, Mohit Bansal

Given a video with aligned dialogue, people can often infer what is more likely to happen next.

Behind the Scene: Revealing the Secrets of Pre-trained Vision-and-Language Models

no code implementations ECCV 2020 Jize Cao, Zhe Gan, Yu Cheng, Licheng Yu, Yen-Chun Chen, Jingjing Liu

To reveal the secrets behind the scene of these powerful models, we present VALUE (Vision-And-Language Understanding Evaluation), a set of meticulously designed probing tasks (e. g., Visual Coreference Resolution, Visual Relation Detection, Linguistic Probing Tasks) generalizable to standard pre-trained V+L models, aiming to decipher the inner workings of multimodal pre-training (e. g., the implicit knowledge garnered in individual attention heads, the inherent cross-modal alignment learned through contextualized multimodal embeddings).

coreference-resolution Coreference Resolution

BachGAN: High-Resolution Image Synthesis from Salient Object Layout

1 code implementation CVPR 2020 Yandong Li, Yu Cheng, Zhe Gan, Licheng Yu, Liqiang Wang, Jingjing Liu

We propose a new task towards more practical application for image generation - high-quality image synthesis from salient object layout.

Image Generation Retrieval

VIOLIN: A Large-Scale Dataset for Video-and-Language Inference

1 code implementation CVPR 2020 Jingzhou Liu, Wenhu Chen, Yu Cheng, Zhe Gan, Licheng Yu, Yiming Yang, Jingjing Liu

We introduce a new task, Video-and-Language Inference, for joint multimodal understanding of video and text.

TVR: A Large-Scale Dataset for Video-Subtitle Moment Retrieval

2 code implementations ECCV 2020 Jie Lei, Licheng Yu, Tamara L. Berg, Mohit Bansal

The queries are also labeled with query types that indicate whether each of them is more related to video or subtitle or both, allowing for in-depth analysis of the dataset and the methods that built on top of it.

Moment Retrieval Retrieval +2

UNITER: UNiversal Image-TExt Representation Learning

6 code implementations ECCV 2020 Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, Jingjing Liu

Different from previous work that applies joint random masking to both modalities, we use conditional masking on pre-training tasks (i. e., masked language/region modeling is conditioned on full observation of image/text).

Language Modelling Masked Language Modeling +11

UNITER: Learning UNiversal Image-TExt Representations

no code implementations25 Sep 2019 Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, Jingjing Liu

Joint image-text embedding is the bedrock for most Vision-and-Language (V+L) tasks, where multimodality inputs are jointly processed for visual and textual understanding.

Language Modelling Masked Language Modeling +9

TVQA+: Spatio-Temporal Grounding for Video Question Answering

3 code implementations ACL 2020 Jie Lei, Licheng Yu, Tamara L. Berg, Mohit Bansal

We present the task of Spatio-Temporal Video Question Answering, which requires intelligent systems to simultaneously retrieve relevant moments and detect referenced visual concepts (people and objects) to answer natural language questions about videos.

Question Answering Video Question Answering

Multi-Target Embodied Question Answering

1 code implementation CVPR 2019 Licheng Yu, Xinlei Chen, Georgia Gkioxari, Mohit Bansal, Tamara L. Berg, Dhruv Batra

To address this, we propose a modular architecture composed of a program generator, a controller, a navigator, and a VQA module.

Embodied Question Answering Navigate +1

Learning to Navigate Unseen Environments: Back Translation with Environmental Dropout

1 code implementation NAACL 2019 Hao Tan, Licheng Yu, Mohit Bansal

Next, we apply semi-supervised learning (via back-translation) on these dropped-out environments to generate new paths and instructions.

Navigate Translation +1

A unified framework for manifold landmarking

no code implementations25 Oct 2017 Hongteng Xu, Licheng Yu, Mark Davenport, Hongyuan Zha

Active manifold learning aims to select and label representative landmarks on a manifold from a given set of samples to improve semi-supervised manifold learning.

Hierarchically-Attentive RNN for Album Summarization and Storytelling

no code implementations EMNLP 2017 Licheng Yu, Mohit Bansal, Tamara L. Berg

For this task, we make use of the Visual Storytelling dataset and a model composed of three hierarchically-attentive Recurrent Neural Nets (RNNs) to: encode the album photos, select representative (summary) photos, and compose the story.

Retrieval Visual Storytelling

A Joint Speaker-Listener-Reinforcer Model for Referring Expressions

2 code implementations CVPR 2017 Licheng Yu, Hao Tan, Mohit Bansal, Tamara L. Berg

The speaker generates referring expressions, the listener comprehends referring expressions, and the reinforcer introduces a reward function to guide sampling of more discriminative expressions.

Referring Expression Referring Expression Comprehension

Detailed Garment Recovery from a Single-View Image

no code implementations3 Aug 2016 Shan Yang, Tanya Ambert, Zherong Pan, Ke Wang, Licheng Yu, Tamara Berg, Ming C. Lin

Most recent garment capturing techniques rely on acquiring multiple views of clothing, which may not always be readily available, especially in the case of pre-existing photographs from the web.

Semantic Parsing Virtual Try-on

Visual Madlibs: Fill in the blank Image Generation and Question Answering

no code implementations31 May 2015 Licheng Yu, Eunbyung Park, Alexander C. Berg, Tamara L. Berg

In this paper, we introduce a new dataset consisting of 360, 001 focused natural language descriptions for 10, 738 images.

Image Generation Multiple-choice +1

Cannot find the paper you are looking for? You can Submit a new open access paper.