1 code implementation • 4 Mar 2023 • Xiao Han, Xiatian Zhu, Licheng Yu, Li Zhang, Yi-Zhe Song, Tao Xiang
In the fashion domain, there exists a variety of vision-and-language (V+L) tasks, including cross-modal retrieval, text-guided image retrieval, multi-modal classification, and image captioning.
1 code implementation • 28 Feb 2023 • Sangwoo Mo, Jong-Chyi Su, Chih-Yao Ma, Mido Assran, Ishan Misra, Licheng Yu, Sean Bell
Semi-supervised learning aims to train a model using limited labels.
no code implementations • 21 Feb 2023 • Yunzhong He, Yuxin Tian, Mengjiao Wang, Feier Chen, Licheng Yu, Maolong Tang, Congcong Chen, Ning Zhang, Bin Kuang, Arul Prakash
In this paper we presents Que2Engage, a search EBR system built towards bridging the gap between retrieval and ranking for end-to-end optimizations.
1 code implementation • 5 Jan 2023 • Hu Xu, Saining Xie, Po-Yao Huang, Licheng Yu, Russell Howes, Gargi Ghosh, Luke Zettlemoyer, Christoph Feichtenhofer
Large vision-language models are generally applicable to many downstream tasks, but come at an exorbitant training cost that only large institutions can afford.
1 code implementation • 23 Nov 2022 • Tsu-Jui Fu, Licheng Yu, Ning Zhang, Cheng-Yang Fu, Jong-Chyi Su, William Yang Wang, Sean Bell
Inspired by this, we introduce a novel task, text-guided video completion (TVC), which requests the model to generate a video from partial frames guided by an instruction.
Ranked #3 on
Video Prediction
on BAIR Robot Pushing
no code implementations • 26 Oct 2022 • Suvir Mirchandani, Licheng Yu, Mengjiao Wang, Animesh Sinha, WenWen Jiang, Tao Xiang, Ning Zhang
Additionally, these works have mainly been restricted to multimodal understanding tasks.
1 code implementation • 17 Jul 2022 • Xiao Han, Licheng Yu, Xiatian Zhu, Li Zhang, Yi-Zhe Song, Tao Xiang
We thus propose a Multi-View Contrastive Learning task for pulling closer the visual representation of one image to the compositional multimodal representation of another image+text.
1 code implementation • 1 Apr 2022 • Yuxuan Wang, Difei Gao, Licheng Yu, Stan Weixian Lei, Matt Feiszli, Mike Zheng Shou
In this paper, we introduce a new dataset called Kinetic-GEB+.
Ranked #1 on
Text to Video Retrieval
on Kinetics-GEB+
(text-to-video R@1 metric)
no code implementations • 10 Mar 2022 • Jie Lei, Xinlei Chen, Ning Zhang, Mengjiao Wang, Mohit Bansal, Tamara L. Berg, Licheng Yu
In this work, we propose LoopITR, which combines them in the same network for joint learning.
no code implementations • CVPR 2022 • Mingyang Zhou, Licheng Yu, Amanpreet Singh, Mengjiao Wang, Zhou Yu, Ning Zhang
We adapt our pre-trained model to a set of V+L downstream tasks, including VQA, NLVR2, Visual Entailment, and RefCOCO+.
no code implementations • 15 Feb 2022 • Licheng Yu, Jun Chen, Animesh Sinha, Mengjiao MJ Wang, Hugo Chen, Tamara L. Berg, Ning Zhang
We introduce CommerceMM - a multimodal model capable of providing a diverse and granular understanding of commerce topics associated to the given piece of content (image, text, image+text), and having the capability to generalize to a wide range of tasks, including Multimodal Categorization, Image-Text Retrieval, Query-to-Product Retrieval, Image-to-Product Retrieval, etc.
1 code implementation • 8 Jun 2021 • Linjie Li, Jie Lei, Zhe Gan, Licheng Yu, Yen-Chun Chen, Rohit Pillai, Yu Cheng, Luowei Zhou, Xin Eric Wang, William Yang Wang, Tamara Lee Berg, Mohit Bansal, Jingjing Liu, Lijuan Wang, Zicheng Liu
Most existing video-and-language (VidL) research focuses on a single dataset, or multiple datasets of a single task.
1 code implementation • CVPR 2021 • Zihang Meng, Licheng Yu, Ning Zhang, Tamara Berg, Babak Damavandi, Vikas Singh, Amy Bearman
Learning the grounding of each word is challenging, due to noise in the human-provided traces and the presence of words that cannot be meaningfully visually grounded.
1 code implementation • EMNLP 2020 • Jie Lei, Licheng Yu, Tamara L. Berg, Mohit Bansal
Given a video with aligned dialogue, people can often infer what is more likely to happen next.
no code implementations • ECCV 2020 • Jize Cao, Zhe Gan, Yu Cheng, Licheng Yu, Yen-Chun Chen, Jingjing Liu
To reveal the secrets behind the scene of these powerful models, we present VALUE (Vision-And-Language Understanding Evaluation), a set of meticulously designed probing tasks (e. g., Visual Coreference Resolution, Visual Relation Detection, Linguistic Probing Tasks) generalizable to standard pre-trained V+L models, aiming to decipher the inner workings of multimodal pre-training (e. g., the implicit knowledge garnered in individual attention heads, the inherent cross-modal alignment learned through contextualized multimodal embeddings).
3 code implementations • EMNLP 2020 • Linjie Li, Yen-Chun Chen, Yu Cheng, Zhe Gan, Licheng Yu, Jingjing Liu
We present HERO, a novel framework for large-scale video+language omni-representation learning.
Ranked #1 on
Video Question Answering
on Howto100M-QA
1 code implementation • CVPR 2020 • Yandong Li, Yu Cheng, Zhe Gan, Licheng Yu, Liqiang Wang, Jingjing Liu
We propose a new task towards more practical application for image generation - high-quality image synthesis from salient object layout.
1 code implementation • CVPR 2020 • Jingzhou Liu, Wenhu Chen, Yu Cheng, Zhe Gan, Licheng Yu, Yiming Yang, Jingjing Liu
We introduce a new task, Video-and-Language Inference, for joint multimodal understanding of video and text.
2 code implementations • ECCV 2020 • Jie Lei, Licheng Yu, Tamara L. Berg, Mohit Bansal
The queries are also labeled with query types that indicate whether each of them is more related to video or subtitle or both, allowing for in-depth analysis of the dataset and the methods that built on top of it.
Ranked #2 on
Video Retrieval
on TVR
6 code implementations • ECCV 2020 • Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, Jingjing Liu
Different from previous work that applies joint random masking to both modalities, we use conditional masking on pre-training tasks (i. e., masked language/region modeling is conditioned on full observation of image/text).
Ranked #2 on
Visual Question Answering
on VCR (Q-A) test
no code implementations • 25 Sep 2019 • Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, Jingjing Liu
Joint image-text embedding is the bedrock for most Vision-and-Language (V+L) tasks, where multimodality inputs are jointly processed for visual and textual understanding.
3 code implementations • ACL 2020 • Jie Lei, Licheng Yu, Tamara L. Berg, Mohit Bansal
We present the task of Spatio-Temporal Video Question Answering, which requires intelligent systems to simultaneously retrieve relevant moments and detect referenced visual concepts (people and objects) to answer natural language questions about videos.
Ranked #4 on
Video Question Answering
on TVQA
1 code implementation • CVPR 2019 • Licheng Yu, Xinlei Chen, Georgia Gkioxari, Mohit Bansal, Tamara L. Berg, Dhruv Batra
To address this, we propose a modular architecture composed of a program generator, a controller, a navigator, and a VQA module.
1 code implementation • NAACL 2019 • Hao Tan, Licheng Yu, Mohit Bansal
Next, we apply semi-supervised learning (via back-translation) on these dropped-out environments to generate new paths and instructions.
Ranked #1 on
Vision-Language Navigation
on Room2Room
4 code implementations • EMNLP 2018 • Jie Lei, Licheng Yu, Mohit Bansal, Tamara L. Berg
Recent years have witnessed an increasing interest in image-based question-answering (QA) tasks.
Ranked #3 on
Video Question Answering
on SUTD-TrafficQA
1 code implementation • CVPR 2018 • Licheng Yu, Zhe Lin, Xiaohui Shen, Jimei Yang, Xin Lu, Mohit Bansal, Tamara L. Berg
In this paper, we address referring expression comprehension: localizing an image region described by a natural language expression.
Ranked #12 on
Referring Expression Segmentation
on RefCOCO+ testA
no code implementations • 25 Oct 2017 • Hongteng Xu, Licheng Yu, Mark Davenport, Hongyuan Zha
Active manifold learning aims to select and label representative landmarks on a manifold from a given set of samples to improve semi-supervised manifold learning.
no code implementations • EMNLP 2017 • Licheng Yu, Mohit Bansal, Tamara L. Berg
For this task, we make use of the Visual Storytelling dataset and a model composed of three hierarchically-attentive Recurrent Neural Nets (RNNs) to: encode the album photos, select representative (summary) photos, and compose the story.
2 code implementations • CVPR 2017 • Licheng Yu, Hao Tan, Mohit Bansal, Tamara L. Berg
The speaker generates referring expressions, the listener comprehends referring expressions, and the reinforcer introduces a reward function to guide sampling of more discriminative expressions.
no code implementations • 3 Aug 2016 • Shan Yang, Tanya Ambert, Zherong Pan, Ke Wang, Licheng Yu, Tamara Berg, Ming C. Lin
Most recent garment capturing techniques rely on acquiring multiple views of clothing, which may not always be readily available, especially in the case of pre-existing photographs from the web.
3 code implementations • 31 Jul 2016 • Licheng Yu, Patrick Poirson, Shan Yang, Alexander C. Berg, Tamara L. Berg
Humans refer to objects in their environments all the time, especially in dialogue with other people.
no code implementations • ICCV 2015 • Licheng Yu, Eunbyung Park, Alexander C. Berg, Tamara L. Berg
In this paper, we introduce a new dataset consisting of 360, 001 focused natural language descriptions for 10, 738 images.
no code implementations • 31 May 2015 • Licheng Yu, Eunbyung Park, Alexander C. Berg, Tamara L. Berg
In this paper, we introduce a new dataset consisting of 360, 001 focused natural language descriptions for 10, 738 images.