no code implementations • 24 Mar 2025 • Mingze Xu, Mingfei Gao, Shiyu Li, Jiasen Lu, Zhe Gan, Zhengfeng Lai, Meng Cao, Kai Kang, Yinfei Yang, Afshin Dehghan
We introduce SlowFast-LLaVA-1. 5 (abbreviated as SF-LLaVA-1. 5), a family of video large language models (LLMs) offering a token-efficient solution for long-form video understanding.
no code implementations • 11 Dec 2024 • Andrew Szot, Bogdan Mazoure, Omar Attia, Aleksei Timofeev, Harsh Agrawal, Devon Hjelm, Zhe Gan, Zsolt Kira, Alexander Toshev
We examine the capability of Multimodal Large Language Models (MLLMs) to tackle diverse domains that extend beyond the traditional language and vision tasks these models are typically trained on.
1 code implementation • 21 Nov 2024 • Enrico Fini, Mustafa Shukor, Xiujun Li, Philipp Dufter, Michal Klein, David Haldimann, Sai Aitharaju, Victor Guilherme Turrisi da Costa, Louis Béthune, Zhe Gan, Alexander T Toshev, Marcin Eichner, Moin Nabi, Yinfei Yang, Joshua M. Susskind, Alaaeldin El-Nouby
We introduce a novel method for pre-training of large-scale vision encoders.
Ranked #1 on
Image Classification
on iNaturalist
no code implementations • 24 Oct 2024 • Zhangheng Li, Keen You, Haotian Zhang, Di Feng, Harsh Agrawal, Xiujun Li, Mohana Prasad Sathya Moorthy, Jeff Nichols, Yinfei Yang, Zhe Gan
Building a generalist model for user interface (UI) understanding is challenging due to various foundational issues, such as platform diversity, resolution variation, and data limitation.
2 code implementations • 21 Oct 2024 • Ruohong Zhang, BoWen Zhang, Yanghao Li, Haotian Zhang, Zhiqing Sun, Zhe Gan, Yinfei Yang, Ruoming Pang, Yiming Yang
This work emphasizes the importance of incorporating detailed rationales in training and leveraging reinforcement learning to strengthen the reasoning capabilities of VLMs.
no code implementations • 9 Oct 2024 • Hanrong Ye, Haotian Zhang, Erik Daxberger, Lin Chen, Zongyu Lin, Yanghao Li, BoWen Zhang, Haoxuan You, Dan Xu, Zhe Gan, Jiasen Lu, Yinfei Yang
First, as there is a lack of QA data for egocentric video understanding, we develop a data engine that efficiently generates 7M high-quality QA samples for egocentric videos ranging from 30 seconds to one hour long, based on human-annotated data.
no code implementations • 3 Oct 2024 • Hong-You Chen, Zhengfeng Lai, Haotian Zhang, Xinze Wang, Marcin Eichner, Keen You, Meng Cao, BoWen Zhang, Yinfei Yang, Zhe Gan
Contrastive Language-Image Pre-training (CLIP) has been a celebrated method for training vision encoders to generate image/text representations facilitating various applications.
1 code implementation • 3 Oct 2024 • Zhengfeng Lai, Vasileios Saveris, Chen Chen, Hong-You Chen, Haotian Zhang, BoWen Zhang, Juan Lao Tebar, Wenze Hu, Zhe Gan, Peter Grasch, Meng Cao, Yinfei Yang
Our findings reveal that a hybrid approach that keeps both synthetic captions and AltTexts can outperform the use of synthetic captions alone, improving both alignment and performance, with each model demonstrating preferences for particular caption formats.
no code implementations • 30 Sep 2024 • Haotian Zhang, Mingfei Gao, Zhe Gan, Philipp Dufter, Nina Wenzel, Forrest Huang, Dhruti Shah, Xianzhi Du, BoWen Zhang, Yanghao Li, Sam Dodge, Keen You, Zhen Yang, Aleksei Timofeev, Mingze Xu, Hong-You Chen, Jean-Philippe Fauconnier, Zhengfeng Lai, Haoxuan You, ZiRui Wang, Afshin Dehghan, Peter Grasch, Yinfei Yang
We present MM1. 5, a new family of multimodal large language models (MLLMs) designed to enhance capabilities in text-rich image understanding, visual referring and grounding, and multi-image reasoning.
Ranked #55 on
Visual Question Answering
on MM-Vet
1 code implementation • 22 Jul 2024 • Mingze Xu, Mingfei Gao, Zhe Gan, Hong-You Chen, Zhengfeng Lai, Haiming Gang, Kai Kang, Afshin Dehghan
As a result, this design allows us to adequately capture both spatial and temporal features that are beneficial for detailed video understanding.
no code implementations • 2 Jul 2024 • Elmira Amirloo, Jean-Philippe Fauconnier, Christoph Roesmann, Christian Kerl, Rinu Boney, Yusu Qian, ZiRui Wang, Afshin Dehghan, Yinfei Yang, Zhe Gan, Peter Grasch
A primary objective of alignment for MLLMs is to encourage these models to align responses more closely with image information.
no code implementations • 1 Jul 2024 • Yusu Qian, Hanrong Ye, Jean-Philippe Fauconnier, Peter Grasch, Yinfei Yang, Zhe Gan
We introduce MIA-Bench, a new benchmark designed to evaluate multimodal large language models (MLLMs) on their ability to strictly adhere to complex instructions.
1 code implementation • 11 Apr 2024 • Haotian Zhang, Haoxuan You, Philipp Dufter, BoWen Zhang, Chen Chen, Hong-You Chen, Tsu-Jui Fu, William Yang Wang, Shih-Fu Chang, Zhe Gan, Yinfei Yang
While Ferret seamlessly integrates regional understanding into the Large Language Model (LLM) to facilitate its referring and grounding capability, it poses certain limitations: constrained by the pre-trained fixed visual encoder and failed to perform well on broader tasks.
Ranked #154 on
Visual Question Answering
on MM-Vet
no code implementations • 8 Apr 2024 • Keen You, Haotian Zhang, Eldon Schoop, Floris Weers, Amanda Swearngin, Jeffrey Nichols, Yinfei Yang, Zhe Gan
For model evaluation, we establish a comprehensive benchmark encompassing all the aforementioned tasks.
no code implementations • 14 Mar 2024 • Brandon McKinzie, Zhe Gan, Jean-Philippe Fauconnier, Sam Dodge, BoWen Zhang, Philipp Dufter, Dhruti Shah, Xianzhi Du, Futang Peng, Floris Weers, Anton Belyi, Haotian Zhang, Karanjeet Singh, Doug Kang, Ankur Jain, Hongyu Hè, Max Schwarzer, Tom Gunter, Xiang Kong, Aonan Zhang, Jianyu Wang, Chong Wang, Nan Du, Tao Lei, Sam Wiseman, Guoli Yin, Mark Lee, ZiRui Wang, Ruoming Pang, Peter Grasch, Alexander Toshev, Yinfei Yang
Through careful and comprehensive ablations of the image encoder, the vision language connector, and various pre-training data choices, we identified several crucial design lessons.
Ranked #73 on
Visual Question Answering
on MM-Vet
1 code implementation • 20 Feb 2024 • Yusu Qian, Haotian Zhang, Yinfei Yang, Zhe Gan
The remarkable advancements in Multimodal Large Language Models (MLLMs) have not rendered them immune to challenges, particularly in the context of handling deceptive information in prompts, thus producing hallucinated responses under such conditions.
no code implementations • 21 Dec 2023 • Bingbing Wen, Zhengyuan Yang, JianFeng Wang, Zhe Gan, Bill Howe, Lijuan Wang
In this paper, we build a visual dialogue dataset, named InfoVisDial, which provides rich informative answers in each round even with external knowledge related to the visual content.
1 code implementation • 29 Nov 2023 • Xiujun Li, Yujie Lu, Zhe Gan, Jianfeng Gao, William Yang Wang, Yejin Choi
Recent multimodal large language models (MLLMs) have shown promising instruction following capabilities on vision-language tasks.
no code implementations • 27 Nov 2023 • Yuhui Zhang, Brandon McKinzie, Zhe Gan, Vaishaal Shankar, Alexander Toshev
Recent advances in image tokenizers, such as VQ-VAE, have enabled text-to-image generation using auto-regressive methods, similar to language modeling.
1 code implementation • 11 Oct 2023 • Zhengfeng Lai, Haotian Zhang, BoWen Zhang, Wentao Wu, Haoping Bai, Aleksei Timofeev, Xianzhi Du, Zhe Gan, Jiulong Shan, Chen-Nee Chuah, Yinfei Yang, Meng Cao
For example, VeCLIP achieves up to +25. 2% gain in COCO and Flickr30k retrieval tasks under the 12M setting.
2 code implementations • 11 Oct 2023 • Haoxuan You, Haotian Zhang, Zhe Gan, Xianzhi Du, BoWen Zhang, ZiRui Wang, Liangliang Cao, Shih-Fu Chang, Yinfei Yang
We introduce Ferret, a new Multimodal Large Language Model (MLLM) capable of understanding spatial referring of any shape or granularity within an image and accurately grounding open-vocabulary descriptions.
1 code implementation • 2 Oct 2023 • Ajay Jaiswal, Zhe Gan, Xianzhi Du, BoWen Zhang, Zhangyang Wang, Yinfei Yang
Recently, several works have shown significant success in training-free and data-free compression (pruning and quantization) of LLMs that achieve 50 - 60% sparsity and reduce the bit width to 3 or 4 bits per weight, with negligible degradation of perplexity over the uncompressed baseline.
2 code implementations • 29 Sep 2023 • Tsu-Jui Fu, Wenze Hu, Xianzhi Du, William Yang Wang, Yinfei Yang, Zhe Gan
Extensive experimental results demonstrate that expressive instructions are crucial to instruction-based image editing, and our MGIE can lead to a notable improvement in automatic metrics and human evaluation while maintaining competitive inference efficiency.
1 code implementation • 18 Sep 2023 • Chunyuan Li, Zhe Gan, Zhengyuan Yang, Jianwei Yang, Linjie Li, Lijuan Wang, Jianfeng Gao
This paper presents a comprehensive survey of the taxonomy and evolution of multimodal foundation models that demonstrate vision and vision-language capabilities, focusing on the transition from specialist models to general-purpose assistants.
1 code implementation • 13 Jun 2023 • Wentao Wu, Aleksei Timofeev, Chen Chen, BoWen Zhang, Kun Duan, Shuangning Liu, Yantao Zheng, Jonathon Shlens, Xianzhi Du, Zhe Gan, Yinfei Yang
Our approach involves employing a named entity recognition model to extract entities from the alt-text, and then using a CLIP model to select the correct entities as labels of the paired image.
1 code implementation • 28 Apr 2023 • Yi-Lin Sung, Linjie Li, Kevin Lin, Zhe Gan, Mohit Bansal, Lijuan Wang
In this paper, we expand on this concept to a multimodal setup by merging transformers trained on different modalities.
2 code implementations • 13 Apr 2023 • Jaemin Cho, Linjie Li, Zhengyuan Yang, Zhe Gan, Lijuan Wang, Mohit Bansal
In this paper, we propose LayoutBench, a diagnostic benchmark for layout-guided image generation that examines four categories of spatial control skills: number, position, size, and shape.
1 code implementation • CVPR 2023 • Xueyan Zou, Zi-Yi Dou, Jianwei Yang, Zhe Gan, Linjie Li, Chunyuan Li, Xiyang Dai, Harkirat Behl, JianFeng Wang, Lu Yuan, Nanyun Peng, Lijuan Wang, Yong Jae Lee, Jianfeng Gao
We present X-Decoder, a generalized decoding model that can predict pixel-level segmentation and language tokens seamlessly.
Ranked #4 on
Instance Segmentation
on ADE20K val
(using extra training data)
1 code implementation • 1 Dec 2022 • Jialian Wu, JianFeng Wang, Zhengyuan Yang, Zhe Gan, Zicheng Liu, Junsong Yuan, Lijuan Wang
Specifically, GRiT consists of a visual encoder to extract image features, a foreground object extractor to localize objects, and a text decoder to generate open-set object descriptions.
Ranked #2 on
Dense Captioning
on Visual Genome
no code implementations • CVPR 2023 • Zhengyuan Yang, JianFeng Wang, Zhe Gan, Linjie Li, Kevin Lin, Chenfei Wu, Nan Duan, Zicheng Liu, Ce Liu, Michael Zeng, Lijuan Wang
Human evaluation on PaintSkill shows that ReCo is +19. 28% and +17. 21% more accurate in generating images with correct object count and spatial relationship than the T2I model.
Conditional Text-to-Image Synthesis
Layout-to-Image Generation
+1
1 code implementation • 21 Nov 2022 • Zixin Zhu, Yixuan Wei, JianFeng Wang, Zhe Gan, Zheng Zhang, Le Wang, Gang Hua, Lijuan Wang, Zicheng Liu, Han Hu
The image captioning task is typically realized by an auto-regressive method that decodes the text tokens one by one.
1 code implementation • 17 Oct 2022 • Chenglei Si, Zhe Gan, Zhengyuan Yang, Shuohang Wang, JianFeng Wang, Jordan Boyd-Graber, Lijuan Wang
While reliability is a broad and vaguely defined term, we decompose reliability into four main facets that correspond to the existing framework of ML safety and are well-recognized to be important: generalizability, social biases, calibration, and factuality.
1 code implementation • CVPR 2023 • Jinghao Zhou, Li Dong, Zhe Gan, Lijuan Wang, Furu Wei
Contrastive language-image pre-training (CLIP) serves as a de-facto standard to align images and texts.
1 code implementation • 17 Oct 2022 • Zhe Gan, Linjie Li, Chunyuan Li, Lijuan Wang, Zicheng Liu, Jianfeng Gao
This paper surveys vision-language pre-training (VLP) methods for multimodal intelligence that have been developed in the last few years.
1 code implementation • CVPR 2023 • Tsu-Jui Fu, Linjie Li, Zhe Gan, Kevin Lin, William Yang Wang, Lijuan Wang, Zicheng Liu
Masked visual modeling (MVM) has been recently proven effective for visual pre-training.
Ranked #1 on
Video Question Answering
on LSMDC-MC
1 code implementation • 20 Jul 2022 • Chenfei Wu, Jian Liang, Xiaowei Hu, Zhe Gan, JianFeng Wang, Lijuan Wang, Zicheng Liu, Yuejian Fang, Nan Duan
In this paper, we present NUWA-Infinity, a generative model for infinite visual synthesis, which is defined as the task of generating arbitrarily-sized high-resolution images or long-duration videos.
Ranked #1 on
Image Outpainting
on LHQC
1 code implementation • NeurIPS 2022 • Zi-Yi Dou, Aishwarya Kamath, Zhe Gan, Pengchuan Zhang, JianFeng Wang, Linjie Li, Zicheng Liu, Ce Liu, Yann Lecun, Nanyun Peng, Jianfeng Gao, Lijuan Wang
Vision-language (VL) pre-training has recently received considerable attention.
Ranked #1 on
Phrase Grounding
on Flickr30k Entities Dev
1 code implementation • CVPR 2023 • Linjie Li, Zhe Gan, Kevin Lin, Chung-Ching Lin, Zicheng Liu, Ce Liu, Lijuan Wang
In this work, we explore a unified VidL framework LAVENDER, where Masked Language Modeling (MLM) is used as the common interface for all pre-training and downstream tasks.
1 code implementation • 27 May 2022 • JianFeng Wang, Zhengyuan Yang, Xiaowei Hu, Linjie Li, Kevin Lin, Zhe Gan, Zicheng Liu, Ce Liu, Lijuan Wang
In this paper, we design and train a Generative Image-to-text Transformer, GIT, to unify vision-language tasks such as image/video captioning and question answering.
Ranked #1 on
Image Captioning
on nocaps-XD near-domain
2 code implementations • 20 Apr 2022 • Sheng Shen, Chunyuan Li, Xiaowei Hu, Jianwei Yang, Yujia Xie, Pengchuan Zhang, Zhe Gan, Lijuan Wang, Lu Yuan, Ce Liu, Kurt Keutzer, Trevor Darrell, Anna Rohrbach, Jianfeng Gao
We propose K-LITE, a simple strategy to leverage external knowledge for building transferable visual systems: In training, it enriches entities in text with WordNet and Wiktionary knowledge, leading to an efficient and scalable approach to learning image representations that uses knowledge about the visual concepts.
1 code implementation • CVPR 2022 • Zhiyuan Fang, JianFeng Wang, Xiaowei Hu, Lin Liang, Zhe Gan, Lijuan Wang, Yezhou Yang, Zicheng Liu
In this paper, we are concerned with a better-performing detector-free image captioning model, and propose a pure vision transformer-based image captioning model, dubbed as ViTCAP, in which grid representations are used without extracting the regional features.
1 code implementation • 8 Dec 2021 • Yixin Nie, Linjie Li, Zhe Gan, Shuohang Wang, Chenguang Zhu, Michael Zeng, Zicheng Liu, Mohit Bansal, Lijuan Wang
Based on this, we ask an even bolder question: can we have an all-MLP architecture for VL modeling, where both VL fusion and the vision encoder are replaced with MLPs?
1 code implementation • CVPR 2022 • Kevin Lin, Linjie Li, Chung-Ching Lin, Faisal Ahmed, Zhe Gan, Zicheng Liu, Yumao Lu, Lijuan Wang
Based on this model architecture, we show that video captioning can benefit significantly from more densely sampled video frames as opposed to previous successes with sparsely sampled video frames for video-and-language understanding tasks (e. g., video question answering).
1 code implementation • 24 Nov 2021 • Tsu-Jui Fu, Linjie Li, Zhe Gan, Kevin Lin, William Yang Wang, Lijuan Wang, Zicheng Liu
Further, unlike previous studies that found pre-training tasks on video inputs (e. g., masked frame modeling) not very effective, we design a new pre-training task, Masked Visual-token Modeling (MVM), for better video modeling.
Ranked #21 on
Zero-Shot Video Retrieval
on DiDeMo
no code implementations • CVPR 2022 • Xiaowei Hu, Zhe Gan, JianFeng Wang, Zhengyuan Yang, Zicheng Liu, Yumao Lu, Lijuan Wang
In this paper, we present LEMON, a LargE-scale iMage captiONer, and provide the first empirical study on the scaling behavior of VLP for image captioning.
Ranked #3 on
Image Captioning
on nocaps-XD entire
(using extra training data)
1 code implementation • 23 Nov 2021 • Zhengyuan Yang, Zhe Gan, JianFeng Wang, Xiaowei Hu, Faisal Ahmed, Zicheng Liu, Yumao Lu, Lijuan Wang
On grounded captioning, UniTAB presents a simpler solution with a single output head, and significantly outperforms state of the art in both grounding and captioning evaluations.
no code implementations • 19 Nov 2021 • JianFeng Wang, Xiaowei Hu, Zhe Gan, Zhengyuan Yang, Xiyang Dai, Zicheng Liu, Yumao Lu, Lijuan Wang
In this paper, we propose a single UniFied transfOrmer (UFO), which is capable of processing either unimodal inputs (e. g., image or language) or multimodal inputs (e. g., the concatenation of the image and the question), for vision-language (VL) representation learning.
1 code implementation • 4 Nov 2021 • Boxin Wang, Chejian Xu, Shuohang Wang, Zhe Gan, Yu Cheng, Jianfeng Gao, Ahmed Hassan Awadallah, Bo Li
In this paper, we present Adversarial GLUE (AdvGLUE), a new multi-task benchmark to quantitatively and thoroughly explore and evaluate the vulnerabilities of modern large-scale language models under various types of adversarial attacks.
Ranked #1 on
Adversarial Robustness
on AdvGLUE
3 code implementations • CVPR 2022 • Zi-Yi Dou, Yichong Xu, Zhe Gan, JianFeng Wang, Shuohang Wang, Lijuan Wang, Chenguang Zhu, Pengchuan Zhang, Lu Yuan, Nanyun Peng, Zicheng Liu, Michael Zeng
Vision-and-language (VL) pre-training has proven to be highly effective on various VL downstream tasks.
Ranked #20 on
Cross-Modal Retrieval
on COCO 2014
(using extra training data)
1 code implementation • 10 Sep 2021 • Zhengyuan Yang, Zhe Gan, JianFeng Wang, Xiaowei Hu, Yumao Lu, Zicheng Liu, Lijuan Wang
To address this challenge, we propose PICa, a simple yet effective method that Prompts GPT3 via the use of Image Captions, for knowledge-based VQA.
Ranked #21 on
Visual Question Answering (VQA)
on OK-VQA
(using extra training data)
1 code implementation • 2 Jul 2021 • Junya Chen, Zhe Gan, Xuan Li, Qing Guo, Liqun Chen, Shuyang Gao, Tagyoung Chung, Yi Xu, Belinda Zeng, Wenlian Lu, Fan Li, Lawrence Carin, Chenyang Tao
InfoNCE-based contrastive representation learners, such as SimCLR, have been tremendously successful in recent years.
1 code implementation • 8 Jun 2021 • Linjie Li, Jie Lei, Zhe Gan, Licheng Yu, Yen-Chun Chen, Rohit Pillai, Yu Cheng, Luowei Zhou, Xin Eric Wang, William Yang Wang, Tamara Lee Berg, Mohit Bansal, Jingjing Liu, Lijuan Wang, Zicheng Liu
Most existing video-and-language (VidL) research focuses on a single dataset, or multiple datasets of a single task.
1 code implementation • NeurIPS 2021 • Tianlong Chen, Yu Cheng, Zhe Gan, Lu Yuan, Lei Zhang, Zhangyang Wang
For example, our sparsified DeiT-Small at (5%, 50%) sparsity for (data, architecture), improves 0. 28% top-1 accuracy, and meanwhile enjoys 49. 32% FLOPs and 4. 40% running time savings.
Ranked #20 on
Efficient ViTs
on ImageNet-1K (with DeiT-T)
no code implementations • ICCV 2021 • Linjie Li, Jie Lei, Zhe Gan, Jingjing Liu
We hope our Adversarial VQA dataset can shed new light on robustness study in the community and serve as a valuable benchmark for future work.
no code implementations • 23 Apr 2021 • Zhe Gan, Yen-Chun Chen, Linjie Li, Tianlong Chen, Yu Cheng, Shuohang Wang, Jingjing Liu, Lijuan Wang, Zicheng Liu
However, we can find "relaxed" winning tickets at 50%-70% sparsity that maintain 99% of the full accuracy.
no code implementations • 1 Apr 2021 • Luowei Zhou, Jingjing Liu, Yu Cheng, Zhe Gan, Lei Zhang
This work concerns video-language pre-training and representation learning.
1 code implementation • NeurIPS 2021 • Xiaohan Chen, Yu Cheng, Shuohang Wang, Zhe Gan, Jingjing Liu, Zhangyang Wang
Based on these results, we articulate the Elastic Lottery Ticket Hypothesis (E-LTH): by mindfully replicating (or dropping) and re-ordering layers for one network, its corresponding winning ticket could be stretched (or squeezed) into a subnetwork for another deeper (or shallower) network from the same family, whose performance is nearly the same competitive as the latter's winning ticket directly found by IMP.
1 code implementation • 22 Mar 2021 • Tianlong Chen, Yu Cheng, Zhe Gan, JianFeng Wang, Lijuan Wang, Zhangyang Wang, Jingjing Liu
Recent advances in computer vision take advantage of adversarial data augmentation to ameliorate the generalization ability of classification models.
1 code implementation • ICLR 2021 • Siyang Yuan, Pengyu Cheng, Ruiyi Zhang, Weituo Hao, Zhe Gan, Lawrence Carin
Voice style transfer, also called voice conversion, seeks to modify one speaker's voice to generate speech as if it came from another (target) speaker.
1 code implementation • NeurIPS 2021 • Tianlong Chen, Yu Cheng, Zhe Gan, Jingjing Liu, Zhangyang Wang
Training generative adversarial networks (GANs) with limited real image data generally results in deteriorated performance and collapsed models.
1 code implementation • CVPR 2021 • Jie Lei, Linjie Li, Luowei Zhou, Zhe Gan, Tamara L. Berg, Mohit Bansal, Jingjing Liu
Experiments on text-to-video retrieval and video question answering on six datasets demonstrate that ClipBERT outperforms (or is on par with) existing methods that exploit full-length videos, suggesting that end-to-end learning with just a few sparsely sampled clips is often more accurate than using densely extracted offline features from full-length videos, proving the proverbial less-is-more principle.
Ranked #27 on
Visual Question Answering (VQA)
on MSRVTT-QA
(using extra training data)
no code implementations • Findings (ACL) 2021 • Shuohang Wang, Luowei Zhou, Zhe Gan, Yen-Chun Chen, Yuwei Fang, Siqi Sun, Yu Cheng, Jingjing Liu
Transformer has become ubiquitous in the deep learning field.
no code implementations • 1 Jan 2021 • Tianlong Chen, Yu Cheng, Zhe Gan, Yu Hu, Zhangyang Wang, Jingjing Liu
Adversarial training is an effective method to combat adversarial attacks in order to create robust neural networks.
no code implementations • 1 Jan 2021 • Minhao Cheng, Zhe Gan, Yu Cheng, Shuohang Wang, Cho-Jui Hsieh, Jingjing Liu
By incorporating different feature maps after the masking, we can distill better features to help model generalization.
1 code implementation • ACL 2021 • Xiaohan Chen, Yu Cheng, Shuohang Wang, Zhe Gan, Zhangyang Wang, Jingjing Liu
Heavily overparameterized language models such as BERT, XLNet and T5 have achieved impressive success in many NLP tasks.
no code implementations • CVPR 2021 • Liqun Chen, Dong Wang, Zhe Gan, Jingjing Liu, Ricardo Henao, Lawrence Carin
The primary goal of knowledge distillation (KD) is to encapsulate the information of a model learned from a teacher network into a student network, with the latter being more compact than the former.
Ranked #12 on
Knowledge Distillation
on CIFAR-100
no code implementations • 15 Dec 2020 • Linjie Li, Zhe Gan, Jingjing Liu
Large-scale pre-trained multimodal transformers, such as ViLBERT and UNITER, have propelled the state of the art in vision-and-language (V+L) research to a new level.
no code implementations • 6 Dec 2020 • Dong Wang, Yuewei Yang, Chenyang Tao, Zhe Gan, Liqun Chen, Fanjie Kong, Ricardo Henao, Lawrence Carin
Deep neural networks excel at comprehending complex visual signals, delivering on par or even superior performance to that of human experts.
1 code implementation • EMNLP 2020 • Shuohang Wang, Yuwei Fang, Siqi Sun, Zhe Gan, Yu Cheng, Jing Jiang, Jingjing Liu
In this paper, we propose Cross-Thought, a novel approach to pre-training sequence encoder, which is instrumental in building reusable sequence embeddings for large-scale NLP tasks such as question answering.
no code implementations • EMNLP 2020 • Yue Dong, Shuohang Wang, Zhe Gan, Yu Cheng, Jackie Chi Kit Cheung, Jingjing Liu
Pre-trained neural abstractive summarization systems have dominated extractive strategies on news summarization performance, at least in terms of ROUGE.
2 code implementations • ICLR 2021 • Boxin Wang, Shuohang Wang, Yu Cheng, Zhe Gan, Ruoxi Jia, Bo Li, Jingjing Liu
Large-scale language models such as BERT have achieved state-of-the-art performance across a wide range of NLP tasks.
Ranked #3 on
Natural Language Inference
on ANLI test
(using extra training data)
no code implementations • 3 Oct 2020 • Yi Wei, Zhe Gan, Wenbo Li, Siwei Lyu, Ming-Ching Chang, Lei Zhang, Jianfeng Gao, Pengchuan Zhang
We present Mask-guided Generative Adversarial Network (MagGAN) for high-resolution face attribute editing, in which semantic facial masks from a pre-trained face parser are used to guide the fine-grained image editing process.
1 code implementation • 3 Oct 2020 • Jinghui Chen, Yu Cheng, Zhe Gan, Quanquan Gu, Jingjing Liu
In this work, we develop a new understanding towards Fast Adversarial Training, by viewing random initialization as performing randomized smoothing for better optimization of the inner maximization problem.
1 code implementation • EMNLP 2020 • Siqi Sun, Zhe Gan, Yu Cheng, Yuwei Fang, Shuohang Wang, Jingjing Liu
Existing language model compression methods mostly use a simple L2 loss to distill knowledge in the intermediate representations of a large BERT model to a smaller one.
no code implementations • 13 Sep 2020 • Shuohang Wang, Luowei Zhou, Zhe Gan, Yen-Chun Chen, Yuwei Fang, Siqi Sun, Yu Cheng, Jingjing Liu
Transformer has become ubiquitous in the deep learning field.
Ranked #1 on
Open-Domain Question Answering
on SearchQA
no code implementations • 10 Sep 2020 • Yuwei Fang, Shuohang Wang, Zhe Gan, Siqi Sun, Jingjing Liu, Chenguang Zhu
Although deep neural networks have achieved tremendous success for question answering (QA), they are still suffering from heavy computational and energy cost for real product deployment.
1 code implementation • 10 Sep 2020 • Yuwei Fang, Shuohang Wang, Zhe Gan, Siqi Sun, Jingjing Liu
During inference, the model makes predictions based on the text input in the target language and its translation in the source language.
Ranked #18 on
Zero-Shot Cross-Lingual Transfer
on XTREME
1 code implementation • ICML 2020 • Liqun Chen, Zhe Gan, Yu Cheng, Linjie Li, Lawrence Carin, Jingjing Liu
In GOT, cross-domain alignment is formulated as a graph matching problem, by representing entities into a dynamically-constructed graph.
2 code implementations • ICML 2020 • Pengyu Cheng, Weituo Hao, Shuyang Dai, Jiachang Liu, Zhe Gan, Lawrence Carin
In this paper, we propose a novel Contrastive Log-ratio Upper Bound (CLUB) of mutual information.
1 code implementation • 21 Jun 2020 • Chen Zhu, Yu Cheng, Zhe Gan, Furong Huang, Jingjing Liu, Tom Goldstein
Adaptive gradient methods such as RMSProp and Adam use exponential moving estimate of the squared gradient to compute adaptive step sizes, achieving better convergence than SGD in face of noisy objectives.
2 code implementations • NeurIPS 2020 • Zhe Gan, Yen-Chun Chen, Linjie Li, Chen Zhu, Yu Cheng, Jingjing Liu
We present VILLA, the first known effort on large-scale adversarial training for vision-and-language (V+L) representation learning.
Ranked #7 on
Visual Entailment
on SNLI-VE val
(using extra training data)
no code implementations • ECCV 2020 • Jize Cao, Zhe Gan, Yu Cheng, Licheng Yu, Yen-Chun Chen, Jingjing Liu
To reveal the secrets behind the scene of these powerful models, we present VALUE (Vision-And-Language Understanding Evaluation), a set of meticulously designed probing tasks (e. g., Visual Coreference Resolution, Visual Relation Detection, Linguistic Probing Tasks) generalizable to standard pre-trained V+L models, aiming to decipher the inner workings of multimodal pre-training (e. g., the implicit knowledge garnered in individual attention heads, the inherent cross-modal alignment learned through contextualized multimodal embeddings).
no code implementations • ACL 2020 • Ruiyi Zhang, Changyou Chen, Zhe Gan, Wenlin Wang, Dinghan Shen, Guoyin Wang, Zheng Wen, Lawrence Carin
Auto-regressive text generation models usually focus on local fluency, and may cause inconsistent semantic meaning in long text generation.
1 code implementation • EMNLP 2020 • Yizhe Zhang, Guoyin Wang, Chunyuan Li, Zhe Gan, Chris Brockett, Bill Dolan
Large-scale pre-trained language models, such as BERT and GPT-2, have achieved excellent performance in language representation learning and free-form text generation.
3 code implementations • EMNLP 2020 • Linjie Li, Yen-Chun Chen, Yu Cheng, Zhe Gan, Licheng Yu, Jingjing Liu
We present HERO, a novel framework for large-scale video+language omni-representation learning.
Ranked #1 on
Video Retrieval
on TVR
no code implementations • Findings of the Association for Computational Linguistics 2020 • Yu Cheng, Zhe Gan, Yizhe Zhang, Oussama Elachqar, Dianqi Li, Jingjing Liu
To realize high-quality style transfer with natural context preservation, we propose a Context-Aware Style Transfer (CAST) model, which uses two separate encoders for each input sentence and its surrounding context.
no code implementations • NAACL 2021 • Shuyang Dai, Zhe Gan, Yu Cheng, Chenyang Tao, Lawrence Carin, Jingjing Liu
In this paper, we investigate text generation in a hyperbolic latent space to learn continuous hierarchical representations.
1 code implementation • CVPR 2020 • Yandong Li, Yu Cheng, Zhe Gan, Licheng Yu, Liqiang Wang, Jingjing Liu
We propose a new task towards more practical application for image generation - high-quality image synthesis from salient object layout.
1 code implementation • CVPR 2020 • Jingzhou Liu, Wenhu Chen, Yu Cheng, Zhe Gan, Licheng Yu, Yiming Yang, Jingjing Liu
We introduce a new task, Video-and-Language Inference, for joint multimodal understanding of video and text.
no code implementations • 20 Jan 2020 • Ruiyi Zhang, Changyou Chen, Zhe Gan, Zheng Wen, Wenlin Wang, Lawrence Carin
Reinforcement learning (RL) has been widely studied for improving sequence-generation models.
no code implementations • 20 Nov 2019 • Wenlin Wang, Hongteng Xu, Zhe Gan, Bai Li, Guoyin Wang, Liqun Chen, Qian Yang, Wenqi Wang, Lawrence Carin
We propose a novel graph-driven generative model, that unifies multiple heterogeneous learning tasks into the same framework.
2 code implementations • ACL 2020 • Yen-Chun Chen, Zhe Gan, Yu Cheng, Jingzhou Liu, Jingjing Liu
Experiments show that the proposed approach significantly outperforms strong Transformer baselines on multiple language generation tasks such as machine translation and text summarization.
1 code implementation • EMNLP 2020 • Yuwei Fang, Siqi Sun, Zhe Gan, Rohit Pillai, Shuohang Wang, Jingjing Liu
In this paper, we present Hierarchical Graph Network (HGN) for multi-hop question answering.
Ranked #32 on
Question Answering
on HotpotQA
1 code implementation • ACL 2020 • Jiacheng Xu, Zhe Gan, Yu Cheng, Jingjing Liu
Recently BERT has been adopted for document encoding in state-of-the-art text summarization models.
1 code implementation • 8 Oct 2019 • Wenhu Chen, Zhe Gan, Linjie Li, Yu Cheng, William Wang, Jingjing Liu
To design a more powerful NMN architecture for practical use, we propose Meta Module Network (MMN) centered on a novel meta module, which can take in function recipes and morph into diverse instance modules dynamically.
1 code implementation • NeurIPS 2019 • Wenlin Wang, Chenyang Tao, Zhe Gan, Guoyin Wang, Liqun Chen, Xinyuan Zhang, Ruiyi Zhang, Qian Yang, Ricardo Henao, Lawrence Carin
This paper considers a novel variational formulation of network embeddings, with special focus on textual networks.
2 code implementations • ICLR 2020 • Chen Zhu, Yu Cheng, Zhe Gan, Siqi Sun, Tom Goldstein, Jingjing Liu
Adversarial training, which minimizes the maximal risk for label-preserving input perturbations, has proved to be effective for improving the generalization of language models.
7 code implementations • ECCV 2020 • Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, Jingjing Liu
Different from previous work that applies joint random masking to both modalities, we use conditional masking on pre-training tasks (i. e., masked language/region modeling is conditioned on full observation of image/text).
Ranked #3 on
Visual Question Answering (VQA)
on VCR (Q-A) test
no code implementations • 25 Sep 2019 • Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, Jingjing Liu
Joint image-text embedding is the bedrock for most Vision-and-Language (V+L) tasks, where multimodality inputs are jointly processed for visual and textual understanding.
no code implementations • 11 Sep 2019 • Shuyang Dai, Yu Cheng, Yizhe Zhang, Zhe Gan, Jingjing Liu, Lawrence Carin
Recent unsupervised approaches to domain adaptation primarily focus on minimizing the gap between the source and the target domains through refining the feature generator, in order to learn a better alignment between the two domains.
1 code implementation • 11 Sep 2019 • Junjie Hu, Yu Cheng, Zhe Gan, Jingjing Liu, Jianfeng Gao, Graham Neubig
Previous storytelling approaches mostly focused on optimizing traditional metrics such as BLEU, ROUGE and CIDEr.
Ranked #10 on
Visual Storytelling
on VIST
1 code implementation • IJCNLP 2019 • Ming Jiang, Qiuyuan Huang, Lei Zhang, Xin Wang, Pengchuan Zhang, Zhe Gan, Jana Diesner, Jianfeng Gao
This paper presents a new metric called TIGEr for the automatic evaluation of image captioning systems.
1 code implementation • IJCNLP 2019 • Dianqi Li, Yizhe Zhang, Zhe Gan, Yu Cheng, Chris Brockett, Ming-Ting Sun, Bill Dolan
These data may demonstrate domain shift, which impedes the benefits of utilizing such data for training.
5 code implementations • IJCNLP 2019 • Siqi Sun, Yu Cheng, Zhe Gan, Jingjing Liu
Pre-trained language models such as BERT have proven to be highly effective for natural language processing (NLP) tasks.
no code implementations • IJCNLP 2019 • Huazheng Wang, Zhe Gan, Xiaodong Liu, Jingjing Liu, Jianfeng Gao, Hongning Wang
In this paper, we focus on unsupervised domain adaptation for Machine Reading Comprehension (MRC), where the source domain has a large amount of labeled data, while only unlabeled passages are available in the target domain.
no code implementations • NAACL 2019 • Wenlin Wang, Zhe Gan, Hongteng Xu, Ruiyi Zhang, Guoyin Wang, Dinghan Shen, Changyou Chen, Lawrence Carin
We propose a topic-guided variational auto-encoder (TGVAE) model for text generation.
1 code implementation • ICCV 2019 • Linjie Li, Zhe Gan, Yu Cheng, Jingjing Liu
In order to answer semantically-complicated questions about an image, a Visual Question Answering (VQA) model needs to fully understand the visual scene in the image, especially the interactive dynamics between different objects.
no code implementations • 17 Mar 2019 • Wenlin Wang, Zhe Gan, Hongteng Xu, Ruiyi Zhang, Guoyin Wang, Dinghan Shen, Changyou Chen, Lawrence Carin
We propose a topic-guided variational autoencoder (TGVAE) model for text generation.
1 code implementation • CVPR 2019 • Liyiming Ke, Xiujun Li, Yonatan Bisk, Ari Holtzman, Zhe Gan, Jingjing Liu, Jianfeng Gao, Yejin Choi, Siddhartha Srinivasa
We present the Frontier Aware Search with backTracking (FAST) Navigator, a general framework for action decoding, that achieves state-of-the-art results on the Room-to-Room (R2R) Vision-and-Language navigation challenge of Anderson et.
Ranked #4 on
Vision-Language Navigation
on Room2Room
no code implementations • ACL 2019 • Zhe Gan, Yu Cheng, Ahmed El Kholy, Linjie Li, Jingjing Liu, Jianfeng Gao
This paper presents a new model for visual dialog, Recurrent Dual Attention Network (ReDAN), using multi-step reasoning to answer a series of questions about an image.
no code implementations • ICLR 2019 • Liqun Chen, Yizhe Zhang, Ruiyi Zhang, Chenyang Tao, Zhe Gan, Haichao Zhang, Bai Li, Dinghan Shen, Changyou Chen, Lawrence Carin
Sequence-to-sequence models are commonly trained via maximum likelihood estimation (MLE).
no code implementations • 20 Dec 2018 • Yu Cheng, Zhe Gan, Yitong Li, Jingjing Liu, Jianfeng Gao
The main challenges in this sequential and interactive image generation task are two-fold: 1) contextual consistency between a generated image and the provided textual description; 2) step-by-step region-level modification to maintain visual consistency across the generated image sequence in each session.
1 code implementation • CVPR 2019 • Yitong Li, Zhe Gan, Yelong Shen, Jingjing Liu, Yu Cheng, Yuexin Wu, Lawrence Carin, David Carlson, Jianfeng Gao
We therefore propose a new story-to-image-sequence generation model, StoryGAN, based on the sequential conditional GAN framework.
no code implementations • 2 Nov 2018 • Ruiyi Zhang, Changyou Chen, Zhe Gan, Wenlin Wang, Liqun Chen, Dinghan Shen, Guoyin Wang, Lawrence Carin
Sequence generation with reinforcement learning (RL) has received significant attention recently.
1 code implementation • NeurIPS 2018 • Liqun Chen, Shuyang Dai, Chenyang Tao, Dinghan Shen, Zhe Gan, Haichao Zhang, Yizhe Zhang, Lawrence Carin
However, the discrete nature of text hinders the application of GAN to text-generation tasks.
4 code implementations • NeurIPS 2018 • Yizhe Zhang, Michel Galley, Jianfeng Gao, Zhe Gan, Xiujun Li, Chris Brockett, Bill Dolan
Responses generated by neural conversational models tend to lack informativeness and diversity.
2 code implementations • ICML 2018 • Yunchen Pu, Shuyang Dai, Zhe Gan, Wei-Yao Wang, Guoyin Wang, Yizhe Zhang, Ricardo Henao, Lawrence Carin
Distinct from most existing approaches, that only learn conditional distributions, the proposed model aims to learn a joint distribution of multiple random variables (domains).
no code implementations • 21 May 2018 • Qiuyuan Huang, Zhe Gan, Asli Celikyilmaz, Dapeng Wu, Jian-Feng Wang, Xiaodong He
We propose a hierarchically structured reinforcement learning approach to address the challenges of planning for generating coherent multi-sentence stories for the visual storytelling task.
Ranked #24 on
Visual Storytelling
on VIST
no code implementations • 15 Jan 2018 • Xinyuan Zhang, Ricardo Henao, Zhe Gan, Yitong Li, Lawrence Carin
Since diagnoses are typically correlated, a deep residual network is employed on top of the CNN encoder, to capture label (diagnosis) dependencies and incorporate information directly from the encoded sentence vector.
no code implementations • 28 Dec 2017 • Wenlin Wang, Zhe Gan, Wenqi Wang, Dinghan Shen, Jiaji Huang, Wei Ping, Sanjeev Satheesh, Lawrence Carin
The TCNLM learns the global semantic coherence of a document via a neural topic model, and the probability of each learned latent topic is further used to build a Mixture-of-Experts (MoE) language model, where each expert (corresponding to one topic) is a recurrent neural network (RNN) that accounts for learning the local structure of a word sequence.
20 code implementations • CVPR 2018 • Tao Xu, Pengchuan Zhang, Qiuyuan Huang, Han Zhang, Zhe Gan, Xiaolei Huang, Xiaodong He
In this paper, we propose an Attentional Generative Adversarial Network (AttnGAN) that allows attention-driven, multi-stage refinement for fine-grained text-to-image generation.
Ranked #1 on
Text-to-Image Generation
on MS-COCO
no code implementations • NeurIPS 2017 • Yunchen Pu, Wei-Yao Wang, Ricardo Henao, Liqun Chen, Zhe Gan, Chunyuan Li, Lawrence Carin
A new form of variational autoencoder (VAE) is developed, in which the joint distribution of data and codes is considered in two (symmetric) forms: ($i$) from observed data fed through the encoder to yield codes, and ($ii$) from latent codes drawn from a simple prior and propagated through the decoder to manifest data.
1 code implementation • NeurIPS 2017 • Zhe Gan, Liqun Chen, Wei-Yao Wang, Yunchen Pu, Yizhe Zhang, Hao liu, Chunyuan Li, Lawrence Carin
The generators are designed to learn the two-way conditional distributions between the two domains, while the discriminators implicitly define a ternary discriminative function, which is trained to distinguish real data pairs and two kinds of fake data pairs.
4 code implementations • NeurIPS 2017 • Yizhe Zhang, Dinghan Shen, Guoyin Wang, Zhe Gan, Ricardo Henao, Lawrence Carin
Learning latent representations from long text sequences is an important first step in many natural language processing applications.
no code implementations • CVPR 2017 • Chuang Gan, Zhe Gan, Xiaodong He, Jianfeng Gao, Li Deng
We propose a novel framework named StyleNet to address the task of generating attractive captions for images and videos with different styles.
1 code implementation • ICML 2017 • Yizhe Zhang, Zhe Gan, Kai Fan, Zhi Chen, Ricardo Henao, Dinghan Shen, Lawrence Carin
We propose a framework for generating realistic text via adversarial training.
no code implementations • ICML 2017 • Yizhe Zhang, Changyou Chen, Zhe Gan, Ricardo Henao, Lawrence Carin
A framework is proposed to improve the sampling efficiency of stochastic gradient MCMC, based on Hamiltonian Monte Carlo.
no code implementations • NeurIPS 2017 • Yunchen Pu, Zhe Gan, Ricardo Henao, Chunyuan Li, Shaobo Han, Lawrence Carin
A new method for learning variational autoencoders (VAEs) is developed, based on Stein variational gradient descent.
2 code implementations • 8 Feb 2017 • Zhe Gan, P. D. Singh, Ameet Joshi, Xiaodong He, Jianshu Chen, Jianfeng Gao, Li Deng
Connecting different text attributes associated with the same entity (conflation) is important in business data analytics since it could help merge two different tables in a database to provide a more comprehensive profile of an entity.
1 code implementation • 13 Dec 2016 • Yin Xian, Yunchen Pu, Zhe Gan, Liang Lu, Andrew Thompson
Its output feature is related to Cohen's class of time-frequency distributions.
Sound
1 code implementation • CVPR 2017 • Zhe Gan, Chuang Gan, Xiaodong He, Yunchen Pu, Kenneth Tran, Jianfeng Gao, Lawrence Carin, Li Deng
The degree to which each member of the ensemble is used to generate an image caption is tied to the image-dependent probability of the corresponding tag.
no code implementations • ACL 2017 • Zhe Gan, Chunyuan Li, Changyou Chen, Yunchen Pu, Qinliang Su, Lawrence Carin
Recurrent neural networks (RNNs) have shown promising performance for language modeling.
no code implementations • EMNLP 2017 • Zhe Gan, Yunchen Pu, Ricardo Henao, Chunyuan Li, Xiaodong He, Lawrence Carin
We propose a new encoder-decoder approach to learn distributed sentence representations that are applicable to multiple purposes.
no code implementations • 23 Nov 2016 • Yunchen Pu, Martin Renqiang Min, Zhe Gan, Lawrence Carin
Previous models for video captioning often use the output from a specific layer of a Convolutional Neural Network (CNN) as video features.
no code implementations • 15 Nov 2016 • Qinliang Su, Xuejun Liao, Chunyuan Li, Zhe Gan, Lawrence Carin
Gaussian graphical models (GGMs) are widely used for statistical modeling, because of ease of inference and the ubiquitous use of the normal distribution in practical approximations.
no code implementations • NeurIPS 2016 • Yunchen Pu, Zhe Gan, Ricardo Henao, Xin Yuan, Chunyuan Li, Andrew Stevens, Lawrence Carin
A novel variational autoencoder is developed to model images, as well as associated labels or captions.
no code implementations • CVPR 2016 • Chunyuan Li, Andrew Stevens, Changyou Chen, Yunchen Pu, Zhe Gan, Lawrence Carin
Learning the representation of shape cues in 2D & 3D objects for recognition is a fundamental task in computer vision.
no code implementations • 22 May 2016 • Jiaming Song, Zhe Gan, Lawrence Carin
Deep conditional generative models are developed to simultaneously learn the temporal dependencies of multiple sequences.
1 code implementation • 25 Dec 2015 • Changyou Chen, David Carlson, Zhe Gan, Chunyuan Li, Lawrence Carin
Stochastic gradient Markov chain Monte Carlo (SG-MCMC) methods are Bayesian analogs to popular stochastic optimization methods; however, this connection is not well studied.
no code implementations • NeurIPS 2015 • Ricardo Henao, Zhe Gan, James Lu, Lawrence Carin
We propose a new deep architecture for topic modeling, based on Poisson Factor Analysis (PFA) modules.
1 code implementation • NeurIPS 2015 • Zhe Gan, Chunyuan Li, Ricardo Henao, David Carlson, Lawrence Carin
Deep dynamic generative models are developed to learn sequential dependencies in time-series data.