Vision-language (VL) pre-training has recently received considerable attention.
In this work, we explore a unified VidL framework LAVENDER, where Masked Language Modeling (MLM) is used as the common interface for all pre-training and downstream tasks.
In this paper, we design and train a Generative Image-to-text Transformer, GIT, to unify vision-language tasks such as image/video captioning and question answering.
Ranked #1 on Image Captioning on nocaps-XD out-of-domain
no code implementations • 20 Apr 2022 • Sheng Shen, Chunyuan Li, Xiaowei Hu, Yujia Xie, Jianwei Yang, Pengchuan Zhang, Anna Rohrbach, Zhe Gan, Lijuan Wang, Lu Yuan, Ce Liu, Kurt Keutzer, Trevor Darrell, Jianfeng Gao
In this paper, we propose K-LITE (Knowledge-augmented Language-Image Training and Evaluation), a simple strategy to leverage external knowledge to build transferable visual systems: In training, it enriches entities in natural language with WordNet and Wiktionary knowledge, leading to an efficient and scalable approach to learning image representations that can understand both visual concepts and their knowledge; In evaluation, the natural language is also augmented with external knowledge and then used to reference learned visual concepts (or describe new ones) to enable zero-shot and few-shot transfer of the pre-trained models.
In this paper, we are concerned with a better-performing detector-free image captioning model, and propose a pure vision transformer-based image captioning model, dubbed as ViTCAP, in which grid representations are used without extracting the regional features.
Based on this, we ask an even bolder question: can we have an all-MLP architecture for VL modeling, where both VL fusion and the vision encoder are replaced with MLPs?
Based on this model architecture, we show that video captioning can benefit significantly from more densely sampled video frames as opposed to previous successes with sparsely sampled video frames for video-and-language understanding tasks (e. g., video question answering).
In this paper, we present LEMON, a LargE-scale iMage captiONer, and provide the first empirical study on the scaling behavior of VLP for image captioning.
Ranked #1 on Image Captioning on nocaps-val-overall
Further, unlike previous studies that found pre-training tasks on video inputs (e. g., masked frame modeling) not very effective, we design a new pre-training task, Masked Visual-token Modeling (MVM), for better video modeling.
In this paper, we propose UNICORN, a vision-language (VL) model that unifies text generation and bounding box prediction into a single architecture.
In this paper, we propose a single UniFied transfOrmer (UFO), which is capable of processing either unimodal inputs (e. g., image or language) or multimodal inputs (e. g., the concatenation of the image and the question), for vision-language (VL) representation learning.
In this paper, we present Adversarial GLUE (AdvGLUE), a new multi-task benchmark to quantitatively and thoroughly explore and evaluate the vulnerabilities of modern large-scale language models under various types of adversarial attacks.
Ranked #1 on Adversarial Robustness on AdvGLUE
Vision-and-language (VL) pre-training has proven to be highly effective on various VL downstream tasks.
To address this challenge, we propose PICa, a simple yet effective method that Prompts GPT3 via the use of Image Captions, for knowledge-based VQA.
InfoNCE-based contrastive representation learners, such as SimCLR, have been tremendously successful in recent years.
For example, our sparsified DeiT-Small at (5%, 50%) sparsity for (data, architecture), improves 0. 28% top-1 accuracy, and meanwhile enjoys 49. 32% FLOPs and 4. 40% running time savings.
1 code implementation • 8 Jun 2021 • Linjie Li, Jie Lei, Zhe Gan, Licheng Yu, Yen-Chun Chen, Rohit Pillai, Yu Cheng, Luowei Zhou, Xin Eric Wang, William Yang Wang, Tamara Lee Berg, Mohit Bansal, Jingjing Liu, Lijuan Wang, Zicheng Liu
Most existing video-and-language (VidL) research focuses on a single dataset, or multiple datasets of a single task.
We hope our Adversarial VQA dataset can shed new light on robustness study in the community and serve as a valuable benchmark for future work.
However, we can find "relaxed" winning tickets at 50%-70% sparsity that maintain 99% of the full accuracy.
This work concerns video-language pre-training and representation learning.
Based on these results, we articulate the Elastic Lottery Ticket Hypothesis (E-LTH): by mindfully replicating (or dropping) and re-ordering layers for one network, its corresponding winning ticket could be stretched (or squeezed) into a subnetwork for another deeper (or shallower) network from the same family, whose performance is nearly the same competitive as the latter's winning ticket directly found by IMP.
Recent advances in computer vision take advantage of adversarial data augmentation to ameliorate the generalization ability of classification models.
Voice style transfer, also called voice conversion, seeks to modify one speaker's voice to generate speech as if it came from another (target) speaker.
Training generative adversarial networks (GANs) with limited real image data generally results in deteriorated performance and collapsed models.
Experiments on text-to-video retrieval and video question answering on six datasets demonstrate that ClipBERT outperforms (or is on par with) existing methods that exploit full-length videos, suggesting that end-to-end learning with just a few sparsely sampled clips is often more accurate than using densely extracted offline features from full-length videos, proving the proverbial less-is-more principle.
Ranked #4 on Visual Question Answering on MSRVTT-QA (using extra training data)
Adversarial training is an effective method to combat adversarial attacks in order to create robust neural networks.
By incorporating different feature maps after the masking, we can distill better features to help model generalization.
Heavily overparameterized language models such as BERT, XLNet and T5 have achieved impressive success in many NLP tasks.
The primary goal of knowledge distillation (KD) is to encapsulate the information of a model learned from a teacher network into a student network, with the latter being more compact than the former.
Large-scale pre-trained multimodal transformers, such as ViLBERT and UNITER, have propelled the state of the art in vision-and-language (V+L) research to a new level.
Deep neural networks excel at comprehending complex visual signals, delivering on par or even superior performance to that of human experts.
In this paper, we propose Cross-Thought, a novel approach to pre-training sequence encoder, which is instrumental in building reusable sequence embeddings for large-scale NLP tasks such as question answering.
Pre-trained neural abstractive summarization systems have dominated extractive strategies on news summarization performance, at least in terms of ROUGE.
Large-scale language models such as BERT have achieved state-of-the-art performance across a wide range of NLP tasks.
Ranked #1 on Natural Language Inference on ANLI test (using extra training data)
In this work, we develop a new understanding towards Fast Adversarial Training, by viewing random initialization as performing randomized smoothing for better optimization of the inner maximization problem.
We present Mask-guided Generative Adversarial Network (MagGAN) for high-resolution face attribute editing, in which semantic facial masks from a pre-trained face parser are used to guide the fine-grained image editing process.
Existing language model compression methods mostly use a simple L2 loss to distill knowledge in the intermediate representations of a large BERT model to a smaller one.
Transformer has become ubiquitous in the deep learning field.
Ranked #1 on Question Answering on Quasart-T
Although deep neural networks have achieved tremendous success for question answering (QA), they are still suffering from heavy computational and energy cost for real product deployment.
During inference, the model makes predictions based on the text input in the target language and its translation in the source language.
Ranked #15 on Zero-Shot Cross-Lingual Transfer on XTREME
In GOT, cross-domain alignment is formulated as a graph matching problem, by representing entities into a dynamically-constructed graph.
Adaptive gradient methods such as RMSProp and Adam use exponential moving estimate of the squared gradient to compute adaptive step sizes, achieving better convergence than SGD in face of noisy objectives.
We present VILLA, the first known effort on large-scale adversarial training for vision-and-language (V+L) representation learning.
Ranked #3 on Referring Expression Comprehension on RefCOCOg-test
To reveal the secrets behind the scene of these powerful models, we present VALUE (Vision-And-Language Understanding Evaluation), a set of meticulously designed probing tasks (e. g., Visual Coreference Resolution, Visual Relation Detection, Linguistic Probing Tasks) generalizable to standard pre-trained V+L models, aiming to decipher the inner workings of multimodal pre-training (e. g., the implicit knowledge garnered in individual attention heads, the inherent cross-modal alignment learned through contextualized multimodal embeddings).
Auto-regressive text generation models usually focus on local fluency, and may cause inconsistent semantic meaning in long text generation.
We present HERO, a novel framework for large-scale video+language omni-representation learning.
Ranked #1 on Video Question Answering on Howto100M-QA
Large-scale pre-trained language models, such as BERT and GPT-2, have achieved excellent performance in language representation learning and free-form text generation.
In this paper, we investigate text generation in a hyperbolic latent space to learn continuous hierarchical representations.
To realize high-quality style transfer with natural context preservation, we propose a Context-Aware Style Transfer (CAST) model, which uses two separate encoders for each input sentence and its surrounding context.
We propose a new task towards more practical application for image generation - high-quality image synthesis from salient object layout.
Reinforcement learning (RL) has been widely studied for improving sequence-generation models.
We propose a novel graph-driven generative model, that unifies multiple heterogeneous learning tasks into the same framework.
Experiments show that the proposed approach significantly outperforms strong Transformer baselines on multiple language generation tasks such as machine translation and text summarization.
In this paper, we present Hierarchical Graph Network (HGN) for multi-hop question answering.
Ranked #33 on Question Answering on HotpotQA
To design a more powerful NMN architecture for practical use, we propose Meta Module Network (MMN) centered on a novel meta module, which can take in function recipes and morph into diverse instance modules dynamically.
This paper considers a novel variational formulation of network embeddings, with special focus on textual networks.
Different from previous work that applies joint random masking to both modalities, we use conditional masking on pre-training tasks (i. e., masked language/region modeling is conditioned on full observation of image/text).
Ranked #2 on Visual Question Answering on VCR (Q-A) test
Joint image-text embedding is the bedrock for most Vision-and-Language (V+L) tasks, where multimodality inputs are jointly processed for visual and textual understanding.
Adversarial training, which minimizes the maximal risk for label-preserving input perturbations, has proved to be effective for improving the generalization of language models.
Previous storytelling approaches mostly focused on optimizing traditional metrics such as BLEU, ROUGE and CIDEr.
Recent unsupervised approaches to domain adaptation primarily focus on minimizing the gap between the source and the target domains through refining the feature generator, in order to learn a better alignment between the two domains.
This paper presents a new metric called TIGEr for the automatic evaluation of image captioning systems.
These data may demonstrate domain shift, which impedes the benefits of utilizing such data for training.
Pre-trained language models such as BERT have proven to be highly effective for natural language processing (NLP) tasks.
In this paper, we focus on unsupervised domain adaptation for Machine Reading Comprehension (MRC), where the source domain has a large amount of labeled data, while only unlabeled passages are available in the target domain.
We propose a topic-guided variational auto-encoder (TGVAE) model for text generation.
In order to answer semantically-complicated questions about an image, a Visual Question Answering (VQA) model needs to fully understand the visual scene in the image, especially the interactive dynamics between different objects.
We propose a topic-guided variational autoencoder (TGVAE) model for text generation.
We present the Frontier Aware Search with backTracking (FAST) Navigator, a general framework for action decoding, that achieves state-of-the-art results on the Room-to-Room (R2R) Vision-and-Language navigation challenge of Anderson et.
Ranked #3 on Vision-Language Navigation on Room2Room
This paper presents a new model for visual dialog, Recurrent Dual Attention Network (ReDAN), using multi-step reasoning to answer a series of questions about an image.
Sequence-to-sequence models are commonly trained via maximum likelihood estimation (MLE).
The main challenges in this sequential and interactive image generation task are two-fold: 1) contextual consistency between a generated image and the provided textual description; 2) step-by-step region-level modification to maintain visual consistency across the generated image sequence in each session.
We therefore propose a new story-to-image-sequence generation model, StoryGAN, based on the sequential conditional GAN framework.
Sequence generation with reinforcement learning (RL) has received significant attention recently.
However, the discrete nature of text hinders the application of GAN to text-generation tasks.
Responses generated by neural conversational models tend to lack informativeness and diversity.
Distinct from most existing approaches, that only learn conditional distributions, the proposed model aims to learn a joint distribution of multiple random variables (domains).
We propose a hierarchically structured reinforcement learning approach to address the challenges of planning for generating coherent multi-sentence stories for the visual storytelling task.
Since diagnoses are typically correlated, a deep residual network is employed on top of the CNN encoder, to capture label (diagnosis) dependencies and incorporate information directly from the encoded sentence vector.
The TCNLM learns the global semantic coherence of a document via a neural topic model, and the probability of each learned latent topic is further used to build a Mixture-of-Experts (MoE) language model, where each expert (corresponding to one topic) is a recurrent neural network (RNN) that accounts for learning the local structure of a word sequence.
In this paper, we propose an Attentional Generative Adversarial Network (AttnGAN) that allows attention-driven, multi-stage refinement for fine-grained text-to-image generation.
Ranked #5 on Text-to-Image Generation on Multi-Modal-CelebA-HQ
A new form of variational autoencoder (VAE) is developed, in which the joint distribution of data and codes is considered in two (symmetric) forms: ($i$) from observed data fed through the encoder to yield codes, and ($ii$) from latent codes drawn from a simple prior and propagated through the decoder to manifest data.
The generators are designed to learn the two-way conditional distributions between the two domains, while the discriminators implicitly define a ternary discriminative function, which is trained to distinguish real data pairs and two kinds of fake data pairs.
Learning latent representations from long text sequences is an important first step in many natural language processing applications.
We propose a novel framework named StyleNet to address the task of generating attractive captions for images and videos with different styles.
Connecting different text attributes associated with the same entity (conflation) is important in business data analytics since it could help merge two different tables in a database to provide a more comprehensive profile of an entity.
The degree to which each member of the ensemble is used to generate an image caption is tied to the image-dependent probability of the corresponding tag.
Recurrent neural networks (RNNs) have shown promising performance for language modeling.
We propose a new encoder-decoder approach to learn distributed sentence representations that are applicable to multiple purposes.
Previous models for video captioning often use the output from a specific layer of a Convolutional Neural Network (CNN) as video features.
Gaussian graphical models (GGMs) are widely used for statistical modeling, because of ease of inference and the ubiquitous use of the normal distribution in practical approximations.
A novel variational autoencoder is developed to model images, as well as associated labels or captions.
Learning the representation of shape cues in 2D & 3D objects for recognition is a fundamental task in computer vision.
Deep conditional generative models are developed to simultaneously learn the temporal dependencies of multiple sequences.
Stochastic gradient Markov chain Monte Carlo (SG-MCMC) methods are Bayesian analogs to popular stochastic optimization methods; however, this connection is not well studied.