With the recent exponential increase in applying Vision Transformer (ViT) to vision tasks, the capability of ViT in adapting cross-domain knowledge, however, remains unexplored in the literature.
1 code implementation • 8 Jun 2021 • Linjie Li, Jie Lei, Zhe Gan, Licheng Yu, Yen-Chun Chen, Rohit Pillai, Yu Cheng, Luowei Zhou, Xin Eric Wang, William Yang Wang, Tamara Lee Berg, Mohit Bansal, Jingjing Liu, Lijuan Wang, Zicheng Liu
Most existing video-and-language (VidL) research focuses on a single dataset, or multiple datasets of a single task.
We hope our Adversarial VQA dataset can shed new light on robustness study in the community and serve as a valuable benchmark for future work.
In this work, we perform the first empirical study to assess whether such trainable subnetworks also exist in pre-trained V+L models.
Vision-and-language pre-training has achieved impressive success in learning multimodal representations between vision and language.
This work concerns video-language pre-training and representation learning.
Based on these results, we articulate the Elastic Lottery Ticket Hypothesis (E-LTH): by mindfully replicating (or dropping) and re-ordering layers for one network, its corresponding winning ticket could be stretched (or squeezed) into a subnetwork for another deeper (or shallower) network from the same family, whose performance is nearly the same competitive as the latter's winning ticket directly found by IMP.
Recent advances in computer vision take advantage of adversarial data augmentation to ameliorate the generalization ability of classification models.
Multimodal pre-training has propelled great advancement in vision-and-language research.
Treating this as an inductive prior, we suggest a brand-new angle towards data-efficient GAN training: by first identifying the lottery ticket from the original GAN using the small training set of real images; and then focusing on training that sparse subnetwork by re-using the same set.
Experiments on text-to-video retrieval and video question answering on six datasets demonstrate that ClipBERT outperforms (or is on par with) existing methods that exploit full-length videos, suggesting that end-to-end learning with just a few sparsely sampled clips is often more accurate than using densely extracted offline features from full-length videos, proving the proverbial less-is-more principle.
Ranked #2 on Visual Question Answering on MSRVTT-QA (using extra training data)
Transformer has become ubiquitous in the deep learning field.
By incorporating different feature maps after the masking, we can distill better features to help model generalization.
Adversarial training is an effective method to combat adversarial attacks in order to create robust neural networks.
Heavily overparameterized language models such as BERT, XLNet and T5 have achieved impressive success in many NLP tasks.
The primary goal of knowledge distillation (KD) is to encapsulate the information of a model learned from a teacher network into a student network, with the latter being more compact than the former.
Large-scale pre-trained multimodal transformers, such as ViLBERT and UNITER, have propelled the state of the art in vision-and-language (V+L) research to a new level.
In this paper, we propose Cross-Thought, a novel approach to pre-training sequence encoder, which is instrumental in building reusable sequence embeddings for large-scale NLP tasks such as question answering.
Pre-trained neural abstractive summarization systems have dominated extractive strategies on news summarization performance, at least in terms of ROUGE.
Large-scale language models such as BERT have achieved state-of-the-art performance across a wide range of NLP tasks.
Ranked #1 on Natural Language Inference on ANLI test (using extra training data)
Existing language model compression methods mostly use a simple L2 loss to distill knowledge in the intermediate representations of a large BERT model to a smaller one.
Transformer has become ubiquitous in the deep learning field.
Ranked #1 on Question Answering on Quasart-T
Although deep neural networks have achieved tremendous success for question answering (QA), they are still suffering from heavy computational and energy cost for real product deployment.
During inference, the model makes predictions based on the text input in the target language and its translation in the source language.
Ranked #8 on Zero-Shot Cross-Lingual Transfer on XTREME
We present a large, tunable neural conversational response generation model, DIALOGPT (dialogue generative pre-trained transformer).
In GOT, cross-domain alignment is formulated as a graph matching problem, by representing entities into a dynamically-constructed graph.
Adaptive gradient methods such as RMSProp and Adam use exponential moving estimate of the squared gradient to compute adaptive step sizes, achieving better convergence than SGD in face of noisy objectives.
We present VILLA, the first known effort on large-scale adversarial training for vision-and-language (V+L) representation learning.
Ranked #2 on Referring Expression Comprehension on RefCOCOg-val
To reveal the secrets behind the scene of these powerful models, we present VALUE (Vision-And-Language Understanding Evaluation), a set of meticulously designed probing tasks (e. g., Visual Coreference Resolution, Visual Relation Detection, Linguistic Probing Tasks) generalizable to standard pre-trained V+L models, aiming to decipher the inner workings of multimodal pre-training (e. g., the implicit knowledge garnered in individual attention heads, the inherent cross-modal alignment learned through contextualized multimodal embeddings).
We present HERO, a novel framework for large-scale video+language omni-representation learning.
Ranked #1 on Video Retrieval on TVR
To realize high-quality style transfer with natural context preservation, we propose a Context-Aware Style Transfer (CAST) model, which uses two separate encoders for each input sentence and its surrounding context.
In this paper, we investigate text generation in a hyperbolic latent space to learn continuous hierarchical representations.
We propose a new task towards more practical application for image generation - high-quality image synthesis from salient object layout.
In this work, we propose a Self-Guided Adaptation (SGA) model, target at aligning feature representation and transferring object detection models across domains while considering the instantaneous alignment difficulty.
Experiments show that the proposed approach significantly outperforms strong Transformer baselines on multiple language generation tasks such as machine translation and text summarization.
In this paper, we present Hierarchical Graph Network (HGN) for multi-hop question answering.
Ranked #28 on Question Answering on HotpotQA
We present a large, tunable neural conversational response generation model, DialoGPT (dialogue generative pre-trained transformer).
To design a more powerful NMN architecture for practical use, we propose Meta Module Network (MMN) centered on a novel meta module, which can take in function recipes and morph into diverse instance modules dynamically.
Adversarial training, which minimizes the maximal risk for label-preserving input perturbations, has proved to be effective for improving the generalization of language models.
Different from previous work that applies joint random masking to both modalities, we use conditional masking on pre-training tasks (i. e., masked language/region modeling is conditioned on full observation of image/text).
Ranked #1 on Visual Reasoning on NLVR2 Test
Previous storytelling approaches mostly focused on optimizing traditional metrics such as BLEU, ROUGE and CIDEr.
Recent unsupervised approaches to domain adaptation primarily focus on minimizing the gap between the source and the target domains through refining the feature generator, in order to learn a better alignment between the two domains.
Pre-trained language models such as BERT have proven to be highly effective for natural language processing (NLP) tasks.
In this paper, we focus on unsupervised domain adaptation for Machine Reading Comprehension (MRC), where the source domain has a large amount of labeled data, while only unlabeled passages are available in the target domain.
In this paper, we propose a hybrid neural conversation model that combines the merits of both response retrieval and generation methods.
Commonsense reasoning is fundamental to natural language understanding.
Ranked #3 on Natural Language Understanding on PDP60
In order to answer semantically-complicated questions about an image, a Visual Question Answering (VQA) model needs to fully understand the visual scene in the image, especially the interactive dynamics between different objects.
We present the Frontier Aware Search with backTracking (FAST) Navigator, a general framework for action decoding, that achieves state-of-the-art results on the Room-to-Room (R2R) Vision-and-Language navigation challenge of Anderson et.
Ranked #3 on Vision-Language Navigation on Room2Room
This paper presents a new model for visual dialog, Recurrent Dual Attention Network (ReDAN), using multi-step reasoning to answer a series of questions about an image.
The main challenges in this sequential and interactive image generation task are two-fold: 1) contextual consistency between a generated image and the provided textual description; 2) step-by-step region-level modification to maintain visual consistency across the generated image sequence in each session.
We therefore propose a new story-to-image-sequence generation model, StoryGAN, based on the sequential conditional GAN framework.
Training task-completion dialogue agents with reinforcement learning usually requires a large number of real user experiences.
We present a large-scale dataset, ReCoRD, for machine reading comprehension requiring commonsense reasoning.
We propose a multi-task learning framework to learn a joint Machine Reading Comprehension (MRC) model that can be applied to a wide range of MRC tasks in different domains.
This paper presents a Discriminative Deep Dyna-Q (D3Q) approach to improving the effectiveness and robustness of Deep Dyna-Q (DDQ), a recently proposed framework that extends the Dyna-Q algorithm to integrate planning for task-completion dialogue policy learning.
This proposal introduces a Dialogue Challenge for building end-to-end task-completion dialogue systems, with the goal of encouraging the dialogue research community to collaborate and benchmark on standard datasets and unified experimental environment.
During dialogue policy learning, the world model is constantly updated with real user experience to approach real user behavior, and in turn, the dialogue agent is optimized using both real experience and simulated experience.
First, we introduce a synthetic dataset, called CoSaL, to evaluate the end-to-end performance of our LBIE system.
This paper presents a novel neural model - Dynamic Fusion Network (DFN), for machine reading comprehension (MRC).
This paper presents a new method --- adversarial advantage actor-critic (Adversarial A2C), which significantly improves the efficiency of dialogue policy learning in task-completion dialogue systems.
To protect image contents, most existing encryption algorithms are designed to transform an original image into a texture-like or noise-like image, which is, however, an obvious visual sign indicating the presence of an encrypted image, results in a significantly large number of attacks.
Multispectral pedestrian detection is essential for around-the-clock applications, e. g., surveillance and autonomous driving.
Essential grammatical information is conveyed in signed languages by clusters of events involving facial expressions and movements of the head and upper body.