Vision-language (V+L) pretraining models have achieved great success in supporting multimedia applications by understanding the alignments between images and text.
For untranscribed speech data, the hypothesis from an ASR system must be used as a label.
Based on this, we ask an even bolder question: can we have an all-MLP architecture for VL modeling, where both VL fusion and the vision encoder are replaced with MLPs?
In particular, we focus on the task of Commonsense Reasoning, demonstrating that the proposed external attention mechanism can augment existing transformer models and significantly improve the model's reasoning capabilities.
no code implementations • 22 Nov 2021 • Lu Yuan, Dongdong Chen, Yi-Ling Chen, Noel Codella, Xiyang Dai, Jianfeng Gao, Houdong Hu, Xuedong Huang, Boxin Li, Chunyuan Li, Ce Liu, Mengchen Liu, Zicheng Liu, Yumao Lu, Yu Shi, Lijuan Wang, JianFeng Wang, Bin Xiao, Zhen Xiao, Jianwei Yang, Michael Zeng, Luowei Zhou, Pengchuan Zhang
Computer vision foundation models, which are trained on diverse, large-scale dataset and can be adapted to a wide range of downstream tasks, are critical for this mission to solve real-world computer vision applications.
Ranked #1 on Action Recognition In Videos on Kinetics-400
Vision-and-language (VL) pre-training has proven to be highly effective on various VL downstream tasks.
3 code implementations • 26 Oct 2021 • Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, Long Zhou, Shuo Ren, Yanmin Qian, Yao Qian, Jian Wu, Michael Zeng, Furu Wei
WavLM extends HuBERT framework to denoising masked speech modeling, where the target is to predict pseudo-labels of simulated noisy speech on masked regions.
Then we utilize a diverse of 4 English knowledge sources to provide more comprehensive coverage of knowledge in different formats.
In this paper, we bring a new way of digesting news content by introducing the task of segmenting a news article into multiple sections and generating the corresponding summary to each section.
In addition to training with the masked language modeling objective, we propose two novel self-supervised pre-training tasks on word and sentence-level alignment between input text sequence and rare word definitions to enhance language modeling representation with dictionary.
The recent proposed Fusion-in-Decoder (FiD), which is built on top of the pretrained generative model T5, achieves the state-of-the-art performance in the reading module.
For a dialogue, it corrupts a window of text with dialogue-inspired noise, and guides the model to reconstruct this window based on the content of the remaining conversation.
It is often observed in knowledge-centric tasks (e. g., common sense question and answering, relation classification) that the integration of external knowledge such as entity representation into language models can help provide useful information to boost the performance.
In this paper, we attempt to bridge these two lines of research and propose a joint and domain adaptive approach to SLU.
Commonsense generation is a challenging task of generating a plausible sentence describing an everyday scenario using provided concepts.
Many downstream tasks and human readers rely on the output of the ASR system; therefore, errors introduced by the speaker and ASR system alike will be propagated to the next task in the pipeline.
However, the performance of using multiple encoders and decoders on zero-shot translation still lags behind universal NMT.
End-to-end (E2E) spoken language understanding (SLU) can infer semantics directly from speech signal without cascading an automatic speech recognizer (ASR) with a natural language understanding (NLU) module.
In this paper, we propose a unified pre-training approach called UniSpeech to learn speech representations with both unlabeled and labeled data, in which supervised phonetic CTC learning and phonetically-aware contrastive self-supervised learning are conducted in a multi-task learning manner.
However, although a KG contains rich structural information, it lacks the context to provide a more precise understanding of the concepts.
The context information is captured by the hidden states of LSTM-LMs across utterance and can be used to guide the first-pass search effectively.
Besides conducting a self-supervised masked language modeling task on the two individual modules using unpaired speech and text, SPLAT aligns representations from the two modules in a shared latent space using a small amount of paired speech and text.
Knowledge graphs (KGs) contain rich information about world knowledge, entities and relations.
In this work, we propose a novel architecture that extends Transformer encoder-decoder architecture in order to improve on these shortcomings.
The training of spoken language understanding (SLU) models often faces the problem of data scarcity.
In this work, we propose a novel NLP task called ASR post-processing for readability (APR) that aims to transform the noisy ASR output into a readable text for humans and downstream tasks while maintaining the semantic meaning of the speaker.
With the abundance of automatic meeting transcripts, meeting summarization is of great interest to both participants and other parties.
Automatic abstractive summaries are found to often distort or fabricate facts in the article.
It is pre-trained on a large set of annotated NLG corpus to acquire the controllable generation ability, and fine-tuned with only a few domain-specific labels to adapt to new domains.
Ranked #4 on Data-to-Text Generation on MULTIWOZ 2.1
Text summarization aims to extract essential information from a piece of text and transform the text into a concise version.
A typical journalistic convention in news articles is to deliver the most salient information in the beginning, also known as the lead bias.
In this paper, we propose a novel multi-task learning framework, NLG-LM, for natural language generation.
In this paper, we put forward a slot-independent neural model (SIM) to track dialogue states while keeping the model complexity invariant to the number of dialogue slots.
For example, the pretrained model without finetuning outperforms pointer-generator network on CNN/DailyMail dataset.
The speaker-attributed WER (SAWER) is 26. 7%.
Conversational question answering (CQA) is a novel QA task that requires understanding of dialogue context.
Ranked #2 on Question Answering on CoQA