A common training technique for language models is teacher forcing (TF).
Conformer, a convolution-augmented Transformer variant, has become the de facto encoder architecture for speech processing due to its superior performance in various tasks, including automatic speech recognition (ASR), speech translation (ST) and spoken language understanding (SLU).
Self-supervised speech representation learning (SSL) has shown to be effective in various downstream tasks, but SSL models are usually large and slow.
In this work, we introduce several new annotated SLU benchmark tasks based on freely available speech data, which complement existing benchmarks and address gaps in the SLU evaluation landscape.
During the fine-tuning stage, we introduce an auxiliary loss that encourages this context embedding vector to be similar to context vectors of surrounding segments.
Conformer, combining convolution and self-attention sequentially to capture both local and global information, has shown remarkable performance and is currently regarded as the state-of-the-art for automatic speech recognition (ASR).
Ranked #8 on Speech Recognition on LibriSpeech test-other
We introduce Wav2Seq, the first self-supervised approach to pre-train both parts of encoder-decoder models for speech data.
Ranked #3 on Named Entity Recognition (NER) on SLUE
In this work we focus on low-resource spoken named entity recognition (NER) and address the question: Beyond self-supervised pre-training, how can we use external speech and/or text data that are not annotated for the task?
Historically these have focused on automatic speech recognition (ASR), speaker identification, or other lower-level tasks.
Ranked #1 on Named Entity Recognition (NER) on SLUE
This paper is a study of performance-efficiency trade-offs in pre-trained models for automatic speech recognition (ASR).
A Multi-mode ASR model can fulfill various latency requirements during inference -- when a larger latency becomes acceptable, the model can process longer future context to achieve higher accuracy and when a latency budget is not flexible, the model can be less dependent on future context but still achieve reliable accuracy.
We develop a novel approach for paper bidding and assignment that is much more robust against such attacks.
Here we propose the "Attention-based Quantum Tomography" (AQT), a quantum state reconstruction using an attention mechanism-based generative network that learns the mixed state density matrix of a noisy quantum state.
The moments (a. k. a., mean and standard deviation) of latent features are often removed as noise when training image recognition models, to increase stability and reduce training time.
Ranked #31 on Domain Generalization on ImageNet-A
This paper introduces Integrated Triaging, a framework that prunes almost all context in early layers of a network, leaving the remaining (deep) layers to scan only a tiny fraction of the full corpus.
In this technical report, we introduce FastFusionNet, an efficient variant of FusionNet .
Graph Convolutional Networks (GCNs) and their variants have experienced significant attention and have become the de facto methods for learning graph representations.
Ranked #3 on Text Classification on Ohsumed
We predict separate convolution kernels based solely on the current time-step in order to determine the importance of context elements.
Ranked #1 on Machine Translation on WMT 2017 English-Chinese
The machine learning community has become increasingly concerned with the potential for bias and discrimination in predictive models.
In this paper we investigate image classification with computational resource limits at test time.