no code implementations • • Manuel Mager, Arturo Oncevay, Abteen Ebrahimi, John Ortega, Annette Rios, Angela Fan, Ximena Gutierrez-Vasques, Luis Chiruzzo, Gustavo Giménez-Lugo, Ricardo Ramos, Ivan Vladimir Meza Ruiz, Rolando Coto-Solano, Alexis Palmer, Elisabeth Mager-Hois, Vishrav Chaudhary, Graham Neubig, Ngoc Thang Vu, Katharina Kann
This paper presents the results of the 2021 Shared Task on Open Machine Translation for Indigenous Languages of the Americas.
This paper describes the submission to the IWSLT 2021 Low-Resource Speech Translation Shared Task by IMS team.
no code implementations • • Nadja Schauffler, Toni Bernhart, Andre Blessing, Gunilla Eschenbach, Markus Gärtner, Kerstin Jung, Anna Kinder, Julia Koch, Sandra Richter, Gabriel Viehhauser, Ngoc Thang Vu, Lorenz Wesemann, Jonas Kuhn
We present the steps taken towards an exploration platform for a multi-modal corpus of German lyric poetry from the Romantic era developed in the project »textklang«.
We conclude that creating shared mental models between users and AI systems is important to achieving successful dialogs.
Based on statistical and qualitative analysis of the responses, we found language style played an important role in how human-like participants perceived a dialog agent as well as how likable.
In this paper, we propose a novel two-stage model architecture that can be trained with only a few in-domain hand-labeled examples.
We present our work on collecting ArzEn-ST, a code-switched Egyptian Arabic - English Speech Translation Corpus.
While neural methods for text-to-speech (TTS) have shown great advances in modeling multiple speakers, even in zero-shot settings, the amount of data needed for those approaches is generally not feasible for the vast majority of the world's over 6, 000 spoken languages.
In this paper, we exploit the advantages from both inter-domain loss and CycleGAN to achieve better shared representation of unpaired speech and text inputs and thus improve the speech-to-text mapping.
In particular, we (i) formulate desired characteristics of explanation quality that apply across tasks and domains, (ii) point out how current evaluation practices violate those characteristics, and (iii) propose actionable guidelines to overcome obstacles that limit today's evaluation of explanation quality and to enable the development of explainable systems that provide tangible benefits for human users.
In order to protect the privacy of speech data, speaker anonymization aims for hiding the identity of a speaker by changing the voice in speech recordings.
For extreme low-resource scenarios, a combination of frequency and morphology-based segmentations is shown to perform the best.
Given that the factors giving rise to CS vary from one country to the other, as well as from one person to the other, CS is found to be a speaker-dependant behaviour, where the frequency by which the foreign language is embedded differs across speakers.
In this work, we propose a speaker anonymization pipeline that leverages high quality automatic speech recognition and synthesis systems to generate speech conditioned on phonetic transcriptions and anonymized speaker embeddings.
Speech synthesis for poetry is challenging due to specific intonation patterns inherent to poetic speech.
The cloning of a speaker's voice using an untranscribed reference sample is one of the great advances of modern neural text-to-speech (TTS) methods.
Our best models achieve 33. 6% improvement in perplexity, +3. 2-5. 6 BLEU points on MT task, and 7% relative improvement on WER for ASR task.
Morphologically-rich polysynthetic languages present a challenge for NLP systems due to data sparsity, and a common strategy to handle this issue is to apply subword segmentation.
While neural text-to-speech systems perform remarkably well in high-resource scenarios, they cannot be applied to the majority of the over 6, 000 spoken languages in the world due to a lack of appropriate training data.
Code-Switching (CS) is a common linguistic phenomenon in multilingual communities that consists of switching between languages while speaking.
We investigate densely connected convolutional networks (DenseNets) and their extension with domain adversarial training for noise robust speech recognition.
Multilingual speakers tend to alternate between languages within a conversation, a phenomenon referred to as "code-switching" (CS).
This paper presents our latest effort on improving Code-switching language models that suffer from data scarcity.
This paper presents our latest investigations on improving automatic speech recognition for noisy speech via speech enhancement.
2 code implementations • 29 Nov 2021 • Siddhant Arora, Siddharth Dalmia, Pavel Denisov, Xuankai Chang, Yushi Ueda, Yifan Peng, Yuekai Zhang, Sujay Kumar, Karthik Ganesan, Brian Yan, Ngoc Thang Vu, Alan W Black, Shinji Watanabe
However, there are few open source toolkits that can be used to generate reproducible results on different Spoken Language Understanding (SLU) benchmarks.
On the way towards general Visual Question Answering (VQA) systems that are able to answer arbitrary questions, the need arises for evaluation beyond single-metric leaderboards for specific datasets.
For this, we investigate different sources of external knowledge and evaluate the performance of our models on in-domain data as well as on special transfer datasets that are designed to assess fine-grained reasoning capabilities.
In this paper, we present our work on code-switched Egyptian Arabic-English automatic speech recognition (ASR).
Concretely, we propose a correction module that is trained to estimate the model's correctness as well as an iterative prediction update based on the prediction's gradient.
This paper describes the submission to the IWSLT 2021 Low-Resource Speech Translation Shared Task by IMS team.
1 code implementation • • Abteen Ebrahimi, Manuel Mager, Arturo Oncevay, Vishrav Chaudhary, Luis Chiruzzo, Angela Fan, John Ortega, Ricardo Ramos, Annette Rios, Ivan Meza-Ruiz, Gustavo A. Giménez-Lugo, Elisabeth Mager, Graham Neubig, Alexis Palmer, Rolando Coto-Solano, Ngoc Thang Vu, Katharina Kann
Continued pretraining offers improvements, with an average accuracy of 44. 05%.
In this paper we explore audiovisual emotion recognition under noisy acoustic conditions with a focus on speech features.
We propose a new method of generating meaningful embeddings for speech, changes to four commonly used meta learning approaches to enable them to perform keyword spotting in continuous signals and an approach of combining their outcomes into an end-to-end automatic speech recognition system to improve rare word recognition.
Recently, leveraging pre-trained Transformer based language models in down stream, task specific models has advanced state of the art results in natural language understanding tasks.
The user study shows that our models increase the ability of the users to judge the correctness of the system and that scores like F1 are not enough to estimate the usefulness of a model in a practical setting with human users.
We compare state of the art networks based on long short-term memory (LSTM), convolutional neural models (CNN) and XLNet Transformer architectures.
Spoken language understanding is typically based on pipeline architectures including speech recognition and natural language understanding steps.
We present an iterative data augmentation framework, which trains and searches for an optimal ensemble and simultaneously annotates new training data in a self-training style.
1 code implementation • • Chia-Yu Li, Daniel Ortega, Dirk Väth, Florian Lux, Lindsey Vanderlyn, Maximilian Schmidt, Michael Neumann, Moritz Völkel, Pavel Denisov, Sabrina Jenne, Zorica Kacarevic, Ngoc Thang Vu
We present ADVISER - an open-source, multi-domain dialog system toolkit that enables the development of multi-modal (incorporating speech, text and vision), socially-engaged (e. g. emotion recognition, engagement level prediction and backchanneling) conversational agents.
In this paper, we first discuss the CS phenomenon in Egypt and the factors that gave rise to the current language.
We present a dependency tree linearization model with two novel components: (1) a tree-structured encoder based on bidirectional Tree-LSTM that propagates information first bottom-up then top-down, which allows each token to access information from the entire tree; and (2) a linguistically motivated head-first decoder that emphasizes the central role of the head and linearizes the subtree by incrementally attaching the dependents on both sides of the head.
Code-switching (CS) is a widespread phenomenon among bilingual and multilingual societies.
In this paper, we explore state-of-the-art deep reinforcement learning methods for dialog policy training such as prioritized experience replay, double deep Q-Networks, dueling network architectures and distributional learning.
This paper presents our latest investigation on end-to-end automatic speech recognition (ASR) for overlapped speech.
We present the IMS-Speech, a web based tool for German and English speech transcription aiming to facilitate research in various disciplines which require accesses to lexical information in spoken language materials.
Ranked #4 on Speech Recognition on TUDA (using extra training data)
In this paper, we present ADVISER - an open source dialog system framework for education and research purposes.
This paper presents our latest investigations on dialog act (DA) classification on automatically generated transcriptions.
We present a general approach with reinforcement learning (RL) to approximate dynamic oracles for transition systems where exact dynamic oracles are difficult to derive.
We present a comparison of word-based and character-based sequence-to-sequence models for data-to-text natural language generation, which generate natural language descriptions for structured inputs.
We propose a machine reading comprehension model based on the compare-aggregate framework with two-staged attention that achieves state-of-the-art results on the MovieQA question answering dataset.
This paper presents our latest investigation on Densely Connected Convolutional Networks (DenseNets) for acoustic modelling (AM) in automatic speech recognition.
In this paper, we investigate the use of adversarial learning for unsupervised adaptation to unseen recording conditions, more specifically, single microphone far-field speech.
Deep learning techniques have recently shown to be successful in many natural language processing tasks forming state-of-the-art systems.
Most modern approaches to computing word embeddings assume the availability of text corpora with billions of words.
Pitch accent detection often makes use of both acoustic and lexical features based on the fact that pitch accents tend to correlate with certain words.
Audiovisual speech recognition (AVSR) is a method to alleviate the adverse effect of noise in the acoustic signal.
We present two novel datasets for the low-resource language Vietnamese to assess models of semantic similarity: ViCon comprises pairs of synonyms and antonyms across word classes, thus offering data to distinguish between similarity and dissimilarity.
Research on multilingual speech emotion recognition faces the problem that most available speech corpora differ from each other in important ways, such as annotation methods or interaction scenarios.
The experimental results reveal that Brown word clusters, part-of-speech tags and open-class words are the most effective at reducing the perplexity of factored language models on the Mandarin-English Code-Switching corpus SEAME.
Adding manually annotated prosodic information, specifically pitch accents and phrasing, to the typical text-based feature set for coreference resolution has previously been shown to have a positive effect on German data.
We present a novel neural model HyperVec to learn hierarchical embeddings for hypernymy detection and directionality.
This paper presents our novel method to encode word confusion networks, which can represent a rich hypothesis space of automatic speech recognition systems, via recurrent neural networks.
We present a general-purpose tagger based on convolutional neural networks (CNN), used for both composing word vectors and encoding context information.
This paper demonstrates the potential of convolutional neural networks (CNN) for detecting and classifying prosodic events on words, specifically pitch accents and phrase boundary tones, from frame-based acoustic features.
Speech emotion recognition is an important and challenging task in the realm of human-computer interaction.
We present a transition-based dependency parser that uses a convolutional neural network to compose word representations from characters.
This paper addresses challenges of Natural Language Processing (NLP) on non-canonical multilingual data in which two or more languages are mixed.
We investigate the usage of convolutional neural networks (CNNs) for the slot filling task in spoken language understanding.
We propose a novel vector representation that integrates lexical contrast into distributional vectors and strengthens the most salient features for determining degrees of word similarity.
This paper investigates two different neural architectures for the task of relation classification: convolutional neural networks and recurrent neural networks.