Metric-based learning is a well-known family of methods for few-shot learning, especially in computer vision.
In this paper we explore audiovisual emotion recognition under noisy acoustic conditions with a focus on speech features.
We propose a new method of generating meaningful embeddings for speech, changes to four commonly used meta learning approaches to enable them to perform keyword spotting in continuous signals and an approach of combining their outcomes into an end-to-end automatic speech recognition system to improve rare word recognition.
Recently, leveraging pre-trained Transformer based language models in down stream, task specific models has advanced state of the art results in natural language understanding tasks.
The user study shows that our models increase the ability of the users to judge the correctness of the system and that scores like F1 are not enough to estimate the usefulness of a model in a practical setting with human users.
We compare state of the art networks based on long short-term memory (LSTM), convolutional neural models (CNN) and XLNet Transformer architectures.
Spoken language understanding is typically based on pipeline architectures including speech recognition and natural language understanding steps.
We present an iterative data augmentation framework, which trains and searches for an optimal ensemble and simultaneously annotates new training data in a self-training style.
We propose a graph-based method to tackle the dependency tree linearization task.
We present ADVISER - an open-source, multi-domain dialog system toolkit that enables the development of multi-modal (incorporating speech, text and vision), socially-engaged (e. g. emotion recognition, engagement level prediction and backchanneling) conversational agents.
Code-switching has become a prevalent phenomenon across many communities.
In this paper, we first discuss the CS phenomenon in Egypt and the factors that gave rise to the current language.
We introduce the IMS contribution to the Surface Realization Shared Task 2019.
We present a dependency tree linearization model with two novel components: (1) a tree-structured encoder based on bidirectional Tree-LSTM that propagates information first bottom-up then top-down, which allows each token to access information from the entire tree; and (2) a linguistically motivated head-first decoder that emphasizes the central role of the head and linearizes the subtree by incrementally attaching the dependents on both sides of the head.
Code-switching (CS) is a widespread phenomenon among bilingual and multilingual societies.
In this paper, we explore state-of-the-art deep reinforcement learning methods for dialog policy training such as prioritized experience replay, double deep Q-Networks, dueling network architectures and distributional learning.
This paper presents our latest investigation on end-to-end automatic speech recognition (ASR) for overlapped speech.
The generalized Dyck language has been used to analyze the ability of Recurrent Neural Networks (RNNs) to learn context-free grammars (CFGs).
In this paper, we present ADVISER - an open source dialog system framework for education and research purposes.
We present a general approach with reinforcement learning (RL) to approximate dynamic oracles for transition systems where exact dynamic oracles are difficult to derive.
We present a comparison of word-based and character-based sequence-to-sequence models for data-to-text natural language generation, which generate natural language descriptions for structured inputs.
We propose a machine reading comprehension model based on the compare-aggregate framework with two-staged attention that achieves state-of-the-art results on the MovieQA question answering dataset.
This paper presents our latest investigation on Densely Connected Convolutional Networks (DenseNets) for acoustic modelling (AM) in automatic speech recognition.
In this paper, we investigate the use of adversarial learning for unsupervised adaptation to unseen recording conditions, more specifically, single microphone far-field speech.
Deep learning techniques have recently shown to be successful in many natural language processing tasks forming state-of-the-art systems.
Most modern approaches to computing word embeddings assume the availability of text corpora with billions of words.
Pitch accent detection often makes use of both acoustic and lexical features based on the fact that pitch accents tend to correlate with certain words.
We present two novel datasets for the low-resource language Vietnamese to assess models of semantic similarity: ViCon comprises pairs of synonyms and antonyms across word classes, thus offering data to distinguish between similarity and dissimilarity.
We propose a neural model that processes both lexical and acoustic features for classification.
Research on multilingual speech emotion recognition faces the problem that most available speech corpora differ from each other in important ways, such as annotation methods or interaction scenarios.
The experimental results reveal that Brown word clusters, part-of-speech tags and open-class words are the most effective at reducing the perplexity of factored language models on the Mandarin-English Code-Switching corpus SEAME.
Adding manually annotated prosodic information, specifically pitch accents and phrasing, to the typical text-based feature set for coreference resolution has previously been shown to have a positive effect on German data.
We present a novel neural model HyperVec to learn hierarchical embeddings for hypernymy detection and directionality.
This paper presents our novel method to encode word confusion networks, which can represent a rich hypothesis space of automatic speech recognition systems, via recurrent neural networks.
We present a general-purpose tagger based on convolutional neural networks (CNN), used for both composing word vectors and encoding context information.
Speech emotion recognition is an important and challenging task in the realm of human-computer interaction.
This paper demonstrates the potential of convolutional neural networks (CNN) for detecting and classifying prosodic events on words, specifically pitch accents and phrase boundary tones, from frame-based acoustic features.
We present a transition-based dependency parser that uses a convolutional neural network to compose word representations from characters.
Distinguishing between antonyms and synonyms is a key task to achieve high performance in NLP systems.
We propose a novel vector representation that integrates lexical contrast into distributional vectors and strengthens the most salient features for determining degrees of word similarity.
This paper investigates two different neural architectures for the task of relation classification: convolutional neural networks and recurrent neural networks.