However, the performance of architecture is limited by the type of operations and prior knowledge.
In this work, we carefully select five datasets, including two in-domain datasets and three out-of-domain datasets with different levels of domain shift, and study the generalization of a deep model in a zero-shot setting.
At the same time, as the deep neural analyzer is learnable, it is expected to be more accurate for signal reconstruction and manipulation, and generalizable from speech to singing.
In this paper, an across-task neural architecture search (AT-NAS) is proposed to address the problem through combining gradient-based meta-learning with EA-based NAS to learn over the distribution of tasks.
The proposed VisualTTS adopts two novel mechanisms that are 1) textual-visual attention, and 2) visual fusion strategy during acoustic decoding, which both contribute to forming accurate alignment between the input text content and lip motion in input lip sequence.
A novel multi-scale temporal convolutional network (TCN) and long short-term memory network (LSTM) based magnetic localization approach is proposed.
However, it is still challenging to search for efficient networks due to the gap between the searched constraint and real inference time exists.
To this end, we both explore two different vocabulary composition methods, as well as propose three sampling methods which help in efficient incremental training for BERT-like models.
Change detection in heterogeneous remote sensing images is a challenging problem because it is hard to make a direct comparison in the original observation spaces, and most methods rely on a set of manually labeled samples.
text, table, image) and propose a novel layout-aware multimodal hierarchical framework, LAMPreT, to model the blocks and the whole document.
We investigate the privacy and utility implications of applying dx-privacy, a variant of Local Differential Privacy, to BERT fine-tuning in NLU applications.
To overcome the unfull training, a stage-wise pruning(SWP) method is proposed, which splits a deep supernet into several stage-wise supernets to reduce the candidate number and utilize inplace distillation to supervise the stage training.
A hybrid approach, which leverages both semantic (deep neural network-based) and lexical (keyword matching-based) retrieval models, is proposed.
We consider that there is a common code between speakers for emotional expression in a spoken language, therefore, a speaker-independent mapping between emotional states is possible.
In order to better capture sentence level semantic relations within a document, we pre-train the model with a novel masked sentence block language modeling task in addition to the masked word language modeling task used by BERT.
Different from previous work, we take the node features from a well-trained graph aggregator instead of the hand-craft features, as the states in reinforcement learning.
Our proposed approach significantly improved the intelligibility (in CER), the MOS, and discrimination ABX scores compared to the official ZeroSpeech 2019 baseline or even the topline.
We propose using an extended model architecture of Tacotron, that is a multi-source sequence-to-sequence model with a dual attention mechanism as the shared model for both the TTS and VC tasks.