Here, we deal with these issues on two aspects: (1) We enhance the prior selection module with the necessary posterior information obtained from the specially designed Posterior Information Prediction Module (PIPM); (2) We propose a Knowledge Distillation Based Training Strategy (KDBTS) to train the decoder with the knowledge selected from the prior distribution, removing the exposure bias of knowledge selection.
In this paper, we propose a Virtual Multi-modality Foreground Matting (VMFM) method to learn human-object interactive foreground (human and objects interacted with him or her) from a raw RGB image.
In the middle of query execution, AdaptNN collects a number of runtime features and predicts termination condition for each individual query, by which better end-to-end latency is attained.
In this paper, we propose a simple yet effective anomaly detection framework for deep RL algorithms that simultaneously considers random, adversarial and out-of-distribution~(OOD) state outliers.
Most of the recent state-of-the-art results for speaker verification are achieved by X-vector and its subsequent variants.
Based on a hybrid learning framework, where a spike actor-network infers actions from states and a deep critic network evaluates the actor, we propose a Population-coding and Dynamic-neurons improved Spiking Actor Network (PDSAN) for efficient state representation from two different scales: input coding and neuronal coding.
In the speaker extraction problem, it is found that additional information from the target speaker contributes to the tracking and extraction of the target speaker, which includes voiceprint, lip movement, facial expression, and spatial information.
Specifically, we first structure the sequence of EMR into a hierarchical graph network and then obtain the causal relationship between multi-granularity features and diagnosis results through counterfactual intervention on the graph.
The spatial self-attention module is designed to attend on the cross-channel correlation in the covariance matrices.
Network representation learning (NRL) is an effective graph analytics technique and promotes users to deeply understand the hidden characteristics of graph data.
Experimental results on segmented speech data show that the proposed MTL framework outperforms the baseline single-task learning (STL) framework in ASR task.
In this paper, we propose MixSpeech, a simple yet effective data augmentation method based on mixup for automatic speech recognition (ASR).
We extend the classical tracking-by-detection paradigm to this tracking-any-object task.
Ranked #1 on Multi-Object Tracking on TAO (using extra training data)
In this work, we fuse a pre-trained acoustic encoder (wav2vec2. 0) and a pre-trained linguistic encoder (BERT) into an end-to-end ASR model.
To verify its universality over languages, we apply pre-trained models to solve low-resource speech recognition tasks in various spoken languages.
End-to-end (E2E) models have achieved promising results on multiple speech recognition benchmarks, and shown the potential to become the mainstream.
For one thing, speakers often rely on the context and commonsense knowledge to express emotions; for another, most utterances contain neutral emotion in conversations, as a result, the confusion between a few non-neutral utterances and much more neutral ones restrains the emotion recognition performance.
In our model, we use the face detector to detect the number of speakers in the scene and use visual information to avoid the permutation problem.
The performance of the proposed BRP-SNN is further verified on the spatial (including MNIST and Cifar-10) and temporal (including TIDigits and DvsGesture) tasks, where the SNN using BRP has reached a similar accuracy compared to other state-of-the-art BP-based SNNs and saved 50% more computational cost than ANNs.
Spiking Neural Networks (SNNs) have incorporated more biologically-plausible structures and learning principles, hence are playing critical roles in bridging the gap between artificial and natural neural networks.
Can we build a system to fully utilize signals in a parallel ST corpus?
The key idea is to generate source transcript and target translation text with a single decoder.
The advisor-advisee relationship represents direct knowledge heritage, and such relationship may not be readily available from academic libraries and search engines.
Empirically, we evaluate DINE over three networks on multi-label classification and link prediction tasks.
This model additionally has a simple and efficient stop criterion for the end of the transduction, making it able to infer the variable number of output sequences.
Ranked #2 on Speech Separation on WSJ0-5mix
With the predicted speaker information from whole observation, our model is helpful to solve the problem of conventional speech separation and speaker extraction for multi-round long recordings.
Audio and Speech Processing Sound
End-to-end models are gaining wider attention in the field of automatic speech recognition (ASR).
For points that belong to the same roof shape, a multi-cue, hierarchical RANSAC approach is proposed for efficient and reliable segmenting and reconstructing the building point cloud.
Vision is often used as a complementary modality for audio speech recognition (ASR), especially in the noisy environment where performance of solo audio modality significantly deteriorates.
Ranked #6 on Lipreading on Lip Reading in the Wild
Visual Dialog is a vision-language task that requires an AI agent to engage in a conversation with humans grounded in an image.
In comparison with the previous non-learning adversarial example attack approaches, the GAN-based adversarial attack example approach can generate the adversarial samples quickly using the GAN architecture every time facing a new sample after training, but meanwhile needs to perturb the attack samples in great quantities, which results in the unpractical application in reality.
The unsupervised pre-training is finished on AISHELL-2 dataset and we apply the pre-trained model to multiple paired data ratios of AISHELL-1 and HKUST.
In this method, iterative update can greatly alleviate the nonstationarity of the environment, unified representation can speed up the interaction with environment and avoid the linear growth of memory usage.
Our WMM2Seq adopts a working memory to interact with two separated long-term memories, which are the episodic memory for memorizing dialog history and the semantic memory for storing KB tuples.
Visual Dialog is a multi-modal task that requires a model to participate in a multi-turn human dialog grounded on an image, and generate correct, human-like responses.
By utilizing this framework as a tool, we propose new sequence classification algorithms that are quite different from existing solutions.
The residuals are computed by differencing the sparse-sampled reflectance function with a dictionary of pre-defined dense-sampled reflectance functions.
Experiments on two Mandarin ASR datasets show the replacement of RNNs by the self-attention networks yields a 8. 4%-10. 2% relative character error rate (CER) reduction.
By combining the sym-STDP rule with bio-plausible synaptic scaling and intrinsic plasticity of the dynamic threshold, our SNN model implemented SL well and achieved good performance in the benchmark recognition task (MNIST dataset).
In this paper, we present a memory-augmented neural network which is motivated by the process of human concept learning.
In this work, we first use WordNet to understand and expand word embedding for settling the polysemy of homographic puns, and then propose a WordNet-Encoded Collocation-Attention network model (WECA) which combined with the context weights for recognizing the puns.
While the disfluency detection has achieved notable success in the past years, it still severely suffers from the data scarcity.
However, there is little research on the construction of metaphor corpora annotated with emotion for the analysis of emotionality of metaphorical expressions.
In this paper, we propose a single-channel speech dereverberation system (DeReGAT) based on convolutional, bidirectional long short-term memory and deep feed-forward neural network (CBLDNN) with generative adversarial training (GAT).
End-to-end models have been showing superiority in Automatic Speech Recognition (ASR).
Experiments on CALLHOME datasets demonstrate that the multilingual ASR Transformer with the language symbol at the end performs better and can obtain relatively 10. 5\% average word error rate (WER) reduction compared to SHL-MLSTM with residual learning.
Considering scene image has large variation in text and background, we further design a modality-transform block to effectively transform 2D input images to 1D sequences, combined with the encoder to extract more discriminative features.
Experiments on HKUST datasets demonstrate that the lexicon free modeling units can outperform lexicon related modeling units in terms of character error rate (CER).
Furthermore, we investigate a comparison between syllable based model and context-independent phoneme (CI-phoneme) based model with the Transformer in Mandarin Chinese.
Unsupervised neural machine translation (NMT) is a recently proposed approach for machine translation which aims to train the model without using any labeled data.
Ranked #5 on Machine Translation on WMT2016 German-English
The first is that they heavily rely on manually designed bigram feature, i. e. they are not good at capturing n-gram features automatically.
Neural Machine Translation (NMT) lays intensive burden on computation and memory cost.
Joint extraction of entities and relations is an important task in information extraction.
Ranked #3 on Relation Extraction on NYT-single
During training, both the dynamic discriminator and the static BLEU objective are employed to evaluate the generated sentences and feedback the evaluations to guide the learning of the generator.
Short text clustering is a challenging problem due to its sparseness of text representation.
Ranked #1 on Short Text Clustering on Stackoverflow
This article proposes a novel character-aware neural machine translation (NMT) model that views the input sequences as sequences of characters rather than words.
As the result of the evaluation shows, our solution provides a valuable and brief model which could be used in modelling question answering or sentence semantic relevance.
To integrate the features on both dimensions of the matrix, this paper explores applying 2D max pooling operation to obtain a fixed-length representation of the text.
Ranked #4 on Text Classification on TREC-6
Recently, end-to-end memory networks have shown promising results on Question Answering task, which encode the past facts into an explicit memory and perform reasoning ability by making multiple computational steps on the memory.
Ranked #17 on Relation Extraction on SemEval-2010 Task 8
However, topics of certain granularity are not adequate to represent the intrinsic semantic information.