We confirm that our method with a Transformer decoder outperforms all relevant methods on HumanAct12, NTU-RGBD, and UESTC datasets in terms of realism and diversity of generated motions.
Current works train deep learning models on low-level data representations to solve the emotion recognition task.
To overcome this problem, we propose a method called minimum stable rank DARTS (MSR-DARTS), for finding a model with the best generalization error by replacing architecture optimization with the selection process using the minimum stable rank criterion.
Ranked #24 on Neural Architecture Search on CIFAR-10
We propose a non-invasive and cost-effective method to automatically detect dementia by utilizing solely speech audio data.
(4) Fast, it is observed that the number of training epochs required by MaskConvNet is close to training a baseline without pruning.
no code implementations • 16 Apr 2019 • Kong Aik Lee, Ville Hautamaki, Tomi Kinnunen, Hitoshi Yamamoto, Koji Okabe, Ville Vestman, Jing Huang, Guohong Ding, Hanwu Sun, Anthony Larcher, Rohan Kumar Das, Haizhou Li, Mickael Rouvier, Pierre-Michel Bousquet, Wei Rao, Qing Wang, Chunlei Zhang, Fahimeh Bahmaninezhad, Hector Delgado, Jose Patino, Qiongqiong Wang, Ling Guo, Takafumi Koshinaka, Jiacen Zhang, Koichi Shinoda, Trung Ngo Trong, Md Sahidullah, Fan Lu, Yun Tang, Ming Tu, Kah Kuan Teh, Huy Dat Tran, Kuruvachan K. George, Ivan Kukanov, Florent Desnous, Jichen Yang, Emre Yilmaz, Longting Xu, Jean-Francois Bonastre, Cheng-Lin Xu, Zhi Hao Lim, Eng Siong Chng, Shivesh Ranjan, John H. L. Hansen, Massimiliano Todisco, Nicholas Evans
The I4U consortium was established to facilitate a joint entry to NIST speaker recognition evaluations (SRE).
We investigate the feasibility of sequence-level knowledge distillation of Sequence-to-Sequence (Seq2Seq) models for Large Vocabulary Continuous Speech Recognition (LVSCR).
With the widespread use of intelligent systems, such as smart speakers, addressee recognition has become a concern in human-computer interaction, as more and more people expect such systems to understand complicated social scenes, including those outdoors, in cafeterias, and hospitals.
Few-shot adaptation provides robust parameter estimation with few training examples, by optimizing the parameters of zero-shot learning and supervised many-shot learning simultaneously.
This paper presents a new framework for human action recognition from a 3D skeleton sequence.
Ranked #94 on Skeleton Based Action Recognition on NTU RGB+D
I-vector based text-independent speaker verification (SV) systems often have poor performance with short utterances, as the biased phonetic distribution in a short utterance makes the extracted i-vector unreliable.