Self-supervised learning in speech involves training a speech representation network on a large-scale unannotated speech corpus, and then applying the learned representations to downstream tasks.
Deep Neural Networks (DNNs) are known to be vulnerable to adversarial attacks, i. e., an imperceptible perturbation to the input can mislead DNNs trained on clean images into making erroneous predictions.
Are end-to-end text-to-speech (TTS) models over-parametrized?
In this paper, we propose AutoPST, which can disentangle global prosody style from speech without relying on any text transcriptions.
The key challenges include (1) the dilemma of whether to explode the memory consumption due to the huge joint space or achieve sub-optimal designs, (2) the discrete nature of the accelerator design space that is coupled yet different from that of the networks and bitwidths, and (3) the chicken and egg problem associated with network-accelerator co-search, i. e., co-search requires operation-wise hardware cost, which is lacking during search as the optimal accelerator depending on the whole network is still unknown during search.
We investigate the existence of sparse subnetworks in pre-trained speech SSL models that achieve even better low-resource ASR results.
Current state-of-the-art object recognition models are largely based on convolutional neural network (CNN) architectures, which are loosely inspired by the primate visual system.
1 code implementation • 9 Jul 2020 • Chuang Gan, Jeremy Schwartz, Seth Alter, Damian Mrowca, Martin Schrimpf, James Traer, Julian De Freitas, Jonas Kubilius, Abhishek Bhandwaldar, Nick Haber, Megumi Sano, Kuno Kim, Elias Wang, Michael Lingelbach, Aidan Curtis, Kevin Feigelis, Daniel M. Bear, Dan Gutfreund, David Cox, Antonio Torralba, James J. DiCarlo, Joshua B. Tenenbaum, Josh H. McDermott, Daniel L. K. Yamins
We introduce ThreeDWorld (TDW), a platform for interactive multi-modal physical simulation.
Speech information can be roughly decomposed into four components: language content, timbre, pitch, and rhythm.
Understanding the structure of complex activities in untrimmed videos is a challenging task in the area of action recognition.
Current state-of-the-art models for video action recognition are mostly based on expensive 3D ConvNets.
Ranked #49 on Action Recognition on Something-Something V2 (using extra training data)
At test time, the stereo-sound student network can work independently to perform object localization us-ing just stereo audio and camera meta-data, without any visual input.
In this paper, we propose a zeroth-order AdaMM (ZO-AdaMM) algorithm, that generalizes AdaMM to the gradient-free regime.
The validation and deployment of novel research ideas in the field of Deep Learning is often limited by the availability of efficient compute kernels for certain basic primitives.
Deep learning approaches to breast cancer detection in mammograms have recently shown promising results.
Interestingly, recent work has shown that deep convolutional neural networks (CNNs) trained on large-scale image recognition tasks can serve as strikingly good models for predicting the responses of neurons in visual cortex to visual stimuli, suggesting that analogies between artificial and biological neural networks may be more than superficial.
Many real-world data sets, especially in biology, are produced by complex nonlinear dynamical systems.
Screening mammography is an important front-line tool for the early detection of breast cancer, and some 39 million exams are conducted each year in the United States alone.
First, subjects robustly recognized objects even when rendered <15% visible, but recognition was largely impaired when processing was interrupted by backward masking.
The effectiveness of this approach points to a path forward for a new class of hybrid machine learning algorithms which take both inspiration and direct constraints from neuronal data.
We present a novel neural network algorithm, the Tensor Switching (TS) network, which generalizes the Rectified Linear Unit (ReLU) nonlinearity to tensor-valued hidden units.
Delta Epsilon Alpha Star is a minimal coverage, real-time robotic search algorithm that yields a moderately aggressive search path with minimal backtracking.
Here, we explore prediction of future frames in a video sequence as an unsupervised learning rule for learning about the structure of the visual world.
Our models outperform the state of the art on MIT1003, on which features and classifiers are learned.