Since the generated samples simulate the low density area for each modal, the discriminator could directly detect anomalies from normal data.
Quantization of neural network parameters and activations has emerged as a successful approach to reducing the model size and inference time on hardware that sup-ports native low-precision arithmetic.
This work aims to study several key questions related to multimodal learning through the lens of adversarial noises: 1) The trade-off between early/middle/late fusion affecting its robustness and accuracy 2) How do different frequency/time domain features contribute to the robustness?
In this work, we target our attack on the wake-word detection system, jamming the model with some inconspicuous background music to deactivate the VAs while our audio adversary is present.
On these features, we apply five models: Gaussian Mixture Model (GMM), Deep Neural Network (DNN), Recurrent Neural Network (RNN), Convolutional Deep Neural Net- work (CNN) and i-vector.
Our CNNs, with up to 34 weight layers, are efficient to optimize over very long sequences (e. g., vector of size 32000), necessary for processing acoustic waveforms.