Classifying EEG responses to naturalistic acoustic stimuli is of theoretical and practical importance, but standard approaches are limited by processing individual channels separately on very short sound segments (a few seconds or less).
On the other hand, the proxy-based loss functions often lead to significant speedups in convergence during training, while the rich relations among data points are often not fully explored by the proxy-based losses.
Machine learning approaches to auditory object recognition are traditionally based on engineered features such as those derived from the spectrum or cepstrum.
The coarse functional distinction between these streams is between object recognition -- the "what" of the signal -- and extracting location related information -- the "where" of the signal.
Inspired by this structure, we have proposed an object detection framework involving the integration of a "What Network" and a "Where Network".
We demonstrate that a simple linear mapping can be learned from sensitivity maps to bounding box coordinates, localizing the recognized object.
Ideally, attention maps predicted by captioning models should be consistent with intrinsic attentions from visual models for any given visual concept.