Recent work in machine learning and cognitive science has suggested that understanding causal information is essential to the development of intelligence.
Deep neural networks have largely demonstrated their ability to perform automated speech recognition (ASR) by extracting meaningful features from input audio frames.
While there have been significant gains in the field of automated video description, the generalization performance of automated description models to novel domains remains a major barrier to using these systems in the real world.
Traditionally, research in automated speech recognition has focused on local-first encoding of audio representations to predict the spoken phonemes in an utterance.
Urban areas consume over two-thirds of the world's energy and account for more than 70 percent of global CO2 emissions.
Automatic video captioning aims to train models to generate text descriptions for all segments in a video, however, the most effective approaches require large amounts of manual annotation which is slow and expensive.
Research in developmental psychology consistently shows that children explore the world thoroughly and efficiently and that this exploration allows them to learn.
Finally we show that diagnostic visualization using LDAM leads to a novel insight into the parameter averaging method for deep net training.
Modern datasets and models are notoriously difficult to explore and analyze due to their inherent high dimensionality and massive numbers of samples.