Model compression techniques on Deep Neural Network (DNN) have been widely acknowledged as an effective way to achieve acceleration on a variety of platforms, and DNN weight pruning is a straightforward and effective method.
To mitigate the challenges, the memristor crossbar array has emerged as an intrinsically suitable matrix computation and low-power acceleration framework for DNN applications.
Despite their ubiquity in NLP tasks, Long Short-Term Memory (LSTM) networks suffer from computational inefficiencies caused by inherent unparallelizable recurrences, which further aggravates as LSTMs require more parameters for larger memory capacity.
Pre-trained language models such as BERT have proven to be highly effective for natural language processing (NLP) tasks.
A surge in artificial intelligence and autonomous technologies have increased the demand toward enhanced edge-processing capabilities.
Therefore, we propose Network of Neural Networks (NoNN), a new distributed IoT learning paradigm that compresses a large pretrained 'teacher' deep network into several disjoint and highly-compressed 'student' modules, without loss of accuracy.
In the distillation process, we propose a fidelity loss to enable the student network to maintain the representation capability of the teacher network.