Reinforcement Learning Based Asymmetrical DNN Modularization for Optimal Loading

1 Jan 2021 · Brijraj Singh, Yash Jain, Mayukh Das, Praveen Doreswamy Naidu ·

Latency of DNN (Deep Neural Network) based prediction is the summation of model loading latency and inference latency. Model loading latency affects the first response from the applications, whereas inference latency affects the subsequent responses. As model loading latency is directly proportional to the model size, this work aims at improving the response time of an intelligent app by reducing the loading latency. The speedup is gained by asymmetrically modularizing the given DNN model among several small child models and loading them in parallel. The decision about number of feasible child models and their corresponding split positions are taken care by reinforcement learning unit (RLU). RLU takes into account the available hardware resources on-device and provides the best splitting index $k$ and their positions $\vec{p}$ specific to the DNN model and device, where $\vec{p}=(p_1, p_2, ..., p_k)$ and $p_i$ is the end position of $i^{th}$ child: $M_i$. The proposed method has shown significant loading improvement (up to 7X) on popular DNNs, used for camera use-case. The proposed method can be used to speed up the app response. Along with that RLU driven approach facilitates for On-device personalization by separating one module only with trainable layers and loading that particular module while training on-device.

PDF Abstract