Furthermore, we explore the requirements for end-to-end inference of a full mobile-grade DNN (MobileNetV2) in terms of IMC array resources, by scaling up our heterogeneous architecture to a multi-array accelerator.
Motor imagery brain--machine interfaces enable us to control machines by merely thinking of performing a motor action.
In this work, we introduce a HW/SW platform for end-to-end CL based on a 10-core FP32-enabled parallel ultra-low-power (PULP) processor.
no code implementations • 18 Oct 2021 • Davide Rossi, Francesco Conti, Manuel Eggimann, Alfio Di Mauro, Giuseppe Tagliavini, Stefan Mach, Marco Guermandi, Antonio Pullini, Igor Loi, Jie Chen, Eric Flamand, Luca Benini
Vega achieves SoA-leading efficiency of 615 GOPS/W on 8-bit INT computation (boosted to 1. 3TOPS/W for 8-bit DNN inference with hardware acceleration).
The increasing complexity of Internet-of-Things (IoT) applications and near-sensor processing algorithms is pushing the computational power of low-power, battery-operated end-node systems.
By leveraging interkernel data dependencies, these energy-bounded execution cycles minimize the number of system activations and nonvolatile data transfers, and thus the total energy overhead.
The record-breaking achievements of deep neural networks (DNNs) in image classification and detection tasks resulted in a surge of new computer vision applications during the past years.
This work provides a survey of neural network accelerator optimization approaches that have been used in recent works and reports their individual effects on edge processing performance.
We present the implementation of seizure detection algorithms based on a minimal number of EEG channels on a parallel ultra-low-power embedded platform.
State-of-the-Art approaches are based on Machine Learning methods and exploit the fusion of time- and frequency-domain features from current and voltage sensors.
This document presents implementations of fundamental convolutional neural network (CNN) layers on the Manticore cluster-based many-core architecture and discusses their characteristics and trade-offs.
With 9. 91 GMAC/s/W, it is 23. 0 times more energy-efficient and 46. 85 times faster than an implementation on the ARM Cortex M4F (0. 43 GMAC/s/W).
Artificial intelligence (AI) technologies have dramatically advanced in recent years, resulting in revolutionary changes in people's lives.
With Motor-Imagery (MI) Brain--Machine Interfaces (BMIs) we may control machines by merely thinking of performing a motor action.
Hyperdimensional computing (HDC) is a brain-inspired computing paradigm based on high-dimensional holistic representations of vectors.
This BNN reaches a 77. 9% accuracy, just 7% lower than the full-precision version, with 58 kB (7. 2 times less) for the weights and 262 kB (2. 4 times less) memory in total.
We present a 3. 1 POp/s/W fully digital hardware accelerator for ternary neural networks.
Our first method, based on sparse bipolar random projection, projects a large number of real-valued Riemannian covariance features to a binary space, where a linear SVM classifier can be learned with binary weights too.
Traditional neural networks require enormous amounts of data to build their complex mappings during a slow training procedure that hinders their abilities for relearning and adapting to new data.
The severe on-chip memory limitations are currently preventing the deployment of the most accurate Deep Neural Network (DNN) models on tiny MicroController Units (MCUs), even if leveraging an effective 8-bit quantization scheme.
While the accuracy of convolutional neural networks has achieved vast improvements by introducing larger and deeper network architectures, also the memory footprint for storing their parameters and activations has increased.
On a prototype in 22nm FDX technology, we demonstrate that both the logic and SRAM voltage can be dropped to 0. 5Vwithout any accuracy penalty on a BNN trained for the CIFAR-10 dataset, improving energy efficiency by 2. 2X w. r. t.
Clock generators are an essential and critical building block of any communication link, whether it be wired or wireless, and they are increasingly critical given the push for lower I/O power and higher bandwidth in Systems-on-Chip (SoCs) for the Internet-of-Things (IoT).
Furthermore, the gesture recognition classifier has been implemented on a Parallel Ultra-Low Power Processor, demonstrating that real-time prediction is feasible with only 21 mW of power consumption for the full TCN sequence prediction network, while a system-level power consumption of less than 100 mW is achieved.
The framework relies on a Reinforcement Learning search that, combined with a deep learning inference framework, automatically explores the design space and learns an optimised solution that speeds up the performance and reduces the memory on embedded CPU platforms.
Experimental results on the BCI Competition IV-2a dataset show that EEG-TCNet achieves 77. 35% classification accuracy in 4-class MI.
Furthermore, it can perform inference on a binarized ResNet-18 trained with 8-bases Group-Net to achieve a 67. 5% Top-1 accuracy with only 3. 0 mJ/frame -- at an accuracy drop of merely 1. 8% from the full-precision ResNet-18.
Convolutional Neural Networks are extensively used in a wide range of applications, commonly including computer vision tasks like image and video classification, recognition, and segmentation.
We quantize weights and activations to 8-bit fixed-point with a negligible accuracy loss of 0. 4% on 4-class MI, and present an energy-efficient hardware-aware implementation on the Mr. Wolf parallel ultra-low power (PULP) System-on-Chip (SoC) by utilizing its custom RISC-V ISA extensions and 8-core compute cluster.
The method -- called pAElla -- targets real-time Malware Detection (MD), it runs on an out-of-band IoT-based monitoring system for DCs/SCs, and involves Power Spectral Density of power measurements, along with AutoEncoders.
These tools are monolithic and mostly proprietary, disagree in their implementation of HDLs, and while many redundant IRs exists, no IR today can be used through the entire circuit design flow.
Our novel method further scales down the standard EEGNet at a negligible accuracy loss of 0. 31% with 7. 6x memory footprint reduction and a small accuracy loss of 2. 51% with 15x reduction.
Radio Resource Management (RRM) in 5G mobile communication is a challenging problem for which Recurrent Neural Networks (RNN) have shown promising results.
The ML model learns the relation between variables precision and the output error; this information is then embedded in the MP focused on minimizing the number of bits.
Distributed, Parallel, and Cluster Computing
We present Random Partition Relaxation (RPR), a method for strong quantization of neural networks weight to binary (+1/-1) and ternary (+1/0/-1) values.
Synthetic aperture radar (SAR) data is becoming increasingly available to a wide range of users through commercial service providers with resolutions reaching 0. 5m/px.
The narrow-space search of floating-point models improves the accuracy on CIFAR10 of an established IoT model from 70. 64% to 74. 87% within the same memory constraints.
The growing number of low-power smart devices in the Internet of Things is coupled with the concept of "Edge Computing", that is moving some of the intelligence, especially machine learning, towards the edge of the network.
We further improve the accuracy to 82. 07% by including 16-bit half types and we obtain the best accuracy of 83. 45% by extending the search with model optimized IEEE 754 reduced types.
In the wake of the success of convolutional neural networks in image classification, object recognition, speech recognition, etc., the demand for deploying these compute-intensive ML models on embedded and mobile systems with tight power and energy constraints at low cost, as well as for boosting throughput in data centers, is growing rapidly.
We present PULP-NN, an optimized computing library for a parallel ultra-low-power tightly coupled cluster of RISC-V processors.
We show that this communication fabric facilitates the pipelined execution of all state of-the-art CNNs by proving the existence of a homomorphism between one graph representation of these networks and the proposed graph topology.
Hyperdimensional computing (HDC) is an emerging computational framework that takes inspiration from attributes of neuronal circuits such as hyperdimensionality, fully distributed holographic representation, and (pseudo)randomness.
In this paper, we present Ara, a 64-bit vector processor based on the version 0. 5 draft of RISC-V's vector extension, implemented in GlobalFoundries 22FDX FD-SOI technology.
To fit the memory and computational limitations of resource-constrained edge-devices, we exploit mixed low-bitwidth compression, featuring 8, 4 or 2-bit uniform quantization, and we model the inference graph with integer-only operations.
We present a theoretical and experimental investigation of the quantization problem for artificial neural networks.
Nano-size unmanned aerial vehicles (UAVs), with few centimeters of diameter and sub-10 Watts of total power budget, have so far been considered incapable of running sophisticated visual-based autonomous navigation software without external aid from base-stations, ad-hoc local positioning infrastructure, and powerful external computation servers.
Embedded inference engines for convolutional networks must be parsimonious in memory bandwidth and buffer sizing to meet power and cost constraints.
no code implementations • 15 Jan 2019 • Miguel de Prado, Jing Su, Rabia Saeed, Lorenzo Keller, Noelia Vallez, Andrew Anderson, David Gregg, Luca Benini, Tim Llewellynn, Nabil Ouerhani, Rozenn Dahyot and, Nuria Pazos
In this work, we present a modular AI pipeline as an integrating framework to bring data, algorithms, and deployment tools together.
Varying contraction levels of muscles is a big challenge in electromyography-based gesture recognition.
All these methods, differing in complexity, aim to represent EEG signals in binary HD space, e. g. with 10, 000 bits.
In this work, we present QS-DNN, a fully automatic search based on Reinforcement Learning which, combined with an inference engine optimizer, efficiently explores through the design space and empirically finds the optimal combinations of libraries and primitives to speed up the inference of CNNs on heterogeneous embedded devices.
Anomaly detection in supercomputers is a very difficult problem due to the big scale of the systems and the high number of components.
However, we also show that: 1) not all real workloads allow for the identification of a good model; 2) starting from the theory of system identification it is very difficult to evaluate if a trace of data leads to a good estimated model.
After the tremendous success of convolutional neural networks in image classification, object detection, speech recognition, etc., there is now rising demand for deployment of these compute-intensive ML models on tightly power constrained embedded and mobile systems at low cost as well as for pushing the throughput in data centers.
This paper presents an efficient binarized algorithm for both learning and classification of human epileptic seizures from intracranial electroencephalography (iEEG).
The last few years have brought advances in computer vision at an amazing pace, grounded on new findings in deep neural network construction and training as well as the availability of large labeled datasets.
In this paper, we propose hardware techniques for optimizations of HD computing, in a synthesizable VHDL library, to enable co-located implementation of both learning and classification tasks on only a small portion of Xilinx(R) UltraScale(TM) FPGAs: (1) We propose simple logical operations to rematerialize the hypervectors on the fly rather than loading them from memory.
Binary Neural Networks (BNNs) are promising to deliver accuracy comparable to conventional deep neural networks at a fraction of the cost in terms of memory and energy.
Power consumption is a looming treat in today's computing progress.
Distributed, Parallel, and Cluster Computing
Accurate, fast, and reliable multiclass classification of electroencephalography (EEG) signals is a challenging task towards the development of motor imagery brain-computer interface (MI-BCI) systems.
As part of our general methodology we discuss the software mapping techniques that enable the state-of-the-art deep convolutional neural network presented in  to be fully executed on-board within a strict 6 fps real-time constraint with no compromise in terms of flight results, while all processing is done with only 64 mW on average.
In the deep-learning community new algorithms are published at an incredible pace.
Deploying state-of-the-art CNNs requires power-hungry processors and off-chip memory.
1 code implementation • 28 Feb 2018 • Ali Moin, Andy Zhou, Abbas Rahimi, Simone Benatti, Alisha Menon, Senam Tamakloe, Jonathan Ting, Natasha Yamamoto, Yasser Khan, Fred Burghardt, Luca Benini, Ana C. Arias, Jan M. Rabaey
We present an end-to-end system combating this variability using a large-area, high-density sensor array and a robust classification algorithm.
Most investigations into near-memory hardware accelerators for deep neural networks have primarily focused on inference, while the potential of accelerating training has received relatively little attention so far.
Distributed, Parallel, and Cluster Computing Hardware Architecture
Heterogeneous embedded systems on chip (HESoCs) co-integrate a standard host processor with programmable manycore accelerators (PMCAs) to combine general-purpose computing with domain-specific, efficient processing capabilities.
Hardware Architecture Distributed, Parallel, and Cluster Computing
Deep convolutional neural networks (CNNs) obtain outstanding results in tasks that require human-level understanding of data, like image or speech recognition.
Design automation in general, and in particular logic synthesis, can play a key role in enabling the design of application-specific Binarized Neural Networks (BNN).
Recurrent neural networks (RNNs) are state-of-the-art in voice awareness/understanding and speech recognition.
Extracting per-frame features using convolutional neural networks for real-time processing of video data is currently mainly performed on powerful GPU-accelerated workstations and compute clusters.
We present a new approach to learn compressible representations in deep architectures with an end-to-end training strategy.
Our codesign approach consists of a network of Smart Memory Cubes (modular extensions to the standard HMC) each augmented with a many-core PIM platform called NeuroCluster.
Hardware Architecture Emerging Technologies
4 code implementations • 18 Dec 2016 • Francesco Conti, Robert Schilling, Pasquale Davide Schiavone, Antonio Pullini, Davide Rossi, Frank Kagan Gürkaynak, Michael Muehlberghuber, Michael Gautschi, Igor Loi, Germain Haugou, Stefan Mangard, Luca Benini
Near-sensor data analytics is a promising direction for IoT endpoints, as it minimizes energy spent on communication and reduces network load - but it also poses security concerns, as valuable data is stored or sent over the network at various stages of the analytics pipeline.
Lossy image compression algorithms are pervasively used to reduce the size of images transmitted over the web and recorded on data storage media.
The required communication links and archiving of the video data are still expensive and this setup excludes preemptive actions to respond to imminent threats.
We propose a highly structured neural network architecture for semantic segmentation with an extremely small model size, suitable for low-power embedded and mobile platforms.
Convolutional neural networks (CNNs) have revolutionized the world of computer vision over the last few years, pushing image classification beyond human accuracy.
An ever increasing number of computer vision and image/video processing challenges are being approached using deep convolutional neural networks, obtaining state-of-the-art results in object recognition and detection, semantic segmentation, action recognition, optical flow and superresolution.