In this work, we introduce INR-Arch, a framework that transforms the computation graph of an nth-order gradient into a hardware-optimized dataflow architecture.
Our methodology involves storing the whole dataset directly in INR format on a GPU, mitigating the significant data communication overhead between the CPU and GPU during training.
Computer vision researchers are embracing two promising paradigms: Vision Transformers (ViTs) and Multi-task Learning (MTL), which both show great performance but are computation-intensive, given the quadratic complexity of self-attention in ViT and the need to activate an entire large MTL model for one task.
The experiment results demonstrate that DGNN-Booster can achieve a speedup of up to 5. 6x compared to the CPU baseline (6226R), 8. 4x compared to the GPU baseline (A6000) and 2. 1x compared to the FPGA baseline without applying optimizations proposed in this paper.
Therefore, in this work, we propose GNNBuilder, the first automated, generic, end-to-end GNN accelerator generation framework.
We further propose a Discrete Proxy Search (DPS) method to find the optimized training settings for Eproxy with only a handful of benchmarked architectures on the target tasks.
However, when deploying MTL onto those real-world systems that are often resource-constrained or latency-sensitive, two prominent challenges arise: (i) during training, simultaneously optimizing all tasks is often difficult due to gradient conflicts across tasks; (ii) at inference, current MTL regimes have to activate nearly the entire model even to just execute a single task.
In this paper, we propose a data-model-hardware tri-design framework for high-throughput, low-cost, and high-accuracy multi-object tracking (MOT) on High-Definition (HD) video stream.
A variety of dynamic graph neural networks designed from algorithmic perspectives have succeeded in incorporating temporal information into graph processing.
Using machine learning to solve combinatorial optimization (CO) problems is challenging, especially when the data is unlabeled.
Accurately segmenting temporal frames of cine magnetic resonance imaging (MRI) is a crucial step in various real-time MRI guided cardiac interventions.
Deep Neural Networks (DNNs) have achieved great success in a variety of machine learning (ML) applications, delivering high-quality inferencing solutions in computer vision, natural language processing, and virtual reality, etc.
The heterogeneity in ML models comes from multi-sensor perceiving and multi-task learning, i. e., multi-modality multi-task (MMMT), resulting in diverse deep neural network (DNN) layers and computation patterns.
First, we propose a novel and scalable dataflow architecture, which generally supports a wide range of GNN models with message-passing mechanism.
Second, we aim to support a diverse set of GNN models with the extensibility to flexibly adapt to new models.
Despite the stride made by machine learning (ML) based performance modeling, two major concerns that may impede production-ready ML applications in EDA are stringent accuracy requirements and generalization capability.
Pioneering in this direction, we expect more GNN endeavors to revolutionize this high-demand Program-to-Circuit problem and to enrich the expressiveness of GNNs on programs.
Such a self-supervised regression task can effectively evaluate the intrinsic power of an architecture to capture and transform the input signal patterns, and allow more sufficient usage of training samples.
Ranked #1 on Neural Architecture Search on NAS-Bench-101 (Spearman Correlation metric)
We implement our proposed accelerator on multiple FPGAs, which outperforms the state-of-the-art designs in terms of both throughput and DSP efficiency.
Self-supervised learning of graph neural networks (GNN) is in great need because of the widespread label scarcity issue in real-world graph/network data.
This paper emphasizes the importance and efficacy of training, quantization and accelerator design, and calls for more research breakthroughs in the area for AI on the edge.
We formulate the MMMT model and heterogeneous hardware implementation co-design as a differentiable optimization problem, with the objective of improving the solution quality and reducing the overall power consumption and critical path latency.
Artificial intelligence (AI) technologies have dramatically advanced in recent years, resulting in revolutionary changes in people's lives.
Despite the great success of High-Level Synthesis (HLS) tools, we observe several unresolved challenges: 1) the high-level abstraction of programming styles in HLS sometimes conceals optimization opportunities; 2) existing HLS tools do not provide flexible trade-off (Pareto) solutions among different objectives and constraints; 3) the actual quality of the resulting RTL designs is hard to predict.
This raises the question of whether we can find an effective proxy search space (PS) that is only a small subset of GS to dramatically improve RandomNAS’s search efficiency while at the same time keeping a good correlation for the top-performing architectures.
To design HW-NAS-Bench, we carefully collected the measured/estimated hardware performance (e. g., energy cost and latency) of all the networks in the search space of both NAS-Bench-201 and FBNet, considering six hardware devices that fall into three categories (i. e., commercial edge devices, FPGA, and ASIC).
High quality AI solutions require joint optimization of AI algorithms, such as deep neural networks (DNNs), and their hardware accelerators.
Quantization has been proven to be an effective method for reducing the computing and/or storage cost of DNNs.
We formulate the co-search problem by fusing DNN search variables and hardware implementation variables into one solution space, and maximize both algorithm accuracy and hardware implementation quality.
Specifically, AutoDNNchip consists of two integrated enablers: (1) a Chip Predictor, built on top of a graph-based accelerator representation, which can accurately and efficiently predict a DNN accelerator's energy, throughput, and area based on the DNN model parameters, hardware configuration, technology-based IPs, and platform constraints; and (2) a Chip Builder, which can automatically explore the design space of DNN chips (including IP selection, block configuration, resource balancing, etc.
The rapidly growing demands for powerful AI algorithms in many application domains have motivated massive investment in both high-quality deep neural network (DNN) models and high-efficiency implementations.
Object detection and tracking are challenging tasks for resource-constrained embedded systems.
Developing artificial intelligence (AI) at the edge is always challenging, since edge devices have limited computation capability and memory resources but need to meet demanding requirements, such as real-time processing, high throughput performance, and high inference accuracy.
Developing deep learning models for resource-constrained Internet-of-Things (IoT) devices is challenging, as it is difficult to achieve both good quality of results (QoR), such as DNN model inference accuracy, and quality of service (QoS), such as inference latency, throughput, and power consumption.
While embedded FPGAs are attractive platforms for DNN acceleration on edge-devices due to their low latency and high energy efficiency, the scarcity of resources of edge-scale FPGA devices also makes it challenging for DNN deployment.