An OpenCL Framework for Real-time Inference of Next-generation Convolutional Neural Networks on FPGAs PDF Download

Are you looking for read ebook online? Search for your book and save it on your Kindle device, PC, phones or tablets. Download An OpenCL Framework for Real-time Inference of Next-generation Convolutional Neural Networks on FPGAs PDF full book. Access full book title An OpenCL Framework for Real-time Inference of Next-generation Convolutional Neural Networks on FPGAs by Sachin Kumawat. Download full books in PDF and EPUB format.

An OpenCL Framework for Real-time Inference of Next-generation Convolutional Neural Networks on FPGAs

Author: Sachin Kumawat
Publisher:
ISBN: 9780355764413
Category :
Languages : en
Pages :

Book Description
Modern Convolutional Neural Networks (CNNs) consist of billions of multiplications and additions which require the use of parallel computing units such as GPUs, FPGAs and other DSP processors. Consequently, General Purpose GPU (GPGPU) computing has taken this field by storm. At the same time, there has been an increased interest in FPGA based acceleration of CNN inference. In this work, we present FICaffe, a framework for FPGA-based Inference with Caffe, which provides a complete automated generation and mapping of CNN accelerators on FPGAs. We target applications with critical latency requirements and design high processing efficiency accelerators for CNNs. The architecture is structured in a highly concurrent OpenCL library, which enables High Level Synthesis tools to effectively exploit data, task and pipeline parallelism. We propose a unified memory model, that drives exploration of optimal design by matching on-chip and off-chip memory bandwidths available on FPGA platforms. We also identify origins of all clock cycle stalls and overheads inherent to CNN acceleration designs and provide a detailed model to accurately predict the runtime latency with less than 4% error against on-board tests. Furthermore, with FICaffe we provide support for cross-network synthesis, such that it is possible to processes a variety of CNNs, with reasonable efficiency, without long re-compilation hours. FICaffe is integrated with the popular deep learning framework Caffe, and is deployable to a wide variety of CNNs. FICaffe's efficacy is shown by mapping to a 28nm Stratix V GXA7 chip, and both network specific and cross-network performance are reported for AlexNet, VGG, SqueezeNet and GoogLeNet. We show a processing efficiency of 95.8% for the widely-reported VGG benchmark, which outperforms prior work. FICaffe also achieves more than 2X speedup on Stratix V GXA7 compared with the best published results on this chip, to the best of our knowledge.

An OpenCL Framework for Real-time Inference of Next-generation Convolutional Neural Networks on FPGAs

Author: Sachin Kumawat
Publisher:
ISBN: 9780355764413
Category :
Languages : en
Pages :

Framework for Mapping Convolutional Neural Networks on FPGAs

Author: Masoud Shahshahani
Publisher:
ISBN:
Category : Artificial intelligence
Languages : en
Pages : 0

Book Description
Artificial Intelligence (AI) applications are on the rise. Recent advances in machine learning and deep learning have created various applications for medicine/healthcare, financial markets, security, entertainment, and social sciences. Deep Learning, especially, has demonstrated tremendous opportunities in computer vision, autonomous driving, natural language processing, and many more. Deep learning allows machines to solve complex problems using Artificial Neural Networks (ANNs), and the learning itself can be supervised or semisupervised. Multilayered artificial neural networks are called Deep Neural Networks (DNNs). These deep computational models are composed of multiple sequentially processing layers that help learn the representations within a given data set. Convolutional Neural Networks (CNN) are a particular class of deep networks that use convolution to extract features from (usually a time-domain or frequency-domain) data and then use the extracted features to classify that data for final inferencing. Several software tools and frameworks are available to facilitate the deep learning community with the fast development and high-performance execution of DNNs. Tool flows, such as PyTorch, Caffe, Theano, and TensorFlow, aim to increase the productivity of CNN software developers by providing a pathway for implementing deep networks on high-performance multi-core CPUs, GPUs, and DSPs. GPUs, especially, provide easy access to floating point operations and also allow very high memory bandwidths. Some of the latest Nvidia GPUs (Nvidia GeForce RTX2080) consume as much as 300 watts of power. Excessive power dissipation can make GPUs an unfavorable candidate for implementing CNNs for a variety of applications. Field Programmable Gate Arrays (FPGAs) provide a high degree of customized parallelization and offer far superior performance per watt. We believe that FPGA-based accelerators are ideal platforms for implementing Convolutional Neural Networks for computer vision and related applications. Software engineers with minimal hardware design skills demand tremendous support within the tool-flows, and FPGA vendors are fully embracing new methodologies like high-level synthesis, where the designs can be described as a program written in languages like C/C++. However, commercial FPGAs are resource-scarce, the CNN mapping design space is enormous, and efficient mapping of CNN can quickly become a challenging task. The requirement of FPGA resources, latency, and power is affected by many parameters, including the CNN architecture and the level of computational parallelism. In practice, a software designer first explores various CNN architectures in software to improve architecture0́9s validation accuracy. Once an architecture has been finalized, the designer ports the architecture design to FPGA for inference acceleration. The mapping process undergoes performance optimization by tweaking many design-related parameters during the design space exploration and changing the operating frequencies. The entire process is highly time-consuming. This dissertation describes a fully automated end-to-end design framework for implementing CNNs on FPGAs. The framework allows a designer to express the CNNs in commonly preferred Python language descriptions and provides a guided tool flow to generate a custom Intellectual Property (IP) block. In addition, the framework allows easy and complete exploration for selecting final design implementations based on optimization parameters that include Performance, Power, and Area (PPA).

Hardware Acceleration of Video Analytics on FPGA Using OpenCL

Author: Akshay Dua
Publisher:
ISBN:
Category : Gate array circuits
Languages : en
Pages : 44

Book Description
With the exponential growth in video content over the period of the last few years, analysis of videos is becoming more crucial for many applications such as self-driving cars, healthcare, and traffic management. Most of these video analysis application uses deep learning algorithms such as convolution neural networks (CNN) because of their high accuracy in object detection. Thus enhancing the performance of CNN models become crucial for video analysis. CNN models are computationally-expensive operations and often require high-end graphics processing units (GPUs) for acceleration. However, for real-time applications in an energy-thermal constrained environment such as traffic management, GPUs are less preferred because of their high power consumption, limited energy efficiency. They are challenging to fit in a small place. To enable real-time video analytics in emerging large scale Internet of things (IoT) applications, the computation must happen at the network edge (near the cameras) in a distributed fashion. Thus, edge computing must be adopted. Recent studies have shown that field-programmable gate arrays (FPGAs) are highly suitable for edge computing due to their architecture adaptiveness, high computational throughput for streaming processing, and high energy efficiency. This thesis presents a generic OpenCL-defined CNN accelerator architecture optimized for FPGA-based real-time video analytics on edge. The proposed CNN OpenCL kernel adopts a highly pipelined and parallelized 1-D systolic array architecture, which explores both spatial and temporal parallelism for energy efficiency CNN acceleration on FPGAs. The large fan-in and fan-out of computational units to the memory interface are identified as the limiting factor in existing designs that causes scalability issues, and solutions are proposed to resolve the issue with compiler automation. The proposed CNN kernel is highly scalable and parameterized by three architecture parameters, namely pe_num, reuse_fac, and vec_fac, which can be adapted to achieve 100% utilization of the coarse-grained computation resources (e.g., DSP blocks) for a given FPGA. The proposed CNN kernel is generic and can be used to accelerate a wide range of CNN models without recompiling the FPGA kernel hardware. The performance of Alexnet, Resnet-50, Retinanet, and Light-weight Retinanet has been measured by the proposed CNN kernel on Intel Arria 10 GX1150 FPGA. The measurement result shows that the proposed CNN kernel, when mapped with 100% utilization of computation resources, can achieve a latency of 11ms, 84ms, 1614.9ms, and 990.34ms for Alexnet, Resnet-50, Retinanet, and Light-weight Retinanet respectively when the input feature maps and weights are represented using 32-bit floating-point data type.

Caffeinated FPGAs

Author: Roberto DiCecco
Publisher:
ISBN:
Category :
Languages : en
Pages : 0

Book Description
This thesis presents a framework for performing training and inference of Convolutional Neural Networks (CNNs) with reduced precision floating-point arithmetic. This work aims to provide a means for FPGA and machine learning researchers to use the customizability of FPGAs to explore the precision requirements of training CNNs with an open-source framework. This is accomplished through the creation of a High-Level Synthesis library with a Custom Precision Floating-Point data type that is configurable in both exponent and mantissa widths, with several standard operators and rounding modes supported. With this library a FPGA CNN Training Engine (FCTE) has been created along with a FPGA CNN framework FPGA Caffe, which is built on Caffe. FCTE has a peak performance of approximately 350 GFLOPs, and has been used to show that a mantissa width of 5 and exponent width of 6 is sufficient for training several models targeting the MNIST and CIFAR-10 datasets.

Automated Customization of ML Inference on FPGAs

Author: Mohammad Ghasemzadeh
Publisher:
ISBN:
Category :
Languages : en
Pages : 98

Book Description
This thesis introduces novel frameworks for automated customization of two classes of machine learning algorithms, deep neural networks and causal Bayesian analysis. The high computational complexity often prohibits the deployment of ML models on resource-constrained embedded devices where memory and energy budgets are strictly limited. FPGAs offer a flexible substrate that can be configured to maximally exploit the parallel nature of computations in different ML algorithms to deliver high-throughput and power-efficient accelerators. To make FPGAs a ubiquitous platform for ML inference, automated frameworks that can customize ML models to the constraints of the underlying hardware and pertinent application requirements are necessary. My work proposes hardware-algorithm co-design approaches to customize ML inference on FPGA platforms and provides end-to-end automated frameworks to generate optimized hardware accelerators which can be used by a broad range of ML developers without requiring any hardware design knowledge. My key contributions include: (i) proposing an end-to-end framework to customize execution of deep neural networks on FPGAs using a reconfigurable encoding approach for the parameters of model which results in 9-fold reduction in memory footprint and 15-fold improvement in throughput without any loss in accuracy, (ii) proposing CausaLearn, the first automated framework that enables real-time and scalable approximation of probability density function in the context of causal Bayesian analysis which offers up to two orders-of-magnitude runtime and energy improvements compared to the best-known prior solution, (iii) proposing ReBNet, an end-to-end framework for training reconfigurable binary neural networks on software and developing efficient accelerators for execution on FPGA.

Hardware Accelerators in Data Centers

Author: Christoforos Kachris
Publisher: Springer
ISBN: 3319927922
Category : Technology & Engineering
Languages : en
Pages : 280

Book Description
This book provides readers with an overview of the architectures, programming frameworks, and hardware accelerators for typical cloud computing applications in data centers. The authors present the most recent and promising solutions, using hardware accelerators to provide high throughput, reduced latency and higher energy efficiency compared to current servers based on commodity processors. Readers will benefit from state-of-the-art information regarding application requirements in contemporary data centers, computational complexity of typical tasks in cloud computing, and a programming framework for the efficient utilization of the hardware accelerators.

Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

Author: Jonathan Greene
Publisher:
ISBN: 9781450343541
Category :
Languages : en
Pages :

Book Description
FPGA '17: The 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays Feb 22, 2017-Feb 24, 2017 Monterey, USA. You can view more information about this proceeding and all of ACM�s other published conference proceedings from the ACM Digital Library: http://www.acm.org/dl.

High-Performance Computing Using FPGAs

Author: Wim Vanderbauwhede
Publisher: Springer Science & Business Media
ISBN: 1461417910
Category : Technology & Engineering
Languages : en
Pages : 798

Book Description
High-Performance Computing using FPGA covers the area of high performance reconfigurable computing (HPRC). This book provides an overview of architectures, tools and applications for High-Performance Reconfigurable Computing (HPRC). FPGAs offer very high I/O bandwidth and fine-grained, custom and flexible parallelism and with the ever-increasing computational needs coupled with the frequency/power wall, the increasing maturity and capabilities of FPGAs, and the advent of multicore processors which has caused the acceptance of parallel computational models. The Part on architectures will introduce different FPGA-based HPC platforms: attached co-processor HPRC architectures such as the CHREC’s Novo-G and EPCC’s Maxwell systems; tightly coupled HRPC architectures, e.g. the Convey hybrid-core computer; reconfigurably networked HPRC architectures, e.g. the QPACE system, and standalone HPRC architectures such as EPFL’s CONFETTI system. The Part on Tools will focus on high-level programming approaches for HPRC, with chapters on C-to-Gate tools (such as Impulse-C, AutoESL, Handel-C, MORA-C++); Graphical tools (MATLAB-Simulink, NI LabVIEW); Domain-specific languages, languages for heterogeneous computing(for example OpenCL, Microsoft’s Kiwi and Alchemy projects). The part on Applications will present case from several application domains where HPRC has been used successfully, such as Bioinformatics and Computational Biology; Financial Computing; Stencil computations; Information retrieval; Lattice QCD; Astrophysics simulations; Weather and climate modeling.

2016 26th International Conference on Field Programmable Logic and Applications (FPL)

Author: IEEE Staff
Publisher:
ISBN: 9781509008513
Category :
Languages : en
Pages :

Book Description
The International Conference on Field Programmable Logic and Applications (FPL) is the first and largest conference covering the rapidly growing area of field programmable logic During the past 26 years, many of the advances achieved in reconfigurable system architectures, applications, embedded processors, design automation methods (EDA) and tools have been first published in the proceedings of the FPL conference series FPL 2016 will offer the following five conference tracks Architectures and Technology, Applications and Benchmarks, Design Methods and Tools, Self aware and Adaptive Systems, Surveys, Trends and Education

Robotic Computing on FPGAs

Author: Shaoshan Liu
Publisher: Morgan & Claypool Publishers
ISBN: 1636391664
Category : Computers
Languages : en
Pages : 220

Book Description
This book provides a thorough overview of the state-of-the-art field-programmable gate array (FPGA)-based robotic computing accelerator designs and summarizes their adopted optimized techniques. This book consists of ten chapters, delving into the details of how FPGAs have been utilized in robotic perception, localization, planning, and multi-robot collaboration tasks. In addition to individual robotic tasks, this book provides detailed descriptions of how FPGAs have been used in robotic products, including commercial autonomous vehicles and space exploration robots.