See discussions, stats, and author profiles for this publication at: https://www.researchgate.
net/publication/329573094
Design and development of deep learning convolutional neural network on an
field programmable gate array
Article · October 2018
CITATIONS READS
3 550
2 authors, including:
Yan Chiew Wong
Technical University of Malaysia Malacca
64 PUBLICATIONS 310 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
CMOS Floating Gate Defect View project
Monitoring performance of solar panel using Raspberry Pi Zero Wireless View project
All content following this page was uploaded by Yan Chiew Wong on 10 June 2019.
The user has requested enhancement of the downloaded file.
Design and Development of Deep Learning
Convolutional Neural Network on an Field
Programmable Gate Array
Y.C.Wong, Y.Q.Lee
Micro and Nano Electronic (MINE) Research Group, Centre for Telecommunication Research & Innovation (CeTRI)
Faculty of Electronic and Computer Engineering, Universiti Teknikal Malaysia Melaka,
Hang Tuah Jaya, 76100 Durian Tunggal, Melaka, Malaysia
ycwong@utem.edu.my
Abstract—This paper presents the design and development recognition for wearable or mobile applications with a
of Convolutional Neural Network on Field Programmable Gate compact size and weight. Facial recognition on chip for
Array. In the recent work of deep learning Convolutional wearable or mobile application can allow users to
Neural Network, CNN is a challenging research area in both authenticate themselves by looking at the camera, allowing
software and hardware implementation. Software financial transactions. Besides that, a policeman who wears
implementations tend to be prohibitively slow considering that
most of the neural networks run on sequentially operation
the device around the neck can automatically check the
architecture. Thus, the objective of this work is to design and information of the person in front of him by accessing a
develop deep learning CNN on FPGA based on the premise registered database. A complete real-time face recognition
that hardware implementations that perform parallel system consists of a face detection, recognition and down-
computation of each neuron in the layers can be made faster. sampling module using FPGA [2]. According to the research
This work focuses on handwriting recognition where the [3], deep learning shows good ability in solving complex
machine has the ability to receive and interpret intelligible learning problem as the emerging field of machine learning.
handwritten input from the sources. The speed of the CNN Unfortunately, the size of the networks becomes
implemented on an FPGA was analyzed. Digits and numbers increasingly large due to the demands of the practical
were successfully recognized by the developed system.
applications, which subsequently pose significant challenge
Index Terms— Convolutional Neural Network, Field
for constructing high-performance implementations of deep
Programmable Gate Array. learning neural networks. Recently, significant researches
have been carried out in the implementation of CNN on an
I. INTRODUCTION FPGA. The research in [8] proposed a completed FPGA-
based real time face recognition system that runs at 45
Convolutional Neural Network (CNN) consists of one or frames per second with Virtex-5 FPGA.
more convolutional layers, followed by one or more fully A. Project Application
connected layers in a standard multilayer neural network. The design and development of CNN on an FPGA can be
The architecture of a CNN is designed to take the applied to many applications, such as the recognition of
advantages of the two-dimensional structure of an input handwritten digits and handwritten documents. Other than
image with local connections and tied weights followed by digit recognition, the implementation of CNN on an FPGA
some forms of pooling which results in translation invariant has been widely used in many tasks such as image
features. CNN is easy to train and has fewer parameters than classification and object detection [4]. The proposed design
the fully connected networks with the same number of is suitable for low power embedded system applications
hidden units. A typical CNN has three types of layer with limited memory. The general application for this work
arranged in feed forward structure, namely the convolutional is the digits recognition implemented on an FPGA, which
layer, subsampling layer and fully connected layer. CNN can be applied to an autonomous car. As FPGA is portable,
extracts simple features at higher resolution and converts it can be applied to an autonomous car, which can detect the
them to complex features at lower resolution [1]. digit or number of the available parking slot in the car park.
Considering the slow process involved in the This work contributes to the sustainable and friendly
implementation of CNN in software, a hardware environment due to low power consumption to operate
implementation of CNN in FPGA is introduced in order to FPGA. FPGA explores high optimized reconfigurable
speed up the process in CNN. FPGA is the construction of architectures where speed up can be provided by exploring
programmable logic, which is not only erasable but also wide parallelism, deep pipeline, fast and efficient data paths.
flexible for design. Deep learning CNN on FPGA can be The CNN frameworks require very high computational
applied to many applications, such as handwritten digit power and large amounts of memory, while the GPU
recognition and handwritten document recognition. It also performs well on expensive machine, it is not suitable for
can be applied as facial recognition system on chip, in which portable devices and embedded systems [5].
the design methodology can be used to integrate the entire Figure 1 shows the block diagram of the DE1-SoC
components of a target system into a single chip so that it computer, while Figure 2 shows the DE1-SoC board.
can be applied to one chip implementation of face Ethernet cable and mini-USB cable are needed for
ISSN: 2180 – 1843 e-ISSN: 2289-8131 Vol. 10 No. 4 October – December 2018 25
Journal of Telecommunication, Electronic and Computer Engineering
connecting the DE1-SoC board. Altera Monitor Program is a DE1-SoC Board was used with Linux to connect the board
good way to begin working with the DE1-SoC Computer to the host computer. After the process of configuring the
and the ARM A9 processor. FPGA from Linux, Linux Applications was developed using
FPGA hardware devices. Lastly, Linux Drivers was
developed for FPGA hardware. The programs were
compiled on the host computer and then the resulting
executable was transferred onto the Linux filesystem, which
is microSD card.
Figure 1: Block diagram of the DE1-SoC computer.
Figure 2: The DE1-SoC board.
II. METHODOLOGY
Figure 3: CNN on FPGA design flow
The methods used in this work is illustrated in Figure 3.
This work focused on the design and development of deep
learning CNN on an FPGA. First and foremost, the Linux
Ubuntu 16.04 LTS was installed and it was started with
Linux on the DE1-SoC board. The developed CNN in C
code was analyzed and enhanced. The purpose of the C code
was for the training and inference of CNN in a general way.
As such, it could be used for any traditional computer vision
tasks such as object classification or detection and
handwritten digits recognition. The long-term goal of this
code is to provide a low level, efficient and very lightweight
deep learning framework to make it easy to deploy in
constrained environment. The currently implemented layers
include convolution, transposed convolution, fully
connected, max-pooling and batch normalization. The
neural network was fitted into an FPGA implemented
circuit. The tools required were Altera DE1-SoC
development and education board, host computer, ethernet Figure 4: The steps of training data
cable, Mini-USB cable for connecting the DE1-SoC board
to the host computer and the MicroSD card. The host Figure 4 shows the steps of the training data. Machine
computer was used for developing software programs that learning can be divided into two phases. The first phase is
run under Linux on the DE1-SoC board. The CNN code was devoted for learning and the next state is for prediction.
simulated and run successfully. After that, configuring the Machine recognition, description, classification and image
26 ISSN: 2180 – 1843 e-ISSN: 2289-8131 Vol. 10 No. 4 October – December 2018
Design and Development of Deep Learning Convolutional Neural Network on an Field Programmable Gate Array
processing are the significant problems in variety of
engineering and scientific disciplines such as biology,
psychology, medicine, marketing, computer vision and
artificial intelligence. Handwritten recognition is the ability
of the machines that receive and interpret intelligible
handwritten input from the sources. Neural network is the
way people used to realize the pattern classification and
image recognition. Basically, handwriting recognition
system was implemented using software technology.
Once the model had been trained, the validation and
testing subsets were used to predict the classification and
recognition result. The prediction process was implemented
to enhance the performance of the classification and
detection tasks as shown in Figure 5.
Figure 6: Part of the process of training data
Figure 7 shows the training error against the iteration. The
training error is the error that emerges when the trained
Figure 5: The steps of prediction model is run back on the training data. According to Figure
7, the train error continued to decrease with the increase of
III. RESULT iteration.
There are two ways of implementing computations in the
hardware or the software [6-8]. The software approach is the
most straightforward, and the development skills are widely
available. Meanwhile, the hardware approach involves the
custom design of a circuit dedicated to a particular need of
the application. FPGA based acceleration solution for DNN
inference in [6], is realized on a SoC device where software
controls the execution and off loads compute intensive
operations to the hardware accelerator.
The MNIST database [7] contains 70000 standardized
images of handwritten digits. The idea was to train the
neural network first using the training set. After the training Figure 7: Training error against the iteration
ended successfully, it was switched off and the effectiveness
of the trained network was tested using the testing set. Each Figure 8 shows the test error against the iteration. Test
MNIST image has a size of 28 x 28 = 784 pixels. Each pixel error is the error when the trained model is run on a set of
was provided as a number between 0-255 indicating its data that has never been exposed. This data is usually used
density. Each pixel was treated as either ‘ON’ or ‘OFF’, that to measure the accuracy of the model before it is shipped to
means black and white. prediction.
In CNN terminology, the 3x3 matrix is called a filter,
kernel or feature detector. The matrix formed by sliding the
filter over the image and computing the dot product is called
‘Convolved Feature’ or ‘Feature Map’. Filter also acts as
feature detectors from the original input image. CNN learns
the values of these filters on its own training process. The
more number of filters were used, the more image features
were extracted and the network becomes better at
recognizing patterns in unseen images.
Figure 6 shows the training data in the training process.
Batch size is the total number of training examples
presented in a single batch. The maximum batch size of this
work was 400000. Batch size and number of batches are two
Figure 8: Test error against the iteration
different things. Iteration is the number of batches needed to
complete one epoch. In this design, the dataset of 4000000
examples were divided into batches of 200 then it took 2000 Figure 9 shows the process of testing images. The python
iterations to complete 1 epoch. script was run to fold all the pictures and categories into
single binary pictures. Then it appeared as ubyte files ready
to tar. Figure 10 shows the selected testing images in
ISSN: 2180 – 1843 e-ISSN: 2289-8131 Vol. 10 No. 4 October – December 2018 27
Journal of Telecommunication, Electronic and Computer Engineering
portable network graphics (PNG) files.
Figure 9: The process of testing images
Figure 13: The speed for processing all the images in ARM Cortex A9 in
DE1-SoC board.
The speed for processing all the images in GPU was 38ms
as shown in Figure 12, while the speed for processing all the
Figure 10: The selected testing images in portable network images in ARM Cortex A9 in DE1-SoC board was 967ms,
graphics (PNG) files as shown in Figure 13. FPGA technology can be evolved
fast. Theoretically, the FPGA with lower power
As shown in Figure 11, each column represents one consumption requires less thermal dissipation
image file, so the prediction file appears five columns. Each countermeasures; hence, it implements the solution in
row represents one number. There are ten rows which smaller dimensions.
represent zero until nine. The testing images that matched The very basic of FPGA is more flexible than most
with the results are shown in the prediction file. The microcontrollers. The term field programmable means the
accuracy for each number is different. The datasets for the FPGA can be reprogrammed to do any task that can be fitted
training examples can be increased to improve the accuracy. into the number of its gates. The power of FPGA is
consumed more than the typical power of microcontrollers.
IV. DISCUSSION
Recognizing digits is not an easy task. Deep learning
CNN performed better than the other methods as it achieved
higher accuracy. The idea is to take a large number of
handwritten digits, known as training examples and then
develop a system, derived from those training examples. In
other words, the neural network uses the examples to
automatically infer rules for recognizing handwritten digits.
The accuracy for each number is different. Furthermore, by
adding the number of training examples, the network can
Figure 11: The results in prediction file learn more and the accuracy could be improved.
The coding used in this work is C code. It can convert a
set of jpg or png images into MNIST binary format. It can
rescale all the jpg and png images in the folders the MNIST
standard 28 x 28-pixel size. The python script is run to fold
all the pictures and categories into single binary pictures.
Then it will appear as ubyte files ready to tar. The long-term
goal of this work is to provide a low level, efficient and a
very lightweight deep learning framework to make it easy to
be deployed in constrained environment.
The design and development of CNN on an FPGA can be
applied to many applications such as the recognition of
handwritten digits and handwritten documents. The general
application for this work is focused on the digits or numbers
recognition implemented on an FPGA, which can be applied
to an autonomous car to detect the number of available
parking slot in the car park.
Figure 12: The speed for processing all the images in GPU.
28 ISSN: 2180 – 1843 e-ISSN: 2289-8131 Vol. 10 No. 4 October – December 2018
Design and Development of Deep Learning Convolutional Neural Network on an Field Programmable Gate Array
V. CONCLUSION REFERENCES
In conclusion, the digits had been successfully recognized [1] J. Wang, J. Lin and Z. Wang, "Efficient Hardware Architectures for
Deep Convolutional Neural Network," in IEEE Transactions on
by the system since the results can be observed in the
Circuits and Systems I, vol. 65, no. 6, 2018, pp. 1941-1953.
prediction file. However, the speed for processing all the [2] J. Matai, A. Irturk and R. Kastner, "Design and Implementation of an
images in ARM Cortex A9 in DE1-SoC board is lower than FPGA-based Real Time Face Recognition System". 19th IEEE
processing in GPU. The developed system uses only 38ms Annual International Symposium on Field-Programmable Custom
Computing Machines, 2011, pp. 97-100.
to recognize a numbers but longer time is needed in FPGA
[3] C. Wang, L. Gong, Q. Yu, X. Li, Y. Xie and X. Zhou, "DLAU: A
with 967ms. However, the FPGA is portable, hence it is Scalable Deep Learning Accelerator Unit on FPGA", IEEE
suitable to apply to autonomous cars which can detect the to Transactions on Computer-Aided Design of Integrated Circuits and
the numbers or digits to find the available parking slot in car Systems, vol. 36, no.3, 2017, pp. 513-517.
[4] Y. Zhou, S. Redkar and X. Huang, "Deep Learning Binary Neural
park. This work can be further enhanced to object detection
Network on an FPGA", 60th IEEE International Midwest
to detect the surrounding object, human and even animals. Symposium on Circuits and Systems (MWSCAS), 2017, pp. 281-
284.
ACKNOWLEDGEMENT [5] F. Yi, H. Xiao, S. Yongjie, "FPGA Accelerating Core Design Based
on XNOR Neural Network algorithm", MATEC Web of Conference
(SMIMA), 2018, pp. 1-5.
The authors acknowledge the technical and financial [6] L. Ruo, "A framework for FPGA-Based Acceleration of Neural
support by Universiti Teknikal Malaysia Melaka (UTeM) Network Inference with Limited Precision via High-Level Synthesis
and Ministry of Science, Technology and Innovation with Streaming Functionality", M.S. theses, University of Toronto,
2016.
Malaysia’s grant no. 01-01-14-SF0133//L00029.
[7] Y. LeCun, C. Cortes, C.J.C. Burges, "MNIST handwritten digit
database", [Online]. Available: http://yann.lecun.com/exdb/mnist/.
[Accessed: 24- Aug- 2018].
[8] J. Matai, A. Irturk and R. Kastner, "Design and Implementation of an
FPGA-based Real Time Face Recognition System", 19th IEEE
Annual International Symposium on Field-Programmable Custom
Computing Machines, 2011, pp. 97-100.
ISSN: 2180 – 1843 e-ISSN: 2289-8131 Vol. 10 No. 4 October – December 2018 29
View publication stats