Rocket
Rocket
Master Thesis
    Interfacing a Neuromorphic
    Coprocessor with a RISC-V
            Architecture
Supervisors
PhD Gianvito Urgese
Dr. Evelina Forno
                                            Candidate
                                            Andrea Spitale
                           3
Summary
                                          5
Contents
1 Introduction 9
2 Technical Background                                                                                 17
  2.1 Machine Learning and Applications . . . . . . .      .   .   .   .   .   .   .   .   .   .   .   17
      2.1.1 Machine Learning . . . . . . . . . . . . .     .   .   .   .   .   .   .   .   .   .   .   17
      2.1.2 Neural Networks . . . . . . . . . . . . .      .   .   .   .   .   .   .   .   .   .   .   17
  2.2 The RISC-V Instruction Set Architecture (ISA)        .   .   .   .   .   .   .   .   .   .   .   21
      2.2.1 RV32I - Base Integer ISA . . . . . . . .       .   .   .   .   .   .   .   .   .   .   .   24
      2.2.2 Choices and Consequences . . . . . . . .       .   .   .   .   .   .   .   .   .   .   .   25
  2.3 ODIN : A Spiking Neural Network Coprocessor          .   .   .   .   .   .   .   .   .   .   .   28
      2.3.1 Supported Neurons Models . . . . . . . .       .   .   .   .   .   .   .   .   .   .   .   30
      2.3.2 SPI Slave . . . . . . . . . . . . . . . . .    .   .   .   .   .   .   .   .   .   .   .   31
      2.3.3 Controller . . . . . . . . . . . . . . . . .   .   .   .   .   .   .   .   .   .   .   .   36
      2.3.4 AER Output . . . . . . . . . . . . . . .       .   .   .   .   .   .   .   .   .   .   .   38
      2.3.5 Scheduler . . . . . . . . . . . . . . . . .    .   .   .   .   .   .   .   .   .   .   .   40
      2.3.6 Neuron Core . . . . . . . . . . . . . . . .    .   .   .   .   .   .   .   .   .   .   .   41
      2.3.7 Synaptic Core . . . . . . . . . . . . . . .    .   .   .   .   .   .   .   .   .   .   .   41
  2.4 Chipyard Hardware Design Framework . . . . .         .   .   .   .   .   .   .   .   .   .   .   45
      2.4.1 Rocket Core . . . . . . . . . . . . . . . .    .   .   .   .   .   .   .   .   .   .   .   46
      2.4.2 Tools & Toolchains . . . . . . . . . . . .     .   .   .   .   .   .   .   .   .   .   .   48
      2.4.3 Simulators . . . . . . . . . . . . . . . . .   .   .   .   .   .   .   .   .   .   .   .   49
  2.5 Parallel Ultra Low Platform (PULP) . . . . . .       .   .   .   .   .   .   .   .   .   .   .   50
      2.5.1 PULPino . . . . . . . . . . . . . . . . .      .   .   .   .   .   .   .   .   .   .   .   50
      2.5.2 PULPissimo . . . . . . . . . . . . . . . .     .   .   .   .   .   .   .   .   .   .   .   52
3 Methods                                                                                              55
  3.1 Setting up ODIN and Chipyard Environments . . . . . . . . . . . .                                56
  3.2 ODIN Integration Inside Chipyard . . . . . . . . . . . . . . . . . . .                           57
  3.3 ODIN Parameters Definition . . . . . . . . . . . . . . . . . . . . . .                           67
                                       6
4 Results & Discussion                                                            71
  4.1 RTL Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    71
  4.2 Synthesis Results . . . . . . . . . . . . . . . . . . . . . . . . . . . .   82
      4.2.1 Area . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    83
5 Conclusions 85
Bibliography 88
                                         7
Chapter 1
Introduction
2015 (150)
2014 (123)
2013 (127)
2012 (102)
2010-2011 (104)
2006-2009 (112)
1998-2005 (108)
1988-1997 (103)
            0     10          20      30         40       50       60         70      80          90      100
                                       Percentage of papers in each field
 Figure 1.1: A graph showing 10 main motivations for which neuromorphic com-
 puting research was conducted. On the bottom, the percentage of papers citing a
 particular motivation is shown. On the left side the year to which each istogram
 refers is reported. Finally the right side gives, within round parenthesis, the number
 of papers on neuromorphic computing published during the years that row refers
 to. Picture taken from [50].
   as the Von Neumann Bottleneck. Overall throughput is reduced and the CPU
   spends most of the operating time in idle state, wasting resources and power.
   The problem is quite evident nowadays, with CPU operating frequencies and
   memory sizes that have increased so faster with respect to the speed with
   which memory and CPU communicate.
5. Low Power. This stands firm as the main reason behind neuromorphic hard-
   ware development. Nowadays Internet of Things (IoT) and many embedded
   systems related devices are needed to perform complex computations, yet the
   power resources are limited. This is strictly related to Moore’s law apparently
   fading away and Dennard scaling. Dennard observed that power density of a
   given device stays the same as the transistors reduced in size, because current
   and operating voltage reduce consequently. Thus number of transistors on a
   given area could be doubled. This allowed the semiconductor industry to have
   microprocessors which operated at always increasing frequencies during the
   years, while maintaining acceptable power consumption and density. How-
   ever two issues rose in the past 20 years, them being dark silicon and leakage
   current impact on the more refined dynamic power consumption equation
   Pdyn = αCf V 2 + V Ileak , which ultimately lead to modern high performance
   processors stop at around 4 GHZ operating frequency, and making designers
   move to a multi core approach.
     This is due to the exploited information encoding and the self healing ca-
     pability of such devices. Moreover they may help in reducing the impact of
     process variation in the fabrication of devices, that often lead to degraded
     performance or unacceptable errors.
Neuromorphic computing and devices are widely adopted to perform tasks which
belong to various domains, as illustrated in Figure 1.2. A few hints are given in
Table 1.1 and Table 1.2.
There is the need for technologies and algorithms which may enable the develop-
ment of more powerful computing edge devices, while keeping power consumption
as low as possible. To this extent, since neuromorphic computing seems to represent
a key factor in enabling the transition from a cloud oriented computing environment
to one which exploits small devices to perform intensive tasks, this thesis focuses
on the development and validation of an architecture that puts a RISC-V based
System on Chip (SOC) with a Spiking Neural Network (SNN) oriented hardware
accelerator.
                                        12
Applica-        Category      Description
tion
Domain
Imaging         Edge          identify edges in a digital image, by identifying points at
                Detection     which brightness rapidly changes. An example is described
                              in [24].
Imaging         Compression   minimizing image size in bytes, possibly without reducing
                              image overall quality. An example is presented in [19].
Imaging         Filtering     changing an image properties to highlight certain features
                              or hide some others. An example is presented in [35].
Imaging         Segmenta-     typically used in medical imaging, to search for tumors or
                tion          similar illnesses, it aims at changing the image
                              representation according to new schemes, which make it
                              easier to analyze. An example is analyzed in [11].
Imaging         Feature       algorithms used to reduce the number of variables or
                Extraction    characteristics associated to an input dataset, in order to
                              speed up the processing phase without losing meaningful
                              information [14].
Imaging         Classifica-   probably the most widespread application, it consists in
                tion Or       analyzing an image, to either detect a certain class of
                Detection     objects and/or classify the image and associate it to a
                              precise category, according to the objects being detected.
                              An example is described in [40].
Speech          Word          technologies and algorithms devoted to analyze spoken
                Recognition   language to recognize words and translate them into
                              machine readable format and text. Some recent
                              advancements have been made, like algorithms that can
                              generate images according to a given text based
                              description, as reported in [41].
Data            Classifica-   the scope of this kind of this category of algorithms is to
Analysis        tion          analyze input data and associate them to a specific class,
                              through the assignment of a label. An example could
                              consist in having a system that determines whether a given
                              email is purely spam or not. An example is given in [49].
Imaging,        Anomaly       algorithms to identify events or data that deviate from a
Control,        Detection     given dataset normal behaviour. An example is presented
Security                      in [9].
Neuroscience    Simulation    neuromorphic architectures are way better suited to
Research                      provide scientists with meaningful simulations that can be
                              conducted in a reasonable amount of time. An example is
                              reported in [10].
Visual                        architectures tailored to run algorithms that improve or
Systems                       repair tissues associated to sense of sight. An example is
                              presented in [3].
Auditory                      architectures tailored to run algorithms that improve or
Systems                       repair tissues associated to sense of hearing. An example is
                              described in [34].
Olfactory                     architectures tailored to run algorithms that improve or
Systems                       repair tissues associated to sense of smell. An example is
                              given in [27].
Somatosen-                    architectures tailored to run algorithms that improve or
sory Systems                  repair tissues associated to sense of touch. An example is
                              analyzed in [33].
                                       14
                                 Introduction
Figure 1.2: A graph showing application domains for which neuromorphic devices
have been developed. The size of boxes is proportional to the number of papers
that have been published for that domain. Picture taken from [50].
                                     15
Chapter 2
Technical Background
This chapter covers all concepts and aspects that are needed to understand the the-
sis workflow, from what machine learning is, how and why it is exploited nowadays,
up to employed Instruction Set Architecture and the digital hardware architectures
involved.
Figure 2.1: McCulloch & Pitts Neuron Model. Picture taken from [8].
f, that will determine the outcome y, according to the equations 2.1 and 2.2
                                                          n
                                                          X
                          g(x) = (x1 , x2 , ..., xn ) =         xi             (2.1)
                                                          i=1
                           (
                            f (g(x)) = 1, if g(x) ≥ θ
                                                                               (2.2)
                            0,            if g(x) < θ
   The aforementioned inputs are to be intended as parameters that function f
needs to determine whether the neuron has to ”fire”, where firing is to be nowa-
days intended as informing all neurons connected to y output that something has
                                          18
                                 Technical Background
      system feeds in data that is known in order to let the neural network output
      data the system would like to predict, transforming one dataset into another.
   • unsupervised : transforms one dataset into another, as if happens in the
     aforementioned supervised learning, but in this case input data or its charac-
     teristics are unknown. This implies that an unsupervised learning algorithm
     typically clusters input data into groups, according to some specific features
     it finds during the processing phase.
   • offline : the input examples set is fixed in size, so the neural network is
     fed with one example at a time and synapses weights are changed according
     to a cost function. Once the global cost function is minimized, learning is
     interrupted and the neural network is deployed. It will just perform infer-
     ence on the incoming input data, leaving synapses weight fixed for the whole
     operational time of the device.
   • online : the neural network model learns and adjust its interconnection after
     one input data is processed, so the model continuously learn and adapt. Thus,
     changes made to synaptic weights solely depend on the actual input data and
     possibly on the current model state.
Let’s make an example ([54]). Let’s imagine having a set of objects to be fed into
geometrically shaped holes (e.g. squares, triangles). babies would probably take an
object, and try to put it inside any hole, until it perfectly fits; this is an example of
parameterized learning. Teenagers, instead, would probably count the number of
sides of each object and look for an hole with that number of sides, before trying
to feed the object it; this is unparameterized learning. In most cases, one could say
that parameterized learning is all about trial and error, whereas unparameterized
one is about counting features. Finally let’s have a little digression on the evolution
history of Artificial Neural Networks (ANNs).
Still, there are a few advantages that make them stand out among the possible
neural network models
  1. completely open, meaning that specification are publicly available and no roy-
     alty fees are due to the RISC-V Foundation, which is a non profit foundation
     which shall keep the ISA stable, preventing it from being abandoned, as it
     happened for other closed source ones [42].
  3. easily extensible to support variants that are more suitable to the desired do-
     main of application, starting from a base yet complete integer ISA (RV32I/RV64I),
     which is usable as standalone for customized accelerators or educational pur-
     poses. This means it should suit all kind of hardware, be it a small micro-
     controller or a powerful supercomputer.
                                       21
                                Technical Background
Figure 2.4: RISC-V Instruction Length Encoding Schemes. Picture taken from
[47].
   Base ISAs use either little endian or big endian memory systems, with the
                                         23
                                Technical Background
privileged architecture further defining bi-endian operations, but the IRAM is filled
by 16 bit little endian parcels, regardless of memory system endianess, to ensure
that length encoding bits always appear first in halfword address order, allowing
the length of a variable length instruction to be quickly determined by a IFU that
examines only the first few bits of the first 16 bit instruction parcel.
    RV32I is the most basic instruction set that is provided and that must be im-
plemented in any case. It was designed to reduce the hardware needed by the most
basic application and to support modern operating systems. It handles 32 32-bit
registers, from x0, that always contains 010 and can be used to discard an instruc-
tion result, to x31, plus the program counter pc, all of which are XLEN wide, with
XLEN representing the data width in a given ISA version, so 32 bit in this case.
Although there is no precise indication on which register must hold the return ad-
dress of a given subroutine or the actual stack pointer, that is the address of the
stack holding the passed subroutine variables from top down, the standard software
calling convention uses x1 as link/return register and x5 as alternative one, plus
x2 as stack pointer. The instruction formats were designed so to ease the decoding
phase as much as possible by:
   • keeping source rs1, rs2 and destination rd registers numbers positions fixed
     between an instruction class and another.
     seen in RISC-IV, also known as SPUR. Zero extension for immediate values
     is not available as the authors did not find any real advantage in providing it
     in nowadays applications.
   There are a total of four main formats R, I, S, U, plus two variants names SB
and UJ, which differ from S and U for the immediate encoding scheme, respectively.
   • Cost : history showed that most companies chose to make their Instruction
     Set Architectures grow over time, following an approach that is known as
     incremental ISA. It provides for having an instruction set that increases its
     instruction count over time, always keeping instruction that had been pre-
     viously developed. The reason is that they are eager to maintain binary
     compatibility, that is the possibility of running very old software on modern
     processors, no matter the consequences and the strain that derive from such
     choice. As [42] reports, the consequence for Intel has been to have a tremen-
     dous amount of assembly instructions (about 3000), which include a few to
     support the old fashioned and long abandoned ones to support Binary Coded
     Decimal (BCD) arithmetic, at the expense of power and occupied area, the
     latter influencing the die cost quadratically. Moreover one should consider
     that yield, that is the number of dies per wafer that are working, decreases as
     the area of the production wafer increases. On the contrary, the RISC-V ar-
     chitects decided to have a single and basic integer ISA, yet complete for most
     applications, called RV32I, which is guaranteed to be stable, so it won’t ever
     change in the future. To make the ISA suitable for other applications, be it
     power constrained ones or requiring very high computing capacities, they pro-
     vide other modular extensions, that can be attached on top of RV32I, keeping
     them optional, and labeling the target ISA according to the extensions being
     used (e.g. RV32IM indicates the support for multiply instructions), if any.
     This choice makes it possible for compilers to produce efficient code that bet-
     ter suits the hardware it runs on and guarantees that new instructions will
     be added by the RISC-V foundation only if there are valid technical reasons
     that justify the introduction of such instructions, after they are discussed by
     a dedicated commission.
  authors note, this feature helps when there is the need for loading multiple
  values from the memory in common single instruction fetch processors, but
  could impact on the performance of superscalar ones, that rely on scheduling
  of such load instructions in parallel to improve the throughput of the system.
• Ease of Programming : variables and temporary data are stored into reg-
  isters, as they are faster in access with respect to common memory. To this
  extent, RISC-V provides 32 integer registers, whereas other proprietary ones
  such as ARM-32 or x86-32 have 16 and 8, respectively. Having 32 registers
  has been proved to be enough for modern applications, although one must
  note that the cost associated with having an higher number of registers back
  in the days in which MIPS and ARM were born was too high to make 32
  registers affordable. This makes it easier for programmers and compilers to
  handle complex programs. Moreover, RISC-V instructions typically take one
  clock cycle, if one ignores cache misses and some other amenities, whereas
  complex instructions from Intel x86-32 may take way more cycles, and this
  rises an issue when programming embedded devices, as programmers may
  want to have more or less precise timing in such applications. Least but not
  last, modern applications benefit a lot from support for Position Indepen-
  dent Code (PIC). PIC means having a block of code that can be correctly
  executed regardless of the absolute address, that is the compiler assigned a
  specific and explicit starting address to the block of code. This is generally
  possible thanks to Program Counter (PC) relative addressing, and it is often
  used to support shared libraries.
• Room for Growth : back in the 70s, Gorgon Moore, co-founder of Intel
  and Fairchild Semiconductor, predicted that the number of transistors being
  integrated on a single chip would double every year, and this remained true
  until very recently. Back in those days, designers were concerned with squeez-
  ing the number of instructions required per program, to reduce the execution
  time of that program, according to the equation instructions
                                                     program
                                                               · CP I · fck , with fck
                                       27
                                Technical Background
      being the processor clock frequency. Today, Moore is slowly fading, mainly
      due to technological issues that hinder the die manufacturing process. Given
      this, RISC-V architects chose to make the ISA as modular as possible, having
      enough room for optional and custom extensions that would help the imple-
      mentation and usage of domain specific accelerators, such as the one involved
      in this work. In this regard, a large part of opcode space for base RV32I ISA
      has been reserved for the integration of specialized coprocessors, that may
      not need any extension but the basic RV32I and a few specialized instruction
      to perform the tasks they were designed for [46].
one of few state of the art technologies which are constantly growing in complex-
ity and efficiency to overcome new challenges, providing fascinating services and
features, ranging from image classification and manipulation, up to speech recogni-
tion and many more that are still being explored. A neural network is a computing
system which takes inspiration from the human brain, exploiting its parallel in-
terconnections to solve complex data problems, modifying its internal parameters
(training phase) in order to recognize unknown input data with higher accuracy
(inference phase). Spiking Neural Networks (SNN) represent an emerging class
of neural networks, coming from neuroscience research and aiming at accurately
reproducing both static and dynamic behaviours of human brain neurons. Ini-
tially conceived to help neuroscientists enrich their knowledge on the human brain
structure and working principles, SNNs have gathered attention in the computer
science f all-to-all synaptic interconnections, for a total of 28 ∗ 28 = 64k synapses,
thus emulating a crossbar which allows every neuron to be connected to every other.
   • REQ: a single line that is asserted once ADDR is ready and stable, in order
     to request attention for that event to the module it is connected to. When
     the ACK signal goes to logic 1, then the REQ line is lowered to logic 0.
   • ACK : a single line that is asserted once the requested event has been received
     and completely processed. As soon as the REQ signal is de-asserted, this line
     can be driven to logic 0 as well.
                                         29
                                   Technical Background
The lines can be an input or an output, depending on whether they are used to
receive data from external devices or to provide it to other modules, as shown in
Figure 2.6.
Although the standard [55] dictates that every involved unit must be identified
through a unique address, in this work ODIN is present as the only AER compatible
device, so there is not really need to define such address. An address event consists
in packing up information of a given spiking event and transmitting it over to
ODIN. The message packet contains various information, depending on the event
being represented and trasmitted, as will be detailed in subsections concerning
AER OUTPUT and CONTROLLER modules. The coprocessor architecture is
illustrated in 2.6 and consists of the following modules
• Master Out Slave In (MOSI), that is the wire for data from master to slaves
   • Master In Slave Out (MISO), that is the wire for data from slaves to master.
     There is only one MISO, so there must be a way to avoid conflicts because of
     multiple wires trying to drive it.
   • SPISS, that is the slave select wire, connected to the Chip Select input of
     slave and driven by the master. Of course there might be multiple SPISS,one
     per each slave, or they may be absent in case of daisy chain connections. A
     daisy chain network has a master and all slaves are connected in series, so
     that the MOSI of one slave becomes the MISO of the following one, up to the
     point in which the last slave MISO is connected to the master. An example
     of daisy chain configuration is depicted in Figure 2.8.
Note that master and slaves are always both transmitters and receivers; indeed, a
master is just a block that initiates the transmission, but it doesn’t mean it cannot
act as a receiver. Furthermore, if one data direction wire for a block is not needed,
either MOSI or MISO can be removed, but internal registers will need to have input
fixed either to high or low logic value. The slave select of the target slave is active
during the whole transmission, while others must have their SPISS deactivated and
their MISO in high impedance, ignoring whatever happens on SPICLK and MOSI.
The master will generate n clock cycles, transmitting data starting from MSB; data
is generated on a precise clock edge and is sampled on the opposite edge, according
to the following parameter
   • CPHA=1 → data change on leading (first) edge, and is sampled on the trailing
     (second) one
Once the LSB is received, the slave SPISS is deactivated, and SPICLK returns to
idle level.
    Master and slaves must agree on the number n of bits to be transferred, which is
typically equal to 8. One bit is transferred per each clock cycle, exploiting NRZ-L
encoding. The SPICLK is idle while there is nothing to be transmitted, according
to the value of parameter CPOL
Figure 2.9: CPOL and CPHA define three parameters of the connection
So in the end, before setting up a SPI interconnection, master and slaves shall
agree on number n of transmitted bits, clock frequency, CPOL and CPHA. CPOL
and CPHA define the idle clock level, the active edge for data output and the
active edge for data sampling, according to the table in Figure 2.9. SPI doesn’t
need arbitration mechanism, as there is only one master; addressing is obtained
through chip select signals and bit synchronization is achieved since the clock is
only generated by the master. However, a slow slave cannot stop the master, so the
SPICLK must be set to a lower frequency in order to make things work, and there is
no error checking protocol. Furthermore, the only shared wire is MISO, which can
be driven by multiple slaves, for which conflict should be avoided through careful
addressing, making sure that only one SPISS is active at a given time. Finally,
performance is in the order of tens of Mbit/s, and it can be implemented only on
interconnections up to tens of centimeters (e.g. SD card memory).
2.3.3    Controller
The whole system is administered by a controller. It is a Moore Finite State
Machine (FSM), as its outputs solely depends on the current state, that moves
from a state to another, depending on the ongoing operations. States are listed
hereby
   • WAIT : this state makes ODIN waiting until the AEROUT bus is freed, as it
     is transmitting a number of AER transactions.
       Indeed, each of these operations lasts 40 SPI clock cycles, so the controller
       leaves the state only as soon as the last bit of data on SPI bus has been
       correctly loaded.
     • WAIT REQDN : this is the state in which the controller moves after an input
       AER event has been either loaded into the scheduler (i.e. previous state is
       PUSH) or it has just ended (e.g. BIST event), so the controller shall wait for
       input AER REQ signal to be driven low before moving on.
The controller also handles AER requests coming from the input AER signals,
according to events listed in Table 2.4 and Table 2.5.
 Ad-         Ad-        Ad-        Event            Num-      Description
 dress       dress      dress      Type             ber of
 <16>        <15:8>     <7:0>                       Cycles
 1           pre neur   pre neur   Single           2         Stimulates neuron at address
             <7:0 >     <7:0 >     Synapse                    post neur <7:0 >with the
                                                              synaptic weight associated to
                                                              pre-synaptic neuron address
                                                              pre neur <7:0 >. Ignores the
                                                              value of the mapping table bit.
 0           neur       0xFF       Single           2         Activates a time reference event
             <7:0 >                Neuron                     for neuron neur <7:0>only.
                                   Time
                                   Refer-
                                   ence
 0           Don’t      0x7F       All              2 * 256   Activates a time reference event
             Care                  Neurons                    for all neurons.
                                   Time
                                   Refer-
                                   ence
 0           neur       0x80       Single           128       Activates a bistability event for
             <7:0 >                Neuron                     all synapses in the dendritic
                                   Bistabil-                  tree of neuron neur <7:0>.
                                   ity
 0           Don’t      0x00       All              128 *     Activates a bistability event for
             Care                  Neurons          256 =     all synapses in the crossbar
                                   Bistabil-        32k       array.
                                   ity
                                      39
                                  Technical Background
2.3.5 Scheduler
    The scheduler represents one of two crucial ODIN components, the other one
being the FSM based controller, and is sketched in Figure 2.10. It is inspired to
priority based ordered First In First Out memory structures [39]. The scheduler
is responsible for handling spiking and bursting (i.e. multi ple spikes in sequence)
events, distinguishing whether they come from ODIN neurons or from other neuro-
morphic devices through the input AER interface. Every spiking event is encoded
into a 14 bit wide packet, consisting of
   • spiking neuron address
• number of spikes - 1
FIFOs, each one having space for 4 events, each one storing the corresponding
neuron address. For an exhaustive description, please refer to Section 2.2.1.3 of
[21].
Figure 2.11: ODIN Neurons SRAM Organization. Picture taken from [7].
    The neuron core handles events related to neurons state change. In particular
it provides information taken from the neurons state SRAM to the neurons update
logic blocks, so that it can determine whether the neuron has to fire or not, and
updates the SRAM accordingly. The neurons state SRAM is made of 28 128 bits
wide words, for a total of 4 kBytes of memory. The memory layout is depicted
in Figure 2.11. The memory input address corresponds to the 8 bit wide neuron
address, whereas the byte to be written or read is selected according to a 4 bits
wide selection address which is give through SPI when an operation on neurons
memory is requested.
                                           42
                                Technical Background
Figure 2.12: ODIN Synapses SRAM Organization. Picture taken from [7].
determines whether the upper or lower half of the selected byte is to be written
or read, according to the requested memory operation. However it seems this is
not true in the implemented architecture, as the mechanism to choose either the
upper half or lower part of the selected byte is based on a byte of masking bits, as
described in 2.3.2. In order to benefit from the usage of high density SRAMs, each
word consists of 32 bits, storing a total of 8 synapses data blocks. Indeed, each
synapse is characterized by a 3 bit wide weight value and a so called mapping table
bit, which serves the purpose of allowing a given synapse to exploit online SDSP
learning mechanism or not, thus being static in value.
    In the neuroscience domain, the term plasticity refers to the possibility for
biological systems to modify synaptic strengths from time to time, according to
                                        43
                                 Technical Background
     (
      w → w + A+ , if Vmem (tpresynaptic ) ≥ θm , θ1 ≤ Ca(tpresynaptic ) < θ3
                                                                                  (2.4)
      w → w + A− , if Vmem (tpresynaptic ) < θm , θ1 ≤ Ca(tpresynaptic ) < θ2
The equations concerning the Ca variable refer to a rule to limit the learning pro-
cess to run under well established conditions, as described in [5]. Indeed, the Ca
variable indicates the calcium concentration in the neuron and gives some insights
into recent firing activity of the neuron. If it exceeds the ranges delimited by
θ1 , θ2 , θ3 , then the learning process could lead to a phenomenon called overfitting,
which means that the employed neural network state evolved so that is it absolutely
able to correctly predict when fed with input samples on which it was trained dur-
ing the learning process, but the prediction accuracy lowers a lot when the neural
network handles previous unseen data samples. This implies that the model actu-
ally memorized the input examples, rather than the correlation that exists between
inputs and outputs. The rule formulated by [5] belongs to the early stopping meth-
ods, a modern approach to prevent overfitting from happening. Last but not least,
the author implemented a set of rules to handle bistable synapses in such net-
works. The solution is originally described in [29] and furthermore modified in [5].
Indeed, Complementary Metal Oxide Semiconductor (CMOS) technology is ubiq-
uitous when dealing with digital designs, but is not really suited for storing analog
values for a relatively long amount of time, due to parasitic capacitances and leaky
currents that affect such structure. Thus, stored values may substantially change
over time and this is highly unacceptable. The adopted solution provides for a com-
paring circuit which matches the actual synaptic weight against a fixed threshold
and either increase or decreases the weight if the actual value is above or below
                                          44
                                Technical Background
that threshold, respectively. Please note that bistability and membrane potential
leakage mechanisms (the latter being commonly indicated as leakage mechanism
in [7]) are not automatically handled from any synaptic related logic, but specific
events have to be generated and sent to ODIN AER input interface, according to
Tab 2.4 and Tab 2.5.
Figure 2.15: An example of two Rocket tiles, one with Berkeley Out of Order
Machine (BOOM) core and the other with a standard Rocket core, inside a complete
SoC. [2]
                                        45
                                 Technical Background
Figure 2.17: MESI Cache Coherence Protocol - State Diagram. Picture taken from
[32].
   • Modified : cache line present only in current cache, and it differs from the
     value stored into main shared memory, so it is a so called dirty value. The
     value will have to be stored back into the main memory before or then, and
     the cache line originally holding it will be labeled as Exclusive once that
     happens.
                                        47
                                Technical Background
   • Exclusive : cache line present only in current cache, the stored value matches
     that hold by shared main memory, thus it is a so called clean value. The
     state of the cache line may switch to Modified is the core holding that cache
     modifies the line, or to Shared if it detects that any other core requests that
     data.
   • Shared : cache line has a clean value and might be actually stored in other
     cores caches as well, so data can be also retrieved elsewhere, and matches the
     value stored in main shared memory. The line may switch its state to Invalid,
     once the same data is modified in any other core.
   • Invalid : line value not valid anymore, as some other core modified the value
     elsewhere.
A state diagram illustrating the MESI states is depicted in Figure 2.17 to give a
clearer overview.
   • Dromajo: a RV64GC emulator primarily used for co-simulation and was orig-
     inally developed by Esperanto Technologies
                                         48
                                Technical Background
2.4.3    Simulators
The chipyard environment can take advantage of three different choices when it
comes to simulators.
Verilator
VCS
VCS is a functional simulator made by Synopsys ®. Provided that the user has a
valid product license, chipyard allows them to have wrappers to build VCS based
simulators from the given Scala and Chisel files, together will all features that
are mandatory for such tools, including faster compilation times and VCD trace
support.
                                         49
                                 Technical Background
FireSim
FireSIM is an open source hardware simulation platform that runs on EC2 F1
FPGAs hosted on the Amazon Web Services (AWS). Its primary target is allowing
anyone to evaluate the performances of an hardware design at FPGA specific speeds,
among other possibilities, such as simulating datacenters and having profiling tools.
• Main : PULPv1,PULPv2,PULPv3,PULPv4
• Mixed-signal : VivoSoC,EdgeSoc
Since the design space they are exploring is pretty huge, the open source list started
with a simple yet effective core named PULPino.
2.5.1     PULPino
Small microcontroller with no caches, no memory hierarchy and no DMA. All IPs
from PULP projects, working on FPGA. As one can see from Figure 2.18, the
core is connected to two separated single port instruction and data RAMs, each
one accessible in a single clock cycle, with no wait states, which all contribute
to provide small power consumption. An AXI4 interconnect is there to provide
connection to the two RAMs and between other peripherals, through an APB
bridge, which allows for high flexibility as long as one uses components suited for
                                         50
                                Technical Background
it or AXI. The peripherals on the bottom left part, namely GPIOs, UART, I 2 C
and SPI master are used to communicate with external devices and are fine grained
clock gated, meaning that they can be shut down whenever not needed. The core
is in a low power state whenever there is nothing to be done, and a simple event
unit, which waits for an interrupt or an event caused by a peripheral, inhibits the
clock gating on the core and wakes it up. Furthermore a SPI slave is provided to
allow external devices to access the entire memory map of pulpino and an advanced
debug unit, accessible via JTAG, allows for debugging. Finally a BOOT ROM has
been integrated in order to allow user to use PULPino as standalone system, simply
loading a bootloader into the core through an external SPI flash memory. Moving
from OPEN RISC to RISC-V was justified by the need for easily extensible ISA,
the possibility of having less and eventually compressed instructions, which overall
helps in power consumption reduction. A couple of extensions which are worth
mentioning are
  2. Post Incrementing Load and Store : load and store are often part of pat-
     terns where the target addresses are repeatedly incremented, but this is done
                                        51
                                Technical Background
2.5.2    PULPissimo
PULPissimo is the most recent and advanced single core platform of the PULP
project. Although it hosts only one core, it is used to as the main controller for
applications in which there are multiple cores, thereby providing
   • new memory subsystem devised to improve both performance and power con-
     sumption
   • new Software Development Kit (SDK): contains tools and runtime support
     for all PULP microcontrollers, with procedurs for setting up cores and pe-
     ripherals, so that application developers can exploit their full potential.
RI5CY
It is an in order single instruction core, made up of 4 pipelined stages, with full
support for RV32I, RV32C and RV32M RISC-V instructions. It can be configured
to support single precision floating point instructions, that is RV32F extension.
Moreover it provides a number of specific features
• Interrupts
• Events, allowing the core to sleep and wait for an event (as seen for PULPino)
• Exceptions
ZERO-RISCY
It has been designed to target application domains for which area and power are
strongly constrained. Further details are available at [13]. RI5CY has recently
being taken over by lowRISC foundation, which provides full documentation at
[17].
                                      54
Chapter 3
Methods
This chapter provides a deep dive into the thesis workflow, giving details on each
and every step needed to have a working simulation environment and a properly
set up architecture consisting of ODIN and RocketChip interfaced by means of
SPI. Chipyard has been selected as RISC-V SoC for the thesis, because it felt eas-
ier to use and offered a very high degree of configuration choices. First, ODIN and
Chipyard will be properly downloaded and configured, then the accelerator will be
integrated inside the Chipyard compilation flow so that it can be instantiated in the
final architecture. Finally, the configuration procedure of ODIN will be explained
and applied to a specific neural network. The steps of the workflow are summarized
in the diagram of Figure 3.1. All steps up to ODIN Configuration Setup will be
outlined in this chapter, whereas RTL Simulation and Synthesis will be carried out
in Chapter 4.
                                         55
                                       Methods
                                             57
                                              Methods
Let’s analyze the code contained in ODIN.scala. In Listing 3.1, the parameters to
build ODIN are declared. In this simple case, there are only three parameters; N
and M indicate the number of neurons and number of bits needed to represent the
neuron address, respectively, whereas address is the address that is to be associated
to ODIN in the memory map of the SoC.
 1   case object ODINKey extends Field[Option[ODINParams]](None)
                                    Listing 3.2: ODINKey
In Listing 3.2, a Key is declared as Option type; a key is used by Chisel to specify
which parameters are needed to customize the functionalities of a given module.
They are to be declared as Option, with default value being None, meaning that
in case the user doesn’t explicitly call ODINKey with parameters, the default values
specified in ODINParams are to be selected.
 1   class ODINIO(val N: Int, val M: Int) extends Bundle {
 2     val CLK = Input(Clock())
 3     val RST = Input(Bool())
 4     val SCK = Input(Clock())
 5     val MOSI = Input(UInt(1.W))
 6     val MISO = Output(UInt(1.W))
 7     val AERIN ADDR = Input(UInt((2∗M+1).W))
 8     val AERIN REQ = Input(UInt(1.W))
 9     val AERIN ACK = Output(UInt(1.W))
10     val AEROUT ADDR = Output(UInt(M.W))
11     val AEROUT REQ = Output(UInt(1.W))
12     val AEROUT ACK = Input(UInt(1.W))
13   }
                                    Listing 3.3: ODINIO
In Listing 3.3, the top level signals of ODIN are declared, with one bit signals being
of Bool type and multi bit ones being declared as unsigned integers, which width is
parameterized according to the value of M. The ”.W” expression serves the purpose
of converting the Scala integer value inside the parenthesis to Chisel Width type.
 1   trait   HasODINIO extends BaseModule {
 2     val   N: Int
 3     val   M: Int
 4     val   io = IO(new ODINIO(N,M))
 5   }
                                    Listing 3.4: ODINIO
In Listing 3.4 the IO bundle required by the blackbox integration is declared. No-
tice that top level parameters are indicated as well, as they are needed for the
input/output definition.
 1   class ODINBlackBox(val N: Int, val M: Int) extends BlackBox(Map(”N” −> IntParam(N), ”
          M” −> IntParam(M))) with HasBlackBoxResource
 2     with HasODINIO
                                                58
                                              Methods
 3   {
 4       addResource(”/vsrc/ODINBlackBox.v”)
 5       addResource(”/vsrc/aer out.v”)
 6       addResource(”/vsrc/controller.v”)
 7       addResource(”/vsrc/fifo.v”)
 8       addResource(”/vsrc/izh calcium.v”)
 9       addResource(”/vsrc/izh effective threshold.v”)
10       addResource(”/vsrc/izh input accumulator.v”)
11       addResource(”/vsrc/izh neuron state.v”)
12       addResource(”/vsrc/izh neuron.v”)
13       addResource(”/vsrc/izh stimulation strength.v”)
14       addResource(”/vsrc/lif calcium.v”)
15       addResource(”/vsrc/lif neuron state.v”)
16       addResource(”/vsrc/lif neuron.v”)
17       addResource(”/vsrc/neuron core.v”)
18       addResource(”/vsrc/sdsp update.v”)
19       addResource(”/vsrc/spi slave.v”)
20       addResource(”/vsrc/synaptic core.v”)
21       addResource(”/vsrc/scheduler.v”)
22   }
                                   Listing 3.5: ODINBlackBox
In Listing 3.5, the blackbox class is declared. Notice that the parameters are
mapped as integers, and that traits HasBlackBoxResource and HasODINIO are
appended to the class declaration. The body of the class just lists all the verilog
files needed for the design description.
 1   trait ODINModule extends HasRegMap{
 2
 3   val io: ODINTopIO
 4    implicit val p: Parameters
 5    def params: ODINParams
 6    val clock: Clock
 7    val reset: Reset
 8
 9       val impl = Module(new ODINBlackBox(params.N,params.M))
10
11       val aerin ack = RegInit(0.U(1.W))
12       val aerout ack = RegInit(0.U(1.W))
13       val aerin req = RegInit(0.U(1.W))
14       val aerout req = RegInit(0.U(1.W))
15       val aerin addr = RegInit(0.U((2∗params.M+1).W))
16       val aerout addr = RegInit(0.U((params.M).W))
17       impl.io.CLK := clock
18       impl.io.RST := reset.asBool
19       impl.io.SCK := clock
20
21       impl.io.AEROUT ACK := aerout ack
22       aerout addr := impl.io.AEROUT ADDR
23       aerout req := impl.io.AEROUT REQ
                                                 59
                                                Methods
In Listing 3.6 a trait is needed to describe the interconnections for ODIN top
level signals. Let’s analyze this trait. An impl val is declared to instantiate the
ODIN module, then all required input and output signals are declared as well,
making sure that those related to input AER and output AER are declared as
registers. Every impl signal is configured so that it appears on the left side of the
:= assignment operator if the signal is an input to ODIN, whereas it appears to the
left side of the operator if it is a signal coming out from ODIN module. The AER
specific signals must be declared as registers because ODIN will be integrated as
a MMIO Peripheral. A MMIO Peripheral is a device which makes it possible for
the processor to communicate with it through memory mapped registers, exploiting
TileLink interconnects.
Once ODIN signals are properly interconnected, it is necessary to inform Chipyard
toolchain on how those registers are to be inserted into the SoC memory map.
In particular, aerout req will be a read-only register, 1 bit wide, placed at offset
0x00, with respect to base address of ODIN module, which is indicated by address
parameter in ODINParams class, whereas aerout ack will be a write-only register,
1 bit wide, placed at offset 0x0C. Similar considerations apply for other ODIN
signals, as illustrated into Listing 3.6 at lines 31-43.
 1   class ODINTL(params: ODINParams, beatBytes: Int)(implicit p: Parameters)
 2     extends TLRegisterRouter(
 3      params.address, ”odin”, Seq(”ucvlouvain,odin”),
 4      beatBytes = beatBytes)(
                                                   60
                                         Methods
                                           61
                                             Methods
10       }
11   }
                        Listing 3.9: ODIN Top level implementation
Listing 3.9 shows the top level trait needed to have top level ports visible. A few
things still need to be addressed before compiling the chipyard project and have
the synthesizable verilog ready.
 1   class WithODIN extends Config((site, here, up) => {
 2     case ODINKey => Some(ODINParams(N = 256, M = 8))
 3   })
                            Listing 3.10: ODIN Config Fragment
Listing 3.10 describes the config fragment needed to put ODIN in whatever config-
uration class. A more configurable version would be
 1   class WithODIN(N: Int, M: Int) extends Config((site, here, up) => {
 2     case ODINKey => Some(ODINParams(N = N, M = M))
 3   })
but the one shown in 3.10 is modified in order to have ODIN instantiated with
default values suggested by its author, as the note on ODIN documentation states
that the crossbar scales according to N and M, but other ODIN modules do not
automatically scale and proper modifications are needed if one changes either M,N
or both.
 1   class DigitalTop(implicit p: Parameters) extends System
 2     with testchipip.CanHaveTraceIO // Enables optionally adding trace IO
 3     with testchipip.CanHaveBackingScratchpad // Enables optionally adding a backing
          scratchpad
 4     with testchipip.CanHavePeripheryBlockDevice // Enables optionally adding the block
          device
 5     with testchipip.CanHavePeripherySerial // Enables optionally adding the TSI serial−
          adapter and port
 6     with sifive.blocks.devices.uart.HasPeripheryUART // Enables optionally adding the sifive
          UART
 7     with sifive.blocks.devices.gpio.HasPeripheryGPIO // Enables optionally adding the sifive
          GPIOs
 8     with sifive.blocks.devices.spi.HasPeripherySPIFlash // Enables optionally adding the sifive
          SPI flash controller
 9     with icenet.CanHavePeripheryIceNIC // Enables optionally adding the IceNIC for FireSim
10     with chipyard.example.CanHavePeripheryInitZero // Enables optionally adding the initzero
           example widget
11     with chipyard.example.CanHavePeripheryGCD // Enables optionally adding the GCD
          example widget
12     with chipyard.example.CanHavePeripheryODIN // Enables optionally adding the ODIN
          example widget
13     with chipyard.example.CanHavePeripheryStreamingFIR // Enables optionally adding the
          DSPTools FIR example widget
14     with chipyard.example.CanHavePeripheryStreamingPassthrough // Enables optionally
          adding the DSPTools streaming−passthrough example widget
                                                62
                                              Methods
There are still two steps to take before ODIN can be utilized in Chipyard framework.
The first one consists in locating the file DigitalTop.scala in generators/chip-
yard/src/main/scala and modify the class DigitalTop as shown in Listing 3.11,
thus adding the string with chipyard.example.CanHavePeripheryODIN.
 1   class DigitalTopModule[+L <: DigitalTop](l: L) extends SystemModule(l)
 2     with testchipip.CanHaveTraceIOModuleImp
 3     with testchipip.CanHavePeripheryBlockDeviceModuleImp
 4     with testchipip.CanHavePeripherySerialModuleImp
 5     with sifive.blocks.devices.uart.HasPeripheryUARTModuleImp
 6     with sifive.blocks.devices.gpio.HasPeripheryGPIOModuleImp
 7     with sifive.blocks.devices.spi.HasPeripherySPIFlashModuleImp
 8     with icenet.CanHavePeripheryIceNICModuleImp
 9     with chipyard.example.CanHavePeripheryGCDModuleImp
10     with chipyard.example.CanHavePeripheryODINModuleImp
11     with freechips.rocketchip.util.DontTouch
                           Listing 3.12: Chipyard DigitalTop part 2
The last one consists in modifying the very same file, this time adding the string
with chipyard.example.CanHavePeripheryODINModuleImp to the class
DigitalTopModule, as depicted in Listing 3.12.
 1   class ODINRocketConfig extends Config(
 2     new chipyard.iobinders.WithUARTAdapter ++                        // display UART with a
          SimUARTAdapter
 3     new chipyard.iobinders.WithTieOffInterrupts ++                 // tie off top−level
          interrupts
 4     new chipyard.iobinders.WithBlackBoxSimMem ++                       // drive the master AXI4
           memory with a blackbox DRAMSim model
 5   // new chipyard.iobinders.WithSimAXIMem ++
 6     new chipyard.iobinders.WithTiedOffDebug ++                      // tie off debug (since we
          are using SimSerial for testing)
 7     //new chipyard.iobinders.WithSimDebug ++
 8     new chipyard.iobinders.WithSimSerial ++                      // drive TSI with SimSerial for
           testing
 9     new testchipip.WithTSI ++                                 // use testchipip serial offchip link
10     new chipyard.example.WithODIN ++
11     new chipyard.config.WithBootROM ++                            // use default bootrom
12     new chipyard.config.WithUART ++                              // add a UART
13     new chipyard.config.WithL2TLBs(1024) ++                        // use L2 TLBs
14
                                                63
                                            Methods
                                               64
                                            Methods
                                              65
                                           Methods
Indeed, one should modify the module ODINTL as illustrated in Listing 3.14,
then look for impl SCK and impl MOSI, which lie inside the ODINTL module,
and connect them according to Listing 3.15. Yet another modification should
be applied to the instantiated ODIN module; look for ”ODINTL odin” inside
the file, and modify it according to Listing 3.16. These changes are necessary
to properly interface the SPIFlash Master with ODIN. Finally, run make CON-
FIG=SmallSPIFlashODINRocketConfig BINARY=../../tests/spiflashread.riscv run-
binary-debug -j4 again on the terminal, still with sims/verilator as working direc-
tory, in order to recompile the whole architecture and produce the simulation ex-
ecutable. You’ll notice that an error has been detected, as shown in the following
code snippet
1   [0] \% Error: plusarg\ file\ mem.sv:50: Assertion failed in TOP.TestHarness.spi\ mem\ 0.
         memory
     This error arises due to the fact that the simulator doesn’t know where to get the
file that SPIFlash module should read to feed ODIN with configuration patterns.
To fix the error, one should issue the following command
1   /home/andrea/Documents/SNN Gianvito/Papers/RocketChip/chipyard/sims/verilator/
        simulator−chipyard−SmallSPIFlashODINRocketConfig−debug +permissive +dramsim
        +spiflash0=/home/andrea/Documents/SNN Gianvito/Papers/RocketChip/chipyard/
        tests/spiflash odin.img +verbose −v spiflashread.chipyard.TestHarness.
        SmallSPIFlashODINRocketConfig.vcd +permissive−off ../../tests/spiflashread.riscv </
        dev/null 2> >(spike−dasm > spiflashread.chipyard.TestHarness.
        SmallSPIFlashODINRocketConfig.out) | tee spiflashread.chipyard.TestHarness.
        SmallSPIFlashODINRocketConfig.log
            Listing 3.17: Command to run ODIN + ROCKET simulation.
You’ll notice this is almost the same command issued before the error occurred.
Indeed the binary file to be read by SPIFlash master module has been specified
with a path given through variable spiflash0, as reported in Listing 3.17. If you
launch the command shown in that code portion, you’ll launch the software RTL
simulation and, after a given amount of time, the process will end, providing you
with three files
     • spiflashread.chipyard.TestHarness.SmallSPIFLASHODINRocketConfig.out : this
       is a long listing of all RISC-V assembly instructions being executed, starting
       from the base address of the bootrom, up to the end of spiflashread program.
       An excerpt is shown in Listing 3.18. That string tells the instruction that is
       being executed (i.e. auipc, 0x0), that it refers to the 19th simulation cycle
       and core hart #0 (i.e. C0), that r10 is going to be filled with a new value and
       that, a few cycles before, when the instruction was located into decode stage,
       it made the processor read two registers, in this case being both r0. Indeed
       each string in this file refers to the write-back stage of the 5 stage pipeline of
       rocketchip, that is the last stage of execution of a given instruction, during
       which the result is written back into the register file, if any.
                                              66
                                           Methods
   • spiflashread.chipyard.TestHarness.SmallSPIFLASHODINRocketConfig.log : this
     is a log of the entire simulation, listing output of the modules involved in the
     simulation. In this work it just lists debug information coming from the pro-
     gram being run for simulation, that is spiflashread.riscv, consisting in a few
     printf functions being called in specific points, so to inform the user about
     the status of the simulation.
   • spiflashread.chipyard.TestHarness.SmallSPIFLASHODINRocketConfig.vcd : this
     is the Value Change Dump (VCD) file, that is a dump of the waveforms in-
     volved in the simulation, which can be read by any VCD reader, such as the
     open source GTKWave.
  2. Add Synapse. Once selected, the routine asks for presynaptic and postsynap-
     tic neuron numbers, that is the neurons one wants to connect through the
     synapse that is going to be set up. Then, it asks whether to set up the map-
     ping table bit and which weight value to assign to the established synapse.
     This process is shown in Figure 3.3 and Figure 3.4.
                                             67
                                     Methods
  3. Add Neuron. When this option is selected, all parameters listed in Figure
     3.2 have to be tuned. First it ask for the neuron number one wants to cus-
     tomize, then all LIF specific parameters can be specified. Once determined,
     the program gives back the values that were specified, together with a brief
     indication on the parameters to which each byte refers.
                                       68
                                      Methods
Figure 3.6: List of SPI configuration registers that can be set up through
odin configurator application.
One may notice that some pictures concerning odin configurator application report
the string ”little-endian” on the terminal. This is due to the fact that the program
can handle either little or big endian systems, even if most desktop systems run
                                        69
                                     Methods
Figure 3.7: Configuation of SPI SYN SIGN for neurons in range [15,0]. Value 4
means that neuron number 2 will have inhibitory synapses, whereas all others in
that range will have excitatory synapses.
little endian processors. Please note that little and big endian system support is
available for synapses and neurons configurations only at this time being.
                                       70
Chapter 4
whether the synfire chain is correctly stimulated and working. A synfire chain is
a feed-forward (i.e. the neural network neurons are arranged so that there are no
cycles) consisting of many layers of neurons. All synapes are excitatory, meaning
that the postsynaptic neuron membrane potential increases whenever one or more
presynaptic neurons fire. Once the first layer of neurons is characterized by some
firing activity, all the subsequent layers are excited and fire, giving birth to a volley
of spikes synchronously propagating from a layer to another [1]. The synfire chain
has been chosen as a benchmark to validate the proposed architecture because of its
simple and predictable behaviour. An example of synfire chain comprising 8 neu-
rons, which constitute the network that is going to be configured in the simulation,
is shown in Figure 4.3.
Figure 4.2: ODIN SPI Slave - Transmission Details. Every write or read operation
consists of 40 clock cycles, first 20 devoted to operation address transmission, the
subsequent 20 are instead needed to transfer data that should be written some-
where, according to Tab 4.2, or it consists of data that someone requested from the
outside.
Figure 4.3: Synfire Chain Network composed of 8 neurons. This is the network
that serves as validating example for the final architecture comprising ODIN and
RocketChip.
let it configure ODIN, and gather the results from output AER interface of the
ODIN coprocessor. The original program was developed by the authors of Chipyard
framework and retained its original filename. Initially conceived to configure the
SPI master and tinker with its parameters to read data from a specific area of
memory, it has been modified to suit the needs of ODIN.
 1   #include <stdlib.h>
 2   #include <stdio.h>
 3
 4   #include ”mmio.h”
 5   #include ”spiflash.h”
 6   #define ODIN AEROUT REQ 0x2000
 7   #define ODIN AEROUT ADDR 0x2004
 8   #define ODIN AEROUT ACK 0x2008
 9   #define ODIN AERIN REQ 0x200C
10   #define ODIN AERIN ADDR 0x2010
11   #define ODIN AERIN ACK 0x2014
12   #define SYNFIRECHAIN NEURONS 8
13   int main(void)
14   {
15     int i;
16
17    spiflash ffmt ffmt;
18    uint8 t neurons[SYNFIRECHAIN NEURONS];
19
20    ffmt.fields.cmd en = 1;
21    ffmt.fields.addr len = 4; // Valid options are 3 or 4 for our model
22    ffmt.fields.pad cnt = 0; // Our SPI flash model assumes 8 dummy cycles for fast reads, 0
         for slow
23    ffmt.fields.cmd proto = SPIFLASH PROTO SINGLE; // Our SPI flash model only
         supports single−bit commands
24    ffmt.fields.addr proto = SPIFLASH PROTO SINGLE; // We support both single and
         quad
25    ffmt.fields.data proto = SPIFLASH PROTO SINGLE; // We support both single and
         quad
26    ffmt.fields.cmd code = 0x13; // Slow read 4 byte
27    ffmt.fields.pad code = 0x00; // Not used by our model
28
29
30    printf(”Initiating ODIN configuration...\n”);
31    configure spiflash(ffmt);
32    test spiflash(0x0,0x2c6,0);
33    //test spiflash(0x0,0xabe, 0); //32 neuroni
34    //test spiflash(0x0,0x551e,0); 256
35    //test spiflash(0x0,0x36B,0); // 10 neuroni
36    /∗∗ ADATTARE IN BASE A QUANTI NEURONI E SINAPSI CONFIGURO∗/
37   printf(”ALL NEURONS CONFIGURED!\n”);
38    reg write32(ODIN AERIN ADDR,0x0021);
39
40    reg write32(ODIN AERIN REQ,1);
                                               73
                                          Results & Discussion
  4. line 31 calls the function devoted to set internal spi flash master module pa-
     rameters according to the values in ffmt structure, whereas line 32 effectively
     starts the SPI transmission, for all bytes in range 0x0-0x2c6.
  5. line 38-42 sets memory mapped input AER registers to values needed to
     request a VIRTUAL SYNAPSE event targeting neuron 0. Note that the
     AERIN REQ signal is lowered only after AERIN ACK is asserted.
  6. line 45-52 handles output AER events as soon as neuron i fires. In particular,
     the software continuously (i.e. polling technique, since no interrupts are avail-
     able for this peripheral) reads the memory mapped ODIN AEROUT REQ
     register, until its value becomes logic 1. This means that a neuron fired, and
     the neuron address is stored into neurons[i]. Then the software sends an ac-
     knowledge through register ODIN AEROUT ACK and waits for
     ODIN AEROUT REQ to be lowered in response to the acknowledge. Fi-
     nally the acknowledge register is cleared so that ODIN can handle subse-
     quent events. These operations are repeated for all neurons in range [0,SYN-
     FIRECHAIN NEURONS].
  7. lines 56-60 properly store addresses of neurons that fired into external DRAM,
     which covers the 0x80000000-0X8FFFFFFF range in the Soc memory map,
     as stated into sims/verilator/generated-src
     /chipyard.TestHarness.SmallSPIFlashODINRocketConfig
     /chipyard.TestHarness.SmallSPIFlashODINRocketConfig.dts device tree source
     file. Thus, the defalt configuration for external DRAM make it able to store
     up to 4096 Mebibyte (0x10000000 possible addresses) of data, which corre-
     spond to a little more than 4 Gigabytes (GB).
       1 mem req valid indicates whether the SPI transition is valid or not. Every
     low to high transition signals that the transmission is being executed and data
     is being sent to ODIN.
                                         75
                              Results & Discussion
Table 4.1: SPI Flash Module Commands. Note that cmd proto is always set to
SPIFLASH PROTO SINGLE and pad code is always set to 0, as it is unused.
                                      76
                                Results & Discussion
Figure 4.4: Synfire chain with 8 neurons: setup of ODIN SPI slave configuration
registers.
      2 mem req addr indicates the byte, coming from the configuration file,
     that spi master is reading and sending to ODIN.
      3 spi addr consists of first 20 bits sent through SPI master. The address
     determines the operation that is to be executed by ODIN SPI slave module,
     as summarized in Table 4.2.
      4 spi cnt counts the internal clock cycles, in order to distinguish incoming
     address bits from data related ones. Indeed, the first 20 SPI clock cycles are
     devoted to the transmission of the operation address, whereas the remaining
     20 are dedicated to data transmission, as depicted in Figure 4.2.
                                        77
                                Results & Discussion
Figure 4.6: Synfire chain with 8 neurons: configuration of least significant byte of
neuron 0, by performing a write operation into the neurons SRAM.
Figure 4.7: Synfire chain with 8 neurons: configuration phase ends as soon as
SPI GATE ACTIVITY is lowered. This lets ODIN run the synfire chain.
                                        78
                                 Results & Discussion
Figure 4.5 shows the second step. In this picture signals from A[12:0] down to WE
refer to neurons SRAM.
       1 A[12:0] is the address referring to the synapses SRAM. Indeed, spi addr[19:0]
     contains the value 64060, which means that synapse memory must be written.
     In particular, synapse between neuron 3 and neuron 4 must be modified, as
     the byte address is 010.
      2 once synapses SRAM Chip Select (CS) signal is asserted, the target
     address is latched into A[12:0]. In the subsequent clock cycle the Write Enable
     (WE) is asserted as well and data provided through D[31:0], that is ready once
     spi cnt is equal to 27hexadecimal = 39decimal , is written into the memory.
Figure 4.6 shows the third step. In this picture signals from A[7:0] down to WE
refer to neurons SRAM.
       1 A[7:0] is the address referring to the neurons SRAM. Indeed, spi addr[19:0]
     contains the value 50000, which means that neurons memory must be writ-
     ten. In particular, byte 0 of neuron 0 must be modified. Once neurons SRAM
     Chip Select (CS) signal is asserted, the target address is latched into A[7:0].
      2 In the subsequent clock cycle the Write Enable (WE) is asserted as well
     and data provided through D[127:0], that is ready once spi cnt is equal to
     27hexadecimal = 39decimal , is written into the memory.
Figure 4.7 shows the fourth step. In this picture signals from A[7:0] down to WE
refer to neurons SRAM.
       1 A[7:0] is the address referring to the neurons SRAM. Indeed, spi addr[19:0]
     contains the value 50000, which means that neurons memory must be writ-
     ten. In particular, byte 0 of neuron 0 must be modified. Once neurons SRAM
     Chip Select (CS) signal is asserted, the target address is latched into A[7:0].
      2 In the subsequent clock cycle the Write Enable (WE) is asserted as well
     and data provided through D[127:0], that is ready once spi cnt is equal to
     27hexadecimal = 39decimal , is written into the memory.
      1 AERIN REQ is asserted to trigger the ODIN controller and let it acco-
     modate the incoming AER event request if AEROUT CTRL BUSY is low.
       2 AERIN ADDR indicates the type of AER event that is being requested
     and information about it. In this case a VIRTUAL SYNAPSE EVENT, for
     which details can be found in 2.5, is being handled and is targeted to neuron
     0.
       3 AERIN ACK is asserted once the virtual synapse event has been cor-
     rectly handled and operations are complete.
      5 AEROUT ADDR is filled with the neuron number that has just fired
     and generated the event.
In the end a all neurons in the range {0,7} fire in sequence, and their output
AER events are correctly received and an acklowedge is sent to ODIN for each one
of them. As soon as neuron 7 fires and its event is correctly handled, the firing
sequence starts again from neuron 0, since neuron 7 is connected to it, and the
simulation is stopped. Other simulation experiments, using larger synfire chains,
have been conducted, but the one presented here was selected due to its simplicity
and shorter simulation time. Simulation of larger networks, such as 256 neurons
wide synfire chains, work similarly.
                                        80
                                  Results & Discussion
                                          81
                                  Results & Discussion
    • SPI Flash master has been changed to read-only peripheral, meaning that
      it can only read from a given file. This is done by giving the true ar-
      gument to chipyard.iobinders.WithSimSPIFlashModel(true) config fragment.
      The chipyard.config.WithSpiFlash(0x100000) fragment should be inserted as
      well, with the parameter between parenthesis being the size of addressable
      space for the SPI controller.
    • external DRAM removed (no need to save ODIN results outside, as in Verila-
      tor simulation). To do so, remove the chipyard.iobinders.WithBlackBoxSimMem
      fragment.
    • smallest RISC-V RocketChip core available has been put in place of standard
      core. To achieve this result, remove freechips.rocketchip.system.BaseConfig
      and freechips.rocketchip.subsystem.WithNBigCores config fragments, then add
      the freechips.rocketchip.system.TinyConfig one, which will instantiate the
      smallest core available, use an incoherent bus interconnect rather than a co-
      herent one, and remove all interconnects needed for external memories.
                                          82
                                      Results & Discussion
 6    // tiny config
 7    // no tlmonitors
 8    // no external DRAM
 9    // bufferless broadcast
10    new chipyard.iobinders.WithTieOffInterrupts ++
11    //new chipyard.iobinders.WithBlackBoxSimMem ++
12    //new chipyard.iobinders.WithTiedOffDebug ++
13    //new chipyard.iobinders.WithSimSerial ++
14    new chipyard.iobinders.WithSimSPIFlashModel(true) ++        // add the SPI flash model in
          the harness (read−only)
15    new chipyard.example.WithODIN ++
16    //new testchipip.WithTSI ++
17
18    new chipyard.config.WithBootROM ++
19    new chipyard.config.WithSPIFlash(0x100000) ++            // add the SPI flash controller
         (1 MiB)
20    new chipyard.config.WithL2TLBs(1024) ++
21    //new freechips.rocketchip.subsystem.WithBufferlessBroadcastHub ++
22    new freechips.rocketchip.subsystem.WithNoMMIOPort ++
23    new freechips.rocketchip.subsystem.WithoutTLMonitors ++
24    new freechips.rocketchip.subsystem.WithNoSlavePort ++
25    new freechips.rocketchip.subsystem.WithNExtTopInterrupts(0) ++
26    //new freechips.rocketchip.subsystem.WithNSmallCores(1) ++
27    //new freechips.rocketchip.subsystem.WithCoherentBusTopology ++
28    new freechips.rocketchip.system.TinyConfig)
Listing 4.2: First line of .out file given by ODIN + ROCKET simulation, running
spiflashread.riscv program.
4.2.1       Area
The synthesis result are shown in Tab 4.3 and Tab 4.4. As one can see, the resource
utilization on PYNQ Z2 board is really low, accounting for 15.99 % of LUTs slices
and 11.07 % of Block RAMs (BRAMs), the latter being instantiated for ODIN
neurons and synapses states, RocketChip data and instruction caches. The number
of I/O pins is 8
     1. clock: main clock source
     2. reset: global synchronous reset
     3. SCK: SPIFlash Master and ODIN SPI Slave clock source
     4. CS: SPIFlash Master Chip Select
     5. 4 in/out pins for quad spi data transmission
and is the minimum needed to have a working design. The PYNQ Z2 provides up
to 125 user programmable I/O pins (known as IOBs), so there more than enough
to integrate other peripherals or systems.
                                               83
                                 Results & Discussion
                                         84
Chapter 5
Conclusions
This work was intended to establish a first contribution towards seamless integration
of neuromorphic technologies with state of the art processors, which cannot really
handle operations deep learning algorithms require to achieve the results they were
conceived for. The architecture should take the best from both domains; on one side
there are advantages coming from using a ”standard” SoC system, as the principles
behind programming such devices are known and it is for sure easier than directly
interacting with a neuromorphic architecture like that on which ODIN is built, plus
it exploits an open source ISA that is gaining more and more attention due to its
modularity, as the base ISA is stable, minimal, and won’t be discontinued in terms
of support, and simplicity, plus all positive points cited in 2.2; on another side
it allows for efficient execution of above mentioned algorithms due to the nature
of networks involved, which further pushes down power consumption, and this is
mandatory when dealing with IoT devices. All in all there is so much room for
improvements, a few of which are provided in the following
  1. make RocketChip able to reconfigure ODIN at run time. This means that
     proper hardware and sotware (APIs) support should be implemented and
     provided to the end user.
  2. FPGA porting. Chipyard leverages a boot ROM, containing the instructions
     to run when the SoC is powered on, together with all details concerning the
     SoC modules through the Device Tree Binary structure. The SoC runs those
     instructions, then run a wait-for-interrupt (WFI) instruction, waiting for the
     RISC-V Frontend Server (FESVR) to load the program and wake up the core
     through an external interrupt. If one wants to deploy and run the system on
     a FPGA, the booting process should be changed, removing the need for an
     external interrupt and let the user program run as soon as the boot loader
     completely set up the SoC modules.
  3. exploit ROcket Custom Coprocessor interface instead of MMIO/SPI, first
     evaluating pros and cons deriving from the possibility of exploiting custom
                                         85
                                 Conclusions
4. change SPIFlash module to allow the user to specify separated input and
   output data files/streams, as at the moment it can only read and write,
   depending on the configuration parameters, from and to one file only.
5. change SPI slave interface of ODIN to support quad spi, thus accelerating
   the configuration phase. This would imply not only modifying the external
   interface of the SPI slave module, but also change the internal behaviour so
   that the controller can process 4 bits at a time, rather than one.
6. have a cluster of ODIN modules talking one to each other, so to increase the
   computing capabilities of such systems.
                                     86
Acknowledgements
This work established the first step towards the start of my professional career.
Lots of difficulties showed up and I was surrounded by a multitude of people that
supported me without even asking for. I’d start with a few words on my parents,
Marco and Milena. They truly believed in my desire to pursue a career in engi-
neering, although my dad initially thought I’d be better with studying law, and
provided me with everything a son might ask for, love above all, and I felt encour-
aged in leaving home to move to Turin. This city changed my way of being, and
acting, and fortified my personality while providing me with opportunities to grow
up. It is also the city that gave me friendships I’d never live without, and they are
countless, yet I’d like to list a few of them : Vito, Sofia, Cosimo, Samuel, Marco. A
special mention goes to Francesco, who spent hundreds of hours in the past months
listening to my concerns, providing hints on how I could face certain problems. I
could not forget to mention my best friend Aurelio, who has always been present,
no matter the distance we were at or the time that passes since the last time we
had to hear each other. I’d like to thank my uncle Carlo, the man that stands as
my lighthouse in my engineering path. Least but not last, I don’t know how to
express the gratitude I have towards Giorgia, my girlfriend since two years, who
truly understood the malaise I went through and always had kind words to relieve
it.
                                         87
Bibliography
 [9] Q. Chen, Q. Qiu, H. Li, and Q. Wu. A neuromorphic architecture for anomaly
     detection in autonomous large-area traffic monitoring. In 2013 IEEE/ACM
     International Conference on Computer-Aided Design (ICCAD), pages 202–
     205, 2013. doi: 10.1109/ICCAD.2013.6691119.
[17] l. ETH Zurich, University of Bologna. Ibex documentation, 2021. URL https:
     //ibex-core.readthedocs.io/en/latest/index.html.
                                         89
                                  BIBLIOGRAPHY
[20] C. Frenkel. Odin spiking neural network (snn) processor. URL https://
     github.com/ChFrenkel/ODIN.
[23] W. Gerstner and W. Kistler. Spiking Neuron Models: Single Neurons, Popula-
     tions, Plasticity. Cambridge University Press, 2002. URL http://lcn.epfl.
     ch/~gerstner/SPNM/SPNM.html.
[47] RISC-V Foundation. The risc-v instruction set manual - volume i : Unprivi-
     leged isa, 2019. URL https://riscv.org/technical/specifications/.
[52] M. Sorbaro, Q. Liu, M. Bortone, and S. Sheik. Optimizing the energy con-
     sumption of spiking neural networks for neuromorphic applications, 2020.
[60] E. Zurich and U. of Bologna. Pulp silicon proven designs, 2021. URL https:
     //pulp-platform.org/implementation.html.
93