The Berkeley Out-of-Order Machine (BOOM): An Industry-
Competitive, Synthesizable, Parameterized RISC-V
                       Processor
                 Christopher Celio
                 David A. Patterson
                 Krste Asanović
                 Electrical Engineering and Computer Sciences
                 University of California at Berkeley
                 Technical Report No. UCB/EECS-2015-167
                 http://www.eecs.berkeley.edu/Pubs/TechRpts/2015/EECS-2015-167.html
                 June 13, 2015
                  Copyright © 2015, by the author(s).
                          All rights reserved.
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that
copies bear this notice and the full citation on the first page. To copy
otherwise, to republish, to post on servers or to redistribute to lists,
requires prior specific permission.
                  The Berkeley Out-of-Order Machine (BOOM):
     An Industry-Competitive, Synthesizable, Parameterized RISC-V Processor
                             Christopher Celio, David Patterson, and Krste Asanović
                            University of California, Berkeley, California 94720–1770
                                            celio@eecs.berkeley.edu
  BOOM is a work-in-progress. Results shown are prelimi-
nary and subject to change as of 2015 June.                              I$           L1 D$ (32k)                L2 data
1. The Berkeley Out-of-Order Machine
BOOM is a synthesizable, parameterized, superscalar out-                  exe
of-order RISC-V core designed to serve as the prototypical
                                                                                       uncore
baseline processor for future micro-architectural studies of                                                regfile
out-of-order processors. Our goal is to provide a readable,                   issue
open-source implementation for use in education, research,
                                                                                             exe
and industry.                                                                                    uncore
   BOOM is written in roughly 9,000 lines of the hardware
construction language Chisel. We leveraged Berkeley’s open-                    ROB                           L2 data (256k)
                                                                                        rename
source Rocket-chip SoC generator, allowing us to quickly bring                bp                   L1 I$ (16k)
up an entire multi-core processor system (including caches
                                                                          Figure 1: 2-wide BOOM, 1.7 mm2 total in TSMC 45 nm
and uncore) by replacing the in-order Rocket core with an
out-of-order BOOM core. BOOM supports atomics, IEEE
754-2008 floating-point, and page-based virtual memory. We            Verilog targeting ASIC tool-flows.[2]
have demonstrated BOOM running Linux, SPEC CINT2006,
                                                                         UC Berkeley also provides the open-source Rocket-chip
and CoreMark.
                                                                      SoC generator, which has been successfully taped out seven
   BOOM, configured similarly to an ARM Cortex-A9,
                                                                      times in two different, modern technologies.[6, 10] BOOM
achieves 3.91 CoreMarks/MHz with a core size of 0.47 mm2
                                                                      makes significant use of Rocket-chip as a library – the caches,
in TSMC 45 nm excluding caches (and 1.1 mm2 with 32 kB
                                                                      the uncore, and functional units all derive from Rocket. In
L1 caches). The in-order Rocket core has been successfully
                                                                      total, over 11,500 lines of code is instantiated by BOOM.
demonstrated to reach over 1.5 GHz in IBM 45 nm SOI, with
the SRAM access being the critical path. As BOOM instan-
                                                                      3. Methodology: What We Plan to Do
tiates the same caches as Rocket, BOOM should be similarly
constrained to 1.5 GHz. So far we have not found it necessary         The typical methodology for single-core studies, as gathered
to deeply pipeline BOOM to keep the logic faster than the             from an informal sampling of ISCA 2014 papers, is to use
SRAM access. With modest resource sizes matching the syn-             CPU2006 coupled with a SimPoints[12]-inspired methodology
thesizable MIPS32 74K, the the worst case path for BOOM’s             to choose the most representative section of the reference input
logic is ∼2.2 GHz in TSMC 45 nm (∼30 FO4).                            set. Each sampling point is typically run for around 10-100
                                                                      million instructions of detailed software-based simulation.
2. Leveraging New Infrastructure
                                                                         The average CPU2006 benchmark is roughly 2.2 trillion
The feasibility of BOOM is in large part due to the available         instructions, with many of the benchmarks exhibiting multiple
infrastructure that has been developed in parallel at Berkeley.       phases of execution.[9] While completely untenable for soft-
   BOOM implements the open-source RISC-V ISA, which                  ware simulators, FPGA-based simulators can bring runtimes
was designed from the ground-up to enable VLSI-driven com-            to within reason – a 50 MHz FPGA simulation can take over
puter architecture research. It is clean, realistic, and highly       12 hours for a single benchmark. Moreover, we hope to utilize
extensible. Available software includes the GCC and LLVM              an FPGA cluster to run all the SPEC workloads in parallel.
compilers and a port of the Linux operating system.[6]
   BOOM is written in Chisel, an open-source hardware con-            4. Comparison to Commercial Designs
struction language developed to enable advanced hardware
design using highly parameterized generators. Chisel allows           Table 1 shows preliminary results of BOOM and Rocket for the
designers to utilize concepts such as object orientation, func-       CoreMark EEMBC benchmark (we use CoreMark because
tional programming, parameterized types, and type inference.          ARM does not offer SPEC results for the A9 and A15 cores).
From a single Chisel source, Chisel can generate a cycle-             Our aim is to be competitive in both performance and area
accurate C++ simulator, Verilog targeting FPGA designs, and           against low-power, embedded out-of-order cores.
                                                                  1
                              Table 1: CoreMark results.
              Processor                     Core Area     CoreMark/      Freq     CoreMark/    IPC
                                                                                                           And as the Rocket-chip SoC evolves, BOOM inherits the new
                                            (core+L1s)    MHz/Core      (MHz)       Core                   improvements.
  Intel Xeon E5 2687W (Sandy)†           ≈18 mm2 @32nm      7.36         3,400     25,007        -
     Intel Xeon E5 2667 (Ivy)*           ≈12 mm2 @22nm      5.60         3,300     18,474      1.96        Benchmarks are harder to use than they should be. Bench-
         ARM Cortex-A15*                 2.8 mm2 @28nm      4.72         2,116      9,977      1.50
     RV64 BOOM four-wide*                1.4 mm2 @45nm      4.70         1,500      7,050      1.50        marks can be difficult to work with and exhibit poor perfor-
     RV64 BOOM two-wide*                 1.1 mm2 @45nm      3.91         1,500      5,865      1.25        mance portability across different processors, address modes,
 ARM Cortex-A9 (Kayla Tegra 3)*          ≈2.5 mm2 @40nm     3.71         1,400      5,189      1.19
            MIPS 74K‡                    2.5 mm2 @65nm      2.50         1,600      4,000        -         and ISAs. Many benchmarks (like CoreMark) are written to
           RV64 Rocket*                  0.5 mm2 @45nm      2.32         1,500      3,480      0.76
          ARM Cortex-A5‡                 0.5 mm2 @40nm      2.13         1,000      2,125        -
                                                                                                           target 32-bit addresses, which can cause poor code generation
Results collected from *the authors (using gcc51 -O3 and perf), † [3],                                     for 64-bit processors. We built a histogram generator into the
or ‡ [1]. The Intel core areas include the L1 and L2 caches.                                               RISC-V ISA simulator to help direct us to potential problem
                                                                                                           areas. However, additional compiler optimizations are needed
    Table 2: A sample of academic out-of-order processors.
                          IVM[13]        SCOORE[7]   FabScalar[8, 11]   Sharing[14]      BOOM
                                                                                                           to improve 64-bit RISC-V code generation.
                            √                              √                √              √
 fully synthesizable
                                             √             √                               √
                                                                                                              We were also surprised to find that SPECINT contains sig-
        FPGA
    parameterized
                                                           √                               √               nificant floating point code – an integer core may spend over
                                             √                                             √
    floating point
   atomic support
                                                                                           √               half its time executing software FP routines. As academic
                             √               √             √                √              √
       L1 cache
                                                           √                √              √
                                                                                                           SPEC results are typically reported in terms of CPI, we must
       L2 cache
   virtual memory
                                                                                           √               be careful to not optimize for the wrong cases. We added
                                                                                           √
      boots Linux
                                                           √                               √               hardware FP support to BOOM to address this issue.
       multi-core
          ISA          Alpha (sub-set)    SPARCv8    PISA (sub-set)†        ?            RISC-V               Finally, we found SPEC difficult to work with, especially
     lines of code            ?              ?          65,000†             ?         9,000 + 11,500
† Information                                                                                              in non-native environments. We created the Speckle wrap-
                       was gathered from publicly available code at [4].
                                                                                                           per to help facilitate cross-compiling and generating portable
5. Related Work                                                                                            directories to run on simulators and FPGAs.[5]
                                                                                                           Diagnosing bugs that occur billions of cycles into a pro-
There have been many academic efforts to implement out-of-                                                 gram is hard. We mostly rely on a Chisel-generated C++
order cores. The Illinois Verilog Model (IVM) is a 4-issue, out-                                           simulator for debugging, but at roughly 30 KIPS, 1 billion
of-order core designed to study transient faults.[13] The Santa                                            cycles takes 8 hours. A torture-test generator (and a suite of
Cruz Out-of-Order RISC Engine (SCOORE) was designed to                                                     small test codes) is invaluable.
efficiently target both ASIC and FPGA generation. However,
SCOORE lacks a synthesizable fetch unit.                                                                   Acknowledgments
   FabScalar is a tool for composing synthesizable out-of-order
cores. It searches through a library of parameterized compo-                                               Research partially funded by DARPA Award Number HR0011-
nents of varying width and depth, guided by performance                                                    12-2-0016, the Center for Future Architecture Research, a
constraints given by the designer. FabScalar has been demon-                                               member of STARnet, a Semiconductor Research Corporation
strated on an FPGA,[8] however, as FabScalar did not imple-                                                program sponsored by MARCO and DARPA, and ASPIRE
ment caches, all memory operations were treated as cache                                                   Lab industrial sponsors and affiliates Intel, Google, Huawei,
hits. Later work incorporated the OpenSPARC T2 caches in a                                                 Nokia, NVIDIA, Oracle, and Samsung. Any opinions, find-
tape-out of FabScalar.[11]                                                                                 ings, conclusions, or recommendations in this paper are solely
   The Sharing Architecture is composed of a two-wide out-of-                                              those of the authors and does not necessarily reflect the posi-
order core (or “slice”) that can be combined with other slices                                             tion or the policy of the sponsors.
to form a single, larger out-of-order core. By implementing
                                                                                                           References
a slice in RTL, they were able to accurately demonstrate the
area costs associated with reconfigurable, virtual cores.[14]                                                [1] “ARM Outmuscles Atom on Benchmark,” http://parisbocek.typepad.
                                                                                                                 com/blog/2011/04/arm-outmuscles-atom-on-benchmark-1.html/.
                                                                                                             [2] “Chisel: Constructing Hardware in a Scala Embedded Language,”
6. Lessons Learned                                                                                               https://chisel.eecs.berkeley.edu/.
                                                                                                             [3] “Coremark EEMBC Benchmark,” https://www.eembc.org/coremark/.
Single-board FPGAs have gotten more capable of han-                                                          [4] “FabScalar pre-release tools,” http://people.engr.ncsu.edu/ericro/
dling mobile processor designs. Chisel provides a back-end                                                       research/fabscalar/pre-release.htm.
mechanism to generate memories optimized for FPGAs, but                                                      [5] “Speckle: A wrapper for the SPEC CPU2006 benchmark suite.” https:
                                                                                                                 //github.com/ccelio/Speckle.
requires no changes to the processor’s source code. While                                                    [6] “The RISC-V Instruction Set Architecture,” http://riscv.org/.
some coding patterns map poorly to FPGAs (e.g., large vari-                                                  [7] W. Ashmawi et al., “Implementation of a power efficient high per-
able shifters), generally techniques that map well to ASICs                                                      formance fpu for scoore,” in Workshop on Architectural Research
                                                                                                                 Prototyping (WARP), held in conjunction with ISCA-35, 2008.
also map well to FPGAs.
                                                                                                             [8] B. H. Dwiel et al., “Fpga modeling of diverse superscalar processors,”
Re-use is critical. Some of the most difficult parts of build-                                                   in Performance Analysis of Systems and Software (ISPASS), 2012 IEEE
ing a processor – for example the cache coherency system,                                                        International Symposium on. IEEE, 2012, pp. 188–199.
                                                                                                             [9] A. Jaleel, “Memory characterization of workloads using
the privileged ISA support, and the FPGA and ASIC flows –                                                        instrumentation-driven simulation,” Web Copy:             http://www.
came to BOOM “for free” via the Rocket-chip SoC generator.                                                       glue. umd. edu/ajaleel/workload, 2010.
                                                                                                       2
[10] Y. Lee et al., “A 45nm 1.3 ghz 16.7 double-precision gflops/w risc-v
     processor with vector accelerators,” in European Solid State Circuits
     Conference (ESSCIRC), ESSCIRC 2014-40th. IEEE, 2014, pp. 199–
     202.
[11] E. Rotenberg et al., “Rationale for a 3d heterogeneous multi-core
     processor,” in Computer Design (ICCD), 2013 IEEE 31st International
     Conference on, Oct 2013, pp. 154–168.
[12] T. Sherwood et al., “Automatically characterizing large scale program
     behavior,” ACM SIGARCH Computer Architecture News, vol. 30, no. 5,
     pp. 45–57, 2002.
[13] N. J. Wang et al., “Characterizing the effects of transient faults on
     a high-performance processor pipeline,” in Dependable Systems and
     Networks, 2004 International Conference on. IEEE, 2004, pp. 61–70.
[14] Y. Zhou and D. Wentzlaff, “The sharing architecture: sub-core con-
     figurability for iaas clouds,” in Proceedings of the 19th international
     conference on Architectural support for programming languages and
     operating systems. ACM, 2014, pp. 559–574.